Deploy on GPU

The following procedure demonstrates how to upload and deploy a LLM with GPUs. The majority are Hugging Face LLMs packaged as a Wallaroo BYOP framework models.

These upload and deploy instructions have been tested with and apply to the following LLM models:

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Upload the LLM Model

LLM models are uploaded to Wallaroo via one of two methods:

  • The Wallaroo SDK wallaroo.client.Client.upload_model method.
  • The Wallaroo MLOps API POST /v1/api/models/upload_and_convert endpoint.

Upload LLM via the Wallaroo SDK

Models are uploaded with the Wallaroo SDK via the wallaroo.client.Client.upload_model.

SDK Upload Model Parameters

wallaroo.client.Client.upload_model has the following parameters.

ParameterTypeDescription
namestring (Required)The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
pathstring (Required)The path to the model file being uploaded.
frameworkstring (Required)The framework of the model from wallaroo.framework.
input_schemapyarrow.lib.Schema
  • Native Wallaroo Runtimes: (Optional)
  • Non-Native Wallaroo Runtimes: (Required)
The input schema in Apache Arrow schema format.
output_schemapyarrow.lib.Schema
  • Native Wallaroo Runtimes: (Optional)
  • Non-Native Wallaroo Runtimes: (Required)
The output schema in Apache Arrow schema format.
convert_waitbool (Optional)
  • True: Waits in the script for the model conversion completion.
  • False: Proceeds with the script without waiting for the model conversion process to display complete.
archwallaroo.engine_config.Architecture (Optional)The architecture the model is deployed to. If a model is intended for deployment to an ARM architecture, it must be specified during this step. Values include:
  • X86 (Default): x86 based architectures.
  • ARM: ARM based architectures.
accelwallaroo.engine_config.Acceleration (Optional)The AI hardware accelerator used. If a model is intended for use with a hardware accelerator, it should be assigned at this step.
  • wallaroo.engine_config.Acceleration._None (Default): No accelerator is assigned. This works for all infrastructures.
  • wallaroo.engine_config.Acceleration.AIO: AIO acceleration for Ampere Optimized trained models, only available with ARM processors.
  • wallaroo.engine_config.Acceleration.Jetson: Nvidia Jetson acceleration used with edge deployments with ARM processors.
  • wallaroo.engine_config.Acceleration.CUDA: Nvidia Cuda acceleration supported by both ARM and X64/X86 processors. This is intended for deployment with GPUs.

SDK Upload Model Returns

wallaroo.client.Client.upload_model returns the model version. The model version refers to the version of the model object in Wallaroo. In Wallaroo, a model version update happens when we upload a new model file (artifact) against the same model object name.

SDK Upload Model Example

The following example demonstrates uploading a LLM using the Wallaroo SDK.

import wallaroo

# connect to Wallaroo

wl = wallaroo.Client()

# upload the model

model = wl.upload_client(
  name = model_name,
  path = file_path,
  input_schema = input_schema,
  output_schema = output.schema,
  framework = framework
)

Upload LLM via the Wallaroo MLOps API

Models are uploaded via the Wallaroo MLOps API endpoint POST /v1/api/models/upload_and_convert.

This endpoint has the following settings. For full examples of uploading models via the Wallaroo MLOps API, see Wallaroo MLOps API Essentials Guide: Model Upload and Registrations

  • Endpoint: POST /v1/api/models/upload_and_convert
  • Headers
    • Authorization: Bearer {token}: The authentication token for a Wallaroo user with access to the workspace the model is uploaded to. See link:#how-to-generate-the-mlops-api-authentication-token[How to generate the MLOps API Authentication Token].
    • Content-Type: multipart/form-data Files are uploaded in the multipart/form-data format with two parts:
      • metadata: Provides the model parameter data as Content-Type application/json. See Upload Model to Workspace Parameters.
      • file: The binary file (ONNX, .zip, etc) as Content-Type application/octet-stream.

How to generate the MLOps API Authentication Token

The API based upload process requires an authentication token. The following is required to retrieve the token.

  • The Wallaroo instance authentication service address: The authentication service URL for the Wallaroo instance. By default, this keycloak.$WALLAROO_SUFFIX, where $WALLAROO_SUFFIX is the DNS suffix of the Wallaroo model ops center. For example, if the DNS address for the Wallaroo instance is wallaroo.example.com, the authentication service URL is keycloak.wallaroo.example.com. For more details on Wallaroo and DNS services, see the link:https://docs.wallaroo.ai/wallaroo-platform-operations/wallaroo-platform-operations-configure/wallaroo-dns-guide/[Wallaroo DNS Configuration Guide].

  • The confidential client: sdk-client.

  • The Wallaroo username making the MLOps API request: Typically this is the user’s email address.

  • The Wallaroo user’s password.

  • $WALLAROO_USERNAME: The user name of the entity authenticating to the Wallaroo model ops center.

  • $WALLAROO_PASSWORD: The password for the entity authenticating to the Wallaroo model ops center.

  • $WALLAROO_AUTH_URL: The authentication URL.

The following example shows retrieving the authentication token using curl. Update the variables based on your instance.

export WALLAROO_USERNAME="username"
export WALLAROO_PASSWORD="password"
export WALLAROO_AUTH_URL="keycloak.wallaroo.example.com"

TOKEN=$(curl -s -X POST "https://$WALLAROO_AUTH_URL/auth/realms/master/protocol/openid-connect/token" \
                 -d client_id=sdk-client \
                 -d username=${WALLAROO_USERNAME} \
                 -d password=${WALLAROO_PASSWORD} \
                 -d grant_type=password | jq .access_token )
echo $TOKEN

"abc123"

MLOPs API Upload LLM Model Parameters

The following parameters are part of the metadata

FieldTypeDescription
nameString (Required)The model name.
visibilityString (Required)Either public or private.
workspace_idString (Required)The numerical ID of the workspace to upload the model to.
conversionDict (Required)The conversion parameters. The following values of conversion.* are parameters of this field.
conversion.frameworkString (Required)The framework of the model being uploaded. See the list of supported models for more details.
conversion.python_versionString (Required)The version of Python required for model.
conversion.requirementsString (Required)Required libraries. Can be [] if the requirements are default Wallaroo JupyterHub libraries.
conversion.input_schemaString (Optional)The input schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode. Only required for Containerized Wallaroo Runtime models. See link:#how-to-convert-input-and-output-schema-to-base64-format[How to Convert Input and Output Schema to Base64 Format]
conversion.output_schemaString (Optional)The output schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode. Only required for non-native runtime models. See link:#how-to-convert-input-and-output-schema-to-base64-format[How to Convert Input and Output Schema to Base64 Format]
conversion.archString (Optional)The architecture the model is deployed to.
If a model is intended for deployment to an ARM architecture, it must be specified during this step. Values include:
  • x86 (Default): x86 based architectures.
  • arm: ARM based architectures.
conversion.accelString (Optional)The AI hardware accelerator used. If a model is intended for use with a hardware accelerator, it should be assigned at this step. The following values are available.

How to Convert Input and Output Schema to Base64 Format

Models packaged as Wallaroo Containerized Runtimes require the input and output schema formatted in Apache Arrow PyArrow Schema format encoded in base64. The following demonstrates converting a Apache Arrow PyArrow Schema to base64.

import pyarrow as pa
import base64

input_schema = pa.schema([
    pa.field('input_1', pa.list_(pa.float32(), list_size=10)),
    pa.field('input_2', pa.list_(pa.float32(), list_size=5))
])
output_schema = pa.schema([
    pa.field('output_1', pa.list_(pa.float32(), list_size=3)),
    pa.field('output_2', pa.list_(pa.float32(), list_size=2))
])

encoded_input_schema = base64.b64encode(
                bytes(input_schema.serialize())
            ).decode("utf8")

encoded_output_schema = base64.b64encode(
                bytes(output_schema.serialize())
            ).decode("utf8")

MLOPs API Upload LLM Model Example

The following examples demonstrate upload a model via the Wallaroo MLOps API using curl. These uses the following environmental variables:

  • $TOKEN: The bearer authentication token.
  • $NAME: The name of the model.
  • $WORKSPACE_ID: The workspace to upload the model to.
  • $FRAMEWORK: The model framework.
  • $INPUT_SCHEMA: The input schema in PyArrow Schema converted to Base64 format.
  • $OUTPUT_SCHEMA: The output schema in PyArrow Schema converted to Base64 format.
curl --progress-bar -X POST \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $TOKEN" \
  -F 'metadata=\ 
      { \
        "name": $NAME, \
        "visibility": "private", \
        "workspace_id": $WORKSPACE_ID, \
        "conversion": \
          {\
            "framework": $FRAMEWORK, \
            "python_version": "3.8", \
            "requirements": []}, \
            "input_schema": $INPUT_SCHEMA, \
            "output_schema": $OUTPUT_SCHEMA, \
            "arch": 'x86' \
          };type=application/json' \
  -F "file=@$MODEL_PATH;type=application/octet-stream" \
$URL/v1/api/models/upload_and_convert | cat

LLM Deploy

LLM’s are deployed via the Wallaroo SDK through the following process:

  1. After the model is uploaded, get the LLM model reference from Wallaroo.
  2. Create or use an existing Wallaroo pipeline and assign the LLM as a pipeline model step.
  3. Set the deployment configuration to assign the resources including the number of CPUs, amount of RAM, etc for the LLM deployment.
  4. Deploy the LLM with the deployment configuration.

Retrieve LLM

LLM’s previously uploaded to Wallaroo can be retrieved without re-uploading the LLM via the Wallaroo SDK method wallaroo.client.Client.get_model(name:String, version:String) which takes the following parameters:

  • name: The name of the model.

The method wallaroo.client.get_model(name) retrieves the most recent model version in the current workspace that matches the provided model name unless a specific version is requested. For more details on managing ML models in Wallaroo, see Manage Models.

The following demonstrates retrieving an uploaded LLM and storing it in the variable model_version.

import wallaroo

# connect with the Wallaroo client

wl = wallaroo.Client()

llm_model = wl.get_model(name=model_name)

Create the Wallaroo Pipeline and Add Model Step

LLMs are deployed via Wallaroo pipelines. Wallaroo pipelines are created in the current user’s workspace with the Wallaroo SDK method wallaroo.client.Client.build_pipeline(pipeline_name:String) method. This creates a pipeline in the user’s current workspace with with provided pipeline_name, and returns wallaroo.pipeline.Pipeline, which can be saved to a variable for other commands.

Pipeline names are unique within a workspace; using the build_pipeline method within a workspace where another pipeline with the same name exists will connect to the existing pipeline.

Once the pipeline reference is stored to a variable, LLMs are added to the pipeline as a pipeline step with the method wallaroo.pipeline.Pipeline.add_model_step(model_version: wallaroo.model_version.ModelVersion). We demonstrated retrieving the LLM model version in the step Get Model.

This example demonstrates creating a pipeline and adding a model version as a pipeline step. For more details on managing Wallaroo pipelines for model deployment, see the Model Deploy guide.

# create the pipeline
llm_pipeline = wl.build_pipeline('sample-llm-pipeline')

# add the LLM as a pipeline model step
llm_pipeline.add_model_step(llm_model)

Set the Deployment Configuration and Deploy the Model

Before deploying the LLM, a deployment configuration is created. This sets how the cluster’s resources are allocated for the LLM’s exclusive use.

  • Pipeline deployment configurations are created through the wallaroo.deployment_config.DeploymentConfigBuilder() class.
  • Various options including the number of cpus, RAM, and other resources are set for the Wallaroo Native Runtime, and the Wallaroo Containerized Runtime.
    • Typically, LLM’s are deployed in the Wallaroo Containerized Runtime, which are referenced in the DeploymentConfigBuilder’s sidekick options.

LLMs deployed with GPUs must include the following parameters:

  • sidekick_cpus(core_count: int): Sets the number of GPUs allocated to the LLM.
  • deployment_label(label: string): The deployment label that matches the nodepool with the GPU nodes. This ensures that the LLM is deployed in the correct nodepool with the required hardware. For examples on setting up a nodepool with GPUs for LLM deployment, see Large Language Models Infrastructure Requirements.

Once the configuration options are set the deployment configuration is finalized with the wallaroo.deployment_config.DeploymentConfigBuilder().build() method.

The following options are available for deployment configurations for LLM deployments. For more details on deployment configurations, see Deployment Configuration guide.

MethodParametersDescription
replica_count(count: int)The number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization.
replica_autoscale_min_max(maximum: int, minimum: int = 0)Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs.
autoscale_cpu_utilization(cpu_utilization_percentage: int)Sets the average CPU percentage metric for when to load or unload another replica.
cpus(core_count: float)Sets the number or fraction of CPUs to use for the deployment, for example: 0.25, 1, 1.5, etc. The units are similar to the Kubernetes CPU definitions.
gpus(core_count: int)Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If gpus is called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance.
memory(memory_spec: str)Sets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
  • KiB (for KiloBytes)
  • MiB (for MegaBytes)
  • GiB (for GigaBytes)
  • TiB (for TeraBytes)
The values are similar to the Kubernetes memory resource units format.
deployment_label(label: string)Label used to match the nodepool label used for the deployment. Required if gpus are set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo.
sidekick_cpus(model: wallaroo.model.Model, core_count: float)Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
  • Model model: The sidekick model to configure.
  • float core_count: Number of CPU cores to use in this sidekick.
sidekick_memory(model: wallaroo.model.Model, memory_spec: str)Sets the memory available to for the model’s sidekick container.The parameters are as follows:
  • Model model: The sidekick model to configure.
  • memory_spec: The amount of memory to allocated as memory unit values. The accepted unit values are:
    • KiB (for KiloBytes)
    • MiB (for MegaBytes)
    • GiB (for GigaBytes)
    • TiB (for TeraBytes)
    The values are similar to the Kubernetes memory resource units format.

Once the deployment configuration is set, the LLM is deployed via the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig]) method. This allocates resources from the cluster for the LLMs deployment based on the DeploymentConfig settings. If the resources set in the deployment configuration are not available at deployment, an error is returned.

The following example shows setting the deployment configuration for a LLM for deployment on x86 architecture with a single GPU, then deploying a pipeline with this deployment configuration.

# set the deployment config with the following:
# Wallaroo Native Runtime:  0.5 cpu, 2 Gi RAM
# Wallaroo Containerized Runtime where the LLM is deployed:  32 CPUs and 40 Gi RAM
deployment_config = DeploymentConfigBuilder() \
    .cpus(0.5).memory('2Gi') \
    .sidekick_cpus(llm_model, 2) \
    .sidekick_memory(llm_model, '40Gi') \
    .sidekick_gpus(llm_model, 1) \
    .deployment_label(deployment_label) \
    .build()

llm_pipeline.deploy(deployment_config)

LLM Upload and Deploy Examples

The following examples demonstrates uploading a Llama 3 8B Quantized Instruct model on x86 processors. This has the following settings parameters:

  • The Hugging Face LLM file packaged as a Wallaroo BYOP framework in the file llama_byop_llama3_instruct_8b.zip, with the framework set to wallaroo.framework.Framework.CUSTOM. This LLM leverages the llamacpp library.
  • The LLM model name in Wallaroo will be llama3-instruct-8b.
  • The input schema:
    • text type String.
  • The output schema:
    • generated_text type String.
  • The deployment configuration will allocate to the LLM:
    • 2 CPUs
    • 40 Gi RAM
    • 1 GPU

First we upload the model via the Wallaroo SDK.

import wallaroo

# connect to Wallaroo

wl = wallaroo.Client()

# set the input and output schemas
input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("generated_text", pa.string())
])

# upload the model and save the model version to the variable `model`
llm_model = wl.upload_model('llama3-instruct-8b', 
    'llama_byop_llama3_instruct_8b.zip',
    framework=wallaroo.framework.Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
)
display(llm_model)
  
Namellama3-instruct-8b
Versiona3d8e89c-f662-49bf-bd3e-0b192f70c8b6
File Namellama_byop_llama3_instruct_8b_new.zip
SHAb92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-5190
Architecturex86
Accelerationnone
Updated At2024-28-May 21:00:08

Once uploaded, we create our pipeline and add the LLM as a pipeline step.

llm_pipeline = wl.build_pipeline("llama-pipeline")
llm_pipeline.add_model_step(llm_model)

We build the deployment configuration with 2 CPUs, 1 GPU and 40 Gi RAM allocated to the LLM. Only 0.5 CPU and 2 Gi RAM is allocated to the Wallaroo Native Runtime to minimize that runtime’s resources, since it has no models in this example. Once the deployment configuration is set, the pipeline is deployed with that deployment configuration.

# set the deployment config with the following:
# Wallaroo Native Runtime:  0.5 cpu, 2 Gi RAM
# Wallaroo Containerized Runtime where the LLM is deployed:  2 CPUs, 1 GPU, and 40 Gi RAM
deployment_config = DeploymentConfigBuilder() \
    .cpus(0.5).memory('2Gi') \
    .sidekick_cpus(llm_model, 2) \
    .sidekick_memory(llm_model, '40Gi') \
    .sidekick_gpus(llm_model, 1) \
    .deployment_label(deployment_label) \
    .build()

llm_pipeline.deploy(deployment_config)

Once deployed, we can check the LLMs deployment status via the wallaroo.pipeline.Pipeline.status() method.

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.124.6.17',
   'name': 'engine-77b97b577d-hh8pn',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-pipeline',
      'status': 'Running',
      'version': '57fce6fd-196c-4530-ae92-b95c923ee908'}]},
   'model_statuses': {'models': [{'name': 'llama3-instruct-8b',
      'sha': 'b92b26c9c53e32ef8d465922ff449288b8d305dd311d48f48aaef2ff3ebce2ec',
      'status': 'Running',
      'version': 'a3d8e89c-f662-49bf-bd3e-0b192f70c8b6'}]}}],
 'engine_lbs': [{'ip': '10.124.6.16',
   'name': 'engine-lb-767f54549f-gdqqd',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.124.6.19',
   'name': 'engine-sidekick-llama3-instruct-8b-234-788f9fd979-5zdxj',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

With the LLM deployed, the LLM is ready to accept inference requests through the method wallaroo.pipeline.Pipeline.infer which accepts either a pandas DataFrame or an Apache Arrow table. The example below accepts a pandas DataFrame and returns the results as the same.

data = pd.DataFrame({'text': ['Summarize what LinkedIn is']})
result = llm_pipeline(data)
result["out.generated_text"][0]

'LinkedIn is a social networking platform designed for professionals and businesses to connect, share information, and network. It allows users to create a profile showcasing their work experience, skills, education, and achievements. LinkedIn is often used for:\n\n1. Job searching: Employers can post job openings, and job seekers can search and apply for positions.\n2. Networking: Professionals can connect with colleagues, clients, and industry peers to build relationships and stay informed about industry news and trends.\n3. Personal branding: Users can showcase their skills, expertise, and achievements to establish themselves as thought leaders in their industry.\n4. Business development: Companies can use LinkedIn to promote their products or services, engage with customers, and build brand awareness.\n5. Learning and development: LinkedIn offers online courses, tutorials, and certifications to help professionals upskill and reskill.\n\nOverall, LinkedIn is a powerful tool for professionals to build their professional identity, expand their network, and advance their careers.'

For access to these sample models and a demonstration on using LLMs with Wallaroo: