Inference with Acceleration Libraries: Deploy on CUDA

How to use deploy models on CUDA.

The following shows the process of uploading, deploying, publishing, and edge deploying a model with Wallaroo for X86 deployment withNvidia Cuda acceleration . The example uses a Hugging Face Summarization model.

The first step is to upload the model, setting the AI accelerator.

input_schema = pa.schema([
    pa.field('inputs', pa.string()),
    pa.field('return_text', pa.bool_()),
    pa.field('return_tensors', pa.bool_()),
    pa.field('clean_up_tokenization_spaces', pa.bool_())
])

output_schema = pa.schema([
    pa.field('summary_text', pa.string()),
])

model = wl.upload_model(name='hf-summarization', 
                        path='./models/hf-summarisation-bart-large-samsun.zip', 
                        framework=Framework.HUGGING_FACE_SUMMARIZATION, 
                        input_schema=input_schema, 
                        output_schema=output_schema,
                        accel=wallaroo.engine_config.Acceleration.CUDA)
display(model)
Namehf-summarization
Version47743b5f-c88a-4150-a37f-9ad591eb4ee3
File Namehf-summarisation-bart-large-samsun.zip
SHAee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.2.0-main-4921
Architecturex86
Accelerationcuda
Updated At2024-03-Apr 22:13:40

With the model uploaded, we deploy it by:

  • Adding the model to a pipeline as a pipeline step.
  • Setting the deployment configuration - the resources allocated to the model from the cluster. For this example, we allocate 4 CPUs, 4 GI RAM, and 1 GPU. Note that we do not specify what type of accelerator or processor architecture is used - this is set at the model level. Because we are deploying it with a GPU, the deployment_label specifies the nodepool to use.
  • Deploying the model. At this point, the model is ready to accept inference requests until it is undeployed.
pipeline = wl.build_pipeline('hf-summarization-pipeline')

pipeline.add_model_step(model)

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('1Gi') \
    .sidekick_gpus(model, 1) \
    .sidekick_cpus(model,4) \
    .sidekick_memory(model, '8Gi') \
    .deployment_label('wallaroo.ai/accelerator: a100') \
    .build()

pipeline.deploy(deployment_config = deployment_config)

Publishing the model stores a copy of the model and the inference engine in an OCI (Open Container Initiative) Registry that is set by the Wallaroo platform operations administrator. Once published, it is ready for deployment in any edge or multi-cloud environment with the same AI Accelerator and Architecture settings.

A template of the docker run command is included with the publish return.

We now publish the pipeline. Note that the Engine Config inherited the acceleration from the model.

# default deployment configuration
publish = pipeline.publish(deployment_config=deployment_config)
display(publish)
Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is publishing....... Published.
ID1
Pipeline Namehf-summarization-pipeline
Pipeline Version6d453276-a4cf-4b01-90d7-78e9da1dd72a
StatusPublished
Engine URLghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2024.2.0-main-4921
Pipeline URLghcr.io/wallaroolabs/doc-samples/pipelines/hf-summarization-pipeline:6d453276-a4cf-4b01-90d7-78e9da1dd72a
Helm Chart URLoci://ghcr.io/wallaroolabs/doc-samples/charts/hf-summarization-pipeline
Helm Chart Referenceghcr.io/wallaroolabs/doc-samples/charts@sha256:a9406689f7429c16758447780c860ee41c78dc674280754eb2b377da1a9efbf4
Helm Chart Version0.0.1-6d453276-a4cf-4b01-90d7-78e9da1dd72a
Engine Config{'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '512Mi'}, 'requests': {'cpu': 1.0, 'memory': '512Mi'}, 'accel': 'cuda', 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none'}, 'images': {}}}
User Images[]
Created Byjohn.hansarick@wallaroo.ai
Created At2024-04-17 19:49:32.922418+00:00
Updated At2024-04-17 19:49:32.922418+00:00
Replaces
Docker Run Command
docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=ghcr.io/wallaroolabs/doc-samples/pipelines/hf-summarization-pipeline:6d453276-a4cf-4b01-90d7-78e9da1dd72a \
    -e CONFIG_CPUS=1 ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2024.2.0-main-4921

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Helm Install Command
helm install --atomic $HELM_INSTALL_NAME \
    oci://ghcr.io/wallaroolabs/doc-samples/charts/hf-summarization-pipeline \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-6d453276-a4cf-4b01-90d7-78e9da1dd72a \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

Once published, the model is deployed on edge or multi-cloud environments through the docker run template. Before deploying, the following environmental variables are set:

  • $EDGE_PORT: The network port used to submit inference requests to the deployed model.
  • $OCI_USERNAME: The user name or identifier to authenticate to the OCI (Open Container Initiative) Registry where the model was published.
  • $OCI_PASSWORD: The password or token to authenticate to the OCI (Open Container Initiative) Registry where the model was published.
docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=ghcr.io/wallaroolabs/doc-samples/pipelines/hf-summarization-pipeline:6d453276-a4cf-4b01-90d7-78e9da1dd72a \
    -e CONFIG_CPUS=1 ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2024.2.0-main-4921

Once deployed, the model is ready to accept inference requests through the specified $EDGE_PORT.