Llamacpp Deploy on IBM Power10 Tutorial
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Llamacpp Deploy on IBM Power10 Tutorial
The following demonstrates deploying a Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Custom Model aka BYOP Framework on IBM Power10 architecture.
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Deploying on IBM Power10 is made easy with Wallaroo. Models are defined with an architecture at the upload stage, allowing the deployment of the same model on different architectures.
Tutorial Overview
This tutorial demonstrates using Wallaroo to:
- Upload a model and define the deployment architecture as
Power10
. - Deploy the model on a node with the Power10 chips and perform a sample inference.
- Publish the model to an OCI registry for deployment on edge and multi-cloud environments with Power10 chips.
Requirements
The following tutorial requires the following:
- Wallaroo version 2024.4 and above.
- At least one Power10 node deployed in the cluster.
- Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
Tutorial Steps
Import libraries
The first step is to import the libraries required.
import json
import os
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
from wallaroo.engine_config import Acceleration
from wallaroo.dynamic_batching_config import DynamicBatchingConfig
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Upload Model for Power10 Deployment
The model architecture is specified during the model upload process. By default this is set to
input_schema = pa.schema([
pa.field("text", pa.string())
])
output_schema = pa.schema([
pa.field("generated_text", pa.string())
])
Upload the Model and Set Architecture to Power10
We upload the model, and specify the architecture as wallaroo.engine_config.Architecture.Power10
. Once the upload is complete, it will be ready for deployment.
model = wl.upload_model('llama-cpp-sdk-power2',
'byop_llamacpp_power.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema,
arch=Architecture.Power10
)
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime...
Model is attempting loading to a container runtime.....................................................................................................
Deployment on Power10
To deploy the model, we do the following:
- Specify the deployment configuration. This defines the resources allocated from the cluster for the model’s use. Notice we do not specify the architecture in this step - this is inherited from the model. For this deployment, the model will be allocated from the Power10 node:
- CPUS: 4
- Memory: 10Gi
- Build the Pipeline and model steps: The Wallaroo pipeline defines the models and order that receive inference data.
- Deploy the pipeline with the deployment configuration: This step deploys the model with the specified resources.
Once complete, the model is ready for inference requests.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '10Gi') \
.build()
pipeline = wl.build_pipeline("llamacpp-pipeyns-power")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
pipeline=wl.get_pipeline("llamacpp-pipeyns-power")
pipeline
name | llamacpp-pipeyns-power |
---|---|
created | 2024-12-24 19:47:42.761843+00:00 |
last_updated | 2024-12-24 19:47:43.105666+00:00 |
deployed | False |
workspace_id | 22 |
workspace_name | younes.amar@wallaroo.ai - Default Workspace |
arch | power10 |
accel | none |
tags | |
versions | f19ae86a-b7de-4922-9dc8-c52af0de3f21, a1dee100-a225-4633-9541-7daed74b541c |
steps | llama-cpp-sdk-power2 |
published | False |
Inference
Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer
method, which accepts either a pandas DataFrame or an Apache Arrow table.
For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request, then display the output.
data = pd.DataFrame({'text': ['Describe what roland garros is']})
result=pipeline.infer(data, timeout=10000)
result["out.generated_text"][0]
' Roland-Garros, also known as the French Open, is a tennis tournament held annually in Paris, France. It is one of the four Grand Slam tennis tournaments, the most prestigious and important events in tennis. The tournament is played on clay courts, which are slower than grass or hard courts, and it is typically held in late May and early June. The tournament was founded in 1928 and is named after the French tennis player Richard "Dick" Roland-Garros, who was a prominent player in the 1920s and 1930s. The tournament has been played at the Stade Roland-Garros in Paris since 1928.'
Undeploy the Model
With the inference example complete, we undeploy the model and resource the resources back to the cluster.
pipeline.undeploy()
Publish the Model
Publishing the model takes the model, pipeline, and Wallaroo engine and puts them into an OCI compliant registry for later deployment on Edge and multi-cloud environments.
Publishing the pipeline uses the pipeline wallaroo.pipeline.Pipeline.publish()
command. This requires that the Wallaroo Ops instance have Edge Registry Services enabled.
When publishing, we specify the pipeline deployment configuration through the wallaroo.DeploymentConfigBuilder
. For our example, we do not specify the architecture; the architecture and is inherited from the model.
The following publishes the pipeline to the OCI registry and displays the container details. For more information, see Wallaroo SDK Essentials Guide: Pipeline Edge Publication.
pipeline.publish(deployment_config=deployment_config)
Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is publishing..................... Published.
ID | 1 | |
Pipeline Name | llamacpp-pipeyns-power | |
Pipeline Version | 4e5f120a-89ad-4233-9aac-cc25158d232e | |
Status | Published | |
Engine URL | sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-ppc64le:v2024.4.0-5866 | |
Pipeline URL | sample.registry.example.com/uat/pipelines/llamacpp-pipeyns-power:4e5f120a-89ad-4233-9aac-cc25158d232e | |
Helm Chart URL | oci://sample.registry.example.com/uat/charts/llamacpp-pipeyns-power | |
Helm Chart Reference | sample.registry.example.com/uat/charts@sha256:b99e8b7e75bd714dd4820a18e7d6850496bd34b25a9e43e3aac19db9c4ae1a04 | |
Helm Chart Version | 0.0.1-4e5f120a-89ad-4233-9aac-cc25158d232e | |
Engine Config | {'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '2Gi'}, 'requests': {'cpu': 1.0, 'memory': '2Gi'}, 'accel': 'none', 'arch': 'power10', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none'}, 'images': {'llama-cpp-sdk-power2-62': {'resources': {'limits': {'cpu': 4.0, 'memory': '10Gi'}, 'requests': {'cpu': 4.0, 'memory': '10Gi'}, 'accel': 'none', 'arch': 'power10', 'gpu': False}}}}} | |
User Images | [] | |
Created By | younes.amar@wallaroo.ai | |
Created At | 2024-12-24 21:25:18.860069+00:00 | |
Updated At | 2024-12-24 21:25:18.860069+00:00 | |
Replaces | ||
Docker Run Command |
Note: Please set the EDGE_PORT , OCI_USERNAME , and OCI_PASSWORD environment variables. | |
Helm Install Command |
Note: Please set the HELM_INSTALL_NAME , HELM_INSTALL_NAMESPACE ,
OCI_USERNAME , and OCI_PASSWORD environment variables. |
Edge Deployment
Included with the publish details include instructions on deploying the model via Docker Run
and Helm Install
commands with the defined deployment configuration on the Power10 architecture.
Contact Us
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today