Llamacpp Deploy on IBM Power10 Tutorial

How to deploy and publish Llamacpp LLMs on the IBM Power10 Architecture

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Llamacpp Deploy on IBM Power10 Tutorial

The following demonstrates deploying a Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Custom Model aka BYOP Framework on IBM Power10 architecture.

For access to these sample models and for a demonstration:

Deploying on IBM Power10 is made easy with Wallaroo. Models are defined with an architecture at the upload stage, allowing the deployment of the same model on different architectures.

Tutorial Overview

This tutorial demonstrates using Wallaroo to:

  • Upload a model and define the deployment architecture as Power10.
  • Deploy the model on a node with the Power10 chips and perform a sample inference.
  • Publish the model to an OCI registry for deployment on edge and multi-cloud environments with Power10 chips.

Requirements

The following tutorial requires the following:

  • Wallaroo version 2024.4 and above.
  • At least one Power10 node deployed in the cluster.
  • Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.

Tutorial Steps

Import libraries

The first step is to import the libraries required.

import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
from wallaroo.engine_config import Acceleration
from wallaroo.dynamic_batching_config import DynamicBatchingConfig

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload Model for Power10 Deployment

The model architecture is specified during the model upload process. By default this is set to

input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("generated_text", pa.string())
])

Upload the Model and Set Architecture to Power10

We upload the model, and specify the architecture as wallaroo.engine_config.Architecture.Power10. Once the upload is complete, it will be ready for deployment.

model = wl.upload_model('llama-cpp-sdk-power2', 
    'byop_llamacpp_power.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
    arch=Architecture.Power10
)
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime...
Model is attempting loading to a container runtime.....................................................................................................

Deployment on Power10

To deploy the model, we do the following:

  • Specify the deployment configuration. This defines the resources allocated from the cluster for the model’s use. Notice we do not specify the architecture in this step - this is inherited from the model. For this deployment, the model will be allocated from the Power10 node:
    • CPUS: 4
    • Memory: 10Gi
  • Build the Pipeline and model steps: The Wallaroo pipeline defines the models and order that receive inference data.
  • Deploy the pipeline with the deployment configuration: This step deploys the model with the specified resources.

Once complete, the model is ready for inference requests.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '10Gi') \
    .build()
pipeline = wl.build_pipeline("llamacpp-pipeyns-power")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
pipeline=wl.get_pipeline("llamacpp-pipeyns-power")
pipeline
namellamacpp-pipeyns-power
created2024-12-24 19:47:42.761843+00:00
last_updated2024-12-24 19:47:43.105666+00:00
deployedFalse
workspace_id22
workspace_nameyounes.amar@wallaroo.ai - Default Workspace
archpower10
accelnone
tags
versionsf19ae86a-b7de-4922-9dc8-c52af0de3f21, a1dee100-a225-4633-9541-7daed74b541c
stepsllama-cpp-sdk-power2
publishedFalse

Inference

Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer method, which accepts either a pandas DataFrame or an Apache Arrow table.

For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request, then display the output.

data = pd.DataFrame({'text': ['Describe what roland garros is']})
result=pipeline.infer(data, timeout=10000)
result["out.generated_text"][0]
' Roland-Garros, also known as the French Open, is a tennis tournament held annually in Paris, France. It is one of the four Grand Slam tennis tournaments, the most prestigious and important events in tennis. The tournament is played on clay courts, which are slower than grass or hard courts, and it is typically held in late May and early June. The tournament was founded in 1928 and is named after the French tennis player Richard "Dick" Roland-Garros, who was a prominent player in the 1920s and 1930s. The tournament has been played at the Stade Roland-Garros in Paris since 1928.'

Undeploy the Model

With the inference example complete, we undeploy the model and resource the resources back to the cluster.

pipeline.undeploy()

Publish the Model

Publishing the model takes the model, pipeline, and Wallaroo engine and puts them into an OCI compliant registry for later deployment on Edge and multi-cloud environments.

Publishing the pipeline uses the pipeline wallaroo.pipeline.Pipeline.publish() command. This requires that the Wallaroo Ops instance have Edge Registry Services enabled.

When publishing, we specify the pipeline deployment configuration through the wallaroo.DeploymentConfigBuilder. For our example, we do not specify the architecture; the architecture and is inherited from the model.

The following publishes the pipeline to the OCI registry and displays the container details. For more information, see Wallaroo SDK Essentials Guide: Pipeline Edge Publication.

pipeline.publish(deployment_config=deployment_config)
Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is publishing..................... Published.
ID1
Pipeline Namellamacpp-pipeyns-power
Pipeline Version4e5f120a-89ad-4233-9aac-cc25158d232e
StatusPublished
Engine URLsample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-ppc64le:v2024.4.0-5866
Pipeline URLsample.registry.example.com/uat/pipelines/llamacpp-pipeyns-power:4e5f120a-89ad-4233-9aac-cc25158d232e
Helm Chart URLoci://sample.registry.example.com/uat/charts/llamacpp-pipeyns-power
Helm Chart Referencesample.registry.example.com/uat/charts@sha256:b99e8b7e75bd714dd4820a18e7d6850496bd34b25a9e43e3aac19db9c4ae1a04
Helm Chart Version0.0.1-4e5f120a-89ad-4233-9aac-cc25158d232e
Engine Config{'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '2Gi'}, 'requests': {'cpu': 1.0, 'memory': '2Gi'}, 'accel': 'none', 'arch': 'power10', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none'}, 'images': {'llama-cpp-sdk-power2-62': {'resources': {'limits': {'cpu': 4.0, 'memory': '10Gi'}, 'requests': {'cpu': 4.0, 'memory': '10Gi'}, 'accel': 'none', 'arch': 'power10', 'gpu': False}}}}}
User Images[]
Created Byyounes.amar@wallaroo.ai
Created At2024-12-24 21:25:18.860069+00:00
Updated At2024-12-24 21:25:18.860069+00:00
Replaces
Docker Run Command
docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=sample.registry.example.com/uat/pipelines/llamacpp-pipeyns-power:4e5f120a-89ad-4233-9aac-cc25158d232e \
    -e CONFIG_CPUS=1.0 --cpus=5.0 --memory=12g \
    sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-ppc64le:v2024.4.0-5866

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Helm Install Command
helm install --atomic $HELM_INSTALL_NAME \
    oci://sample.registry.example.com/uat/charts/llamacpp-pipeyns-power \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-4e5f120a-89ad-4233-9aac-cc25158d232e \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

Edge Deployment

Included with the publish details include instructions on deploying the model via Docker Run and Helm Install commands with the defined deployment configuration on the Power10 architecture.

Contact Us

For access to these sample models and for a demonstration: