IBM Granite 8B Code Instruct Large Language Model (LLM) with GPU Tutorial
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
IBM Granite 8B Code Instruct Large Language Model (LLM) with GPU
The following demonstrates deploying an IBM Granite 8B Code Instruct Large Language Model (LLM)
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Tutorial Overview
The following demonstrates deploying and inferencing with an IBM Granite 8B Code Instruct Large Language Model (LLM) in Wallaroo.
This process shows how to:
- Retrieve a previously uploaded IBM Granite 8B Code Instruct LLM.
- Deploy the LLM and allocate resources for its exclusive use.
- Perform inference requests through the deployed LLM.
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Prerequisites
- Wallaroo 2024.1 and above.
- A cluster with GPUs. See Create GPU Nodepools for instructions on adding a GPU enabled nodepool to a cluster hosting Wallaroo.
- The IBM Granite 8B Code Instruct LLM contained in the Wallaroo Arbitrary Python aka BYOP (Bring Your Own Predict) framework.
Tutorial Steps
Import libraries
The first step is to import the Python libraries required.
import json
import os
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Retrieve the LLM
The Wallaroo SDK method wallaroo.client.Client.list_models
returns a List of models previously uploaded to Wallaroo. We then specify the most current model version to assign to our model
variable for later steps.
model = wl.list_models()[0].versions()[-1]
model
Name | byop-granite-instruct-8b-v2 |
Version | 4d3f402d-e242-409f-8678-29c18f59a4a8 |
File Name | byop_granite_8b_code_instruct.zip |
SHA | ffa1a170b0e1628924c18a96c44f43c8afef1e535e378c2eb071a61dd282c669 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-5330 |
Architecture | x86 |
Acceleration | none |
Updated At | 2024-22-Jul 12:52:47 |
IBM Granite 8B Code Instruct BYOP Template
Wallaroo BYOP models use Python scripts combined with the LLM artifacts to deploy the target model and perform inference requests.
Wallaroo BYOP models are composed of:
- One or more Python scripts.
- A
requirements.txt
file to specify the libraries. - Any model artifacts. For this example, the IBM Granite 8B Code Instruct LLM.
The following template demonstrates the Python script used with the Wallaroo BYOP model accept inference requests, forwards them to the IBM Granite LLM, then returns the responses back to the requester.
import os
import numpy as np
from mac.inference import Inference
from mac.inference.creation import InferenceBuilder
from mac.types import InferenceData
from mac.config.inference import CustomInferenceConfig
from typing import Any, Set
from transformers import pipeline
class GraniteInference(Inference):
@property
def expected_model_types(self) -> Set[Any]:
return {pipeline}
@Inference.model.setter
def model(self, model) -> None:
# self._raise_error_if_model_is_wrong_type(model)
self._model = model
def _predict(self, input_data: InferenceData):
generated_texts = []
prompts = input_data["text"].tolist()
for prompt in prompts:
messages = [
{"role": "user", "content": prompt},
]
generated_text = self.model(messages, max_new_tokens=1024, do_sample=True)[
0
]["generated_text"][-1]["content"]
generated_texts.append(generated_text)
return {"generated_text": np.array(generated_texts)}
class GraniteInferenceBuilder(InferenceBuilder):
@property
def inference(self) -> GraniteInference:
return GraniteInference()
def create(self, config: CustomInferenceConfig) -> GraniteInference:
inference = self.inference
model = self._load_model(config.model_path)
inference.model = model
return inference
def _load_model(self, model_path):
return pipeline(
task="text-generation",
model=os.path.join(model_path, "artifacts", "granite-8b-code-instruct"),
device_map="auto",
)
Deploy the LLM
Deploying a model in Wallaroo takes the following steps:
- Create the deployment configuration. This sets the number of resources allocated from the cluster for the LLMs use. For this example, the following resources are allocated:
- CPUs: 4
- RAM: 2 Gi
- GPUs: 1. Note that when GPUs are allocated for LLMS deployed in Wallaroo, the
deployment_label
setting is required to specify the nodepool with the GPUs.
- Assign the LLM to a Wallaroo pipeline as a model step, then deploy the pipeline with the deployment configuration.
Once the deployment configuration is complete, the LLM is ready to accept inference requests.
# create the deployment configuration
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '2Gi') \
.sidekick_gpus(model, 1) \
.deployment_label("wallaroo.ai/accelerator:a10040") \
.build()
# create the pipeline and add the LLM as a model step
pipeline = wl.build_pipeline("granite-pipe-v2")
pipeline.add_model_step(model)
# deploy the LLM with the deployment configuration
pipeline.deploy(deployment_config=deployment_config)
We verify the deployment status - once the status is Running
the LLM is ready for inference requests.
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.240.5.6',
'name': 'engine-7bd8d4664d-69qfx',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'granite-pipe-v2',
'status': 'Running',
'version': 'c27736f6-0ee2-4ca0-9982-9845d2d5f756'}]},
'model_statuses': {'models': [{'name': 'byop-granite-instruct-8b-v2',
'sha': 'ffa1a170b0e1628924c18a96c44f43c8afef1e535e378c2eb071a61dd282c669',
'status': 'Running',
'version': '4d3f402d-e242-409f-8678-29c18f59a4a8'}]}}],
'engine_lbs': [{'ip': '10.240.5.7',
'name': 'engine-lb-776bbf49b9-rb5mt',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.240.5.8',
'name': 'engine-sidekick-byop-granite-instruct-8b-v2-99-55d95d96f5-gjml9',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Submit Inference Request to Deployed LLM
Inference Requests to LLMs deployed in Wallaroo accept the following inputs:
- pandas DataFrame
- Apache Arrow Tables
Inference Requests performed through the Wallaroo SDK returns inference results in the same format they were submitted in; if the request is in a pandas DataFrame, the response is returned in a pandas DataFrame.
For this example, the inference request is submitted as a pandas DataFrame, with result returned in the same format.
data = pd.DataFrame({"text": ["Write a code to find the maximum value in a list of numbers."]})
result = pipeline.infer(data, timeout=10000)
result
time | in.text | out.generated_text | anomaly.count | |
---|---|---|---|---|
0 | 2024-07-22 14:21:36.748 | Write a code to find the maximum value in a li... | You can use the `max()` function in Python to ... | 0 |
We isolate the generated_text
output field so show the inference result generated from the IBM Granite LLM.
result["out.generated_text"].values[0]
'You can use the `max()` function in Python to find the maximum value in a given list. The `max()` function takes an iterable (such as a list) and returns the largest element.\n\nHere is a Python code snippet that finds the maximum value in a list of numbers:\n\n```python\ndef find_max(numbers):\n max_value = max(numbers)\n return max_value\n```\n\nTo use the `find_max` function, you need to pass a list of numbers as an argument. The function will return the maximum value in the list.'
When complete, we undeploy the LLM and return the resources back to the cluster.
pipeline.undeploy()
Waiting for undeployment - this will take up to 45s .................................... ok
name | granite-pipe-v1 |
---|---|
created | 2024-07-22 12:09:21.555827+00:00 |
last_updated | 2024-07-22 12:09:21.605179+00:00 |
deployed | False |
arch | x86 |
accel | none |
tags | |
versions | 4b2d7802-e930-437f-a29e-05ced44eddd7, 798784e0-85d6-461a-b005-177340b48f5a |
steps | byop-granite-instruct-8b-v1 |
published | False |
This sample notebook is available through the Wallaroo Tutorials repository.
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today