IBM Granite 8B Code Instruct Large Language Model (LLM) with GPU Tutorial


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

IBM Granite 8B Code Instruct Large Language Model (LLM) with GPU

The following demonstrates deploying an IBM Granite 8B Code Instruct Large Language Model (LLM)

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Tutorial Overview

The following demonstrates deploying and inferencing with an IBM Granite 8B Code Instruct Large Language Model (LLM) in Wallaroo.

This process shows how to:

  • Retrieve a previously uploaded IBM Granite 8B Code Instruct LLM.
  • Deploy the LLM and allocate resources for its exclusive use.
  • Perform inference requests through the deployed LLM.

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.

Prerequisites

Tutorial Steps

Import libraries

The first step is to import the Python libraries required.

import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Retrieve the LLM

The Wallaroo SDK method wallaroo.client.Client.list_models returns a List of models previously uploaded to Wallaroo. We then specify the most current model version to assign to our model variable for later steps.

model = wl.list_models()[0].versions()[-1]
model
Namebyop-granite-instruct-8b-v2
Version4d3f402d-e242-409f-8678-29c18f59a4a8
File Namebyop_granite_8b_code_instruct.zip
SHAffa1a170b0e1628924c18a96c44f43c8afef1e535e378c2eb071a61dd282c669
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-5330
Architecturex86
Accelerationnone
Updated At2024-22-Jul 12:52:47

IBM Granite 8B Code Instruct BYOP Template

Wallaroo BYOP models use Python scripts combined with the LLM artifacts to deploy the target model and perform inference requests.

Wallaroo BYOP models are composed of:

The following template demonstrates the Python script used with the Wallaroo BYOP model accept inference requests, forwards them to the IBM Granite LLM, then returns the responses back to the requester.

import os
import numpy as np

from mac.inference import Inference
from mac.inference.creation import InferenceBuilder
from mac.types import InferenceData
from mac.config.inference import CustomInferenceConfig

from typing import Any, Set
from transformers import pipeline

class GraniteInference(Inference):
    @property
    def expected_model_types(self) -> Set[Any]:
        return {pipeline}

    @Inference.model.setter
    def model(self, model) -> None:
        # self._raise_error_if_model_is_wrong_type(model)
        self._model = model

    def _predict(self, input_data: InferenceData):
        generated_texts = []
        prompts = input_data["text"].tolist()

        for prompt in prompts:
            messages = [
                {"role": "user", "content": prompt},
            ]

            generated_text = self.model(messages, max_new_tokens=1024, do_sample=True)[
                0
            ]["generated_text"][-1]["content"]
            generated_texts.append(generated_text)

        return {"generated_text": np.array(generated_texts)}

class GraniteInferenceBuilder(InferenceBuilder):
    @property
    def inference(self) -> GraniteInference:
        return GraniteInference()

    def create(self, config: CustomInferenceConfig) -> GraniteInference:
        inference = self.inference
        model = self._load_model(config.model_path)
        inference.model = model

        return inference

    def _load_model(self, model_path):
        return pipeline(
            task="text-generation",
            model=os.path.join(model_path, "artifacts", "granite-8b-code-instruct"),
            device_map="auto",
        )

Deploy the LLM

Deploying a model in Wallaroo takes the following steps:

  • Create the deployment configuration. This sets the number of resources allocated from the cluster for the LLMs use. For this example, the following resources are allocated:
    • CPUs: 4
    • RAM: 2 Gi
    • GPUs: 1. Note that when GPUs are allocated for LLMS deployed in Wallaroo, the deployment_label setting is required to specify the nodepool with the GPUs.
  • Assign the LLM to a Wallaroo pipeline as a model step, then deploy the pipeline with the deployment configuration.

Once the deployment configuration is complete, the LLM is ready to accept inference requests.

# create the deployment configuration
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '2Gi') \
    .sidekick_gpus(model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a10040") \
    .build()
# create the pipeline and add the LLM as a model step
pipeline = wl.build_pipeline("granite-pipe-v2")
pipeline.add_model_step(model)

# deploy the LLM with the deployment configuration
pipeline.deploy(deployment_config=deployment_config)

We verify the deployment status - once the status is Running the LLM is ready for inference requests.

pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.240.5.6',
   'name': 'engine-7bd8d4664d-69qfx',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'granite-pipe-v2',
      'status': 'Running',
      'version': 'c27736f6-0ee2-4ca0-9982-9845d2d5f756'}]},
   'model_statuses': {'models': [{'name': 'byop-granite-instruct-8b-v2',
      'sha': 'ffa1a170b0e1628924c18a96c44f43c8afef1e535e378c2eb071a61dd282c669',
      'status': 'Running',
      'version': '4d3f402d-e242-409f-8678-29c18f59a4a8'}]}}],
 'engine_lbs': [{'ip': '10.240.5.7',
   'name': 'engine-lb-776bbf49b9-rb5mt',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.240.5.8',
   'name': 'engine-sidekick-byop-granite-instruct-8b-v2-99-55d95d96f5-gjml9',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Submit Inference Request to Deployed LLM

Inference Requests to LLMs deployed in Wallaroo accept the following inputs:

  • pandas DataFrame
  • Apache Arrow Tables

Inference Requests performed through the Wallaroo SDK returns inference results in the same format they were submitted in; if the request is in a pandas DataFrame, the response is returned in a pandas DataFrame.

For this example, the inference request is submitted as a pandas DataFrame, with result returned in the same format.

data = pd.DataFrame({"text": ["Write a code to find the maximum value in a list of numbers."]})
result = pipeline.infer(data, timeout=10000)
result
timein.textout.generated_textanomaly.count
02024-07-22 14:21:36.748Write a code to find the maximum value in a li...You can use the `max()` function in Python to ...0

We isolate the generated_text output field so show the inference result generated from the IBM Granite LLM.

result["out.generated_text"].values[0]
'You can use the `max()` function in Python to find the maximum value in a given list. The `max()` function takes an iterable (such as a list) and returns the largest element.\n\nHere is a Python code snippet that finds the maximum value in a list of numbers:\n\n```python\ndef find_max(numbers):\n    max_value = max(numbers)\n    return max_value\n```\n\nTo use the `find_max` function, you need to pass a list of numbers as an argument. The function will return the maximum value in the list.'

When complete, we undeploy the LLM and return the resources back to the cluster.

pipeline.undeploy()
Waiting for undeployment - this will take up to 45s .................................... ok
namegranite-pipe-v1
created2024-07-22 12:09:21.555827+00:00
last_updated2024-07-22 12:09:21.605179+00:00
deployedFalse
archx86
accelnone
tags
versions4b2d7802-e930-437f-a29e-05ced44eddd7, 798784e0-85d6-461a-b005-177340b48f5a
stepsbyop-granite-instruct-8b-v1
publishedFalse

This sample notebook is available through the Wallaroo Tutorials repository.

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.