Vertex AI Inference Endpoint Deployment and Monitoring with Wallaroo

Table of Contents

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Wallaroo Deployment of Managed Inference Endpoint Models with Google Vertex

The following tutorial demonstrates uploading, deploying, inferencing and monitoring a LLM with Managed Inference Endpoints.

These models leverage LLMs deployed in other services, with Wallaroo providing a single source for inference requests, logging results, monitoring for hate/abuse/racism and other factors, and tracking model drift through Wallaroo assays.

Provided Models

The following models are provided:

  • byop_llama2_vertex_v2_9.zip: A Wallaroo BYOP model that uses Google Vertex as a Managed Inference Endpoint.

Prerequisites

This tutorial requires:

  • Wallaroo 2024.1 and above
  • Credentials for authenticating to Google Vertex

Tutorial Steps

Import Library

The following libraries are used to upload and perform inferences on the LLM with Managed Inference Endpoints.

import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

A connection to Wallaroo is opened through the Wallaroo SDK client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

The request_timeout flag is used for Wallaroo BYOP models where the file size may require additional time to complete the upload process.

wl = wallaroo.Client()

LLM with Managed Inference Endpoint Model Code

The Wallaroo BYOP model byop_llamav2_vertex_v2_9.zip contains the following artifacts:

  • main.py: Python script that controls the behavior of the model.
  • requirements.txt: Python requirements file that sets the Python libraries used.

The model performs the following.

  1. Accepts the inference request from the requester.

  2. Load the credentials to the Google Vertex session from the provided environmental variables. These are supplied during the Set Deployment Configuration step. The following code shows this process.

    credentials = Credentials.from_service_account_info(
          json.loads(os.environ["GOOGLE_APPLICATION_CREDENTIALS"].replace("'", '"')),
          scopes=["https://www.googleapis.com/auth/cloud-platform"],
      )
    
  3. Take the inference request, connect to Google and submit the request to the deployed LLM. The inference result is returned to the BYOP model, which is then returned.

    def _predict(self, input_data: InferenceData):
        credentials.refresh(Request())
        token = credentials.token
    
        headers = {
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json",
        }
        prompts = input_data["text"].tolist()
        instances = [{"prompt": prompt, "max_tokens": 200} for prompt in prompts]
    
        response = requests.post(
            f"{self.model}",
            json={"instances": instances},
            headers=headers,
        )
    
        predictions = response.json()
    
        if isinstance(predictions["predictions"], str):
            generated_text = [
                prediction.split("Output:\n")[-1]
                for prediction in predictions["predictions"]
            ]
        else:
            generated_text = [
                prediction["predictions"][0].split("Output:\n")[-1]
                for prediction in predictions["predictions"]
            ]
    
        return {"generated_text": np.array(generated_text)}
    

This model is contained in a Wallaroo pipeline which accepts the inference request, then returns the final result back to the requester.

Upload LLM with Managed Inference Endpoint Model

Uploading models uses the Wallaroo Client upload_model method, which takes the following parameters:

ParameterTypeDescription
namestring (Required)The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
pathstring (Required)The path to the model file being uploaded.
frameworkstring (Required)The framework of the model from wallaroo.framework.
input_schemapyarrow.lib.Schema (Required)The input schema in Apache Arrow schema format.
output_schemapyarrow.lib.Schema (Required)The output schema in Apache Arrow schema format.

The following shows the upload parameters for the byop_llama2_vertex_v2_9.zip Wallaroo BYOP model with the following input and output schema:

  • Input:
    • text (String): The input text.
  • Output:
    • generated_text (String): The result returned from the GPT 3.5 model as a Managed Inference Endpoint.

The uploaded model reference is saved to the variable model.

input_schema = pa.schema([
    pa.field("text", pa.string()),
])

output_schema = pa.schema([
    pa.field("generated_text", pa.string())
])
model = wl.upload_model('byop-llama-vertex-v1', 
    './models/byop_llama2_vertex_v2_9.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
)
model

Set Deployment Configuration

The deployment configuration sets the resources assigned to the LLM with Managed Inference Endpoint. For this example, following resources are applied.

  • byop_llama2_vertex_v2_9.zip: 2 cpus, 1 Gi RAM, plus the environmental variable GOOGLE_APPLICATION_CREDENTIALS loaded from the file credentials.json.
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 2) \
    .sidekick_memory(model, '1Gi') \
    .sidekick_env(model, {"GOOGLE_APPLICATION_CREDENTIALS": str(json.load(open("credentials.json", 'r')))}) \
    .build()

Deploy Model

To deploy the model:

  1. We build a Wallaroo pipeline and assign the model as a pipeline step. For this tutorial it is called llama-vertex-pipe.
  2. The pipeline is deployed with the deployment configuration.
  3. Once the resources allocation is complete, the model is ready for inferencing.

See Model Deploy for more details on deploying LLMs in Wallaroo.

pipeline = wl.build_pipeline("llama-vertex-pipe")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)

Generate Inference Request

The inference request will be submitted as a pandas DataFrame as a text entry.

input_data = pd.DataFrame({'text': ['What happened to the Serge llama?', 'How are you doing?']})

Submit Inference Request

The inference request is submitted to the pipeline with the infer method, which accepts either:

  • pandas DataFrame
  • Apache Arrow Table

The results are returned in the same format as submitted. For this example, a pandas DataFrame is submitted, so a pandas DataFrame is returned. The final generated text is displayed.

pipeline.infer(input_data)

Undeploy

pipeline.undeploy()