Vertex AI Inference Endpoint Deployment and Monitoring with Wallaroo
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Wallaroo Deployment of Managed Inference Endpoint Models with Google Vertex
The following tutorial demonstrates uploading, deploying, inferencing and monitoring a LLM with Managed Inference Endpoints.
These models leverage LLMs deployed in other services, with Wallaroo providing a single source for inference requests, logging results, monitoring for hate/abuse/racism and other factors, and tracking model drift through Wallaroo assays.
Provided Models
The following models are provided:
byop_llama2_vertex_v2_9.zip
: A Wallaroo BYOP model that uses Google Vertex as a Managed Inference Endpoint.
Prerequisites
This tutorial requires:
- Wallaroo 2024.1 and above
- Credentials for authenticating to Google Vertex
Tutorial Steps
Import Library
The following libraries are used to upload and perform inferences on the LLM with Managed Inference Endpoints.
import json
import os
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
A connection to Wallaroo is opened through the Wallaroo SDK client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
The request_timeout
flag is used for Wallaroo BYOP models where the file size may require additional time to complete the upload process.
wl = wallaroo.Client()
LLM with Managed Inference Endpoint Model Code
The Wallaroo BYOP model byop_llamav2_vertex_v2_9.zip
contains the following artifacts:
main.py
: Python script that controls the behavior of the model.requirements.txt
: Python requirements file that sets the Python libraries used.
The model performs the following.
Accepts the inference request from the requester.
Load the credentials to the Google Vertex session from the provided environmental variables. These are supplied during the Set Deployment Configuration step. The following code shows this process.
credentials = Credentials.from_service_account_info( json.loads(os.environ["GOOGLE_APPLICATION_CREDENTIALS"].replace("'", '"')), scopes=["https://www.googleapis.com/auth/cloud-platform"], )
Take the inference request, connect to Google and submit the request to the deployed LLM. The inference result is returned to the BYOP model, which is then returned.
def _predict(self, input_data: InferenceData): credentials.refresh(Request()) token = credentials.token headers = { "Authorization": f"Bearer {token}", "Content-Type": "application/json", } prompts = input_data["text"].tolist() instances = [{"prompt": prompt, "max_tokens": 200} for prompt in prompts] response = requests.post( f"{self.model}", json={"instances": instances}, headers=headers, ) predictions = response.json() if isinstance(predictions["predictions"], str): generated_text = [ prediction.split("Output:\n")[-1] for prediction in predictions["predictions"] ] else: generated_text = [ prediction["predictions"][0].split("Output:\n")[-1] for prediction in predictions["predictions"] ] return {"generated_text": np.array(generated_text)}
This model is contained in a Wallaroo pipeline which accepts the inference request, then returns the final result back to the requester.
Upload LLM with Managed Inference Endpoint Model
Uploading models uses the Wallaroo Client upload_model
method, which takes the following parameters:
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework . |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. |
The following shows the upload parameters for the byop_llama2_vertex_v2_9.zip
Wallaroo BYOP model with the following input and output schema:
- Input:
text
(String): The input text.
- Output:
generated_text
(String): The result returned from the GPT 3.5 model as a Managed Inference Endpoint.
The uploaded model reference is saved to the variable model
.
input_schema = pa.schema([
pa.field("text", pa.string()),
])
output_schema = pa.schema([
pa.field("generated_text", pa.string())
])
model = wl.upload_model('byop-llama-vertex-v1',
'./models/byop_llama2_vertex_v2_9.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema,
)
model
Set Deployment Configuration
The deployment configuration sets the resources assigned to the LLM with Managed Inference Endpoint. For this example, following resources are applied.
byop_llama2_vertex_v2_9.zip
: 2 cpus, 1 Gi RAM, plus the environmental variableGOOGLE_APPLICATION_CREDENTIALS
loaded from the filecredentials.json
.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(model, 2) \
.sidekick_memory(model, '1Gi') \
.sidekick_env(model, {"GOOGLE_APPLICATION_CREDENTIALS": str(json.load(open("credentials.json", 'r')))}) \
.build()
Deploy Model
To deploy the model:
- We build a Wallaroo pipeline and assign the model as a pipeline step. For this tutorial it is called
llama-vertex-pipe
. - The pipeline is deployed with the deployment configuration.
- Once the resources allocation is complete, the model is ready for inferencing.
See Model Deploy for more details on deploying LLMs in Wallaroo.
pipeline = wl.build_pipeline("llama-vertex-pipe")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
Generate Inference Request
The inference request will be submitted as a pandas DataFrame as a text entry.
input_data = pd.DataFrame({'text': ['What happened to the Serge llama?', 'How are you doing?']})
Submit Inference Request
The inference request is submitted to the pipeline with the infer
method, which accepts either:
- pandas DataFrame
- Apache Arrow Table
The results are returned in the same format as submitted. For this example, a pandas DataFrame is submitted, so a pandas DataFrame is returned. The final generated text is displayed.
pipeline.infer(input_data)
Undeploy
pipeline.undeploy()