Managed LLM Inference Endpoints (MaaS) in Wallaroo


Table of Contents

Wallaroo provides the ability to leverage LLMs deployed in external cloud through Wallaroo’s Arbitrary Python aka BYOP models. As a result, Wallaroo users are able to:

  • Perform inference requests submitted to Wallaroo against Managed Inference Endpoints: LLMs deployed in other services such as:
    • Google Vertex
    • OpenAI
    • Azure ML Studio
  • Monitor LLM inference endpoints:

This allows organizations to use Wallaroo as a centralized location for inference requests, edge and multi-cloud deployments, real time and scheduled monitoring.

The following provides examples how to:

  • Deploy BYOP models with Managed Inference Endpoints.
  • Inference through the Wallaroo deployed BYOP models.
  • Monitor the inference results through in-line or offline Wallaroo LLM listeners.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Deploy LLMs with Managed Inference Endpoints

LLMs with Managed Inference Endpoint are deployed in Wallaroo through Arbitrary Python aka BYOP models. Arbitrary Python or BYOP (Bring Your Own Predict) allows organizations to use Python scripts and supporting libraries as it’s own model.

Wallaroo BYOP models integrate externally deployed LLMs by:

  • Providing credentials to connect to the LLMs deployed in external services.
  • Using the BYOP’s _predict method to submit the inference request to the externally deployed LLM and process the received inference result.

Structure for Managed Inference Endpoints BYOP Models

The following shows how to implement Managed Inference Endpoints for BYOP models in Wallaroo.

For the BYOP model to connect with the Managed Inference Endpoint, credentials are provided to authenticate to the endpoint. These are loaded from environmental variables, which are passed in during the Deploy Models stage. This allows deployment with updated credentials without changing the underlying code.

The following is generic example of loading credentials passed during model deployment as environmental variables.

credentials = Credentials.from_service_account_info( json.loads(os.environ["APPLICATION_CREDENTIALS"].replace("'", '"')), scopes=["https://SERVICE_URL"], )

The Wallaroo BYOP _predict method uses these credentials to submit the incoming InferenceData to the Managed Inference Endpoint, then returns the results. The following example shows a generic template of that process.

def _predict(self, input_data: InferenceData): credentials.refresh(Request()) token = credentials.token headers = { "Authorization": f"Bearer {token}", "Content-Type": "application/json", } prompts = input_data["text"].tolist() instances = [{"prompt": prompt, "max_tokens": 200} for prompt in prompts] response = requests.post( f"{self.model}", json={"instances": instances}, headers=headers, ) predictions = response.json() if isinstance(predictions["predictions"], str): generated_text = [ prediction.split("Output:\n")[-1] for prediction in predictions["predictions"] ] else: generated_text = [ prediction["predictions"][0].split("Output:\n")[-1] for prediction in predictions["predictions"] ] return {"generated_text": np.array(generated_text)}

Upload Models

BYOP models that integrate Managed Inference Endpoints are uploaded through the Wallaroo SDK, or the Wallaroo MLops API.

Upload via the Wallaroo SDK

Models are uploaded with the Wallaroo SDK via the wallaroo.client.Client.upload_model with the following parameters. For full details, see the LLM Deploy guides.

ParameterTypeDescription
namestring (Required)The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
pathstring (Required)The path to the model file being uploaded.
frameworkstring (Required)The framework of the model from wallaroo.framework.
input_schemapyarrow.lib.Schema (Required)The input schema in Apache Arrow schema format.
output_schemapyarrow.lib.Schema (Required)The output schema in Apache Arrow schema format.

The following template demonstrates uploading a generic model to Wallaroo.

import wallaroo # connect to Wallaroo wl = wallaroo.Client() # upload the model llm_model = wl.upload_client( name = llm_model_name, path = llm_file_path, input_schema = llm_input_schema, # defined in Apache Pyarrow.Schema output_schema = llm_output_schema, # defined in Apache Pyarrow.Schema framework = llm_framework )

Upload via the Wallaroo MLOps API

The method wallaroo.client.Client.generate_upload_model_api_command generates a curl script for uploading models to Wallaroo via the Wallaroo MLOps API. The generated curl script is based on the Wallaroo SDK user’s current workspace. This is useful for environments that do not have the Wallaroo SDK installed, or uploading very large models (10 gigabytes or more).

The command assumes that other upload parameters are set to default. For details on uploading models via the Wallaroo MLOps API, see Wallaroo MLOps API Essentials Guide: Model Upload and Registrations.

This method takes the following parameters:

ParameterTypeDescription
base_urlString (Required)The Wallaroo domain name. For example: wallaroo.example.com.
nameString (Required)The name to assign the model at upload. This must match DNS naming conventions.
pathString (Required)Path to the ML or LLM model file.
frameworkString (Required)The framework from wallaroo.framework.Framework For a complete list, see Wallaroo Supported Models.
input_schemaString (Required)The model’s input schema in PyArrow.Schema format.
output_schemaString (Required)The model’s output schema in PyArrow.Schema format.

This outputs a curl command in the following format (indentions added for emphasis). The sections marked with {} represent the variable names that are injected into the script from the above parameter or from the current SDK session:

  • {Current Workspace['id']}: The value of the id for the current workspace.
  • {Bearer Token}: The bearer token used to authentication to the Wallaroo MLOps API.
curl --progress-bar -X POST \ -H "Content-Type: multipart/form-data" \ -H "Authorization: Bearer {Bearer Token}" -F "metadata={"name": {name}, "visibility": "private", "workspace_id": {Current Workspace['id']}, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, \ "input_schema": "{base64 version of input_schema}", \ "output_schema": "base64 version of the output_schema"};type=application/json" \ -F "file=@{path};type=application/octet-stream" \ https://{base_url}/v1/api/models/upload_and_convert

Once generated, users can use the script to upload the model via the Wallaroo MLOps API.

The following example shows setting the parameters above and generating the model upload API command.

import wallaroo import pyarrow as pa # set the input and output schemas input_schema = pa.schema([ pa.field("text", pa.string()) ]) output_schema = pa.schema([ pa.field("generated_text", pa.string()) ]) # use the generate model upload api command wl.generate_upload_model_api_command( base_url='https://example.wallaroo.ai/', name='sample_model_name', path='llama_byop.zip', framework=Framework.CUSTOM, input_schema=input_schema, output_schema=output_schema)

The output of this command is:

curl --progress-bar -X POST -H "Content-Type: multipart/form-data" -H "Authorization: Bearer abc123" -F "metadata={"name": "sample_model_name", "visibility": "private", "workspace_id": 20, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json" -F "file=@llama_byop.zip;type=application/octet-stream" https://example.wallaroo.ai/v1/api/models/upload_and_convert'

Deploy Models

LLM deployments allocate resources from the cluster for the model’s use. Once deployed, models are to accept inference results.

LLM deployments use Deployment Configurations to determine resource allocation, including:

  • The number of CPUs
  • The amount of RAM
  • The number of GPUs (if any)
  • The number of replicas
  • Etc.

The following demonstrates deploying the model via the Wallaroo SDK. For full details on deploying models in Wallaroo, see ML Operations Model Deploy. For this example, the BYOP model is deployed to the Wallaroo Containerized Runtime with 2 GPUs, 1 Gi of RAM. The sidekick_env provides the BYOP model with the environmental variables used to connect to the service hosting the LLM.

The model is deployed by adding the model to a Wallaroo pipeline as a model step, then issuing the deploy command with the deployment configuration as a parameter.

deployment_config = DeploymentConfigBuilder() \ .cpus(1).memory('2Gi') \ .sidekick_cpus(llm_model, 2) \ .sidekick_memory(llm_model, '1Gi') \ .sidekick_env(llm_model, {"APPLICATION_CREDENTIALS": str(json.load(open("credentials.json", 'r')))}) \ .build() pipeline = wl.build_pipeline('sample-external-deployed-llm-pipeline') pipeline.add_model_step(model) pipeline.deploy(deployment_config)

Once the deployment configuration is complete, the model is ready to accept inference requests.

Inference with BYOP Managed Inference Endpoint Enabled Models

Inference requests are submitted to the deployed BYOP model with the enabled Managed Inference Endpoint via either the Wallaroo SDK or the MLOps API.

Inference via the Wallaroo SDK

Inference requests via the Wallaroo SDK to deployed models use either the pipeline.Pipeline.infer or the pipeline.Pipeline.infer_from_file methods.

pipeline.Pipeline.infer accepts either a pandas DataFrame, or an Apache Arrow Table.

pipeline.Pipeline.infer_from_file accepts either a pandas Record format file in JSON or an Apache Arrow table.

For full details on inference requests, see Model Inference.

The following example demonstrates executing an inference request a model deployed in Wallaroo.

# set the input as a pandas DataFrame input_data = pd.DataFrame({'text': ['What happened to the Serge llama?', 'How are you doing?']}) # perform the inference request pipeline.infer(input_data)

Inference via the Wallaroo MLOps API

Inference requests through the Wallaroo MLOps API are made against the deployed model’s Inference URL endpoint. For full details, see Perform Inference via API.

The Model’s Inference URL endpoint is in the following format:

https://{Wallaroo Domain}/v1/api/pipelines/infer/{Pipeline Name}-{Pipeline Id}/{Pipeline Name}

For example, the pipeline with the the Wallaroo Domain example.wallaroo.ai, the pipeline name sample_pipeline and the pipeline id 100 is:

https://example.wallaroo.ai/v1/api/pipelines/infer/sample_pipeline-100/sample_pipeline

The following headers are required for connecting the the Model’ Inference URL:

  • Authorization: This requires the JWT token in the format 'Bearer ' + token. For example:

    Authorization: Bearer abcdefg==
  • Content-Type:

    • For DataFrame formatted JSON:

      Content-Type:application/json; format=pandas-records
    • For Arrow binary files, the Content-Type is application/vnd.apache.arrow.file.

      Content-Type:application/vnd.apache.arrow.file

The following demonstrates performing an inference request via curl with a a JSON file in pandas Record format.

curl $INFERENCE_URL \ -H "Content-Type: application/json; format=pandas-records" \ -H "Authorization: Bearer {Bearer Token}" \ --data '[{{"text": ["What happened to the Serge llama?" , "How are you doing?"]}}]'

Monitor Inference Results from Managed Inference Endpoints

BYOP with Managed Inference Endpoint LLMs monitor inference results through:

  • LLM Validation Listeners: The LLM’s response is evaluated in real time and the evaluation included with the final inference result.
  • Offline scoring monitors evaluates logged inference results as a separate activity and stores those results as inference results logs in Wallaroo.

In-Line Monitoring

In-line monitoring of Managed Inference Endpoint LLMs in Wallaroo uses two models:

  • The BYOP model the utilizes the Managed Inference Endpoint to execute the inference request.
  • A validation listener model that accepts the inference results from the external deployed LLM, and scores the results.

Both models are used in the same Wallaroo pipeline, with the inputs from the BYOP model passed to the validation listener model, with the final result passed to the requester.

The following demonstrates uploading a LLM Listener Model to Wallaroo, then applying both models to the same Wallaroo pipeline as model steps. This allows the inference request to first go to the Wallaroo BYOP model that leverages the external deployed LLM, then forwards those results to the validation listener model.

import wallaroo # connect to Wallaroo wl = wallaroo.Client() # upload the model llm_model = wl.upload_client( name = model_name, path = file_path, input_schema = input_schema, # defined in Apache Pyarrow.Schema output_schema = output_schema, # defined in Apache Pyarrow.Schema framework = framework ) validation_listener_model = wl.upload_client( name = validation_model_name, path = validation_file_path, input_schema = validation_input_schema, # defined in Apache Pyarrow.Schema output_schema = validation_output_schema, # defined in Apache Pyarrow.Schema framework = validation_framework # sample deployment config # LLM model: 2 cpus, 1 Gi RAM # Validation model: 2 cpus, 8 Gi RAM deployment_config = DeploymentConfigBuilder() \ .cpus(1).memory('2Gi') \ .sidekick_cpus(validation_listener_model, 2) \ .sidekick_memory(validation_listener_model, '8Gi') \ .sidekick_cpus(llm_model, 2) \ .sidekick_memory(llm_model, '1Gi') \ .sidekick_env(llm_model, {"APPLICATION_CREDENTIALS": str(json.load(open("credentials.json", 'r')))}) \ .build() # create the pipeline and add the models pipeline = wl.build_pipeline('sample-external-deployed-llm-pipeline') pipeline.add_model_step(llm_model) pipeline.add_model_step(validation_listener_model) # deploy the models with the deployment configuration pipeline.deploy(deployment_config)

Offline Scoring of External Deployed LLMs

Offline scoring of Wallaroo BYOP models with external deployed LLMs is provided through LLM Monitoring Listeners. These provide on demand or on scheduled analysis of inference results and score them for criteria including:

  • Toxicity
  • Sentiment
  • Profanity
  • Hate
  • Etc

LLM monitoring Listeners are composed of models trained to evaluate LLM outputs, updated or refined according to the organization’s needs to add additional scoring methods.

LLM Listeners are executed either as:

  • Run Once: The LLM Listener runs once and evaluates a set of LLM outputs.
  • Run Scheduled: The LLM Listener is executed on a user defined schedule (run once an hour, several times a day, etc).
  • Based on events: Feature coming soon!

The results of these evaluations are stored and retrieved in the same way as LLM inference results are stored as Inference Results, so each LLM Listener evaluation result can be used to detect LLM model drift or other issues.

See LLM Monitoring Listeners for more details.

BYOP with Managed Inference Endpoint Tutorials

For full tutorials showing how to upload, deploy, and inference on Wallaroo BYOP models with Managed Inference Endpoints, see Managed Inference Endpoint Inference

For access to these sample models and a demonstration on using LLMs with Wallaroo: