2024.3 Product Release Notes

We are pleased to announce the following product improvements in our 2024.3 release:

Dynamic Batching Support for LLMs: Dynamic Batching collects incoming inference requests and processing them a single batch. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request.
Autoscale Triggers for LLMs: Autoscale triggers via queue depth provide LLMs greater flexibility by increasing resources to LLMs based on scale up and down triggers. This decreases inference latency when more requests come in, then spools idle resources back down to save on costs. The allocation of resources is smoothed by optional autoscaling windows that allows scaling up and down over a longer period of time, preventing sudden resources spikes and drops.
- Autoscale Triggers for LLMs
- Autoscaling with Llama 3 8B and Llama.cpp
Autoscale Triggers for GPUs: Autoscale triggers via queue depth provide autoscaling for LLMs that require GPUs independent of CPU utilization. Replicas are scaled up and down depending on the incoming inference requests, allowing for greater efficiency for allocating GPU resources based on need.
- Deployment Replicas and Autoscale
Support for Local Wheel and PyPi Indexes: Updates to the Wallaroo frameworks for Custom Model aka Bring Your Own Predict (BYOP) and Python models allow for local Python wheel and custom PyPi indexes. This allows greater flexibility and customization of what libraries are in these frameworks when deploying Models in Wallaroo.
- Model Uploads and Registrations: Custom Model: Python Libraries
- Model Uploads and Registrations: Python Models: Python Libraries
Inference Automation with Run Continuously Tasks: A new Inference Automation task Run Continuously Task provides tasks that continue to run until killed. Use cases include: polling databases for new inference data to process, update deployed models with new versions and other real time or near real time use cases.
- Inference Automation: Create Tasks: Run Task Continuously
- AI Workload Automation Run Continuously Task Tutorial
Deploy Pipelines Asynchronously: Pipeline deployment through the Wallaroo SDK is either deployed and monitored during pipeline deployment, or through a “fire and forget” async parameter. This allows organizations to deploy one or more pipelines at a time without waiting for the entire deployment process to finish before taking other actions in Wallaroo.
- Model Deploy: Deploy Pipelines Asynchronously
- AI Workload Automation Orchestration Multiple Pipeline Deployment Tutorial
Edit Pipeline Deployment Configuration in the Wallaroo User Interface: Deployment configurations for previously deployed and currently undeployed pipelines are editable from the Wallaroo Dashboard. This allows updates of a previously deployed pipeline to easily optimize its hardware usage.
- Deployment Configuration via the Wallaroo Dashboard

For access to sample models and a demonstration on deploying ML models and LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today