2024.3 Product Release Notes
We are pleased to announce the following product improvements in our 2024.3 release:
- Dynamic Batching Support for LLMs: Dynamic Batching collects incoming inference requests and processing them a single batch. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request.
- Autoscale Triggers for LLMs: Autoscale triggers via queue depth provide LLMs greater flexibility by increasing resources to LLMs based on scale up and down triggers. This decreases inference latency when more requests come in, then spools idle resources back down to save on costs. The allocation of resources is smoothed by optional autoscaling windows that allows scaling up and down over a longer period of time, preventing sudden resources spikes and drops.
- Autoscale Triggers for GPUs: Autoscale triggers via queue depth provide autoscaling for LLMs that require GPUs independent of CPU utilization. Replicas are scaled up and down depending on the incoming inference requests, allowing for greater efficiency for allocating GPU resources based on need.
- Support for Local Wheel and PyPi Indexes: Updates to the Wallaroo frameworks for Custom Model aka Bring Your Own Predict (BYOP) and Python models allow for local Python wheel and custom PyPi indexes. This allows greater flexibility and customization of what libraries are in these frameworks when deploying Models in Wallaroo.
- Inference Automation with Run Continuously Tasks: A new Inference Automation task Run Continuously Task provides tasks that continue to run until killed. Use cases include: polling databases for new inference data to process, update deployed models with new versions and other real time or near real time use cases.
- Deploy Pipelines Asynchronously: Pipeline deployment through the Wallaroo SDK is either deployed and monitored during pipeline deployment, or through a “fire and forget” async parameter. This allows organizations to deploy one or more pipelines at a time without waiting for the entire deployment process to finish before taking other actions in Wallaroo.
- Edit Pipeline Deployment Configuration in the Wallaroo User Interface: Deployment configurations for previously deployed and currently undeployed pipelines are editable from the Wallaroo Dashboard. This allows updates of a previously deployed pipeline to easily optimize its hardware usage.
For access to sample models and a demonstration on deploying ML models and LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today