LLM Inference with Qualcomm QAIC

Qualcomm QIAC provides AI acceleration for Large Language Models (LLMs) at low power with high performance for x86 architectures.For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Wallaroo supports Qualcomm QAIC, providing high performance x86 compatible processing with AI acceleration at low power costs. This increases the performance of LLM models with lower energy needs.

Wallaroo supports vLLM with QAIC acceleration across two different autopackaging scenarios:

wallaroo.framework.Framework.VLLM: Native async vLLM implementations in Wallaroo compatible with NVIDIA CUDA.
wallaroo.framework.Framework.CUSTOM: Custom async vLLM implementations in Wallaroo using BYOP (Bring Your Own Predict) provide greater flexibility through a lightweight Python interface.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

QAIC AI Acceleration Features

QAIC AI Acceleration delivers a x86 compatible architecture with AI acceleration with low power cost. The following Wallaroo features are supported for LLMs with QAIC AI acceleration deployed in Wallaroo:

OpenAI API Compatibility: Provides OpenAI API client compatible inference requests with optional token streaming.
Replica autoscaling: Spin up or down replicas based on utilization criteria to optimize resource allocation and minimize costs.
Continuous Batching: Improves throughput by dynamically grouping incoming requests in real time to optimize inference processing.

QAIC Acceleration Exceptions

QAIC acceleration applied to LLMs have the following exceptions:

If the model acceleration option is set to QAIC, but the architecture is set to an incompatible architecture (aka anything other than X86):
- The upload, deployment and publish operations fail with the following error message: “The specified model optimization configuration is not available. Please try this operation again using a different configuration or contact Wallaroo at support@wallaroo.ai for questions or help.”
If the uploaded model uses the native LLM framework or the custom LLM framework with a framework configuration,
- With acceleration set to QAIC,
- With an acceleration configuration options that do not match the framework configuration,
- The following occurs:
  - The model will be uploaded, with the QAIC acceleration configuration overriding the framework configuration.
If the uploaded model uses the native LLM framework or the custom LLM framework with a framework configuration that includes QAIC parameters,
- And the model acceleration is **not QAIC,
- The following occurs:
  - The model will be uploaded, with the framework configuration options specific to QAIC is ignored.
If the model upload acceleration is set to wallaroo.engine_config.Acceleration.QAIC, the wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig) parameter must be set to wallaroo.Acceleration.QaicConfig. If any other acceleration config type is used, the model will upload and use the QAIC acceleration parameters, but the accel_config options default to the wallaroo.Acceleration.QaicConfig default values.

The following compatibility matrix shows what framework configuration options highlighting which are QAIC compatible or not.

qaicConfig	qaicConfig Data Type	qaicConfig Value	Wallaroo vLLMConfig Framework	vLMConfig Data Type	Compatibility with Framework.VLLM + Acceleration.CUDA	Compatibility with Framework.VLLM + Acceleration.QAIC
mxfp6_matmul (quantization)	bool	True	quantization=“mxfp6”	optional str (i.e. str or None)	❌ (only supported in qaic-vllm patch)	√
mxfp6_matmul (quantization)	bool	False	None	optional str (i.e. str or None)	√	√
mxint8_kv_cache (kv cache)	bool	True	kv_cache_dtype=”mxint8”	str	❌ (only supported in qaic-vllm patch)	√
mxint8_kv_cache (kv cache)	bool	False	“auto”	str	√	√
full_batch_size	int	Any	max_num_seqs	int	√	√ (for most scenarios)
ctx_len	int	Any	max_model_len	int	√	√ (for most scenarios)
prefill_seq_len	int	Any	max_seq_len_to_capture	int	√	√ (for most scenarios)
num_devices	int	Any	device_group	List[int]	❌ (only supported in qaic-vllm patch)	⚠️ (length of list needs to be <= num_devices)

Legend:

√ : The option is compatible.
❌ : The option is not compatible.
⚠️ : The option is compatible with the following conditions.

How to Configure LLM Deployment with QAIC

vLLM AI acceleration settings are applied at model upload via the Wallaroo SDK. The general steps include the following:

(Required) The model framework is either the vLLM Native Framework or vLLM Custom Framework.
- (Optional) Set the framework configuration to optimize the LLM performance.
(Required) At model upload, the model acceleration setting is wallaroo.engine_config.Acceleration.QAIC.
- (Optional) Set the acceleration configuration options to fine tune hardware performance.

When Deploying LLMs Using the Native vLLM Runtime in Wallaroo

QAIC acceleration is applied to LLMs in Wallaroo at model upload through the Wallaroo SDK or the Wallaroo MLOps API.

Upload Native LLMs with the vLLM Runtime Via the Wallaroo SDK

LLMs are uploaded to Wallaroo via the Wallaroo SDK using the method wallaroo.client.Client.upload_model.

The method wallaroo.client.Client.upload_model takes the following parameters:

Parameter	Type	Description
name	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
path	`string` (Required)	The path to the model file being uploaded.
framework	`string` (Required)	The framework of the model from `wallaroo.framework.Framework`. For native vLLM, this framework is `wallaroo.framework.Framework.VLLM`.
input_schema	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format.
output_schema	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format.
framework_config	`wallaroo.framework.VLLMConfig` (Optional)	Sets the vLLM framework configuration options.
accel	`wallaroo.engine_config.Acceleration.QAIC` (Required) OR `wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig)` (Optional)	The AI hardware accelerator used. Submitting with the `with_config(QaicConfig)` parameters overrides the hardware performance defaults.
convert_wait	`bool` (Optional)	True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

wallaroo.framework.VLLMConfig contains the following parameters. If no framework configuration is define, then the default values are applied.

Issue

For QAIC, set <code>block_size=32</code>.

Parameters	Type
max_num_seqs	Integer (Default: 256)
max_model_len	Integer (Default: None)
max_seq_len_to_capture	Integer (Default: 8192)
quantization	(Default: None)
kv_cache_dtype	(Default: `'auto'`)
gpu_memory_utilization	Float (Default: 0.9)
block_size	(Default: None)
device_group	(Default: None) This setting is ignored for CUDA acceleration.

QAIC hardware performance is configurable at model upload with the wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig). This provides additional hardware fine tuning. If no acceleration parameters are defined, the default values are applied.

wallaroo.engine_config.QaicConfig takes the following parameters.

Parameters	Type	Description
num_cores	Integer (Default: `16`)	Number of cores used to compile the model.
num_devices	Integer (Default: `1`)	Number of System-on-Chip (SoC) in a given card to compile the model for.
ctx_len	Integer (*Default: `128`)	Maximum context that the compiled model remembers.
prefill_seq_len	Integer	The length of the Prefill prompt.
full_batch_size	Integer (Default: `None`)	Maximum number of sequences per iteration. Set to enable continuous batching mode.
mxfp6_matmul	Boolean (Default: `False`)	Enable compilation for MXFP6 precision.
mxint8_kv_cache	Boolean (Default: `False`)	Compress Present/Past KV to MXINT8.
aic_enable_depth_first	Boolean (Default: `False`)	Enables DFS with default memory size.

Upload Example for Native vLLM Frameworks via the Wallaroo SDK

The following demonstrates uploading a LLM with QAIC acceleration.

# define the input and output parameters
input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

# define the framework configuration.  This is an **optional** step.
framework_config= wallaroo.framework.VLLMConfigVLLMConfig(
        max_num_seqs=16,
        max_model_len=256,
        max_seq_len_to_capture=128, 
        quantization="mxfp6",
        kv_cache_dtype="mxint8", 
        gpu_memory_utilization=1,
        block_size=32
)

# Set the QAIC acceleration parameters.  This is an **optional** step
qaic_config = wallaroo.engine_config.QaicConfig(
    num_devices=4, 
    full_batch_size=16, 
    ctx_len=256, 
    prefill_seq_len=128, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True
)

The following shows uploading the LLM with QAIC AI acceleration enabled without the acceleration configuration options.

llm = wl.upload_model(
    "sample-model-name", 
    "sample-model-file.zip", 
    framework=wallaroo.framework.Framework.VLLM,
    framework_config=framework_config,
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC
)

The following demonstrates uploading the LLM with QAIC AI acceleration enabled with the acceleration configuration options.

llm = wl.upload_model(
    "sample-model-name", 
    "sample-model-file.zip", 
    framework=wallaroo.framework.Framework.VLLM,
    framework_config=framework_config,
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC.with_config(qaic_config)
)

Upload Native LLMs with the vLLM Runtime Via the Wallaroo MLOps API

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

/v1/api/models/upload_and_convert

This endpoint accepts the following parameters.

Field		Type	Description
name		String (Required)	The model name.
visibility		String (Required)	Either `public` or `private`.
workspace_id		String (Required)	The numerical ID of the workspace to upload the model to.
conversion		String (Required)	The conversion parameters that include the following:
	framework	String (Required)	The framework of the model being uploaded. For Native vLLM frameworks, this value is `vllm`
	accel	String (Optional) OR Dict (Optional)	The AI accelerator used. If using `qaic`, this parameter is either a string to use the default parameters, or as a `Dict` for hardware acceleration parameters.
	python_version	String (Required)	The version of Python required for the model. For Custom vLLM frameworks, this value is `3.8`.
	requirements	String (Required)	Required libraries. For Custom vLLM frameworks, this value is `[]`.
	framework_config	Dict (Optional)	The framework configuration.
	input_schema	String (Optional)	The input schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`. Only required for Containerized Wallaroo Runtime models.
	output_schema	String (Optional)	The output schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`. Only required for non-native runtime models.

The framework_config parameter accepts the following parameters.

Field		Type
config		Dict	The framework configuration values. The following subset are parameters of the `config` field.
	max_num_seqs	Integer (Default: 256)
	max_model_len	Integer (Default: None)
	max_seq_len_to_capture	Integer (Default: 8192)
	quantization	(Default: None)
	kv_cache_dtype	(Default: `'auto'`)
	gpu_memory_utilization	Float (Default: 0.9)
	block_size	(Default: None)
	device_group	(Default: None) This setting is ignored for CUDA acceleration.
framework		String	The framework of the `framework_config` type. For Native vLLM frameworks, this value is `"vllm"`.

Issue

For QAIC, set <code>block_size=32</code>.

The optional acceleration configuration for qaic includes the following parameters. If these parameters are now defined at model upload the default values are applied.

Parameters	Type	Description
num_cores	Integer (Default: `16`)	Number of cores used to compile the model.
num_devices	Integer (Default: `1`)	Number of System-on-Chip (SoC) in a given card to compile the model for.
ctx_len	Integer (*Default: `128`)	Maximum context that the compiled model remembers.
prefill_seq_len	Integer	The length of the Prefill prompt.
full_batch_size	Integer (Default: `None`)	Maximum number of sequences per iteration. Set to enable continuous batching mode.
mxfp6_matmul	Boolean (Default: `False`)	Enable compilation for MXFP6 precision.
mxint8_kv_cache	Boolean (Default: `False`)	Compress Present/Past KV to MXINT8.
aic_enable_depth_first	Boolean (Default: `False`)	Enables DFS with default memory size.

The following example demonstrates uploading a LLM with QAIC acceleration enabled, with additional acceleration configuration options defined.

curl --progress-bar -X POST \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer <your-token-here>" \
  -F 'metadata={"name": "<your model name here>", "visibility": "private", "workspace_id": <your workspace id>, "conversion": {"framework": "vllm", "framework_config": {"framework": "vllm", "config":{"max_num_seqs": 16, "max_model_len": 256, "max_seq_len_to_capture": 128, "quantization": "mxfp6", "kv_cache_dtype": "mxint8", "gpu_memory_utilization": 1, "block_size": 32}}, "accel": {"qaic":{"num_devices":4,"full_batch_size": 16, "ctx_len": 256, "prefill_seq_len": 128, "mxfp6_matmul":true,"mxint8_kv_cache":true}}, "python_version": "3.8", "requirements": []}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
  -F "file=@<your llm file here>;type=application/octet-stream" \
  https://qaic.example.wallaroo.ai/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This model version is for deployment.

# Retrieve the model
custom_vllm_model = wl.get_model(your-model-name)

Deploy LLMs Using the Native Wallaroo vLLM Runtime with QAIC Acceleration

Deploying LLMs with QAIC acceleration has the following steps.

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per model replica.
- The GPU type is inherited from the model’s accel parameters - in this case, QAIC.
- For QAIC, each deployment configuration gpu values is the number of System-on-Chip (SoC) to use.
Create a Wallaroo pipeline and add the pre-configure LLM with QAIC acceleration as a model step.
Deploy the Wallaroo pipeline with the deployment configuration.

Deployment Configuration for LLMs Deployed Using the Native vLLM Runtime in Wallaroo

The deployment configuration sets what resources are allocated for the LLM. For this example, the LLM is allocated the following:

cpus: 4
RAM: 12 Gi
gpus: 4
- For Wallaroo deployment configurations for QAIC, the gpu parameter specifies the number of SoCs allocated.
Deployment label: Specifies the node with the QAIC SoCs.

deployment_config = DeploymentConfigBuilder() \
    .replica_autoscale_min_max(minimum=1, maximum=2) \
    .cpus(1).memory('1Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '12Gi') \
    .sidekick_gpus(model, 4) \
    .deployment_label("kubernetes.io/os:linux") \
    .build()

Create Wallaroo Pipeline and Add the LLM with Native vLLM Runtime

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps determine how inference data is provided to the LLM.

The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.

# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-pipeline')

# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(vllm_model)

Deploy the LLM pipeline With the Native vLLM Runtime and QAIC Acceleration

With the Deployment Configuration defined and the pipeline ready, the pipeline is deployed with the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig]) method. This allocates resources from the cluster for the deployment based on the DeploymentConfig settings. If the resources requested are not available at deployment, an error is returned.

The following example demonstrates deploying the pipeline with the previously defined deployment configuration.

vllm_pipeline.deploy(deployment_config=deployment_config)

Once the deployment configuration is complete, the pipeline is ready to accept inference requests.

When Deploying LLMs Using the Custom vLLM Runtime in Wallaroo

Custom vLLM Runtime Requirements

Wallaroo Custom Model include the following artifacts.

Artifact	Type	Description
Python interface aka `.py` scripts with classes that extend `mac.inference.AsyncInference` and `mac.inference.creation.InferenceBuilder`	Python Script	Extend the classes `mac.inference.Inference` and `mac.inference.creation.InferenceBuilder`. These are included with the Wallaroo SDK. Note that there is no specified naming requirements for the classes that extend `mac.inference.AsyncInference` and `mac.inference.creation.InferenceBuilder` - any qualified class name is sufficient as long as these two classes are extended as defined below.
`requirements.txt`	Python requirements file	This sets the Python libraries used for the Custom Model. These libraries should be targeted for Python 3.10 compliance. These requirements and the versions of libraries should be exactly the same between creating the model and deploying it in Wallaroo. This insures that the script and methods will function exactly the same as during the model creation process.
Other artifacts	Files	Other models, files, and other artifacts used in support of this model.

Custom vLLM Runtime implementations in Wallaroo extend the Wallaroo SDK mac.inference.Inference and mac.inference.creation.InferenceBuilder. For Continuous Batching leveraging a custom vLLM runtime implementation, the following additions are required:

In the requirements.txt file, the vllm library must be included. For optimal performance in Wallaroo, use the version specified below.
```
vllm==0.6.6
```
Import the following libraries into the Python script that extends the mac.inference.Inference and mac.inference.creation.InferenceBuilder:
```
from vllm import AsyncLLMEngine, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
```
The class that accepts InferenceBuilder extends must also extend the following to support continuous batching configurations:
- def inference(self) -> AsyncVLLMInference: Specifies the Inference instance used by create.
- def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference: Creates the inference subclass and specifies the vLLM used with the inference requests.

The following shows an example of extending the inference and create to for AsyncVLLMInference.

# vllm import libraries 
from vllm import AsyncLLMEngine, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs

class AsyncVLLMInferenceBuilder(InferenceBuilder):
    """Inference builder class for AsyncVLLMInference."""

    def inference(self) -> AsyncVLLMInference: # extend mac.inference.AsyncInference
        """Returns an Inference subclass instance.
        This specifies the Inference instance to be used
        by create() to build additionally needed components."""
        return AsyncVLLMInference()

    def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
        """Creates an Inference subclass and assigns a model to it.
        :param config: Inference configuration
        :return: Inference subclass
        """
        inference = self.inference
        inference.model = AsyncLLMEngine.from_engine_args(
            AsyncEngineArgs(
                model=(config.model_path / "model").as_posix(),
            ),
        )
        return inference

Upload Custom vLLM Runtime to Wallaroo

LLMs in the custom vLLM framework are uploaded either via the Wallaroo SDK or the Wallaroo MLOps API.

Upload LLMs with the Custom vLLM Runtime Via the Wallaroo SDK

LLMs are uploaded to Wallaroo via the Wallaroo SDK using the method wallaroo.client.Client.upload_model.

The method wallaroo.client.Client.upload_model takes the following parameters:

Parameter	Type	Description
name	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
path	`string` (Required)	The path to the model file being uploaded.
framework	`string` (Required)	The framework of the model from `wallaroo.framework.Framework`. For native vLLM, this framework is `wallaroo.framework.Framework.VLLM`.
input_schema	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format.
output_schema	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format.
framework_config	`wallaroo.framework.VLLMConfig` (Optional)	Sets the vLLM framework configuration options.
accel	`wallaroo.engine_config.Acceleration.QAIC` (Required) OR `wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig)` (Optional)	The AI hardware accelerator used. Submitting with the `with_config(QaicConfig)` parameters overrides the hardware performance defaults.
convert_wait	`bool` (Optional)	True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

wallaroo.framework.CustomConfig contains the following parameters.

Parameters	Type
max_num_seqs	Integer (Default: 256)
max_model_len	Integer (Default: None)
max_seq_len_to_capture	Integer (Default: 8192)
quantization	(Default: None)
kv_cache_dtype	(Default: `'auto'`)
gpu_memory_utilization	Float (Default: 0.9)
block_size	(Default: None)
device_group	(Default: `None`)

Issue

For QAIC, set <code>block_size=32</code>.

QAIC hardware performance is configurable at model upload with the wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig). This provides additional hardware fine tuning. If no acceleration configuration is defined, the default values are applied.

wallaroo.engine_config.QaicConfig takes the following parameters.

Parameters	Type	Description
num_cores	Integer (Default: `16`)	Number of cores used to compile the model.
num_devices	Integer (Default: `1`)	Number of System-on-Chip (SoC) in a given card to compile the model for.
ctx_len	Integer (*Default: `128`)	Maximum context that the compiled model remembers.
prefill_seq_len	Integer	The length of the Prefill prompt.
full_batch_size	Integer (Default: `None`)	Maximum number of sequences per iteration. Set to enable continuous batching mode.
mxfp6_matmul	Boolean (Default: `False`)	Enable compilation for MXFP6 precision.
mxint8_kv_cache	Boolean (Default: `False`)	Compress Present/Past KV to MXINT8.
aic_enable_depth_first	Boolean (Default: `False`)	Enables DFS with default memory size.

Upload Example for Custom vLLM Frameworks via the Wallaroo SDK

The following demonstrates uploading a LLM with QAIC acceleration.

# define the input and output parameters
input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

# define the framework configuration.  This is an **optional** step.
framework_config= wallaroo.framework.CustomConfigVLLMConfig(
        max_num_seqs=16,
        max_model_len=256,
        max_seq_len_to_capture=128, 
        quantization="mxfp6",
        kv_cache_dtype="mxint8", 
        gpu_memory_utilization=1,
        block_size=32
)

# Set the QAIC acceleration parameters.  This is an **optional** step
# If acceleration configuration is not defined, the default values are used
qaic_config = wallaroo.engine_config.QaicConfig(
    num_devices=4, 
    full_batch_size=16, 
    ctx_len=256, 
    prefill_seq_len=128, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True
)

The following shows uploading the LLM with QAIC AI acceleration enabled without the acceleration configuration options.

vllm = wl.upload_model(
    "sample-model-name", 
    "sample-model-file.zip", 
    framework=wallaroo.framework.Framework.CUSTOM,
    framework_config=framework_config,
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC
)

The following demonstrates uploading the LLM with QAIC AI acceleration enabled with the acceleration configuration options.

llm = wl.upload_model(
    "sample-model-name", 
    "sample-model-file.zip", 
    framework=wallaroo.framework.Framework.VLLM,
    framework_config=framework_config,
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC.with_config(qaic_config)
)

Upload Custom LLMs with the vLLM Runtime Via the Wallaroo MLOps API

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

/v1/api/models/upload_and_convert

This endpoint accepts the following parameters.

Field		Type	Description
name		String (Required)	The model name.
visibility		String (Required)	Either `public` or `private`.
workspace_id		String (Required)	The numerical ID of the workspace to upload the model to.
conversion		String (Required)	The conversion parameters that include the following:
	framework	String (Required)	The framework of the model being uploaded. For Custom vLLM frameworks, this value is `custom`
	accel	String (Optional) OR Dict (Optional)	The AI accelerator used. If using `qaic`, this parameter is either a string to use the default parameters, or as a `Dict` for hardware acceleration parameters.
	python_version	String (Required)	The version of Python required for the model. For Custom vLLM frameworks, this value is `3.8`.
	requirements	String (Required)	Required libraries. For Custom vLLM frameworks, this value is `[]`.
	framework_config	Dict (Optional)	The framework configuration.
	input_schema	String (Optional)	The input schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`.
	output_schema	String (Optional)	The output schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`.

The framework_config parameter accepts the following parameters.

Field		Type
config		Dict The framework configuration values. The following subset are parameters of the `config` field.
	max_num_seqs	Integer (Default: 256)
	max_model_len	Integer (Default: None)
	max_seq_len_to_capture	Integer (Default: 8192)
	quantization	(Default: None)
	kv_cache_dtype	(Default: `'auto'`)
	gpu_memory_utilization	Float (Default: 0.9)
	block_size	(Default: None)
	device_group	(Default: None)	This setting is ignored for CUDA acceleration.
framework		String The framework of the `framework_config` type. For Custom vLLM frameworks, this value is `"custom"`.

Issue

For QAIC, set <code>block_size=32</code>.

The optional acceleration configuration for qaic includes the following parameters. If these parameters are now defined at model upload the default values are applied.

Parameters	Type	Description
num_cores	Integer (Default: `16`)	Number of cores used to compile the model.
num_devices	Integer (Default: `1`)	Number of System-on-Chip (SoC) in a given card to compile the model for.
ctx_len	Integer (*Default: `128`)	Maximum context that the compiled model remembers.
prefill_seq_len	Integer	The length of the Prefill prompt.
full_batch_size	Integer (Default: `None`)	Maximum number of sequences per iteration. Set to enable continuous batching mode.
mxfp6_matmul	Boolean (Default: `False`)	Enable compilation for MXFP6 precision.
mxint8_kv_cache	Boolean (Default: `False`)	Compress Present/Past KV to MXINT8.
aic_enable_depth_first	Boolean (Default: `False`)	Enables DFS with default memory size.

The following example demonstrates uploading a LLM with:

QAIC acceleration enabled
A defined model framework configuration
A defined acceleration configuration

curl --progress-bar -X POST \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer <your-token-here>" \
  -F 'metadata={"name": "<your model name here>", "visibility": "private", "workspace_id": <your workspace id>, "conversion": {"framework": "custom", "framework_config": {"framework": "custom", "config":{"max_num_seqs": 16, "max_model_len": 256, "max_seq_len_to_capture": 128, "quantization": "mxfp6", "kv_cache_dtype": "mxint8", "gpu_memory_utilization": 1}}, "accel": {"qaic":{"num_devices":4,"full_batch_size": 16, "ctx_len": 256, "prefill_seq_len": 128, "mxfp6_matmul":true,"mxint8_kv_cache":true}}, "python_version": "3.8", "requirements": []}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
  -F "file=@<your llm file here>;type=application/octet-stream" \
  https://qaic.example.wallaroo.ai/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This model version is for deployment.

# Retrieve the model
custom_vllm_model = wl.get_model(your-model-name)

Deploy LLMs Using the Custom Wallaroo vLLM Runtime with QAIC Acceleration

Deploying LLMs with QAIC acceleration has the following steps.

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per model replica.
- The GPU type is inherited from the model’s accel parameters - in this case, QAIC.
- For QAIC, each deployment configuration gpu values is the number of System-on-Chips (SoCs) to use.
Create a Wallaroo pipeline and add the pre-configure LLM with QAIC acceleration as a model step.
Deploy the Wallaroo pipeline with the deployment configuration.

Deployment Configuration for LLMs Deployed Using the Custom vLLM Runtime in Wallaroo

The deployment configuration sets what resources are allocated for the LLM. For this example, the LLM is allocated the following:

cpus: 4
RAM: 12 Gi
gpus: 4
- For Wallaroo deployment configurations for QAIC, the gpu parameter specifies the number of SoCs allocated.
Deployment label: Specifies the node with the gpus.

deployment_config = DeploymentConfigBuilder() \
    .replica_autoscale_min_max(minimum=1, maximum=2) \
    .cpus(1).memory('1Gi') \
    .sidekick_cpus(custom_vllm_model, 4) \
    .sidekick_memory(custom_vllm_model, '12Gi') \
    .sidekick_gpus(custom_vllm_model, 4) \
    .deployment_label("kubernetes.io/os:linux") \
    .build()

Create Wallaroo Pipeline and Add the LLM with Custom vLLM Runtime

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps determine how inference data is provided to the LLM.

The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.

# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-pipeline')

# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(custom_vllm_model)

Deploy the LLM pipeline With the Custom vLLM Runtime and QAIC Acceleration

The following example demonstrates deploying the pipeline with the previously defined deployment configuration.

vllm_pipeline.deploy(deployment_config=deployment_config)

Once the deployment configuration is complete, the pipeline is ready to accept inference requests.

Tutorials

Troubleshooting

The specified model optimization configuration is not available

If the model acceleration option is set to QAIC, but the architecture is set to an incompatible architecture (aka anything other than X86):
- The upload, deployment and publish operations fail with the following error message: “The specified model optimization configuration is not available. Please try this operation again using a different configuration or contact Wallaroo at support@wallaroo.ai for questions or help.”