Wallaroo supports Qualcomm QAIC, providing high performance x86 compatible processing with AI acceleration at low power costs. This increases the performance of LLM models with lower energy needs.
Wallaroo supports vLLM with QAIC acceleration across two different autopackaging scenarios:
wallaroo.framework.Framework.VLLM
: Native async vLLM implementations in Wallaroo compatible with NVIDIA CUDA.wallaroo.framework.Framework.CUSTOM
: Custom async vLLM implementations in Wallaroo using BYOP (Bring Your Own Predict) provide greater flexibility through a lightweight Python interface.For access to these sample models and a demonstration on using LLMs with Wallaroo:
QAIC AI Acceleration delivers a x86 compatible architecture with AI acceleration with low power cost. The following Wallaroo features are supported for LLMs with QAIC AI acceleration deployed in Wallaroo:
QAIC acceleration applied to LLMs have the following exceptions:
QAIC
, but the architecture is set to an incompatible architecture (aka anything other than X86
):“The specified model optimization configuration is not available. Please try this operation again using a different configuration or contact Wallaroo at support@wallaroo.ai for questions or help.”
QAIC
,QAIC
,QAIC
is ignored.wallaroo.engine_config.Acceleration.QAIC
, the wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig)
parameter must be set to wallaroo.Acceleration.QaicConfig
. If any other acceleration config type is used, the model will upload and use the QAIC
acceleration parameters, but the accel_config
options default to the wallaroo.Acceleration.QaicConfig
default values.The following compatibility matrix shows what framework configuration options highlighting which are QAIC compatible or not.
qaicConfig | qaicConfig Data Type | qaicConfig Value | Wallaroo vLLMConfig Framework | vLMConfig Data Type | Compatibility with Framework.VLLM + Acceleration.CUDA | Compatibility with Framework.VLLM + Acceleration.QAIC |
---|---|---|---|---|---|---|
mxfp6_matmul (quantization) | bool | True | quantization=“mxfp6” | optional str (i.e. str or None) | ❌ (only supported in qaic-vllm patch) | √ |
mxfp6_matmul (quantization) | bool | False | None | optional str (i.e. str or None) | √ | √ |
mxint8_kv_cache (kv cache) | bool | True | kv_cache_dtype=”mxint8” | str | ❌ (only supported in qaic-vllm patch) | √ |
mxint8_kv_cache (kv cache) | bool | False | “auto” | str | √ | √ |
full_batch_size | int | Any | max_num_seqs | int | √ | √ (for most scenarios) |
ctx_len | int | Any | max_model_len | int | √ | √ (for most scenarios) |
prefill_seq_len | int | Any | max_seq_len_to_capture | int | √ | √ (for most scenarios) |
num_devices | int | Any | device_group | List[int] | ❌ (only supported in qaic-vllm patch) | ⚠️ (length of list needs to be <= num_devices) |
Legend:
vLLM AI acceleration settings are applied at model upload via the Wallaroo SDK. The general steps include the following:
wallaroo.engine_config.Acceleration.QAIC
.QAIC acceleration is applied to LLMs in Wallaroo at model upload through the Wallaroo SDK or the Wallaroo MLOps API.
LLMs are uploaded to Wallaroo via the Wallaroo SDK using the method wallaroo.client.Client.upload_model
.
The method wallaroo.client.Client.upload_model
takes the following parameters:
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework.Framework . For native vLLM, this framework is wallaroo.framework.Framework.VLLM . |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. |
framework_config | wallaroo.framework.VLLMConfig (Optional) | Sets the vLLM framework configuration options. |
accel | wallaroo.engine_config.Acceleration.QAIC (Required) OR wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig) (Optional) | The AI hardware accelerator used. Submitting with the with_config(QaicConfig) parameters overrides the hardware performance defaults. |
convert_wait | bool (Optional) |
|
wallaroo.framework.VLLMConfig
contains the following parameters. If no framework configuration is define, then the default values are applied.
For QAIC, set <code>block_size=32</code>.
Parameters | Type |
---|---|
max_num_seqs | Integer (Default: 256) |
max_model_len | Integer (Default: None) |
max_seq_len_to_capture | Integer (Default: 8192) |
quantization | (Default: None) |
kv_cache_dtype | (Default: 'auto' ) |
gpu_memory_utilization | Float (Default: 0.9) |
block_size | (Default: None) |
device_group | (Default: None) This setting is ignored for CUDA acceleration. |
QAIC hardware performance is configurable at model upload with the wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig)
. This provides additional hardware fine tuning. If no acceleration parameters are defined, the default values are applied.
wallaroo.engine_config.QaicConfig
takes the following parameters.
Parameters | Type | Description |
---|---|---|
num_cores | Integer (Default: 16 ) | Number of cores used to compile the model. |
num_devices | Integer (Default: 1 ) | Number of System-on-Chip (SoC) in a given card to compile the model for. |
ctx_len | Integer (*Default: 128 ) | Maximum context that the compiled model remembers. |
prefill_seq_len | Integer | The length of the Prefill prompt. |
full_batch_size | Integer (Default: None ) | Maximum number of sequences per iteration. Set to enable continuous batching mode. |
mxfp6_matmul | Boolean (Default: False ) | Enable compilation for MXFP6 precision. |
mxint8_kv_cache | Boolean (Default: False ) | Compress Present/Past KV to MXINT8. |
aic_enable_depth_first | Boolean (Default: False ) | Enables DFS with default memory size. |
The following demonstrates uploading a LLM with QAIC acceleration.
# define the input and output parameters
input_schema = pa.schema([
pa.field('prompt', pa.string()),
pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
pa.field('generated_text', pa.string()),
pa.field('num_output_tokens', pa.int64())
])
# define the framework configuration. This is an **optional** step.
framework_config= wallaroo.framework.VLLMConfigVLLMConfig(
max_num_seqs=16,
max_model_len=256,
max_seq_len_to_capture=128,
quantization="mxfp6",
kv_cache_dtype="mxint8",
gpu_memory_utilization=1,
block_size=32
)
# Set the QAIC acceleration parameters. This is an **optional** step
qaic_config = wallaroo.engine_config.QaicConfig(
num_devices=4,
full_batch_size=16,
ctx_len=256,
prefill_seq_len=128,
mxfp6_matmul=True,
mxint8_kv_cache=True
)
The following shows uploading the LLM with QAIC AI acceleration enabled without the acceleration configuration options.
llm = wl.upload_model(
"sample-model-name",
"sample-model-file.zip",
framework=wallaroo.framework.Framework.VLLM,
framework_config=framework_config,
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.QAIC
)
The following demonstrates uploading the LLM with QAIC AI acceleration enabled with the acceleration configuration options.
llm = wl.upload_model(
"sample-model-name",
"sample-model-file.zip",
framework=wallaroo.framework.Framework.VLLM,
framework_config=framework_config,
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.QAIC.with_config(qaic_config)
)
Models are uploaded via the Wallaroo MLOps API via the following endpoint:
/v1/api/models/upload_and_convert
This endpoint accepts the following parameters.
Field | Type | Description | |
---|---|---|---|
name | String (Required) | The model name. | |
visibility | String (Required) | Either public or private . | |
workspace_id | String (Required) | The numerical ID of the workspace to upload the model to. | |
conversion | String (Required) | The conversion parameters that include the following: | |
framework | String (Required) | The framework of the model being uploaded. For Native vLLM frameworks, this value is vllm | |
accel | String (Optional) OR Dict (Optional) | The AI accelerator used. If using qaic , this parameter is either a string to use the default parameters, or as a Dict for hardware acceleration parameters. | |
python_version | String (Required) | The version of Python required for the model. For Custom vLLM frameworks, this value is 3.8 . | |
requirements | String (Required) | Required libraries. For Custom vLLM frameworks, this value is [] . | |
framework_config | Dict (Optional) | The framework configuration. | |
input_schema | String (Optional) | The input schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode . Only required for Containerized Wallaroo Runtime models. | |
output_schema | String (Optional) | The output schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode . Only required for non-native runtime models. |
The framework_config
parameter accepts the following parameters.
Field | Type | ||
---|---|---|---|
config | Dict | The framework configuration values. The following subset are parameters of the config field. | |
max_num_seqs | Integer (Default: 256) | ||
max_model_len | Integer (Default: None) | ||
max_seq_len_to_capture | Integer (Default: 8192) | ||
quantization | (Default: None) | ||
kv_cache_dtype | (Default: 'auto' ) | ||
gpu_memory_utilization | Float (Default: 0.9) | ||
block_size | (Default: None) | ||
device_group | (Default: None) This setting is ignored for CUDA acceleration. | ||
framework | String | The framework of the framework_config type. For Native vLLM frameworks, this value is "vllm" . |
For QAIC, set <code>block_size=32</code>.
The optional acceleration configuration for qaic
includes the following parameters. If these parameters are now defined at model upload the default values are applied.
Parameters | Type | Description |
---|---|---|
num_cores | Integer (Default: 16 ) | Number of cores used to compile the model. |
num_devices | Integer (Default: 1 ) | Number of System-on-Chip (SoC) in a given card to compile the model for. |
ctx_len | Integer (*Default: 128 ) | Maximum context that the compiled model remembers. |
prefill_seq_len | Integer | The length of the Prefill prompt. |
full_batch_size | Integer (Default: None ) | Maximum number of sequences per iteration. Set to enable continuous batching mode. |
mxfp6_matmul | Boolean (Default: False ) | Enable compilation for MXFP6 precision. |
mxint8_kv_cache | Boolean (Default: False ) | Compress Present/Past KV to MXINT8. |
aic_enable_depth_first | Boolean (Default: False ) | Enables DFS with default memory size. |
The following example demonstrates uploading a LLM with QAIC acceleration enabled, with additional acceleration configuration options defined.
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer <your-token-here>" \
-F 'metadata={"name": "<your model name here>", "visibility": "private", "workspace_id": <your workspace id>, "conversion": {"framework": "vllm", "framework_config": {"framework": "vllm", "config":{"max_num_seqs": 16, "max_model_len": 256, "max_seq_len_to_capture": 128, "quantization": "mxfp6", "kv_cache_dtype": "mxint8", "gpu_memory_utilization": 1, "block_size": 32}}, "accel": {"qaic":{"num_devices":4,"full_batch_size": 16, "ctx_len": 256, "prefill_seq_len": 128, "mxfp6_matmul":true,"mxint8_kv_cache":true}}, "python_version": "3.8", "requirements": []}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
-F "file=@<your llm file here>;type=application/octet-stream" \
https://qaic.example.wallaroo.ai/v1/api/models/upload_and_convert | cat
The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model
. This model version is for deployment.
# Retrieve the model
custom_vllm_model = wl.get_model(your-model-name)
Deploying LLMs with QAIC acceleration has the following steps.
accel
parameters - in this case, QAIC
.gpu
values is the number of System-on-Chip (SoC) to use.The deployment configuration sets what resources are allocated for the LLM. For this example, the LLM is allocated the following:
gpu
parameter specifies the number of SoCs allocated.deployment_config = DeploymentConfigBuilder() \
.replica_autoscale_min_max(minimum=1, maximum=2) \
.cpus(1).memory('1Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '12Gi') \
.sidekick_gpus(model, 4) \
.deployment_label("kubernetes.io/os:linux") \
.build()
Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline
method. Pipeline steps determine how inference data is provided to the LLM.
The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.
# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-pipeline')
# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(vllm_model)
With the Deployment Configuration defined and the pipeline ready, the pipeline is deployed with the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig])
method. This allocates resources from the cluster for the deployment based on the DeploymentConfig
settings. If the resources requested are not available at deployment, an error is returned.
The following example demonstrates deploying the pipeline with the previously defined deployment configuration.
vllm_pipeline.deploy(deployment_config=deployment_config)
Once the deployment configuration is complete, the pipeline is ready to accept inference requests.
Wallaroo Custom Model include the following artifacts.
Artifact | Type | Description |
---|---|---|
Python interface aka .py scripts with classes that extend mac.inference.AsyncInference and mac.inference.creation.InferenceBuilder | Python Script | Extend the classes mac.inference.Inference and mac.inference.creation.InferenceBuilder . These are included with the Wallaroo SDK. Note that there is no specified naming requirements for the classes that extend mac.inference.AsyncInference and mac.inference.creation.InferenceBuilder - any qualified class name is sufficient as long as these two classes are extended as defined below. |
requirements.txt | Python requirements file | This sets the Python libraries used for the Custom Model. These libraries should be targeted for Python 3.10 compliance. These requirements and the versions of libraries should be exactly the same between creating the model and deploying it in Wallaroo. This insures that the script and methods will function exactly the same as during the model creation process. |
Other artifacts | Files | Other models, files, and other artifacts used in support of this model. |
Custom vLLM Runtime implementations in Wallaroo extend the Wallaroo SDK mac.inference.Inference
and mac.inference.creation.InferenceBuilder
. For Continuous Batching leveraging a custom vLLM runtime implementation, the following additions are required:
In the requirements.txt
file, the vllm
library must be included. For optimal performance in Wallaroo, use the version specified below.
vllm==0.6.6
Import the following libraries into the Python script that extends the mac.inference.Inference
and mac.inference.creation.InferenceBuilder
:
from vllm import AsyncLLMEngine, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
The class that accepts InferenceBuilder
extends must also extend the following to support continuous batching configurations:
def inference(self) -> AsyncVLLMInference
: Specifies the Inference instance used by create
.def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
Creates the inference subclass and specifies the vLLM used with the inference requests.The following shows an example of extending the inference
and create
to for AsyncVLLMInference
.
# vllm import libraries
from vllm import AsyncLLMEngine, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
class AsyncVLLMInferenceBuilder(InferenceBuilder):
"""Inference builder class for AsyncVLLMInference."""
def inference(self) -> AsyncVLLMInference: # extend mac.inference.AsyncInference
"""Returns an Inference subclass instance.
This specifies the Inference instance to be used
by create() to build additionally needed components."""
return AsyncVLLMInference()
def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
"""Creates an Inference subclass and assigns a model to it.
:param config: Inference configuration
:return: Inference subclass
"""
inference = self.inference
inference.model = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(
model=(config.model_path / "model").as_posix(),
),
)
return inference
LLMs in the custom vLLM framework are uploaded either via the Wallaroo SDK or the Wallaroo MLOps API.
LLMs are uploaded to Wallaroo via the Wallaroo SDK using the method wallaroo.client.Client.upload_model
.
The method wallaroo.client.Client.upload_model
takes the following parameters:
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework.Framework . For native vLLM, this framework is wallaroo.framework.Framework.VLLM . |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. |
framework_config | wallaroo.framework.VLLMConfig (Optional) | Sets the vLLM framework configuration options. |
accel | wallaroo.engine_config.Acceleration.QAIC (Required) OR wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig) (Optional) | The AI hardware accelerator used. Submitting with the with_config(QaicConfig) parameters overrides the hardware performance defaults. |
convert_wait | bool (Optional) |
|
wallaroo.framework.CustomConfig
contains the following parameters.
Parameters | Type |
---|---|
max_num_seqs | Integer (Default: 256) |
max_model_len | Integer (Default: None) |
max_seq_len_to_capture | Integer (Default: 8192) |
quantization | (Default: None) |
kv_cache_dtype | (Default: 'auto' ) |
gpu_memory_utilization | Float (Default: 0.9) |
block_size | (Default: None) |
device_group | (Default: None ) |
For QAIC, set <code>block_size=32</code>.
QAIC hardware performance is configurable at model upload with the wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig)
. This provides additional hardware fine tuning. If no acceleration configuration is defined, the default values are applied.
wallaroo.engine_config.QaicConfig
takes the following parameters.
Parameters | Type | Description |
---|---|---|
num_cores | Integer (Default: 16 ) | Number of cores used to compile the model. |
num_devices | Integer (Default: 1 ) | Number of System-on-Chip (SoC) in a given card to compile the model for. |
ctx_len | Integer (*Default: 128 ) | Maximum context that the compiled model remembers. |
prefill_seq_len | Integer | The length of the Prefill prompt. |
full_batch_size | Integer (Default: None ) | Maximum number of sequences per iteration. Set to enable continuous batching mode. |
mxfp6_matmul | Boolean (Default: False ) | Enable compilation for MXFP6 precision. |
mxint8_kv_cache | Boolean (Default: False ) | Compress Present/Past KV to MXINT8. |
aic_enable_depth_first | Boolean (Default: False ) | Enables DFS with default memory size. |
The following demonstrates uploading a LLM with QAIC acceleration.
# define the input and output parameters
input_schema = pa.schema([
pa.field('prompt', pa.string()),
pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
pa.field('generated_text', pa.string()),
pa.field('num_output_tokens', pa.int64())
])
# define the framework configuration. This is an **optional** step.
framework_config= wallaroo.framework.CustomConfigVLLMConfig(
max_num_seqs=16,
max_model_len=256,
max_seq_len_to_capture=128,
quantization="mxfp6",
kv_cache_dtype="mxint8",
gpu_memory_utilization=1,
block_size=32
)
# Set the QAIC acceleration parameters. This is an **optional** step
# If acceleration configuration is not defined, the default values are used
qaic_config = wallaroo.engine_config.QaicConfig(
num_devices=4,
full_batch_size=16,
ctx_len=256,
prefill_seq_len=128,
mxfp6_matmul=True,
mxint8_kv_cache=True
)
The following shows uploading the LLM with QAIC AI acceleration enabled without the acceleration configuration options.
vllm = wl.upload_model(
"sample-model-name",
"sample-model-file.zip",
framework=wallaroo.framework.Framework.CUSTOM,
framework_config=framework_config,
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.QAIC
)
The following demonstrates uploading the LLM with QAIC AI acceleration enabled with the acceleration configuration options.
llm = wl.upload_model(
"sample-model-name",
"sample-model-file.zip",
framework=wallaroo.framework.Framework.VLLM,
framework_config=framework_config,
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.QAIC.with_config(qaic_config)
)
Models are uploaded via the Wallaroo MLOps API via the following endpoint:
/v1/api/models/upload_and_convert
This endpoint accepts the following parameters.
Field | Type | Description | |
---|---|---|---|
name | String (Required) | The model name. | |
visibility | String (Required) | Either public or private . | |
workspace_id | String (Required) | The numerical ID of the workspace to upload the model to. | |
conversion | String (Required) | The conversion parameters that include the following: | |
framework | String (Required) | The framework of the model being uploaded. For Custom vLLM frameworks, this value is custom | |
accel | String (Optional) OR Dict (Optional) | The AI accelerator used. If using qaic , this parameter is either a string to use the default parameters, or as a Dict for hardware acceleration parameters. | |
python_version | String (Required) | The version of Python required for the model. For Custom vLLM frameworks, this value is 3.8 . | |
requirements | String (Required) | Required libraries. For Custom vLLM frameworks, this value is [] . | |
framework_config | Dict (Optional) | The framework configuration. | |
input_schema | String (Optional) | The input schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode . | |
output_schema | String (Optional) | The output schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode . |
The framework_config
parameter accepts the following parameters.
Field | Type | ||
---|---|---|---|
config | Dict The framework configuration values. The following subset are parameters of the config field. | ||
max_num_seqs | Integer (Default: 256) | ||
max_model_len | Integer (Default: None) | ||
max_seq_len_to_capture | Integer (Default: 8192) | ||
quantization | (Default: None) | ||
kv_cache_dtype | (Default: 'auto' ) | ||
gpu_memory_utilization | Float (Default: 0.9) | ||
block_size | (Default: None) | ||
device_group | (Default: None) | This setting is ignored for CUDA acceleration. | |
framework | String The framework of the framework_config type. For Custom vLLM frameworks, this value is "custom" . |
For QAIC, set <code>block_size=32</code>.
The optional acceleration configuration for qaic
includes the following parameters. If these parameters are now defined at model upload the default values are applied.
Parameters | Type | Description |
---|---|---|
num_cores | Integer (Default: 16 ) | Number of cores used to compile the model. |
num_devices | Integer (Default: 1 ) | Number of System-on-Chip (SoC) in a given card to compile the model for. |
ctx_len | Integer (*Default: 128 ) | Maximum context that the compiled model remembers. |
prefill_seq_len | Integer | The length of the Prefill prompt. |
full_batch_size | Integer (Default: None ) | Maximum number of sequences per iteration. Set to enable continuous batching mode. |
mxfp6_matmul | Boolean (Default: False ) | Enable compilation for MXFP6 precision. |
mxint8_kv_cache | Boolean (Default: False ) | Compress Present/Past KV to MXINT8. |
aic_enable_depth_first | Boolean (Default: False ) | Enables DFS with default memory size. |
The following example demonstrates uploading a LLM with:
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer <your-token-here>" \
-F 'metadata={"name": "<your model name here>", "visibility": "private", "workspace_id": <your workspace id>, "conversion": {"framework": "custom", "framework_config": {"framework": "custom", "config":{"max_num_seqs": 16, "max_model_len": 256, "max_seq_len_to_capture": 128, "quantization": "mxfp6", "kv_cache_dtype": "mxint8", "gpu_memory_utilization": 1}}, "accel": {"qaic":{"num_devices":4,"full_batch_size": 16, "ctx_len": 256, "prefill_seq_len": 128, "mxfp6_matmul":true,"mxint8_kv_cache":true}}, "python_version": "3.8", "requirements": []}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
-F "file=@<your llm file here>;type=application/octet-stream" \
https://qaic.example.wallaroo.ai/v1/api/models/upload_and_convert | cat
The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model
. This model version is for deployment.
# Retrieve the model
custom_vllm_model = wl.get_model(your-model-name)
Deploying LLMs with QAIC acceleration has the following steps.
accel
parameters - in this case, QAIC
.gpu
values is the number of System-on-Chips (SoCs) to use.The deployment configuration sets what resources are allocated for the LLM. For this example, the LLM is allocated the following:
gpu
parameter specifies the number of SoCs allocated.deployment_config = DeploymentConfigBuilder() \
.replica_autoscale_min_max(minimum=1, maximum=2) \
.cpus(1).memory('1Gi') \
.sidekick_cpus(custom_vllm_model, 4) \
.sidekick_memory(custom_vllm_model, '12Gi') \
.sidekick_gpus(custom_vllm_model, 4) \
.deployment_label("kubernetes.io/os:linux") \
.build()
Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline
method. Pipeline steps determine how inference data is provided to the LLM.
The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.
# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-pipeline')
# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(custom_vllm_model)
With the Deployment Configuration defined and the pipeline ready, the pipeline is deployed with the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig])
method. This allocates resources from the cluster for the deployment based on the DeploymentConfig
settings. If the resources requested are not available at deployment, an error is returned.
The following example demonstrates deploying the pipeline with the previously defined deployment configuration.
vllm_pipeline.deploy(deployment_config=deployment_config)
Once the deployment configuration is complete, the pipeline is ready to accept inference requests.
QAIC
, but the architecture is set to an incompatible architecture (aka anything other than X86
):“The specified model optimization configuration is not available. Please try this operation again using a different configuration or contact Wallaroo at support@wallaroo.ai for questions or help.”