Configures the minimum and maximum for autoscaling
Sets the average CPU metric to scale on in a percentage
Sets the number of GPUs to be used for the model's sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment.
This DeploymentConfigBuilder instance for chaining.
Sets the number of CPUs to be used for the model's sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment.
This DeploymentConfigBuilder instance for chaining.
Sets the memory to be used for the model's sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment.
This DeploymentConfigBuilder instance for chaining.
Sets the environment variables to be set for the model's sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment.
This DeploymentConfigBuilder instance for chaining.
Sets the machine architecture for the model's sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment.
This DeploymentConfigBuilder instance for chaining.
Sets the acceleration option for the model's sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment.
This DeploymentConfigBuilder instance for chaining.
Configure the scale_up_queue_depth threshold as an autoscaling trigger.
This method sets a queue depth threshold above which all pipeline components (including the engine and LLM sidekicks) will incrementally scale up.
The scale_up_queue_depth is calculated as: (number of requests in queue + requests being processed) / number of available replicas over a scaling window.
Notes: - This parameter must be configured to activate queue-based autoscaling. - No default value is provided. - When configured, scale_up_queue_depth overrides the default autoscaling trigger (cpu_utilization). - The setting applies to all components of the pipeline. - When set, scale_down_queue_depth is automatically set to 1 if not already configured.
DeploymentConfigBuilder: The current instance for method chaining.
Configure the scale_down_queue_depth threshold as an autoscaling trigger.
This method sets a queue depth threshold below which all pipeline components (including the engine and LLM sidekicks) will incrementally scale down.
The scale_down_queue_depth is calculated as: (number of requests in queue + requests being processed) / number of available replicas over a scaling window.
Notes: - This parameter is optional and defaults to 1 if not set. - scale_down_queue_depth is only applicable when scale_up_queue_depth is configured. - The setting applies to all components of the pipeline. - This threshold helps prevent unnecessary scaling down when the workload is still significant but below the scale-up threshold.
DeploymentConfigBuilder: The current instance for method chaining.
Configure the autoscaling window for incrementally scaling up/down pipeline components.
This method sets the time window over which the autoscaling metrics are evaluated for making scaling decisions. It applies to all components of the pipeline, including the engine and LLM sidekicks.
Notes: - The default value is 300 seconds if not specified. - This setting is only applicable when scale_up_queue_depth is configured. - The autoscaling window helps smooth out short-term fluctuations in workload and prevents rapid scaling events.
DeploymentConfigBuilder: The current instance for method chaining.