Dynamic Batching with vLLM Benchmarks

The following benchmarks were created with a Llama 3 8b vLLM with a DynamicBatchingConfig on a A100 a2-ultragpu-1g node.

Configuration

Llama 3 8b vLLM inference requests typically complete within a few seconds to two minutes depending on the size of the batch. If all replicas of a deployed LLM are busy processing inference requests, submitting additional data introduces delays depending on the LLM’s availability.

Using Dynamic Batch Configurations helps alleviate these bottlenecks by combining inference requests into batches, and submitting the batch as a single inference request, then returning the individual inference requests to the original senders. For more details on Dynamic Batch Configurations, see Dynamic Batching for LLMs.

Model Attributes

The Llama 3 8b vLLM has the following attributes.

max_tokens=4096
temperature=0.5
top_p=0.9

For the input, the same input was used each time: “Describe what Wallaroo.AI is.” This input is roughly 7 tokens.

Deployment Attributes

For these benchmarks, three replicas of the Wallaroo inference engine and LLM were deployed with different Dynamic Batching Configurations applied. The Wallaroo inference engine timeout has been set to 300 seconds.

All scenarios were executed in Wallaroo version 2024.3. For pre-batching scenarios, we use the same prompt on the batch to make a fair comparison with dynamic batching.

Summary

In a realistic scenario of 1 qps, we send a single prompt/request/sec using a single client. The first scenario uses the following Dynamic Configuration.

batch_size_target=8
max_batch_delay_ms=1000

The results are measured with the following standards:

queries per second (qps): How many queries are submitted per second.
prompt per request (prompt/req): How many prompts were submitted with each request.
seconds per request (s/req): How many seconds it took to complete the request, including 504 timeout requests.
second per prompt (s/prompt): How many seconds per prompt, including 504 timeout requests.
tokens per second (tok/s): How many tokens were processed each second.

Dynamic Parameters	Without Dynamic Batching	With Dynamic Batching	Chart
1 qps (1 prompt/req)	Failed to keep up with incoming requests. 268.34 s/req 261.91 s/prompt incl. 504 requests ~75 tok/s 40/60 submitted prompts returned 504 timeout error.	Successfully returned all 60 requests. 75-240 tok/s 105.67 s/req 105.4 s/prompt

Detailed Load Testing Results

For these tests, rather then sending single prompts via multiple clients at once, a specific number of queries (aka prompts) are submitted per second as Queries per Second (qps) to represent a real world scenario.

Each load test set a different Dynamic Batch Configuration on the same hardware, with the same deployment configurations and the same query. Dynamic Batching Configurations are defined by the following:

Max batch delay: The amount of time in milliseconds (Default: 10) to watch before sending each batch to the model for an inference request.
Batch size target: Minimum size of a batch (Default: 4) sent to the model.
Batch size limit (Optional): The maximum size of a batch the model can process (Default: None). This is a guardrail to control the maximum batch size.

For each scenario, the following examples are run:

1 qps with a single prompt per request for one full minute.
For the first scenario, 8 qps with a single prompt per request for one full minute.
- Each following scenario is 5 qps with a single prompt per request for one full minute.

This benchmark sample sets the following Dynamic Batch Configuration setting:

max_batch_delay_ms=1000
batch_size_target=8
batch_size_limit=10

1 qps over 1 minute, with a single prompt per request.

Summary:
  Success rate:	100.00%
  Total:	161.4362 secs
  Slowest:	118.2918 secs
  Fastest:	17.7672 secs
  Average:	60.6985 secs
  Requests/sec:	0.3717

  Total data:	301.10 KiB
  Size/request:	5.02 KiB
  Size/sec:	1.86 KiB

Response time histogram:
   17.767 [1]  |■■
   27.820 [5]  |■■■■■■■■■■■■■
   37.872 [7]  |■■■■■■■■■■■■■■■■■■
   47.925 [12] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
   57.977 [10] |■■■■■■■■■■■■■■■■■■■■■■■■■■
   68.029 [4]  |■■■■■■■■■■
   78.082 [3]  |■■■■■■■■
   88.134 [6]  |■■■■■■■■■■■■■■■■
   98.187 [4]  |■■■■■■■■■■
  108.239 [3]  |■■■■■■■■
  118.292 [5]  |■■■■■■■■■■■■■

Response time distribution:
  10.00% in 27.9309 secs
  25.00% in 39.2333 secs
  50.00% in 55.9209 secs
  75.00% in 82.8132 secs
  90.00% in 107.2998 secs
  95.00% in 113.2997 secs
  99.00% in 118.2918 secs
  99.90% in 118.2918 secs
  99.99% in 118.2918 secs

Status code distribution:
  [200] 60 responses

8 qps over 1 minute, with a single prompt per request.

Summary:
  Success rate:	100.00%
  Total:	1439.6336 secs
  Slowest:	1380.3805 secs
  Fastest:	57.0044 secs
  Average:	724.6244 secs
  Requests/sec:	0.3334

  Total data:	5.98 MiB
  Size/request:	12.76 KiB
  Size/sec:	4.25 KiB

Response time histogram:
    57.004 [1]  |
   189.342 [40] |■■■■■■■■■■■■■■■■■■■■■■
   321.680 [56] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
   454.017 [48] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
   586.355 [48] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
   718.692 [48] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
   851.030 [48] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
   983.368 [48] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1115.705 [48] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1248.043 [48] |■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1380.381 [47] |■■■■■■■■■■■■■■■■■■■■■■■■■■

Response time distribution:
  10.00% in 194.5656 secs
  25.00% in 382.5160 secs
  50.00% in 701.7780 secs
  75.00% in 1018.5019 secs
  90.00% in 1210.6469 secs
  95.00% in 1276.4725 secs
  99.00% in 1337.9496 secs
  99.90% in 1380.3805 secs
  99.99% in 1380.3805 secs

Status code distribution:
  [200] 480 responses

This benchmark sample set uses the following Dynamic Batch Configuration.

max_batch_delay_ms=1000
batch_size_target=16
batch_size_limit=20

1 qps over 1 minute, with a single prompt per request.

Summary:
  Success rate:	100.00%
  Total:	169.4051 secs
  Slowest:	133.3843 secs
  Fastest:	9.6190 secs
  Average:	58.7042 secs
  Requests/sec:	0.3542

  Total data:	361.23 KiB
  Size/request:	6.02 KiB
  Size/sec:	2.13 KiB

Response time histogram:
    9.619 [1]  |■■■
   21.996 [10] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
   34.372 [9]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■
   46.749 [8]  |■■■■■■■■■■■■■■■■■■■■■■■■■
   59.125 [6]  |■■■■■■■■■■■■■■■■■■■
   71.502 [8]  |■■■■■■■■■■■■■■■■■■■■■■■■■
   83.878 [5]  |■■■■■■■■■■■■■■■■
   96.255 [0]  |
  108.631 [3]  |■■■■■■■■■
  121.008 [3]  |■■■■■■■■■
  133.384 [7]  |■■■■■■■■■■■■■■■■■■■■■■

Response time distribution:
  10.00% in 18.3447 secs
  25.00% in 24.0884 secs
  50.00% in 53.7569 secs
  75.00% in 78.7127 secs
  90.00% in 122.2735 secs
  95.00% in 124.3839 secs
  99.00% in 133.3843 secs
  99.90% in 133.3843 secs
  99.99% in 133.3843 secs

Status code distribution:
  [200] 60 responses

5 qps over 1 minute, with a single prompt per request.

Summary:
  Success rate:	100.00%
  Total:	754.9725 secs
  Slowest:	699.6325 secs
  Fastest:	53.3764 secs
  Average:	381.6321 secs
  Requests/sec:	0.3974

  Total data:	2.27 MiB
  Size/request:	7.76 KiB
  Size/sec:	3.08 KiB

Response time histogram:
   53.376 [1]  |
  118.002 [29] |■■■■■■■■■■■■■■■■■■■■■■■■■
  182.628 [32] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  247.253 [23] |■■■■■■■■■■■■■■■■■■■■
  311.879 [35] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  376.504 [35] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  441.130 [20] |■■■■■■■■■■■■■■■■■
  505.756 [36] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  570.381 [20] |■■■■■■■■■■■■■■■■■
  635.007 [35] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  699.632 [34] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

Response time distribution:
  10.00% in 118.6393 secs
  25.00% in 230.3809 secs
  50.00% in 351.2682 secs
  75.00% in 527.8342 secs
  90.00% in 642.1149 secs
  95.00% in 646.4432 secs
  99.00% in 697.6381 secs
  99.90% in 699.6325 secs
  99.99% in 699.6325 secs

Status code distribution:
  [200] 300 responses

This benchmark sample set uses the following Dynamic Batch Configuration:

max_batch_delay_ms=2000
batch_size_target=16
batch_size_limit=20

1 qps over 1 minute, with a single prompt per request.

Summary:
  Success rate:	100.00%
  Total:	256.8241 secs
  Slowest:	209.6773 secs
  Fastest:	19.1298 secs
  Average:	110.8761 secs
  Requests/sec:	0.2336

  Total data:	460.00 KiB
  Size/request:	7.67 KiB
  Size/sec:	1.79 KiB

Response time histogram:
   19.130 [1]  |■■
   38.185 [7]  |■■■■■■■■■■■■■■■■■■
   57.239 [0]  |
   76.294 [10] |■■■■■■■■■■■■■■■■■■■■■■■■■■
   95.349 [5]  |■■■■■■■■■■■■■
  114.404 [12] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  133.458 [6]  |■■■■■■■■■■■■■■■■
  152.513 [7]  |■■■■■■■■■■■■■■■■■■
  171.568 [4]  |■■■■■■■■■■
  190.623 [0]  |
  209.677 [8]  |■■■■■■■■■■■■■■■■■■■■■

Response time distribution:
  10.00% in 35.3275 secs
  25.00% in 74.0342 secs
  50.00% in 108.0208 secs
  75.00% in 150.1338 secs
  90.00% in 203.6858 secs
  95.00% in 206.6842 secs
  99.00% in 209.6773 secs
  99.90% in 209.6773 secs
  99.99% in 209.6773 secs

Status code distribution:
  [200] 60 responses

5 qps over 1 minute, with a single prompt per request.

Summary:
  Success rate:	100.00%
  Total:	875.0367 secs
  Slowest:	815.5057 secs
  Fastest:	8.7272 secs
  Average:	396.0567 secs
  Requests/sec:	0.3428

  Total data:	2.50 MiB
  Size/request:	8.55 KiB
  Size/sec:	2.93 KiB

Response time histogram:
    8.727 [1]  |
   89.405 [22] |■■■■■■■■■■■■■■
  170.083 [33] |■■■■■■■■■■■■■■■■■■■■■
  250.761 [32] |■■■■■■■■■■■■■■■■■■■■
  331.439 [37] |■■■■■■■■■■■■■■■■■■■■■■■■
  412.116 [36] |■■■■■■■■■■■■■■■■■■■■■■■
  492.794 [25] |■■■■■■■■■■■■■■■■
  573.472 [49] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  654.150 [28] |■■■■■■■■■■■■■■■■■■
  734.828 [18] |■■■■■■■■■■■
  815.506 [19] |■■■■■■■■■■■■

Response time distribution:
  10.00% in 132.2805 secs
  25.00% in 209.4897 secs
  50.00% in 392.2650 secs
  75.00% in 570.3415 secs
  90.00% in 696.0153 secs
  95.00% in 760.4724 secs
  99.00% in 768.4708 secs
  99.90% in 815.5057 secs
  99.99% in 815.5057 secs

Status code distribution:
  [200] 300 responses