[FEATURE] Split remote inference text list if its number exceeds user configured limitation #2428

chishui · 2024-05-09T05:56:29Z

Is your feature request related to a problem?
Most of ML services may set a limitation on maximum length of text in a batch request. For example, Cohere explicitly sets 96 for it's embedding API (ref), OpenAI has a limitation of 2048 (ref). Exceeding the limitation could cause a failed request.

With chunking processor feature and batch ingestion feature, we are more likely to run into the issue that the number of texts in a single batch request could exceed server's limitation.

So we'd like to have a solution to cut texts into small batches if total number of them in a batch request exceeds the limitation.

What solution would you like?

1. Support a new parameter in connector to set the maximum batch size limitation from remote server.

POST /_plugins/_ml/connectors/_create
{
  ...
  "parameters": {
     "max_batch_size": 96
  },
  ...
}

If max_batch_size is not set, we don't cut texts into sub batches.

2. Cut texts into small batches

if max_batch_size is set and total number of texts exceed the number it sets, we use max_batch_size to chunk texts. There will be total number of texts / max_batch_size batches and each batch has text number no greater than max_batch_size.

3. Sort texts based on text length before cutting

As @model-collapse explained here, LLM has better performance if the batched texts have similar text length. We also did a benchmark to test the performance of sorting texts before cutting them into small batches and sending batch requests to a SageMaker model. We found sorting could improve ingesting latency by 5.5%.

SageMaker host type: g5.xlarge
Processor: Sparse Encoding
Benchmark Setup
- Bulk size: 160
- client: 1
- batch size: 16

Metrics	no sort	sort before making batches
Min Throughput (docs/s)	391.85	205.16
Mean Throughput (docs/s)	445.6	466.79
Median Throughput (docs/s)	445.02	470.9
Max Throughput (docs/s)	494.83	527.66
Latency P50 (ms)	373.039	357.348
Latency P90 (ms)	417.837	394.659
Latency P99 (ms)	469.123	425.151
Total Benchmark Time (s)	720	680
Error Rate (%)	0	0

Example

In general, if we are supposed to send a list of texts for inference e.g.

[
  "abcde",
  "abcdefghijk",
  "a",
  "ab",
  "abcdefgh",
]

and user sets max_batch_size to 2, after sorting and cutting batches, we would make three inference requests to remote server with following inputs separately:

["a", "ab"]
["abcde", "abcdefgh"]
["abcdefghijk"]

What alternatives have you considered?
N/A

Do you have any additional context?
N/A

The text was updated successfully, but these errors were encountered:

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

chishui · 2024-05-27T02:56:57Z

We decided to reuse input_docs_processed_step_size for batch size and we'll implement doc's sorting by length from TextEmbeddingProcessor and SparseEncodingProcessor side to avoid sorting docs from multi modal processor. Please refer to this PR on neural-search repo opensearch-project/neural-search#744 for implementation detail.

Closing this issue now.

chishui added enhancement New feature or request untriaged labels May 9, 2024

chishui added a commit to chishui/ml-commons that referenced this issue May 17, 2024

Support maximum batch size (opensearch-project#2428)

45021e8

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

chishui mentioned this issue May 17, 2024

[Feature] Split remote inference text list if its number exceeds user configured limitation #2455

Closed

5 tasks

ylwu-amzn removed the untriaged label May 21, 2024

dhrubo-os assigned chishui May 21, 2024

chishui closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Split remote inference text list if its number exceeds user configured limitation #2428

[FEATURE] Split remote inference text list if its number exceeds user configured limitation #2428

chishui commented May 9, 2024 •

edited

Loading

chishui commented May 27, 2024

[FEATURE] Split remote inference text list if its number exceeds user configured limitation #2428

[FEATURE] Split remote inference text list if its number exceeds user configured limitation #2428

Comments

chishui commented May 9, 2024 • edited Loading

1. Support a new parameter in connector to set the maximum batch size limitation from remote server.

2. Cut texts into small batches

3. Sort texts based on text length before cutting

Example

chishui commented May 27, 2024

chishui commented May 9, 2024 •

edited

Loading