You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem?
Most of ML services may set a limitation on maximum length of text in a batch request. For example, Cohere explicitly sets 96 for it's embedding API (ref), OpenAI has a limitation of 2048 (ref). Exceeding the limitation could cause a failed request.
So we'd like to have a solution to cut texts into small batches if total number of them in a batch request exceeds the limitation.
What solution would you like?
1. Support a new parameter in connector to set the maximum batch size limitation from remote server.
POST /_plugins/_ml/connectors/_create
{
..."parameters": {
"max_batch_size": 96
},
...
}
If max_batch_size is not set, we don't cut texts into sub batches.
2. Cut texts into small batches
if max_batch_size is set and total number of texts exceed the number it sets, we use max_batch_size to chunk texts. There will be total number of texts / max_batch_size batches and each batch has text number no greater than max_batch_size.
3. Sort texts based on text length before cutting
As @model-collapse explained here, LLM has better performance if the batched texts have similar text length. We also did a benchmark to test the performance of sorting texts before cutting them into small batches and sending batch requests to a SageMaker model. We found sorting could improve ingesting latency by 5.5%.
SageMaker host type: g5.xlarge
Processor: Sparse Encoding
Benchmark Setup
Bulk size: 160
client: 1
batch size: 16
Metrics
no sort
sort before making batches
Min Throughput (docs/s)
391.85
205.16
Mean Throughput (docs/s)
445.6
466.79
Median Throughput (docs/s)
445.02
470.9
Max Throughput (docs/s)
494.83
527.66
Latency P50 (ms)
373.039
357.348
Latency P90 (ms)
417.837
394.659
Latency P99 (ms)
469.123
425.151
Total Benchmark Time (s)
720
680
Error Rate (%)
0
0
Example
In general, if we are supposed to send a list of texts for inference e.g.
and user sets max_batch_size to 2, after sorting and cutting batches, we would make three inference requests to remote server with following inputs separately:
["a", "ab"]
["abcde", "abcdefgh"]
["abcdefghijk"]
What alternatives have you considered?
N/A
Do you have any additional context?
N/A
The text was updated successfully, but these errors were encountered:
We decided to reuse input_docs_processed_step_size for batch size and we'll implement doc's sorting by length from TextEmbeddingProcessor and SparseEncodingProcessor side to avoid sorting docs from multi modal processor. Please refer to this PR on neural-search repo opensearch-project/neural-search#744 for implementation detail.
Is your feature request related to a problem?
Most of ML services may set a limitation on maximum length of text in a batch request. For example, Cohere explicitly sets 96 for it's embedding API (ref), OpenAI has a limitation of 2048 (ref). Exceeding the limitation could cause a failed request.
With chunking processor feature and batch ingestion feature, we are more likely to run into the issue that the number of texts in a single batch request could exceed server's limitation.
So we'd like to have a solution to cut texts into small batches if total number of them in a batch request exceeds the limitation.
What solution would you like?
1. Support a new parameter in connector to set the maximum batch size limitation from remote server.
If
max_batch_size
is not set, we don't cut texts into sub batches.2. Cut texts into small batches
if
max_batch_size
is set and total number of texts exceed the number it sets, we usemax_batch_size
to chunk texts. There will betotal number of texts / max_batch_size
batches and each batch has text number no greater thanmax_batch_size
.3. Sort texts based on text length before cutting
As @model-collapse explained here, LLM has better performance if the batched texts have similar text length. We also did a benchmark to test the performance of sorting texts before cutting them into small batches and sending batch requests to a SageMaker model. We found sorting could improve ingesting latency by 5.5%.
Example
In general, if we are supposed to send a list of texts for inference e.g.
and user sets
max_batch_size
to 2, after sorting and cutting batches, we would make three inference requests to remote server with following inputs separately:What alternatives have you considered?
N/A
Do you have any additional context?
N/A
The text was updated successfully, but these errors were encountered: