Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Split remote inference text list if its number exceeds user configured limitation #2428

Closed
chishui opened this issue May 9, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@chishui
Copy link
Contributor

chishui commented May 9, 2024

Is your feature request related to a problem?
Most of ML services may set a limitation on maximum length of text in a batch request. For example, Cohere explicitly sets 96 for it's embedding API (ref), OpenAI has a limitation of 2048 (ref). Exceeding the limitation could cause a failed request.

With chunking processor feature and batch ingestion feature, we are more likely to run into the issue that the number of texts in a single batch request could exceed server's limitation.

So we'd like to have a solution to cut texts into small batches if total number of them in a batch request exceeds the limitation.

What solution would you like?

1. Support a new parameter in connector to set the maximum batch size limitation from remote server.

POST /_plugins/_ml/connectors/_create
{
  ...
  "parameters": {
     "max_batch_size": 96
  },
  ...
}

If max_batch_size is not set, we don't cut texts into sub batches.

2. Cut texts into small batches

if max_batch_size is set and total number of texts exceed the number it sets, we use max_batch_size to chunk texts. There will be total number of texts / max_batch_size batches and each batch has text number no greater than max_batch_size.

3. Sort texts based on text length before cutting

As @model-collapse explained here, LLM has better performance if the batched texts have similar text length. We also did a benchmark to test the performance of sorting texts before cutting them into small batches and sending batch requests to a SageMaker model. We found sorting could improve ingesting latency by 5.5%.

  • SageMaker host type: g5.xlarge
  • Processor: Sparse Encoding
  • Benchmark Setup
    • Bulk size: 160
    • client: 1
    • batch size: 16
Metrics no sort sort before making batches
Min Throughput (docs/s) 391.85 205.16
Mean Throughput (docs/s) 445.6 466.79
Median Throughput (docs/s) 445.02 470.9
Max Throughput (docs/s) 494.83 527.66
Latency P50 (ms) 373.039 357.348
Latency P90 (ms) 417.837 394.659
Latency P99 (ms) 469.123 425.151
Total Benchmark Time (s) 720 680
Error Rate (%) 0 0

Example

In general, if we are supposed to send a list of texts for inference e.g.

[
  "abcde",
  "abcdefghijk",
  "a",
  "ab",
  "abcdefgh",
]

and user sets max_batch_size to 2, after sorting and cutting batches, we would make three inference requests to remote server with following inputs separately:

  1. ["a", "ab"]
  2. ["abcde", "abcdefgh"]
  3. ["abcdefghijk"]

What alternatives have you considered?
N/A

Do you have any additional context?
N/A

@chishui chishui added enhancement New feature or request untriaged labels May 9, 2024
chishui added a commit to chishui/ml-commons that referenced this issue May 17, 2024
Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
@chishui
Copy link
Contributor Author

chishui commented May 27, 2024

We decided to reuse input_docs_processed_step_size for batch size and we'll implement doc's sorting by length from TextEmbeddingProcessor and SparseEncodingProcessor side to avoid sorting docs from multi modal processor. Please refer to this PR on neural-search repo opensearch-project/neural-search#744 for implementation detail.

Closing this issue now.

@chishui chishui closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

2 participants