You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If using ollama, health checks return with an error while the endpoint is ok. It is because the checks are done in parallel and with too many models, it times out (ollama is loading them in serial).
Proposition is to limit the number of parallel requests, maybe by provider (for ollama 1 or 2 should be the maximum, for openai maybe 3 or 4 maximum so it doesn't impact production rate limitation and it is more corporate citizen). I proposed to change the code below but you may want to add an env somewhere to capture the limits:
Maybe you want to use instead Semaphore or Queue for the work?
file: litellm/proxy/health_check.py
async def _perform_health_check(model_list: list, details: Optional[bool] = True, max_concurrent_tasks: int = 2):
"""
Perform a health check for each model in the list.
Limit the level of async call
"""
semaphore = asyncio.BoundedSemaphore(max_concurrent_tasks)
async def sem_task(model):
async with semaphore:
litellm_params = model["litellm_params"]
model_info = model.get("model_info", {})
litellm_params["messages"] = _get_random_llm_message()
mode = model_info.get("mode", None)
return await litellm.ahealth_check(
litellm_params,
mode=mode,
prompt="test from litellm",
input=["test from litellm"],
)
tasks = [sem_task(model) for model in model_list]
results = await asyncio.gather(*tasks)
healthy_endpoints = []
unhealthy_endpoints = []
...
Relevant log output
error from health checks (the error starts happening after few tested models):
Please note that the error text gets truncated, and maybe we want to return only Timeout, not the full details in the /health endpoint
"error": "error:litellm.APIConnectionError: \nTraceback (most recent call last):\n File \"/usr/local/lib/python3.11/site-packages/litellm/main.py\", line 427, in acompletion\n response = await init_response\n ^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py\", line 495, in ollama_acompletion\n raise e # don't use verbose_logger.exception, if exception is raised\n ^^^^^^^\n File \"/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py\", line 436, in ollama_acompletion\n resp = await session.post(url, json=data)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/site-packages/aiohttp/client.py\", line 684, in _request\n await resp.start(conn)\n File \"/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py\", line 994, in start\n with self._timer:\n File \"/usr/local/lib/python3.11/site-packages/aiohttp/helpers.py\", line 713, in __exit__\n raise asyncio.TimeoutError from None\nTimeoutError\n. Missing `mode`. Set the `mode` for the model - https://docs.litellm.ai/docs/proxy/health#embedding-models \nstacktrace: Traceback (most recent call last):\n File \"/usr/local/lib/python3.11/site-packages/litellm/main.py\", line 427, in acompletion\n response = await init_response\n ^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py\", line 495, in ollama_acompletion\n raise e # don't use verbose_logger.exception, if exception is raised\n ^^^^^^^\n File \"/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py\", line 436, in ollama_acompletion\n resp = await session.post(url, json=data)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/usr/local/lib/python3.11/site-packages/aiohttp/client.py\", line 684, in _request\n await resp.start(conn)\n File \"/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py\", line 994, in start\n with self._timer:\n File \"/usr/local/lib/python3.11/site-packages/aiohttp/helpers.py\", line 713, in __exit__\n raise asyncio.TimeoutError from None\nTimeoutError\n\nDuring handling of the above exceptio"
},
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered:
What happened?
If using ollama, health checks return with an error while the endpoint is ok. It is because the checks are done in parallel and with too many models, it times out (ollama is loading them in serial).
Proposition is to limit the number of parallel requests, maybe by provider (for ollama 1 or 2 should be the maximum, for openai maybe 3 or 4 maximum so it doesn't impact production rate limitation and it is more corporate citizen). I proposed to change the code below but you may want to add an env somewhere to capture the limits:
Maybe you want to use instead Semaphore or Queue for the work?
file: litellm/proxy/health_check.py
Relevant log output
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: