[Bug]: Need to reduce parallel requests for health check based on model provider (i.e. ollama) #5816

shuther · 2024-09-21T12:01:18Z

What happened?

If using ollama, health checks return with an error while the endpoint is ok. It is because the checks are done in parallel and with too many models, it times out (ollama is loading them in serial).
Proposition is to limit the number of parallel requests, maybe by provider (for ollama 1 or 2 should be the maximum, for openai maybe 3 or 4 maximum so it doesn't impact production rate limitation and it is more corporate citizen). I proposed to change the code below but you may want to add an env somewhere to capture the limits:
Maybe you want to use instead Semaphore or Queue for the work?
file: litellm/proxy/health_check.py

async def _perform_health_check(model_list: list, details: Optional[bool] = True, max_concurrent_tasks: int = 2):
    """
    Perform a health check for each model in the list.
    Limit the level of async call
    """
    semaphore = asyncio.BoundedSemaphore(max_concurrent_tasks)

    async def sem_task(model):
        async with semaphore:
            litellm_params = model["litellm_params"]
            model_info = model.get("model_info", {})
            litellm_params["messages"] = _get_random_llm_message()
            mode = model_info.get("mode", None)
            return await litellm.ahealth_check(
                litellm_params,
                mode=mode,
                prompt="test from litellm",
                input=["test from litellm"],
            )

    tasks = [sem_task(model) for model in model_list]
    results = await asyncio.gather(*tasks)

    healthy_endpoints = []
    unhealthy_endpoints = []
...

Relevant log output

error from health checks (the error starts happening after few tested models):
Please note that the error text gets truncated, and maybe we want to return only Timeout, not the full details in the /health endpoint

"error": "error:litellm.APIConnectionError: \nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.11/site-packages/litellm/main.py\", line 427, in acompletion\n    response = await init_response\n               ^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py\", line 495, in ollama_acompletion\n    raise e  # don't use verbose_logger.exception, if exception is raised\n    ^^^^^^^\n  File \"/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py\", line 436, in ollama_acompletion\n    resp = await session.post(url, json=data)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/site-packages/aiohttp/client.py\", line 684, in _request\n    await resp.start(conn)\n  File \"/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py\", line 994, in start\n    with self._timer:\n  File \"/usr/local/lib/python3.11/site-packages/aiohttp/helpers.py\", line 713, in __exit__\n    raise asyncio.TimeoutError from None\nTimeoutError\n. Missing `mode`. Set the `mode` for the model - https://docs.litellm.ai/docs/proxy/health#embedding-models  \nstacktrace: Traceback (most recent call last):\n  File \"/usr/local/lib/python3.11/site-packages/litellm/main.py\", line 427, in acompletion\n    response = await init_response\n               ^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py\", line 495, in ollama_acompletion\n    raise e  # don't use verbose_logger.exception, if exception is raised\n    ^^^^^^^\n  File \"/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py\", line 436, in ollama_acompletion\n    resp = await session.post(url, json=data)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/usr/local/lib/python3.11/site-packages/aiohttp/client.py\", line 684, in _request\n    await resp.start(conn)\n  File \"/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py\", line 994, in start\n    with self._timer:\n  File \"/usr/local/lib/python3.11/site-packages/aiohttp/helpers.py\", line 713, in __exit__\n    raise asyncio.TimeoutError from None\nTimeoutError\n\nDuring handling of the above exceptio"
    },

Twitter / LinkedIn details

No response

The text was updated successfully, but these errors were encountered:

shuther added the bug Something isn't working label Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Need to reduce parallel requests for health check based on model provider (i.e. ollama) #5816

[Bug]: Need to reduce parallel requests for health check based on model provider (i.e. ollama) #5816

shuther commented Sep 21, 2024

[Bug]: Need to reduce parallel requests for health check based on model provider (i.e. ollama) #5816

[Bug]: Need to reduce parallel requests for health check based on model provider (i.e. ollama) #5816

Comments

shuther commented Sep 21, 2024

What happened?

Relevant log output

Twitter / LinkedIn details