Skip to content
View lihuahua123's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report lihuahua123

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Beta Lists are currently in beta. Share feedback and report bugs.
Showing results

PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".

Python 71 15 Updated May 23, 2023

A fast reverse proxy to help you expose a local server behind a NAT or firewall to the internet.

Go 84,599 13,170 Updated Sep 6, 2024

Library for faster pinned CPU <-> GPU transfer in Pytorch

Python 682 39 Updated Feb 21, 2020
Python 30 3 Updated Sep 17, 2024

A throughput-oriented high-performance serving framework for LLMs

Cuda 485 17 Updated Sep 21, 2024

Stateful LLM Serving

Python 25 3 Updated Jul 28, 2024

Modular and structured prompt caching for low-latency LLM inference

Python 44 3 Updated May 12, 2024
Jupyter Notebook 12 2 Updated May 28, 2024

A large-scale simulation framework for LLM inference

Python 237 23 Updated Aug 24, 2024

The Open Source Survey

515 74 Updated Dec 24, 2020

SGLang is a fast serving framework for large language models and vision language models.

Python 5,199 369 Updated Sep 21, 2024

An open-source project dedicated to build foundational large language model for natural science, mainly in physics, chemistry and material science.

Jupyter Notebook 184 23 Updated Aug 2, 2024

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Python 2,093 138 Updated Sep 20, 2024

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Python 1,706 88 Updated Jan 21, 2024

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

1,048 22 Updated Jul 31, 2024

Serving multiple LoRA finetuned LLM as one

Python 951 44 Updated May 8, 2024

A tool which profiles OpenCL devices to find their peak capacities

C++ 399 112 Updated May 20, 2024

Efficient and easy multi-instance LLM serving

Python 118 10 Updated Sep 21, 2024

Heterogeneous AI Computing Virtualization Middleware

Go 663 153 Updated Sep 20, 2024

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 276 32 Updated Sep 11, 2024

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

Jupyter Notebook 72 3 Updated Mar 13, 2024
Jupyter Notebook 15 1 Updated Jun 22, 2024

🤖 AI Gateway | AI Native API Gateway

Go 2,860 470 Updated Sep 20, 2024
Jupyter Notebook 121 7 Updated Mar 12, 2024

The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!

Python 6,992 779 Updated Sep 20, 2024

Summary of some awesome work for optimizing LLM inference

27 1 Updated Sep 18, 2024
Python 35 3 Updated Jun 27, 2024

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 1,652 223 Updated Sep 21, 2024

Latency and Memory Analysis of Transformer Models for Training and Inference

Python 337 40 Updated May 28, 2024
Next