Lists (2)
Sort Name ascending (A-Z)
Stars
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
A fast reverse proxy to help you expose a local server behind a NAT or firewall to the internet.
Library for faster pinned CPU <-> GPU transfer in Pytorch
A throughput-oriented high-performance serving framework for LLMs
Modular and structured prompt caching for low-latency LLM inference
A large-scale simulation framework for LLM inference
SGLang is a fast serving framework for large language models and vision language models.
An open-source project dedicated to build foundational large language model for natural science, mainly in physics, chemistry and material science.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A tool which profiles OpenCL devices to find their peak capacities
Efficient and easy multi-instance LLM serving
Heterogeneous AI Computing Virtualization Middleware
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
Summary of some awesome work for optimizing LLM inference
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
Latency and Memory Analysis of Transformer Models for Training and Inference