Stars
FlashInfer: Kernel Library for LLM Serving
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
how to optimize some algorithm in cuda.
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
A high-throughput and memory-efficient inference and serving engine for LLMs
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
The official GitHub page for the survey paper "A Survey of Large Language Models".
Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
microsoft / Megatron-DeepSpeed
Forked from NVIDIA/Megatron-LMOngoing research training transformer language models at scale, including: BERT & GPT-2
Collective communications library with various primitives for multi-machine training.
🔮 ChatGPT Desktop Application (Mac, Windows and Linux)
Example models using DeepSpeed
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
Transformer related optimization, including BERT, GPT
Fast and memory-efficient exact attention
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Universal cross-platform tokenizers binding to HF and sentencepiece
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Universal LLM Deployment Engine with ML Compilation