Skip to content
View kevin-14's full-sized avatar

Block or report kevin-14

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results
Python 74 7 Updated Sep 9, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,198 110 Updated Sep 30, 2024

Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.

Jupyter Notebook 1,525 94 Updated Feb 16, 2024

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Python 461 21 Updated Sep 30, 2024

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 2,375 192 Updated Sep 30, 2024

compiler learning resources collect.

Python 2,076 324 Updated May 27, 2024

how to optimize some algorithm in cuda.

Cuda 1,475 122 Updated Sep 29, 2024

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,224 153 Updated Jun 25, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 27,631 4,073 Updated Sep 30, 2024

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Python 132,787 26,459 Updated Sep 30, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,568 173 Updated Sep 27, 2024

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 4,314 388 Updated Sep 28, 2024

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

Python 4,371 451 Updated Aug 6, 2024

AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术

Jupyter Notebook 10,628 1,533 Updated Sep 29, 2024

The official GitHub page for the survey paper "A Survey of Large Language Models".

Python 10,106 797 Updated Aug 20, 2024

LLM inference in C/C++

C++ 65,661 9,421 Updated Sep 30, 2024

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

Python 1,645 149 Updated Oct 25, 2023

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

C++ 26,138 8,706 Updated Sep 30, 2024

Port of OpenAI's Whisper model in C/C++

C 34,712 3,533 Updated Sep 27, 2024

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Python 1,850 343 Updated Sep 20, 2024

Collective communications library with various primitives for multi-machine training.

C++ 1,198 302 Updated Jun 26, 2024

🔮 ChatGPT Desktop Application (Mac, Windows and Linux)

Rust 52,436 5,898 Updated Aug 29, 2024

Example models using DeepSpeed

Python 6,015 1,020 Updated Sep 17, 2024

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.

Python 15,988 1,564 Updated Sep 30, 2024

Transformer related optimization, including BERT, GPT

C++ 5,803 889 Updated Mar 27, 2024

Fast and memory-efficient exact attention

Python 13,600 1,245 Updated Sep 30, 2024

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

Python 8,123 1,458 Updated Sep 30, 2024

Universal cross-platform tokenizers binding to HF and sentencepiece

C++ 253 57 Updated Aug 12, 2024

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

C++ 14,223 2,867 Updated Sep 30, 2024

Universal LLM Deployment Engine with ML Compilation

Python 18,756 1,531 Updated Sep 28, 2024
Next