Starred repositories
OpenAI compatible API for TensorRT LLM triton backend
An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.
An efficient implementation of a rate limiter for asyncio.
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality!
Flax is a neural network library for JAX that is designed for flexibility.
Efficient Triton Kernels for LLM Training
Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"
MSCCL++: A GPU-driven communication stack for scalable AI applications
Official inference repo for FLUX.1 models
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
Minimalistic large language model 3D-parallelism training
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
vLLM adapter for a TGIS-compatible gRPC server.
📺 An End-to-End Solution for High-Resolution and Long Video Generation Based on Transformer Diffusion
Large Language Model Text Generation Inference
NVIDIA Linux open GPU with P2P support
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on infer…
😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc
🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT)
A native PyTorch Library for large model training
Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.
LLM prompts, llama3 prompts, llama2 prompts