TinyChatEngine: On-Device LLM Inference Library

Running large language models (LLMs) on the edge is useful: running copilot services (coding, office, smart reply) on laptops, cars, robots, and more. Users can get instant responses with better privacy, as the data is local.

This is enabled by LLM model compression technique: SmoothQuant and AWQ (Activation-aware Weight Quantization), co-designed with TinyChatEngine that implements the compressed low-precision model.

Demo on an NVIDIA GeForce RTX 4070 laptop:

Demo on an Apple MacBook Air (M1, 2020):

Feel free to check out our slides for more details!

Overview

LLM Compression: SmoothQuant and AWQ

SmoothQuant: Smooth the activation outliers by migrating the quantization difficulty from activations to weights, with a mathematically equal transformation (100*1 = 10*10).

AWQ (Activation-aware Weight Quantization): Protect salient weight channels by analyzing activation magnitude as opposed to the weights.

LLM Inference Engine: TinyChatEngine

Universal: x86 (Intel/AMD), ARM (Apple M1/M2), CUDA (Nvidia GPU).
No library dependency: From-scratch C/C++ implementation.
High performance: Real-time on Macbook & GeForce laptop.
Easy to use: Download and compile, then ready to go!

Prerequisites

MacOS

For MacOS, install boost and llvm by

brew install boost
brew install llvm

For M1/M2 users, install Xcode from AppStore to enable the metal compiler for GPU support.

Windows

For Windows, download and install the GCC compiler with MSYS2. Follow this tutorial: https://code.visualstudio.com/docs/cpp/config-mingw for installation.

Install required dependencies with MSYS2

pacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git

Add binary directories (e.g., C:\msys64\mingw64\bin and C:\msys64\usr\bin) to the environment path

Step-by-step to deploy LLaMA2-7B-chat with TinyChatEngine

Here, we provide step-by-step instructions to deploy LLaMA2-7B-chat with TinyChatEngine from scratch.

Download the repo.

git clone --recursive https://github.com/mit-han-lab/TinyChatEngine

Download the quantized LLaMA2-7B-chat model from our model zoo.
```
cd TinyChatEngine/llm
```
- On an x86 device (e.g., Intel/AMD laptop)
```
python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_x86
```
- On an ARM device (e.g., M1/M2 Macbook, Raspberry Pi)
```
python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_ARM
```
- On a CUDA device (e.g., Jetson AGX Orin, PC/Server)
```
python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_CUDA
```
- Check this table for the detailed list of supported models

Compile and start the chat locally.

make chat -j
./chat
Using model: LLaMA7B_2_chat
Using LLaMA's default data format: INT4
Loading model... Finished!
USER: Write a syllabus for Operating Systems.
ASSISTANT:
Of course! Here is a sample syllabus for a college-level course on operating systems:
Course Title: Introduction to Operating Systems
Course Description: This course provides an overview of the fundamental concepts and techniques used in modern operating systems, including process management, memory management, file systems, security, and I/O devices. Students will learn how these components work together to provide a platform for running applications and programs on a computer.
Course Objectives:
* Understand the basic architecture of an operating system
* Learn about processes, threads, and process scheduling algorithms
* Study memory management techniques such as paging and segmentation
* Explore file systems including file organization, storage devices, and file access methods
* Investigate security mechanisms to protect against malicious software attacks
* Analyze input/output (I/O) operations and their handling by the operating system
...

Backend support

Precision	x86 (Intel/AMD CPU)	ARM (Apple M1/M2)	Nvidia GPU	Apple GPU
FP32	✅	✅
FP16
W4A16			✅	✅
W4A32	✅	✅		✅
W4A8	✅	✅
W8A8	✅	✅

Quantization and Model Support

The goal of TinyChatEngine is to support various quantization methods on various devices. For example, At present, it supports the quantized weights for int8 opt models that originate from smoothquant using the provided conversion script opt_smooth_exporter.py. For LLaMA models, scripts are available for converting Huggingface format checkpoints to our int4 wegiht format, and for quantizing them to specific methods based on your device. Before converting and quantizing your models, it is recommended to apply the fake quantization from AWQ to achieve better accuracy. We are currently working on supporting more models, please stay tuned!

Device-specific int4 Weight Reordering

To mitigate the runtime overheads associated with weight reordering, TinyChatEngine conducts this process offline during model conversion. In this section, we will explore the weight layouts of QM_ARM and QM_x86. These layouts are tailored for ARM and x86 CPUs, supporting 128-bit SIMD and 256-bit SIMD operations, respectively. We also support QM_CUDA for Nvidia GPUs, including server and edge GPUs.

Platforms	ISA	Quantization methods
Intel/AMD	x86-64	QM_x86
Apple M1/M2 Mac	arm	QM_ARM
Nvidia GPU	CUDA	QM_CUDA

Example layout of QM_ARM: For QM_ARM, consider the initial configuration of a 128-bit weight vector, [w0, w1, ... , w30, w31], where each wi is a 4-bit quantized weight. TinyChatEngine rearranges these weights in the sequence [w0, w16, w1, w17, ..., w15, w31] by interleaving the lower half and upper half of the weights. This new arrangement facilitates the decoding of both the lower and upper halves using 128-bit AND and shift operations, as depicted in the subsequent figure. This will eliminate runtime reordering overheads and improve performance.

Download and deploy models from our Model Zoo

We offer a selection of models that have been tested with TinyChatEngine. These models can be readily downloaded and deployed on your device. To download a model, locate the target model's ID in the table below and use the associated script.

Models	Precisions	ID	x86 backend	ARM backend	CUDA backend
LLaMA2_13B_chat	fp32	LLaMA2_13B_chat_fp32	✅	✅
LLaMA2_13B_chat	int4	LLaMA2_13B_chat_awq_int4	✅	✅	✅
LLaMA2_7B_chat	fp32	LLaMA2_7B_chat_fp32	✅	✅
LLaMA2_7B_chat	int4	LLaMA2_7B_chat_awq_int4	✅	✅	✅
LLaMA_7B	fp32	LLaMA_7B_fp32	✅	✅
LLaMA_7B	int4	LLaMA_7B_awq_int4	✅	✅	✅
opt-6.7B	fp32	opt_6.7B_fp32	✅	✅
	int8	opt_6.7B_smooth_int8	✅	✅
	int4	opt_6.7B_awq_int4	✅	✅
opt-1.3B	fp32	opt_1.3B_fp32	✅	✅
	int8	opt_1.3B_smooth_int8	✅	✅
	int4	opt_1.3B_awq_int4	✅	✅
opt-125m	fp32	opt_125m_fp32	✅	✅
	int8	opt_125m_smooth_int8	✅	✅
	int4	opt_125m_awq_int4	✅	✅

For instance, to download the quantized LLaMA-2-7B-chat model: (for int4 models, use --QM to choose the quantized model for your device)

On an Intel/AMD latptop:

python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_x86

On an M1/M2 Macbook:

python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_ARM

On an Nvidia GPU:

python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_CUDA

To deploy a quantized model with TinyChatEngine, compile and run the chat program.

make chat -j
./chat <model_name> <precision>

Experimental features

TinyChatEngine offers versatile capabilities suitable for various applications. Additionally, we introduce a sophisticated voice chatbot. Explore our step-by-step guide here to seamlessly deploy a chatbot locally on your device!

Related Projects

TinyEngine: Memory-efficient and High-performance Neural Network Library for Microcontrollers

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Acknowledgement

llama.cpp

whisper.cpp

transformers

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
assets		assets
kernels		kernels
llm		llm
.clang-format		.clang-format
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyChatEngine: On-Device LLM Inference Library

Demo on an NVIDIA GeForce RTX 4070 laptop:

Demo on an Apple MacBook Air (M1, 2020):

Overview

LLM Compression: SmoothQuant and AWQ

LLM Inference Engine: TinyChatEngine

Prerequisites

MacOS

Windows

Step-by-step to deploy LLaMA2-7B-chat with TinyChatEngine

Backend support

Quantization and Model Support

Device-specific int4 Weight Reordering

Download and deploy models from our Model Zoo

Experimental features

Related Projects

Acknowledgement

About

Releases

Packages

Languages

License

Cc2pang/TinyChatEngine

Folders and files

Latest commit

History

Repository files navigation

TinyChatEngine: On-Device LLM Inference Library

Demo on an NVIDIA GeForce RTX 4070 laptop:

Demo on an Apple MacBook Air (M1, 2020):

Overview

LLM Compression: SmoothQuant and AWQ

LLM Inference Engine: TinyChatEngine

Prerequisites

MacOS

Windows

Step-by-step to deploy LLaMA2-7B-chat with TinyChatEngine

Backend support

Quantization and Model Support

Device-specific int4 Weight Reordering

Download and deploy models from our Model Zoo

Experimental features

Related Projects

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages