Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
requirements.txt		requirements.txt

README.md

Jais

This document elaborates how to build Jais model to runnable engines on multi-GPU node and perform a summarization task using these engines.

Currently it has been tested on

Jais-13b-chat
Jais-30b-chat-v3
Jais

Overview

The TensorRT-LLM support for Jais is based on the GPT model, the implementation can be found in tensorrt_llm/models/gpt/model.py. Jais model resembles GPT very much except it uses alibi embedding, embedding scale, swiglu, and logits scale, we therefore reuse the GPT example code for Jais,

convert_checkpoint.py to convert the Jais model into tensorrt-llm checkpoint format.

In addition, there are two shared files in the parent folder examples for inference and evaluation:

../run.py to run the inference on an input text;
../summarize.py to summarize the articles in the cnn_dailymail dataset.

Support Matrix

The tested configurations are:

FP16
FP8
Inflight Batching
Tensor Parallel

Usage

This section gives a whole process where we convert HF models, build TensorRT-LLM engines and ultimately perform summarization.

Build TensorRT engine(s)

Run the following commands and TRT-LLM will first transforms a HF model into its own checkpoint format, then builds a TRT engine based on the checkpoint

# single gpu, dtype float16 for jais-13b-chat
python3 ../gpt/convert_checkpoint.py --model_dir core42/jais-13b-chat \
        --dtype float16 \
        --output_dir jais-13b-chat/trt_ckpt/fp16/1-gpu

# 2-way tensor parallelism for jais-30b-chat-v3
python3 ../gpt/convert_checkpoint.py --model_dir core42/jais-30b-chat-v3 \
        --dtype float16 \
        --tp_size 2 \
        --output_dir jais-30b-chat-v3/trt_ckpt/fp16/2-gpu

# Build a single-GPU float16 engine from TensorRT-LLM checkpoint for jais-13b-chat
# Enable the special TensorRT-LLM GPT Attention plugin (--gpt_attention_plugin) to increase runtime performance.
# It is recommend to use --remove_input_padding along with --gpt_attention_plugin for better performance
trtllm-build --checkpoint_dir jais-13b-chat/trt_ckpt/fp16/1-gpu \
        --gpt_attention_plugin float16 \
        --remove_input_padding enable \
        --output_dir jais-13b-chat/trt_engines/fp16/1-gpu

# Build 2-way tensor parallelism engines from TensorRT-LLM checkpoint for jais-30b-chat-v3
trtllm-build --checkpoint_dir jais-30b-chat-v3/trt_ckpt/fp16/2-gpu \
        --gpt_attention_plugin float16 \
        --remove_input_padding enable \
        --output_dir jais-30b-chat-v3/trt_engines/fp16/2-gpu

Run

The ../run.py script can be used to run inference with the built engine(s).

python3 ../run.py --engine_dir jais-13b-chat/trt_engines/fp16/1-gpu \
        --tokenizer_dir core42/jais-13b-chat \
        --max_output_len 10

If the engines are run successfully, you will see output like:

......
Input [Text 0]: "Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef in Paris before moving to England in 1816"

python3 ../run.py --engine_dir jais-13b-chat/trt_engines/fp16/1-gpu \
        --tokenizer_dir core42/jais-13b-chat \
        --max_output_len 8 \
        --input_text "ولد في 1304 ميلادياً ابن بطوطه, لقد ذهب"

If the engines are run successfully, you will see output like:

.....
Input [Text 0]: "ولد في 1304 ميلادياً ابن بطوطه, لقد ذهب"
Output [Text 0 Beam 0]: " في جميع أنحاء العالم المعروف في ذلك الوقت"

To run a 2 TP model you can do the following

mpirun -np 2 \
    python3 ../run.py --engine_dir jais-30b-chat-v3/trt_engines/fp16/2-gpu \
        --tokenizer_dir core42/jais-30b-chat-v3 \
        --max_output_len 30

If the engines are run successfully, you will see output like:

Input [Text 0]: "Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef, working in a series of high-end establishments.

Soyer's career took him to work in a number of establishments across Europe,"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jais

jais

README.md

Jais

Overview

Support Matrix

Usage

Build TensorRT engine(s)

Run

Files

jais

Directory actions

More options

Directory actions

More options

Latest commit

History

jais

Folders and files

parent directory

README.md

Jais

Overview

Support Matrix

Usage

Build TensorRT engine(s)

Run