Why is the inference speed so slow? #3

khiemkhanh98 · 2024-03-13T21:32:26Z

It took 30s to generate ~100 tokens on A6000 GPU, which i found around 5x slower than LLAVA of same size and same quant. Why is it the case?

ByungKwanLee · 2024-03-14T05:54:29Z

I am trying to investigate it!

Is it sure that your model is in gpu vram?

ByungKwanLee · 2024-03-14T07:46:24Z

I think it may stem from flash attention

LLaVA official repository model and other huggingface models are normally applied with flash attention

However, I checked MoAI is not applied with it well.

Therefore, I will try to equip it!

ByungKwanLee added the good first issue Good for newcomers label Mar 14, 2024

ByungKwanLee self-assigned this Mar 14, 2024

ByungKwanLee removed their assignment Mar 23, 2024

Provide feedback