Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the inference speed so slow? #3

Open
khiemkhanh98 opened this issue Mar 13, 2024 · 2 comments
Open

Why is the inference speed so slow? #3

khiemkhanh98 opened this issue Mar 13, 2024 · 2 comments
Labels
good first issue Good for newcomers

Comments

@khiemkhanh98
Copy link

It took 30s to generate ~100 tokens on A6000 GPU, which i found around 5x slower than LLAVA of same size and same quant. Why is it the case?

@ByungKwanLee
Copy link
Owner

I am trying to investigate it!

Is it sure that your model is in gpu vram?

@ByungKwanLee ByungKwanLee added the good first issue Good for newcomers label Mar 14, 2024
@ByungKwanLee ByungKwanLee self-assigned this Mar 14, 2024
@ByungKwanLee
Copy link
Owner

ByungKwanLee commented Mar 14, 2024

I think it may stem from flash attention

LLaVA official repository model and other huggingface models are normally applied with flash attention

However, I checked MoAI is not applied with it well.

Therefore, I will try to equip it!

@ByungKwanLee ByungKwanLee removed their assignment Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants