DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

⭐️ Star to follow our team's projects !

🚀🚀🚀 Official implementation of DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models.

Authors: Yuhang Cao, Pan Zhang, Xiaoyi Dong, Dahua Lin, Jiaqi Wang
Institutes: The Chinese University of Hong Kong; Shanghai AI Laboratory
Resources: [Paper]
Models: [DualFocus-LLaVA-1.5-7B]; [DualFocus-LLaVA-1.5-13B]; [DualFocus-ShareGPT4V-13B]

📜 News

[2024/02/22] The paper, evaluation code and checkpoints are released!

👨‍💻 Todo

Release evaluation code and checkpoints
Release paper
Release training code
Speed up inference

🤖 Model Zoo

Name	LLM	SEED^IMG	MMBench	GQA^*	TextVQA
LLaVA-1.5-7B	Vicuna-7B	66.2	64.3	67.2	58.2
DualFocus-LLaVA-1.5-7B	Vicuna-7B	68.9 (+2.7)	66.8 (+2.5)	69.4 (+2.2)	62.3 (+3.9)
LLaVA-1.5-13B	Vicuna-13B	68.2	67.7	69.3	61.3
DualFocus-LLaVA-1.5-13B	Vicuna-13B	71.0 (+2.8)	71.4 (+3.7)	74.5 (+5.2)	65.7(+4.4)
ShareGPT4V-13B	Vicuna-13B	70.8	68.5	71.1	62.2
DualFocus-ShareGPT4V-13B	Vicuna-13B	72.9 (+2.1)	71.0 (+3.5)	75.7 (+4.6)	66.7 (+4.5)

GQA^*: we convert the GQA dataset into multi-choice-question format via GPT-3.5. Please refer to here for details.

Install

git clone https://github.com/InternLM/InternLM-XComposer --depth=1
cd projects/DualFocus
conda create -n DualFocus python=3.9 -y
conda activate DualFocus

pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Data Preparation

You should follow this instruction Data.md to manage the datasets.

Evaluation

We provide scripts for evaluation on 4 benchmarks. Here we take DualFocus-LLaVA-1.5-7B as an example. For slurm users, please config the parameters, PARTITION, QUOTA_TYPE and GPUS yourself.

MMBench-EN

Download mmbench_dev_20230712.tsv and put under ./playground/data/eval/mmbench.
Multi-GPU inference.

# for single node inference
bash scripts/eval/eval_mmbench.sh yhcao/DualFocus-LLaVA-1.5-7B

# for slurm inference
bash scripts/eval/slurm_eval_mmbench.sh yhcao/DualFocus-LLaVA-1.5-7B

Submit the results to the evaluation server: ./playground/data/eval/mmbench/answers_upload/{res}.xlsx.

SEED-Bench-Image

Following the official instructions to download the images. Put images under ./playground/data/eval/seed_bench/SEED-Bench-image.
Multiple-GPU inference and evaluate.

# for single node inference
bash scripts/eval/eval_seed.sh yhcao/DualFocus-LLaVA-1.5-7B

# for slurm inference
base scripts/eval/slurm_eval_seed.sh yhcao/DualFocus-LLaVA-1.5-7B

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to ./playground/data/eval/textvqa.
Multi-GPU inference and evaluate.

# for single node inference
bash scripts/eval/eval_textvqa.sh yhcao/DualFocus-LLaVA-1.5-7B

# for slurm inference
bash scripts/eval/slurm_eval_textvqa.sh yhcao/DualFocus-LLaVA-1.5-7B

GQA

Download the data and evaluation scripts following the official instructions and put under ./playground/data/eval/gqa/data. Download the json and put under ./playground/data/eval/gqa. You may need to modify eval.py as this due to the missing assets in the GQA v1.2 release.
Multi-GPU inference and evaluate.

# for single node inference
bash scripts/eval/eval_gqa.sh yhcao/DualFocus-LLaVA-1.5-7B

# for slurm inference
bash scripts/eval/slurm_eval_gqa.sh yhcao/DualFocus-LLaVA-1.5-7B

❤️ Acknowledgments

LLaVA: the codebase we built upon. Thanks for their wonderful work.
Vicuna: the amazing open-sourced large language model!

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@article{cao2024dualfocus,
  title={DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models}, 
  author={Yuhang Cao and Pan Zhang and Xiaoyi Dong and Dahua Lin and Jiaqi Wang},
  journal={arXiv preprint arXiv:2311.12793},
  year={2024},
}

License

Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

📜 News

👨‍💻 Todo

🤖 Model Zoo

Install

Data Preparation

Evaluation

MMBench-EN

SEED-Bench-Image

TextVQA

GQA

❤️ Acknowledgments

✒️ Citation

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

📜 News

👨‍💻 Todo

🤖 Model Zoo

Install

Data Preparation

Evaluation

MMBench-EN

SEED-Bench-Image

TextVQA

GQA

❤️ Acknowledgments

✒️ Citation

License