Skip to content

Latest commit

 

History

History
227 lines (179 loc) · 9.76 KB

MMEdit.md

File metadata and controls

227 lines (179 loc) · 9.76 KB

Can We Edit Multimodal Large Language Models?

Table of Contents

Overview

📕 MMEdit Datasets

dataset Google Drive BaiduNetDisk Description
E-IC [Google Drive] [BaiduNetDisk] dataset for editing Image Captioning
E-VQA [Google Drive] [BaiduNetDisk] dataset for editing Visual Question Answering
  • All images used in E-IC and E-VQA are available for download at Google Drive or BaiduNetDisk.
  • For locality, it is the same as factual editing in order to measure whether unrelated facts retain their outputs.
  • For multimodal locality, it assesses the impact of editing on the visual module, which is similar to regular locality.
dataset description
editing-data
├── caption
│   ├── caption_train_edit.json
│   └── caption_eval_edit.json
├── locality
│   ├── NQ dataset
│   │   ├── train.json
│   │   └── validation.json
├── multimodal_locality
│   ├── OK-VQA dataset
│   │   ├── okvqa_loc.json
└── vqa
    ├── vqa_train.json
    └── vqa_eval.json
  • Multimodal locality (evaluation for multimodal locality, see dataset's details in this paper)

🔧 Requirements

Pip Installation

Note: Please use Python 3.9+ for EasyEdit To get started, simply install conda and run:

git clone https://github.com/zjunlp/EasyEdit.git
conda create -n EasyEdit python=3.9.7
...
pip install -r requirements.txt

Checkpoints Preparation

You should configure the qformer_checkpoint and pretrained_ckpt settings, deviating from the original repository's guidelines. Please refer to the Multimodal section in this file for the correct settings. pretrained_ckpt can be downloaded from here, and for the qformer_checkpoint, you can find it here.

📌 Use EasyEdit

MultimodalTrainer

  • Meta-learning based: MEND
  • Memory-based routing: SERAC

For above editing methods, pre-training of corresponding meta-networks or classifiers is required. Therefore, in EasyEdit, we provide a unified framework for pretraining the relevant network structures. Take the training SERAC for example:

Step1: Define a MLLM as the object to be edited. Choose the MLLM to be edited. EasyEdit supports partial multimodal models(MiniGPT-4, BLIP2OPT so far). The corresponding configuration file directory is hparams/TRAINING/YUOR_METHOD/YOUR_MODEL.YAML for training, such as hparams/TRAINING/MEND/minigpt4.yaml, set the corresponding model_name to select the object for editing. And hparams/YUOR_METHOD/YOUR_MODEL.YAML for evaluating.

model_name: minigpt4
model_class: Blip2OPT
tokenizer_class: LlamaTokenizer
tokenizer_name: Vicuna

Step2: Choose the appropriate Editing Method The selection of editing methods is a crucial step, as different methods have their own strengths and weaknesses. Users need to consider the trade-off between editing success rate, generalization, and maintaining unrelated performance.

## In this case, we use SERAC method, so you should import `SERACMultimodalTrainingHparams` for training
from easyeditor import SERACMultimodalTrainingHparams
## Loading config from hparams/TRAINING/SERAC/minigpt4.yaml
training_hparams = SERACMultimodalTrainingHparams.from_hparams('./hparams/TRAINING/SERAC/minigpt4.yaml')

Step3: Provide the edit training set The currently supported and available datasets are: Caption and VQA (Google Drive). Please place them in the "data" directory and initialize the dataset_class (CaptionDataset for Caption and VQADataset for VQA) to load the corresponding training set.

train_ds = CaptionDataset('data/caption_train_edit.json', config=training_hparams)
eval_ds = CaptionDataset('data/caption_eval_edit.json', config=training_hparams)

Step4: Combine them into a Trainer

trainer = MultimodalTrainer(
    config=hparams,
    train_set=train_ds,
    val_set=eval_ds
)

Step5: Run and Edit Done! We can conduct Run and Evaluation.

trainer.run()
  • Run: The CHECKPOINT will be saved to the path results_dir.
  • Edit: Set the archive field in the hparams file to CHECKPOINT. EasyEdit will automatically load the corresponding pre-trained weights during the editing process (Go to edit).

Training Example

training_hparams = SERACMultimodalTrainingHparams.from_hparams('hparams/TRAINING/SERAC/minigpt4.yaml')
train_ds = CaptionDataset('data/caption_train_edit.json', config=training_hparams)
eval_ds = CaptionDataset('data/caption_eval_edit.json', config=training_hparams)
trainer = MultimodalTrainer(
    config=hparams,
    train_set=train_ds,
    val_set=eval_ds
)

trainer.run()

Evaluating Example

hparams = SERACMultimodalHparams.from_hparams('hparams/SERAC/minigpt4.yaml')
# train_ds = CaptionDataset('data/caption_train_edit.json', config=hparams)
eval_ds = CaptionDataset('data/caption_eval_edit.json', config=hparams)
trainer = MultimodalTrainer(
    config=hparams,
    train_set=eval_ds,
    val_set=eval_ds
)

trainer.run()

The results will include the following metrics:

  • rewrite_acc $\rightarrow$ Reliablilty
  • rephrase_acc $\rightarrow$ Generalization
  • image_rephrase_acc $\rightarrow$ Generalization for Multimodal
  • locality_acc $\rightarrow$ Locality
  • multimodal_locality_acc $\rightarrow$ Locality for Multimodal

MultimodalEditor

MultimodalEditor is the class for Multi-Modality Editing. You can choose the appropriate editing method (such as IKE) based on your specific needs.

  • Due to different transformer versions and different GPU models, the editing results may fluctuate slightly.

Step1: Generate embedding files for IKE You can use Generate_Embedding_for_IKE() in multimodal_edit.py to generate directly.

## Generate embedding files for IKE

hparams = IKEMultimodalHyperParams.from_hparams('hparams/IKE/blip2.yaml')
train_ds = VQADataset('data/vqa_train.json', config=hparams)
sentence_model = SentenceTransformer(hparams.sentence_model_name).to(f'cuda:{hparams.device}')
encode_ike_facts_multimodal(sentence_model, train_ds, hparams)

Step2: Run and Edit! Select specific model and dataset, then use test_IKE_MiniGPT4_Caption() in multimodal_edit.py to run.

hparams = IKEMultimodalHyperParams.from_hparams('hparams/IKE/minigpt4.yaml')
editor = MultimodalEditor.from_hparams(hparams)
eval_ds = CaptionDataset('data/caption_eval_edit.json', config=hparams)
metrics, edited_model, _ = editor.edit_dataset(
    ds=eval_ds,
    train_ds=eval_ds,
    keep_original_weight=True        
)

🎉 Acknowledgement

We would like to express our sincere gratitude to the excellent work LAVIS, MiniGPT-4, SERAC and MEND.

📖 Citation

If finding this work useful for your research, you can cite it as follows:

@inproceedings{DBLP:conf/emnlp/0008TL0WC023,
  author       = {Siyuan Cheng and
                  Bozhong Tian and
                  Qingbin Liu and
                  Xi Chen and
                  Yongheng Wang and
                  Huajun Chen and
                  Ningyu Zhang},
  editor       = {Houda Bouamor and
                  Juan Pino and
                  Kalika Bali},
  title        = {Can We Edit Multimodal Large Language Models?},
  booktitle    = {Proceedings of the 2023 Conference on Empirical Methods in Natural
                  Language Processing, {EMNLP} 2023, Singapore, December 6-10, 2023},
  pages        = {13877--13888},
  publisher    = {Association for Computational Linguistics},
  year         = {2023},
  url          = {https://aclanthology.org/2023.emnlp-main.856},
  timestamp    = {Wed, 13 Dec 2023 17:20:20 +0100},
  biburl       = {https://dblp.org/rec/conf/emnlp/0008TL0WC023.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}