RL-MIL-NAACL2024

Source code for Reinforced Multiple Instance Selection for Speaker Attribute Prediction accepted at NAACL 2024.

Overview

Prerequisites

python 3.10+
Ubuntu GPU-enabled server with CUDA 12.1+
- Check your GPUs with nvidia-smi
python environment with packages installed as in requirements.txt
Weights and Biases Account

Setup Environment

cd ROOT_OF_THE_REPO
mkdir -p data/jigsaw

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

To run the experiments, you need to login to your Weights and Biases account. You can do that by running the following command and following the instructions:

wandb login

Datasets

Synthetic Data

Download the train.csv.zip file from Kaggle's Toxic Comment Classification Challenge
Unzip the file resulting in train.csv
Place the file in data/jigsaw/

Congressional Speech Data

Download the hein-daily.zip file from Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts
Unzip the file resulting in hein-daily directory
Place the directory in create_dataset/Political_Dataset/
Download the congress-terms.csv file from Congress Age GitHub Repository
Place the file in create_dataset/Political_Dataset/
Run the following command to create the merged dataset (including age column):
```
cd create_dataset/Political_Dataset
./preprocess_speeches.sh
```
You can see the political_data.csv file under data directory.

Facebook Data

Dataset is private and introduced in the following paper:

Kennedy, B., Atari, M., Mostafazadeh Davani, A., Hoover, J., Omrani, A., Graham, J., & Dehghani, M. (2021). Moral concerns are differentially observable in language. Cognition, 212, 104696. doi:10.1016/j.cognition.2021.104696

Please contact the authors for more information.

Running the Experiments

Script for data preparation

The shell script scripts/prepare_dataset.sh will prepare the data for training, validation and testing. Here are the variables that you can set along with their descriptions:

Variable	Description
`dataset`	Possible values of dataset: `political_data_with_age` `facebook` `jigsaw`
`whole_bag_size`	For `jigsaw` datasets: `50` For `political_data_with_age` and `facebook` datasets: `100`
`num_pos_samples`	This argument is only used in `jigsaw` datasets, it is the number of toxic instances that will be added to the bag (5 is used for making `jigsaw_5` and 10 is used for making `jigsaw_10`)
`embedding_model`	This argument will determine the model that will be used to get the embeddings of the data. We used `roberta-base` in all settings
`data_embedded_column_name`	This argument will determine the name of the column that has the text data. For `facebook` and `political_data_with_age` datasets: `text` For `jigsaw` datasets: `comment_text`
`random_seeds`	Random seeds are an array of integers that will make different splits of the data
`gpus`	It define the array of integers that are indexes of the GPUs that will be used to create embeddings of the data

dataset="jigsaw" # or "political_data_with_age" or "facebook"
whole_bag_size=50 # or 100
num_pos_samples=5 # only used in jigsaw datasets, 5 for making `jigsaw_5` and 10 for making `jigsaw_10`

data_embedded_column_name="comment_text" # or "text"

random_seeds=(0 1 2 3 4)
gpus=(0 1 2 3 4)

After setting the variables, you can run the script as follows:

cd scripts
./prepare_dataset.sh

After that for each random seed, a folder will be created in data/ directory (e.g. data/seed_10/).

Scripts for running the MIL experiments

`scripts/run_mil.sh`

The shell script scripts/run_mil.sh will run the MIL experiment for a single dataset and random_seed 0. Here are the variables that you can set along with their descriptions:

Variable	Description
`baseline_types`	Possible values: `MeanMLP` `MaxMLP` `AttentionMLP` `repset`
`target_labels`	Facebook dataset: `care` `purity` `loyalty` `authority` `fairness` Political dataset: `age` `gender` `party` Jigsaw datasets: `hate`
`gpus`	The same description as above
`embedding_models`	The same description as above
`bag_sizes`	The bag size ($b_i$) in the paper. We used `20`` in all experiments
`autoencoder_layer_sizes`	The layer sizes of the autoencoder. We used `"768,256,768"` in all experiments
`wandb_entity`	Your Weights and Biases entity
`wandb_project`	Your Weights and Biases project name
`dataset`	The same description as above
`data_embedded_column_name`	The same description as above
`task_type`	Always `classification`

baseline_types=("MeanMLP" "MaxMLP" "AttentionMLP" "repset")

# For facebook dataset: ("care" "purity" "loyalty" "authority" "fairness")
# For political_data_with_age dataset: ("age" "gender" "party")
# For jigsaw datasets: ("hate")
target_labels=("care" "purity" "loyalty" "authority" "fairness")

gpus=(0 1 2 3 4 5 6 7)

# wandb config
wandb_entity="YOUR_WANDB_ENTITY"
wandb_project="YOUR_WANDB_PROJECT_NAME"

# Dataset is either: `political_data_with_age,` `facebook,` `jigsaw_5,` or `jigsaw_10`
dataset="facebook"

# For `facebook` and `political_data_with_age` datasets: "text"
# For `jigsaw` datasets: "comment_text"
data_embedded_column_name="text"

After setting the variables, you can run the script as follows:

cd scripts
./run_mil.sh

`scripts/run_mil_multiple.sh`

If you want to run the MIL experiments and baseline experiments multiple times with different seeds, you can use the shell script scripts/run_mil_multiple.sh. Here are the variables that you can set: (All of the variables have the same description as above)

datasets=("political_data_with_age" "facebook" "jigsaw_5" "jigsaw_10")

baselines=("MeanMLP" "MaxMLP" "AttentionMLP" "repset" "SimpleMLP") # You can pass "SimpleMLP" to run the baseline experiments

random_seeds=(0 1 2 3 4 5 6 7)
gpus=(0 1 2 3 4 5 6 7)

After setting the variables, you can run the script as follows:

cd scripts
./run_mil_multiple.sh

Scripts for running the baseline experiments

`scripts/run_baselines.sh`

The shell script scripts/run_baselines.sh will run the baseline experiment. Here are the variables that you can set: (All of the variables have the same description as above)

# For facebook dataset: ("care" "purity" "loyalty" "authority" "fairness")
# For political_data_with_age dataset: ("age" "gender" "party")
# For jigsaw datasets: ("hate")
target_labels=("hate")

gpus=(0 1 2 3 4 5 6 7)

# wandb config
wandb_entity="YOUR_WANDB_ENTITY"
wandb_project="YOUR_WANDB_PROJECT_NAME"

# Dataset is either: `political_data_with_age,` `facebook,` `jigsaw_5,` or `jigsaw_10`
dataset="jigsaw_5"

# For `facebook` and `political_data_with_age` datasets: "text"
# For `jigsaw` datasets: "comment_text"
data_embedded_column_name="comment_text"

For running the Baseline experiments multiple times with different seeds, you should pass the "SimpleMLP" to the baselines variable in scripts/run_mil_multiple.sh script.

After setting the variables, you can run the script as follows:

cd scripts
./run_baselines.sh

Scripts for running the Reinforced MIL experiments

`scripts/run_rlmil.sh`

The shell script scripts/run_rlmil.sh will run the Reinforced MIL experiment for a single dataset and random_seed 0. The variables with the same name as above have the same description. Here are the new variables that you can set along with their descriptions:

Variable	Description
`rl_task_model`	Possible values: `vanilla` `ensemble`
`sample_algorithm`	Possible values: `static` `with_replacement` `without_replacement`
`no_autoencoder_for_rl`	If you want to run the experiments without autoencoder component, you can pass this flag to the script

baseline_types=("MeanMLP" "MaxMLP" "AttentionMLP" "repset")

# For facebook dataset: ("care" "purity" "loyalty" "authority" "fairness")
# For political_data_with_age dataset: ("age" "gender" "party")
# For jigsaw datasets: ("hate")
target_labels=("hate")

gpus=(0 1 2 3 4 5 6 7)

# wandb config
wandb_entity="YOUR_WANDB_ENTITY"
wandb_project="YOUR_WANDB_PROJECT_NAME"

# Dataset is either: `political_data_with_age,` `facebook,` `jigsaw_5,` or `jigsaw_10`
dataset="jigsaw_10"

# For `facebook` and `political_data_with_age` datasets: "text"
# For `jigsaw` datasets: "comment_text"
data_embedded_column_name="comment_text"

# Possible values: "vanilla" or "ensemble". Keep in mind before running the script with rl_task_model="ensemble" you should run the `only_ensemble` setting first.
rl_task_model="vanilla"

# These are sampling strategies for selecting $|b_i|$ instances with possible values: "static" "with_replacement" "without_replacement"
sample_algorithm="without_replacement"

After setting the variables, there are three different variations of the script that you can run:

Regular RL-MIL with autoencoder component, different rl_task_model and different sample_algorithms. Uncomment the following section in the script and comment the other sections:

# different RL models running
SESSION_NAME="${gpu}_${sample_algorithm}_${dataset}_${target_label}_${baseline_type}"
screen -dmS "$SESSION_NAME" bash -c "
    CUDA_VISIBLE_DEVICES=$gpu python3 run_rlmil.py --rl --baseline $baseline_type \
                                    --autoencoder_layer_sizes $autoencoder_layer_sizes \
                                    --label $target_label \
                                    --data_embedded_column_name $data_embedded_column_name \
                                    --prefix $prefix \
                                    --dataset $dataset \
                                    --bag_size $bag_size \
                                    --batch_size 32 \
                                    --run_sweep \
                                    --embedding_model $embedding_model \
                                    --train_pool_size 1 --eval_pool_size 10 --test_pool_size 10 \
                                    --balance_dataset \
                                    --wandb_entity $wandb_entity \
                                    --wandb_project $wandb_project \
                                    --random_seed 0 \
                                    --task_type $task_type \
                                    --rl_model $rl_model \
                                    --search_algorithm $search_algorithm \
                                    --rl_task_model $rl_task_model \
                                    --sample_algorithm $sample_algorithm \
                                    --reg_alg $reg_alg ;
    exit"

# Comment all other sections below this

RL-MIL without autoencoder component:

# Comment the section above this

# different RL models without autoencoder
SESSION_NAME="${gpu}_noauto_${sample_algorithm}_${dataset}_${target_label}_${baseline_type}"
screen -dmS "$SESSION_NAME" bash -c "
    CUDA_VISIBLE_DEVICES=$gpu python3 run_rlmil.py --rl --baseline $baseline_type \
                                    --autoencoder_layer_sizes $autoencoder_layer_sizes \
                                    --label $target_label \
                                    --data_embedded_column_name $data_embedded_column_name \
                                    --prefix $prefix \
                                    --dataset $dataset \
                                    --bag_size $bag_size \
                                    --batch_size 32 \
                                    --run_sweep \
                                    --embedding_model $embedding_model \
                                    --train_pool_size 1 --eval_pool_size 10 --test_pool_size 10 \
                                    --balance_dataset \
                                    --wandb_entity $wandb_entity \
                                    --wandb_project $wandb_project \
                                    --random_seed 0 \
                                    --task_type $task_type \
                                    --rl_model $rl_model \
                                    --search_algorithm $search_algorithm \
                                    --rl_task_model $rl_task_model \
                                    --sample_algorithm $sample_algorithm \
                                    --reg_alg $reg_alg  \
                                    --no_autoencoder_for_rl ;
    exit"

# Comment the section below this

Run RL-MIL ensemble only:

# Comment all sections above this

# only ensemble running
SESSION_NAME="${gpu}_ens_${sample_algorithm}_${dataset}_${target_label}_${baseline_type}"
screen -dmS "$SESSION_NAME" bash -c "
CUDA_VISIBLE_DEVICES=$gpu python3 run_rlmil.py --rl --baseline $baseline_type \
                                                                --autoencoder_layer_sizes $autoencoder_layer_sizes \
                                                                --label $target_label \
                                                                --data_embedded_column_name $data_embedded_column_name \
                                                                --prefix $prefix \
                                                                --dataset $dataset \
                                                                --bag_size $bag_size \
                                                                --batch_size 32 \
                                                                --run_sweep \
                                                                --embedding_model $embedding_model \
                                                                --train_pool_size 1 --eval_pool_size 10 --test_pool_size 10 \
                                                                --balance_dataset \
                                                                --wandb_entity $wandb_entity \
                                                                --wandb_project $wandb_project \
                                                                --random_seed 0 \
                                                                --task_type $task_type \
                                                                --only_ensemble \
                                                                --rl_model $rl_model \
                                                                --sample_algorithm $sample_algorithm ;
                                        exit"

After setting the variables, and commenting/uncommenting the sections, you can run the script as follows:

cd scripts
./run_rlmil.sh

`scripts/run_rlmil_multiple.sh`

If you want to run the RL-MIL experiments multiple times with different seeds, you can use the shell script scripts/run_rlmil_multiple.sh. Here are the variables that you can set: (All of the variables have the same description as above)

datasets=("political_data_with_age" "facebook" "jigsaw_5" "jigsaw_10")

baselines=("MeanMLP" "MaxMLP" "AttentionMLP" "repset")

random_seeds=(0 1 2 3 4 5 6 7)
gpus=(0 1 2 3 4 5 6 7)

After setting the variables, you can run the script as follows:

cd scripts
./run_rlmil_multiple.sh

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{salkhordeh-ziabari-etal-2024-reinforced,
    title = "Reinforced Multiple Instance Selection for Speaker Attribute Prediction",
    author = "Salkhordeh Ziabari, Alireza  and
      Omrani, Ali  and
      Hejabi, Parsa  and
      Golazizian, Preni  and
      Kennedy, Brendan  and
      Piray, Payam  and
      Dehghani, Morteza",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.181",
    doi = "10.18653/v1/2024.naacl-long.181",
    pages = "3307--3321",
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
create_dataset		create_dataset
notebooks		notebooks
scripts		scripts
yaml_configs		yaml_configs
.gitignore		.gitignore
README.md		README.md
RLMIL_Datasets.py		RLMIL_Datasets.py
aggregate_results.py		aggregate_results.py
configs.py		configs.py
figure2.png		figure2.png
logger.py		logger.py
models.py		models.py
multiple_run_mil.py		multiple_run_mil.py
multiple_run_rlmil.py		multiple_run_rlmil.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
run_baseline.py		run_baseline.py
run_mil.py		run_mil.py
run_rlmil.py		run_rlmil.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL-MIL-NAACL2024

Overview

Prerequisites

Setup Environment

Datasets

Synthetic Data

Congressional Speech Data

Facebook Data

Running the Experiments

Script for data preparation

Scripts for running the MIL experiments

`scripts/run_mil.sh`

`scripts/run_mil_multiple.sh`

Scripts for running the baseline experiments

`scripts/run_baselines.sh`

Scripts for running the Reinforced MIL experiments

`scripts/run_rlmil.sh`

`scripts/run_rlmil_multiple.sh`

Citations

About

Releases

Packages

Languages

AlirezaZiabari/RL-MIL

Folders and files

Latest commit

History

Repository files navigation

RL-MIL-NAACL2024

Overview

Prerequisites

Setup Environment

Datasets

Synthetic Data

Congressional Speech Data

Facebook Data

Running the Experiments

Script for data preparation

Scripts for running the MIL experiments

scripts/run_mil.sh

scripts/run_mil_multiple.sh

Scripts for running the baseline experiments

scripts/run_baselines.sh

Scripts for running the Reinforced MIL experiments

scripts/run_rlmil.sh

scripts/run_rlmil_multiple.sh

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`scripts/run_mil.sh`

`scripts/run_mil_multiple.sh`

`scripts/run_baselines.sh`

`scripts/run_rlmil.sh`

`scripts/run_rlmil_multiple.sh`

Packages