This project contains the Comet-atomic 2020 model source code modified for the Slovenian language.
Before starting the project make sure these requirements are available:
- python. For executing the code in this project.
- git. For versioning your code.
- dvc. For versioning your data (part of project requirements).
First create the virtual environment where the service will store all the modules.
Using the virtualenv
command, run the following commands:
# install the virtual env command
pip install virtualenv
# create a new virtual environment
virtualenv -p python ./.venv
# activate the environment (UNIX)
./.venv/bin/activate
# activate the environment (WINDOWS)
./.venv/Scripts/activate
# deactivate the environment (UNIX & WINDOWS)
deactivate
Install conda, a program for creating python virtual environments. Then run the following commands:
# create a new virtual environment
conda create --name slomet2020 python=3.8 pip
# activate the environment
conda activate slomet2020
# deactivate the environment
deactivate
To install the requirements run:
pip install -e .
To get the data reach out to the project's maintainer.
NOTE: The data will be made publicly available. Stay tuned for more!
To run the experiments, run the folowing commands:
# model training script
python scripts/train_comet_gpt2.py \
--train_data_path=./data/atomic_train.tsv \
--valid_data_path=./data/atomic_dev.tsv \
--models_dir_path=./models
# model testing script
python scripts/test_comet_gpt2.py \
--test_data_path=./data/atomic_test.tsv \
--models_dir_path=./models/checkpoint_latest \
--results_dir_path=./results
# model evaluation script
python scripts/eval_comet_gpt2.py \
--pred_file_path=./results/pred_generations.jsonl
An alternative way of running the whole experiment is by using DVC. To do this, simply run:
dvc exp run
This command will read the dvc.yaml
file and execute the stages accordingly, taking
any dependencies into consideration.
The results folder contain the files for both evaluating the generations and the
evalution results. File results/pred_generations_gens_scores.jsonl
show the
performance of the model based on various metrics.
The table below shows the performances of the commonsense models trained using the corresponding language model and language data set.
Language Model | Language | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | METEOR | ROUGE-L |
---|---|---|---|---|---|---|---|---|
macedonizer/sl-gpt2 | Slovene | 0.297 | 0.150 | 0.086 | 0.058 | 0.487 | 0.207 | 0.383 |
gpt-janez | Slovene | 0.324 | 0.174 | 0.108 | 0.076 | 0.508 | 0.225 | 0.397 |
COMET(GPT2-XL) | English | 0.407 | 0.248 | 0.171 | 0.124 | 0.653 | 0.292 | 0.485 |
This project support the following models:
- gpt-janez
- macedonizer/sl-gpt2
When the model is trained, use the scripts below to load the model and tokenizer:
# Importing the GPT2 modules from huggingface/transformers
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# define the directory path that contains the model data
MODEL_DIR_PATH = "./models/checkpoint_latest"
# initialize the model and tokenizer with the trained data
model = GPT2LMHeadModel.from_pretrained(MODEL_DATA_PATH)
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DATA_PATH)
TODO
(Comet-) Atomic 2020: On Symbolic and Neural Commonsense Knowledge Graphs.
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, Yejin Choi
AAAI Conference on Artificial Intelligence, 2021
- Setup script
- Folder structure
- Code for model training
- Code for model prediction
- Code for model evaluation
- Add support for 3rd party models (outside huggingface)
- Add
params.yaml
and modify the scripts to read the params from the file - Add DVC pipelines for model training and evaluation
- Add scripts for storing and retrieving the data set
This work is developed by Department of Artificial Intelligence at Jozef Stefan Institute.
The work is supported by the Slovenian Research Agency and the RSDO project.