Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
calc_class_frequency.py		calc_class_frequency.py
calc_class_frequency_metaclip.py		calc_class_frequency_metaclip.py
calc_word_frequency.py		calc_word_frequency.py

README.md

Concept Frequency Estimation

This folder contains the code for estimating the frequency of concepts in a image-text dataset. The code is written at a very early stage of this project, thus is not well-organized and not optimized. We provide the results in metadata/freqs folder for CC-12M, YFCC-15M, LAION-400M, LAION-2B, and MetaCLIP. For future usage, we recommend just take the code for reference, and consider latest works in this direction, e.g., MetaCLIP, NeglectedTailsVLM, and frequency_determines_performance.

Requirements

nltk is required for word tokenization and lemmatization, pandas is needed for loading image-text dataset metadata, and tqdm is used for progress bar. You can install them via pip. Besides, processing large-scale datasets requires considerable CPU cores and memory, we recommend using a high-performance server. Also, saving the tokenized intermediate results can take up a lot of disk space, be aware of that.

Usage

Download the metadata of image-text datasets containing the captions, e.g., CC-12M, YFCC-15M, LAION-400M, LAION-2B, MetaCLIP-400M, and MetaCLIP-2.5B.

The following command calculates the frequency of ImageNet classes in the LAION-400M dataset. You can replace the --url_path with the path to the metadata file of other datasets, and specify the --dataset to indicate which dataset the concepts are from. The --input_format is used to specify the format of the metadata file, which can be tsv, parquet, or json.

python calc_class_frequency.py \
--url_path ../datasets/laion/laion400m-meta \
--input_format parquet \
--caption_col TEXT \
--dataset imagenet # which dataset the concepts of interest are from

For MetaCLIP, you should run the following command because the authors has provided a file of concept frequency.

python calc_class_frequency_metaclip.py \
--json_path ../datasets/MetaCLIP/metaclip/datacard_400m.json \
--dataset imagenet

We also provide the code for calculating the frequency of words instead of class names. You may check calc_word_frequency.py for more details and the usage is similar to the above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concept_freq_utils

concept_freq_utils

README.md

Concept Frequency Estimation

Requirements

Usage

Files

concept_freq_utils

Directory actions

More options

Directory actions

More options

Latest commit

History

concept_freq_utils

Folders and files

parent directory

README.md

Concept Frequency Estimation

Requirements

Usage