Skip to content

Dataset used to evaluate Skill Extraction systems based on the ESCO skills taxonomy.

Notifications You must be signed in to change notification settings

jensjorisdecorte/Skill-Extraction-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Skill Extraction: benchmarks

This is the official repository containing three skill extraction datasets, from the following papers:

  1. Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction
  2. Extreme Multi-Label Skill Extraction Training using Large Language Models

Dataset description

The TECH and HOUSE subsets form an extention of the SkillSpan [1] dataset, in which spans of skill mentions in sentences have been labeled with corresponding ESCO [2] skills.

The TECHWOLF subset, although smaller, represents a more generic distribution of job descriptions and skill spans. ESCO skills are directly annotated on the full sentence level, thus omitting the intermediate span identification step.

The ESCO skills in the dataset are referenced by their preferred label, in the 1.1.0 ESCO version.

Dataset statistics TECH HOUSE TECHWOLF
val test val test test
# sentences 470 1882 243 973 326
# spans 262 1024 191 786 588
# spans with ESCO label 152 644 131 532 588

Usage

It is recommended to use the HuggingFace datasets for ease of use:

However, the raw dataset files are also kept under the data directory.

Cite

If you use the TECH or HOUSE dataset, please include the following reference:

@inproceedings{8770980,
  articleno    = {{4}},
  author       = {{Decorte, Jens-Joris and Van Hautte, Jeroen and Deleu, Johannes and Develder, Chris and Demeester, Thomas}},
  booktitle    = {{Proceedings of the 2nd Workshop on Recommender Systems for Human Resources (RecSys-in-HR 2022)}},
  editor       = {{Kaya, Mesut and Bogers, Toine and Graus, David and Mesbah, Sepideh and Johnson, Chris and Gutiérrez, Francisco}},
  isbn         = {{9781450398565}},
  issn         = {{1613-0073}},
  language     = {{eng}},
  location     = {{Seatle, USA}},
  pages        = {{7}},
  publisher    = {{CEUR}},
  title        = {{Design of negative sampling strategies for distantly supervised skill extraction}},
  url          = {{https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_4.pdf}},
  volume       = {{3218}},
  year         = {{2022}},
}

If you use the TECHWOLF dataset, please include the following refence:

@misc{decorte2023extrememultilabelskillextraction,
      title={Extreme Multi-Label Skill Extraction Training using Large Language Models}, 
      author={Jens-Joris Decorte and Severine Verlinden and Jeroen Van Hautte and Johannes Deleu and Chris Develder and Thomas Demeester},
      year={2023},
      eprint={2307.10778},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2307.10778}, 
}

Reference

[1] Zhang, Mike, et al. "Skillspan: Hard and soft skill extraction from english job postings." arXiv preprint arXiv:2204.12811 (2022).

[2] https://esco.ec.europa.eu/en/classification/skill_main

About

Dataset used to evaluate Skill Extraction systems based on the ESCO skills taxonomy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published