English | 简体中文
For technical description of the algorithm, please see our paper:
ERNIE-Doc: A Retrospective Long-Document Modeling Transformer
Siyu Ding*, Junyuan Shang*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (* : equal contribution)
Preprint December 2020
Accepted by ACL-2021
ERNIE-Doc is a document-level language pretraining model. Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification, question answering, information extraction and semantic matching.
We proposed three novel methods to enhance the long document modeling ability of Transformers:
- Retrospective Feed Mechanism: Inspired by the human reading behavior of skimming a document first and then looking back upon it attentively, we design a retrospective feed mechanism in which segments from a document are fed twice as input. As a result, each segment in the retrospective phase could explicitly fuse the semantic information of the entire document learned in the skimming phase, which prevents context fragmentation.
- Enhanced Recurrence Mechansim, a drop-in replacement for a Recurrence Transformer (like Transformer-XL), by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence. In this manner, the maximum effective context length can be expanded, and past higher-level representations can be exploited to enrich future lower-level representations.
- Segment-reordering Objective, a document-aware task of predicting the correct order of the permuted set of segments of a document, to model the relationship among segments directly. This allows ERNIE-Doc to build full document representations for prediction.
Illustrations of ERNIE-Doc and Recurrence Transformers, where models with three layers take as input a long document which is sliced into four segments.
We release the checkpoints for ERNIE-Doc base_en/zh and ERNIE-Doc large_en model。
- ERNIE-Doc base_en (12-layer, 768-hidden, 12-heads)
- ERNIE-Doc base_zh (12-layer, 768-hidden, 12-heads)
- ERNIE-Doc large_en (24-layer, 1024-hidden, 16-heads)
We compare the performance of ERNIE-Doc with the existing SOTA pre-training models (such as Longformer, BigBird, ETC and ERNIE2.0) for language modeling (WikiText-103) and document-level natural language understanding tasks, including long-text classification (IMDB, HYP, THUCNews, IFLYTEK), question answering (TriviaQA, HotpotQA, DRCD, CMRC2018, DuReader, C3), information extraction (OpenKPE) and semantic matching (CAIL2019-SCM).
Model | Param. | PPL |
---|---|---|
Results of base models | ||
LSTM | - | 48.7 |
LSTM+Neural cache | - | 40.8 |
GCNN-14 | - | 37.2 |
QRNN | 151M | 33.0 |
Transformer-XL Base | 151M | 24.0 |
SegaTransformer-XL Base | 151M | 22.5 |
ERNIE-Doc Base | 151M | 21.0 |
Results of large models | ||
Adaptive Input | 247M | 18.7 |
Transformer-XL Large | 247M | 18.3 |
Compressive Transformer | 247M | 17.1 |
SegaTransformer-XL Large | 247M | 17.1 |
ERNIE-Doc Large | 247M | 16.8 |
Models | Acc. | F1 |
---|---|---|
RoBERTa | 95.3 | 95.0 |
Longformer | 95.7 | - |
BigBird | - | 95.2 |
ERNIE-Doc Base | 96.1 | 96.1 |
XLNet-Large | 96.8 | - |
ERNIE-Doc Large | 97.1 | 97.1 |
Models | F1 |
---|---|
RoBERTa | 87.8 |
Longformer | 94.8 |
BigBird | 92.2 |
ERNIE-Doc Base | 96.3 |
ERNIE-Doc Large | 96.6 |
Models | THU | THU | IFK |
---|---|---|---|
Acc. | Acc. | Acc. | |
Dev | Test | Dev | |
BERT | 97.7 | 97.3 | 60.3 |
BERT-wwm-ext | 97.6 | 97.6 | 59.4 |
RoBERTa-wwm-ext | - | - | 60.3 |
ERNIE 1.0 | 97.7 | 97.3 | 59.0 |
ERNIE 2.0 | 98.0 | 97.5 | 61.7 |
ERNIE-Doc | 98.3 | 97.7 | 62.4 |
- TriviaQA on dev-set
Models | F1 |
---|---|
RoBERTa | 74.3 |
Longformer | 75.2 |
BigBird | 79.5 |
ERNIE-Doc Base | 80.1 |
Longformer Large | 77.8 |
BigBird Large | - |
ERNIE-Doc Large | 82.5 |
- HotpotQA on dev-set
Models | Span-F1 | Supp.-F1 | Joint-F1 |
---|---|---|---|
RoBERTa | 73.5 | 83.4 | 63.5 |
Longformer | 74.3 | 84.4 | 64.4 |
BigBird | 75.5 | 87.1 | 67.8 |
ERNIE-Doc Base | 79.4 | 86.3 | 70.5 |
Longformer Large | 81.0 | 85.8 | 71.4 |
BigBird Large | 81.3 | 89.4 | - |
ERNIE-Doc Large | 82.2 | 87.6 | 73.7 |
Models | DRCD | DRCD | CMRC2018 | DuReader | C3 | C3 |
---|---|---|---|---|---|---|
dev | test | dev | dev | dev | test | |
EM/F1 | EM/F1 | EM/F1 | EM/F1 | Acc. | Acc. | |
BERT | 85.7/91.6 | 84.9/90.9 | 66.3/85.9 | 59.5/73.1 | 65.7 | 64.5 |
BERT-wwm-ext | 85.0/91.2 | 83.6/90.4 | 67.1/85.7 | -/- | 67.8 | 68.5 |
RoBERTa-wwm-ext | 86.6/92.5 | 85.2/92.0 | 67.4/87.2 | -/- | 67.1 | 66.5 |
MacBERT | 88.3/93.5 | 87.9/93.2 | 69.5/87.7 | -/- | - | - |
XLNet-zh | 83.2/92.0 | 82.8/91.8 | 63.0/85.9 | -/- | - | - |
ERNIE 1.0 | 84.6/90.9 | 84.0/90.5 | 65.1/85.1 | 57.9/72/1 | 65.5 | 64.1 |
ERNIE 2.0 | 88.5/93.8 | 88.0/93.4 | 69.1/88.6 | 61.3/74.9 | 72.3 | 73.2 |
ERNIE-Doc | 90.5/95.2 | 90.5/95.1 | 76.1/91.6 | 65.8/77.9 | 76.5 | 76.5 |
Models | F1@1 | F1@3 | F1@5 |
---|---|---|---|
BLING-KPE | 26.7 | 29.2 | 20.9 |
JointKPE | 39.1 | 39.8 | 33.8 |
ETC | - | 40.2 | - |
ERNIE-Doc | 40.2 | 40.5 | 34.4 |
Models | Dev (Acc.) | Test (Acc.) |
---|---|---|
BERT | 61.9 | 67.3 |
ERNIE 2.0 | 64.9 | 67.9 |
ERNIE-Doc | 65.6 | 68.8 |
This code base has been tested with Paddle (version>=2.0) with Python3. Other dependency of ERNIE-Doc is listed in requirements.txt
, you can install it by
pip install -r requirements.txt
We release the finetuning code for English and Chinese classification tasks and Chinese Question Answers Tasks. For example, you can finetune ERNIE-Doc base model on IMDB and IFLYTEK dataset by
sh script/run_imdb.sh
sh script/run_iflytek.sh
sh script/run_dureader.sh
Preprocessing code for IMDB dataset
The log of training and the evaluation results are in log/job.log.0
.
Notice: The actual total batch size is equal to configured batch size * number of used gpus
.
You can cite the paper as below:
@article{ding2020ernie,
title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2012.15688},
year={2020}
}