ERNIE/ernie-doc at repro · PaddlePaddle/ERNIE

History

Name		Name	Last commit message	Last commit date
parent directory ..
.meta		.meta
configs		configs
data/imdb		data/imdb
finetune		finetune
model		model
reader		reader
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
lanch.py		lanch.py
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_mrc.py		run_mrc.py

README.md

English | 简体中文

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Framework
Pre-trained Models
Fine-tuning Tasks
Usage
- Install Paddle
- Fine-tuning
Citation

For technical description of the algorithm, please see our paper:

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Siyu Ding*, Junyuan Shang*, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (* : equal contribution)

Preprint December 2020

Accepted by ACL-2021

ERNIE-Doc is a document-level language pretraining model. Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification, question answering, information extraction and semantic matching.

Framework

We proposed three novel methods to enhance the long document modeling ability of Transformers:

Retrospective Feed Mechanism: Inspired by the human reading behavior of skimming a document first and then looking back upon it attentively, we design a retrospective feed mechanism in which segments from a document are fed twice as input. As a result, each segment in the retrospective phase could explicitly fuse the semantic information of the entire document learned in the skimming phase, which prevents context fragmentation.
Enhanced Recurrence Mechansim, a drop-in replacement for a Recurrence Transformer (like Transformer-XL), by changing the shifting-one-layer-downwards recurrence to the same-layer recurrence. In this manner, the maximum effective context length can be expanded, and past higher-level representations can be exploited to enrich future lower-level representations.
Segment-reordering Objective, a document-aware task of predicting the correct order of the permuted set of segments of a document, to model the relationship among segments directly. This allows ERNIE-Doc to build full document representations for prediction.

Illustrations of ERNIE-Doc and Recurrence Transformers, where models with three layers take as input a long document which is sliced into four segments.

Pre-trained Models

We release the checkpoints for ERNIE-Doc base_en/zh and ERNIE-Doc large_en model。

ERNIE-Doc base_en (12-layer, 768-hidden, 12-heads)
ERNIE-Doc base_zh (12-layer, 768-hidden, 12-heads)
ERNIE-Doc large_en (24-layer, 1024-hidden, 16-heads)

Fine-tuning Tasks

We compare the performance of ERNIE-Doc with the existing SOTA pre-training models (such as Longformer, BigBird, ETC and ERNIE2.0) for language modeling (WikiText-103) and document-level natural language understanding tasks, including long-text classification (IMDB, HYP, THUCNews, IFLYTEK), question answering (TriviaQA, HotpotQA, DRCD, CMRC2018, DuReader, C3), information extraction (OpenKPE) and semantic matching (CAIL2019-SCM).

Language Modeling

WikiText-103

Model	Param.	PPL
Results of base models
LSTM	-	48.7
LSTM+Neural cache	-	40.8
GCNN-14	-	37.2
QRNN	151M	33.0
Transformer-XL Base	151M	24.0
SegaTransformer-XL Base	151M	22.5
ERNIE-Doc Base	151M	21.0
Results of large models
Adaptive Input	247M	18.7
Transformer-XL Large	247M	18.3
Compressive Transformer	247M	17.1
SegaTransformer-XL Large	247M	17.1
ERNIE-Doc Large	247M	16.8

Long-Text Classification

IMDB reviews

Models	Acc.	F1
RoBERTa	95.3	95.0
Longformer	95.7	-
BigBird	-	95.2
ERNIE-Doc Base	96.1	96.1
XLNet-Large	96.8	-
ERNIE-Doc Large	97.1	97.1

Hyperpartisan News Dection

Models	F1
RoBERTa	87.8
Longformer	94.8
BigBird	92.2
ERNIE-Doc Base	96.3
ERNIE-Doc Large	96.6

THUCNews(THU)、IFLYTEK(IFK)

Models	THU	THU	IFK
	Acc.	Acc.	Acc.
	Dev	Test	Dev
BERT	97.7	97.3	60.3
BERT-wwm-ext	97.6	97.6	59.4
RoBERTa-wwm-ext	-	-	60.3
ERNIE 1.0	97.7	97.3	59.0
ERNIE 2.0	98.0	97.5	61.7
ERNIE-Doc	98.3	97.7	62.4

Question Answering

TriviaQA on dev-set

Models	F1
RoBERTa	74.3
Longformer	75.2
BigBird	79.5
ERNIE-Doc Base	80.1
Longformer Large	77.8
BigBird Large	-
ERNIE-Doc Large	82.5

HotpotQA on dev-set

Models	Span-F1	Supp.-F1	Joint-F1
RoBERTa	73.5	83.4	63.5
Longformer	74.3	84.4	64.4
BigBird	75.5	87.1	67.8
ERNIE-Doc Base	79.4	86.3	70.5
Longformer Large	81.0	85.8	71.4
BigBird Large	81.3	89.4	-
ERNIE-Doc Large	82.2	87.6	73.7

DRCD, CMRC2018, DuReader, C3

Models	DRCD	DRCD	CMRC2018	DuReader	C3	C3
	dev	test	dev	dev	dev	test
	EM/F1	EM/F1	EM/F1	EM/F1	Acc.	Acc.
BERT	85.7/91.6	84.9/90.9	66.3/85.9	59.5/73.1	65.7	64.5
BERT-wwm-ext	85.0/91.2	83.6/90.4	67.1/85.7	-/-	67.8	68.5
RoBERTa-wwm-ext	86.6/92.5	85.2/92.0	67.4/87.2	-/-	67.1	66.5
MacBERT	88.3/93.5	87.9/93.2	69.5/87.7	-/-	-	-
XLNet-zh	83.2/92.0	82.8/91.8	63.0/85.9	-/-	-	-
ERNIE 1.0	84.6/90.9	84.0/90.5	65.1/85.1	57.9/72/1	65.5	64.1
ERNIE 2.0	88.5/93.8	88.0/93.4	69.1/88.6	61.3/74.9	72.3	73.2
ERNIE-Doc	90.5/95.2	90.5/95.1	76.1/91.6	65.8/77.9	76.5	76.5

Information Extraction

Open Domain Web Keyphrase Extraction

Models	F1@1	F1@3	F1@5
BLING-KPE	26.7	29.2	20.9
JointKPE	39.1	39.8	33.8
ETC	-	40.2	-
ERNIE-Doc	40.2	40.5	34.4

Semantic Matching

CAIL2019-SCM

Models	Dev (Acc.)	Test (Acc.)
BERT	61.9	67.3
ERNIE 2.0	64.9	67.9
ERNIE-Doc	65.6	68.8

Usage

Install PaddlePaddle

This code base has been tested with Paddle (version>=2.0) with Python3. Other dependency of ERNIE-Doc is listed in requirements.txt, you can install it by

pip install -r requirements.txt

Fine-tuning

We release the finetuning code for English and Chinese classification tasks and Chinese Question Answers Tasks. For example, you can finetune ERNIE-Doc base model on IMDB and IFLYTEK dataset by

sh script/run_imdb.sh
sh script/run_iflytek.sh
sh script/run_dureader.sh

Preprocessing code for IMDB dataset

The log of training and the evaluation results are in log/job.log.0.

Notice: The actual total batch size is equal to configured batch size * number of used gpus.

Citation

You can cite the paper as below:

@article{ding2020ernie,
  title={ERNIE-DOC: The Retrospective Long-Document Modeling Transformer},
  author={Ding, Siyu and Shang, Junyuan and Wang, Shuohuan and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2012.15688},
  year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ernie-doc

ernie-doc

README.md

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Framework

Pre-trained Models

Fine-tuning Tasks

Language Modeling

Long-Text Classification

Question Answering

Information Extraction

Semantic Matching

Usage

Install PaddlePaddle

Fine-tuning

Citation

Files

ernie-doc

Directory actions

More options

Directory actions

More options

Latest commit

History

ernie-doc

Folders and files

parent directory

README.md

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Framework

Pre-trained Models

Fine-tuning Tasks

Language Modeling

Long-Text Classification

Question Answering

Information Extraction

Semantic Matching

Usage

Install PaddlePaddle

Fine-tuning

Citation