Skip to content

NTT123/viwik18

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

viwik18 dataset

Clean Vietnamese Text - Wikipedia dump 08-2018

Alphabet: aáàảãạăaáàảãạăắằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễệfghiíìỉĩịjklmnoóòỏõọôốồổỗộơớờởỡợpqrstuúùủũụưứừửữựvwxyýỳỷỹỵz

Merge to single file

    $ cat dataset/viwik18_* > viwik18.txt

Generate the dataset manually

    $ wget https://dumps.wikimedia.org/viwiki/20180801/viwiki-20180801-pages-articles.xml.bz2
    $ bzip2 -d viwiki-20180801-pages-articles.xml.bz2
    $ python WikiExtractor.py --no-templates -s --lists viwiki-20180801-pages-articles.xml -q -o - | perl -CSAD -Mutf8 cleaner.pl > viwik18.txt

viwik19 dataset

Checkout the new dataset viwik19 at https://github.com/NTT123/viwik18/tree/viwik19

About

Vietnamese Text Dataset - Wikipedia vi 2018

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published