Skip to content

TeMU-BSC/catalan_CC0_sentences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

collected CC0 sentences written in Catalan, from Public Domains and/ CC0 licences

TeMU-BSC is the Text Mining Unit of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación in Barcelona (Spain) [https://www.bsc.es/discover-bsc/organisation/scientific-structure/text-mining]

Contents

catalan_government_crawling_frases_seleccionades_filtrades.txt

93691 sentences selected from the Catalan Government Crawling. Numbers have been transcribed.

The Catalan Government Crawling Corpus is a 39-million-token web corpus of Catalan built from the web. It has been obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government during September and October 2020.

Both the packaging and its content are under a CC0 Universal Licence. Please refer to web.gencat.cat/en/menu-ajuda/ajuda/avis_legal/index.html

datasets_sentences.tsv

155990 sentences extracted by the following Projecte Aina datasets:

The Language Technologies Unit agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

edited_generated_selected_chatbot.txt

2166 sentences, generated from edited_selected_chatbot.txt, and semi-authomaticaly doing masking with bsc/roberta-base-ca-cased transfromer model, and keeping only the well-formed ones.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

edited_selected_chatbot.txt

873 sentences selected from our chatbot corpus, not published already.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

frases_agenda.txt

160665 sentences generated with substituition templates for this project, with the municipalities of all Catalan-speaking areas, published here for the first time.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

frases_diccionaris_enciclopedia.txt

20k sentences from "Diccionaris de l'Enciclopèdia", published here under CC0 licence by the included "CC0 waiver", to be used in the Common voice platform.

frases_spl.txt

1711 sentences created by Secretaria de política lingüística (Linguistic Policy Office, from the Catalan government) for this project, published here for the first time, and covered as CC0 by the included "CC0 waiver" (see SPL_CC0_waiver.pdf)

frases_toponimia_valenciana_balear.txt

15451 sentences generated with substituition templates for this project, published here for the first time.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

generades_spl_seleccionades.txt

4469 new sentences, generated from frases_spl, and semi-authomaticaly doing masking with bsc/roberta-base-ca-cased transfromer model, and keeping only the well-formed ones.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

literatura.txt

366 literary sentences, published here under a CC0 licence, and edited for well-formedness and idiomacy.

marius_serra_sentences.txt

35244 sentences from Marius Serra's, work, extracted with the author's permission.

This sentences are extracted from the following works:

Books:

  • La Napeu
  • Fora de joc a Montserrat
  • Jugar-s'hi la vida
  • La novel·la de Sant Jordi
  • D'on trec el temps
  • Hawaii Lima
  • Plans de futur
  • L'arca de Babel
  • De com s'escriu una novel·la
  • Enviar i rebre
  • Farsa
  • La vida normal
  • L'home del sac
  • La llegenda de Sant Jordi
  • Mon oncle
  • Quiet
  • Res no és perfecte a Hawaii
  • Tres és massa
  • Verbàlia

Mots encreuats (crosswords)

Author's blog.

marius_serra_sentences_crosswords.txt

17795 new sentences from Marius Serra's, crosswords, provided by the author.

more_intents.txt

74772 new intent-like sentences, generated with substitution templates, published here for the first time.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

new_sentences_from_catalan_newswire.txt

58310 sentences from a catalan newswire. The owner agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode.

plantilles_intents.txt:

2664 intent-like sentences, generated with substituition templates for this project, published here for the first time.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

pujolar_sentences.txt

764 sentences from Joan Pujolar's work, extracted from here and here.

selected_club.txt

21237 sentences from our own corpora and datasets:

sentences_from_xitxat_corpus.txt

4154 new sentences from the XitXat corpus, written by our team, published here for the fist time.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

sentences_from_xitxat_corpus2.txt

3818 other sentences from the XitXat corpus, written by our team, published here for the fist time.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

They have been added to the Common Voice corpus through the Sentence Collector

wikidata_sentences.txt

18550 sentences generated with substituition templates for this project with wikidata data, published here for the first time.

The TeMU-BSC agrees that Mozilla may publish these contributions under the CC0 public domain dedication available at https://creativecommons.org/publicdomain/zero/1.0/legalcode. We, therefore, agree to waive all copyright and related or neighbouring rights together with all associated claims and causes of action with respect to these contributions to the extent possible under the law.

translated_wiki.es-ca.txt

49990 sentences randomly selected and translated from wiki.es.txt into Catalan. Not post edited.

Aggregated files

Files that aggregate files descrived before:

new_catalan_cc0_corpus.txt

Contains all the 107k sentences from the files:

  • edited_generated_selected_chatbot.txt
  • edited_selected_chatbot.txt
  • frases_spl.txt
  • generades_spl_seleccionades.txt
  • more_intents.txt
  • plantilles_intents.txt
  • selected_club.txt

xitxat_toponyms.txt

37428 sentences from:

  • frases_toponims_illes.txt
  • frases_toponims_valencians.txt
  • sentences_from_xitxat_corpus.txt
  • wikidata_sentences.txt

Some sentences have been edited or removed while supervising the contents.

contribution_agrmt

Contribution agreements for the previously published sentences.

Licence

  • edited_generated_selected_chatbot.txt, edited_selected_chatbot.txt, frases_spl.txt, generades_spl_seleccionades.txt, more_intents.txt, plantilles_intents.txt, selected_club.txt are owned by TeMU-BSC and published here under CC0 licence.

  • catalan_government_crawling_frases_seleccionades_filtrades and literatura sources are public under a CC0 licence

About

collected CC0 sentences written in Catalan

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published