Skip to content

A curated list of resources for NLP (Natural Language Processing) for Chinese 中文自然语言处理相关资料

License

Notifications You must be signed in to change notification settings

zubryan/Awesome-Chinese-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 

Repository files navigation

awesome-chinese-nlp

A curated list of resources for NLP (Natural Language Processing) for Chinese 中文自然语言处理相关资料

Contents 列表

1. Chinese NLP Toolkits 中文NLP工具

2. Corpus 中文语料

3. Organizations 相关中文NLP组织和会议

4. Learning Materials 学习资料

Chinese NLP Toolkits 中文NLP工具

Toolkits 综合NLP工具包

  • THULAC 中文词法分析工具包 by 清华 (C++/Java/Python) [link]

  • NLPIR by 中科院 (Java) [github]

  • LTP 语言技术平台 by 哈工大 (C++) [github]

  • FudanNLP by 复旦 (Java) [github]

  • CoreNLP by Stanford (Java) [github]

  • BosonNLP by Boson (商业API服务) [link]

  • HanNLP (Java) [github]

  • SnowNLP (Python) [github] Python library for processing Chinese text

  • YaYaNLP (Python) [github] 纯python编写的中文自然语言处理包,取名于“牙牙学语”

  • DeepNLP (Python) [github] Deep Learning NLP Pipeline implemented on Tensorflow with pretrained Chinese models.

  • chinese_nlp (C++ & Python) [github] Chinese Natural Language Processing tools and examples

Popular NLP Toolkits for English/Multi-Language 常用的英文或支持多语言的NLP工具包

  • CoreNLP by Stanford (Java) [github]

  • NLTK (Python) [link]

  • spaCy (Python) [link]

  • OpenNLP (Java) [link]

  • gensim (Python) [github] Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

Chinese Word Segment 中文分词

  • Jieba 结巴中文分词 (Python) [github] 做最好的 Python 中文分词组件

  • kcws 深度学习中文分词 (Python) [github] BiLSTM+CRF与IDCNN+CRF

  • Genius 中文分词 (Python) [github] Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。

  • loso 中文分词 (Python) [github]

Information Extraction 信息提取

  • MITIE (C++) [github] library and tools for information extraction

  • Duckling (Haskell) [github] Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.

  • IEPY (Python) [github] IEPY is an open source tool for Information Extraction focused on Relation Extraction.

  • Snorkel: A training data creation and management system focused on information extraction [github]

  • Neural Relation Extraction implemented with LSTM in TensorFlow [github]

  • A neural network model for Chinese named entity recognition [github]

QA & Chatbot 问答和聊天机器人

  • Rasa NLU (Python) [github] turn natural language into structured data

  • Chatterbot (Python) [github] ChatterBot is a machine learning, conversational dialog engine for creating chat bots.

  • Chatbot (Python) [github] 基於向量匹配的情境式聊天機器人

  • Tipask (PHP) [github] 一款开放源码的PHP问答系统,基于Laravel框架开发,容易扩展,具有强大的负载能力和稳定性。

  • QuestionAnsweringSystem (Java) [github] 一个Java实现的人机问答系统,能够自动分析问题并给出候选答案。

  • 使用TensorFlow实现的Sequence to Sequence的聊天机器人模型 (Python) [github]

Corpus 中文语料

  • 开放知识图谱OpenKG.cn [link]

  • CLDC中文语言资源联盟 [link]

  • 用于训练中英文对话系统的语料库 Datasets for Training Chatbot System [github]

  • 中文 Wikipedia Dump [link]

  • 98年人民日报词性标注库@百度盘 [link]

  • 百度百科100gb语料@百度盘) [link] 密码neqs 出处应该是梁斌penny大神

  • 搜狗20061127新闻语料(包含分类)@百度盘 [link]

  • UDChinese (for training spaCy POS) [github]

  • 八卦版問答中文語料 [github]

  • 中文word2vec模型 [github]

  • 中文突发事件语料库(Chinese Emergency Corpus)[github]

  • dgk_lost_conv 中文对白语料 chinese conversation corpus [github]

  • 漢語拆字字典 [github]

  • 中国股市公告信息爬取 [github] 通过python脚本从巨潮网络的服务器获取中国股市(sz,sh)的公告(上市公司和监管机构)

  • tushare财经数据接口 [website] TuShare是一个免费、开源的python财经数据接口包。

  • 保险行业语料库 [github] [52nlp介绍Blog] OpenData in insurance area for Machine Learning Tasks

  • 最全中华古诗词数据库 [github] 唐宋两朝近一万四千古诗人, 接近5.5万首唐诗加26万宋诗. 两宋时期1564位词人,21050首词。

Organizations 相关中文NLP组织和会议

  • 中国中文信息学会 [website]

  • NLP Conference Calender [website] Main conferences, journals, workshops and shared tasks in NLP community.

Learning Materials 学习资料

  • 中文Deep Learning Book [github]

  • Stanford CS224n Natural Language Processing with Deep Learning 2017 [link]

  • Oxford CS DeepNLP 2017 [github]

  • Speech and Language Processing by Dan Jurafsky and James H. Martin [link]

  • 52nlp 我爱自然语言处理 [blog]

  • hankcs 码农场 [blog]

  • 文本处理实践课资料 [github] 文本处理实践课资料,包含文本特征提取(TF-IDF),文本分类,文本聚类,word2vec训练词向量及同义词词林中文词语相似度计算、文档自动摘要,信息抽取,情感分析与观点挖掘等实验。

About

A curated list of resources for NLP (Natural Language Processing) for Chinese 中文自然语言处理相关资料

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published