Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何更细致的分割英文字符和空字符 以及汉字和数字组成的数量词 #1283

Closed
1 task done
WuSiQingChun opened this issue Sep 18, 2019 · 5 comments
Closed
1 task done
Labels

Comments

@WuSiQingChun
Copy link

WuSiQingChun commented Sep 18, 2019

注意事项

请确认下列注意事项:

  • 我已仔细阅读下列文档,都没有找到答案:
  • 我已经通过Googleissue区检索功能搜索了我的问题,也没有找到答案。
  • 我明白开源社区是出于兴趣爱好聚集起来的自由社区,不承担任何责任或义务。我会礼貌发言,向每一个帮助我的人表示感谢。
  • 我在此括号内输入x打钩,代表上述事项确认完毕

版本号

当前最新版本号是:portable-1.7.4
我使用的版本是:portable-1.6.4

我的问题

运行如下程序后,文本中的 “3.一”会识别成一个词,英文字符的右括号和换行符\r\n识别成一个词,如何才能把他们分开呢。

复现问题

步骤

  1. 首先……
  2. 然后……
  3. 接着……

触发代码

  public static void main(String[] args) {
		
	String text = "3.一位项目经理应该做下列哪一项?(C)\r\n" ;
	    
	List<Term> term = HanLP.newSegment().enableOffset(true).enableIndexMode(true).enableIndexMode(1).seg(text);
		
	   System.out.println(term.toString());
}

期望输出

[3/m,./w,一/m, 位/q, 项目/n, 经理/n, 应该/v, 做/v, 下列/b, 哪/r, 一/m, 项/q, ?/w, (/w, C/nx, )/w, /w]

实际输出

[3.一/m, 位/q, 项目/n, 经理/n, 应该/v, 做/v, 下列/b, 哪/r, 一/m, 项/q, ?/w, (/w, C/nx, )
/w]

其他信息

@hankcs
Copy link
Owner

hankcs commented Sep 18, 2019

        CharType.type['一'] = CharType.CT_CNUM;
        CharType.type['\r'] = CharType.CT_DELIMITER;
        CharType.type['\n'] = CharType.CT_DELIMITER;
        List<Term> termList = HanLP.segment(
            "3.一位项目经理应该做下列哪一项?(C)\r\n"
        );
        System.out.println(termList);

@hankcs
Copy link
Owner

hankcs commented Sep 18, 2019

另外,1.7.4版删除data/dictionary/other/CharType.bin就能得到正确结果。

@hankcs hankcs closed this as completed in 498b6f7 Sep 19, 2019
@WuSiQingChun
Copy link
Author

        CharType.type['一'] = CharType.CT_CNUM;
        CharType.type['\r'] = CharType.CT_DELIMITER;
        CharType.type['\n'] = CharType.CT_DELIMITER;
        List<Term> termList = HanLP.segment(
            "3.一位项目经理应该做下列哪一项?(C)\r\n"
        );
        System.out.println(termList);

非常感谢,我试了一下,发现“一”可以分割开了,但是)\r\n还是无法分开。

@hankcs
Copy link
Owner

hankcs commented Sep 19, 2019

试试代码库中刚提交的补丁,或者CharType.type['\r'] = CharType.CT_OTHER;。有任何字符你不想跟其他字符混到一起,都可以赋给它一个独一无二的整型值。

@WuSiQingChun
Copy link
Author

试试代码库中刚提交的补丁,或者CharType.type['\r'] = CharType.CT_OTHER;。有任何字符你不想跟其他字符混到一起,都可以赋给它一个独一无二的整型值。

嗯嗯好的,确实可以了。

hankcs added a commit that referenced this issue Jan 10, 2020
Montinosq added a commit to Montinosq/HanLP that referenced this issue Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants