下面是看Lucene in Action 第四章 分词
单词:
- sophisticated 复杂,精妙
常用的Analyzer解析:
- WhitespaceAnalyzer
只是去掉空格,就按空格分成若干个token
- SimpleAnalyzer
首先会去年非字符,转小写 还会省略数字
- StopAnalyzer
和上一个差不多,只是会去掉一些特定的英文单词(stop words)eg:the , a
- StandardAnalyzer
是Lucene最精妙的核心Analyzer,可以去有一定的逻辑去识别特定的各种符号,比如公司名称,e-mail, etc,, 也会转小写和根据stop words删除
TokenStrem
There are two different styles of TokenStreams: Tokenizer and TokenFilter . A good generalization to
explain the distinction is that Tokenizers deal with individual characters, and TokenFilter s deal with
words. Tokenizers produce a new TokenStream , while TokenFilters simply filter the tokens from a
prior TokenStreamTokenStream 有两种不同的形式:Tokenizer(断词) 和 TokenFilter ,一个不错的概论就是能明显的区分:
Tokenizers处理的是单个字符,TokenFilter处理的是word,Tokenizers产生新的TokenStream,而TokenFilter就简单的过滤之前的TokenStram的token