Fork me on GitHub

lucene-analyzer

下面是看Lucene in Action 第四章 分词

单词:

  1. sophisticated 复杂,精妙

常用的Analyzer解析:

  • WhitespaceAnalyzer 只是去掉空格,就按空格分成若干个token
  • SimpleAnalyzer 首先会去年非字符,转小写 还会省略数字
  • StopAnalyzer 和上一个差不多,只是会去掉一些特定的英文单词(stop words)eg:the , a
  • StandardAnalyzer 是Lucene最精妙的核心Analyzer,可以去有一定的逻辑去识别特定的各种符号,比如公司名称,e-mail, etc,, 也会转小写和根据stop words删除

TokenStrem

There are two different styles of TokenStreams: Tokenizer and TokenFilter . A good generalization to
explain the distinction is that Tokenizers deal with individual characters, and TokenFilter s deal with
words. Tokenizers produce a new TokenStream , while TokenFilters simply filter the tokens from a
prior TokenStream

TokenStream 有两种不同的形式:Tokenizer(断词) 和 TokenFilter ,一个不错的概论就是能明显的区分:
Tokenizers处理的是单个字符,TokenFilter处理的是word,Tokenizers产生新的TokenStream,而TokenFilter就简单的过滤之前的TokenStram的token

Analyzer building blocks provided in Lucene’s core API


坚持原创技术分享,您的支持将鼓励我继续创作!