Elasticsearch Analyzer 内置分词器( 二 )


  • Lower Case Tokenizer : 转小写的
Token filters
  • Stop Token Filter : 过滤停顿词 默认使用 规则 english
5.2 Configuration
  • stopwords : 指定分词的规则 默认 english , 或者分词的数组
  • stopwords_path : 指定分词停顿词文件
5.3 实验如下就是对 Stop Analyzer 的实现 , 先转小写 后进行停顿词的过滤
PUT /stop_example{"settings": {"analysis": {"filter": {"english_stop": {"type":"stop","stopwords":"_english_"}},"analyzer": {"rebuilt_stop": {"tokenizer": "lowercase","filter": ["english_stop"]}}}}}设置 stopwords 参数 指定过滤的停顿词列表
PUT my_index{"settings": {"analysis": {"analyzer": {"my_stop_analyzer": {"type": "stop","stopwords": ["the", "over"]}}}}}POST my_index/_analyze{"analyzer": "my_stop_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ quick, brown, foxes, jumped, lazy, dog, s, bone ]6. Whitespace Analyzer空格 分词器, 顾名思义 遇到空格就进行分词, 不会转小写
POST _analyze{"analyzer": "whitespace","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]6.1 DefinitionTokenizer
  • Whitespace Tokenizer
6.2 Configuration无配置
6.3 实验whitespace analyzer 的实现就是如下, 可以根据实际情况进行 添加 filter
PUT /whitespace_example{"settings": {"analysis": {"analyzer": {"rebuilt_whitespace": {"tokenizer": "whitespace","filter": []}}}}}7. Keyword Analyzer很特殊 它不会进行分词, 怎么输入 就怎么输出
POST _analyze{"analyzer": "keyword","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}//注意 这里并没有进行分词 而是原样输出[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]7.1 DefinitionTokennizer
  • Keyword Tokenizer
7.2 Configuration无配置
7.3 实验rebuit 如下 就是 Keyword Analyzer 实现
PUT /keyword_example{"settings": {"analysis": {"analyzer": {"rebuilt_keyword": {"tokenizer": "keyword","filter": []}}}}}8. Patter Analyzer正则表达式 进行拆分 ,注意 正则匹配的是 标记, 就是要被分词的标记 默认是 按照 \w+ 正则分词
POST _analyze{"analyzer": "pattern","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}// 默认是 按照 \w+ 正则[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]8.1 DefinitionTokennizer
  • Pattern Tokenizer
Token Filters
  • Lower Case Token Filter
  • Stop Token Filter (默认未开启)
8.2 ConfigurationpatternA Java regular expression, defaults to \W+.flagsJava regular expression.lowercase转小写 默认开启 true.stopwords停顿词过滤 默认none未开启 , Defaults to _none_.stopwords_path停顿词文件路径8.3 实验Pattern Analyzer 的实现 就是如下
PUT /pattern_example{"settings": {"analysis": {"tokenizer": {"split_on_non_word": {"type":"pattern","pattern":"\\W+"}},"analyzer": {"rebuilt_pattern": {"tokenizer": "split_on_non_word","filter": ["lowercase"]}}}}}9. Language Analyzer提供了如下 这么多语言分词器 , 其中 english 也在其中
arabic, armenian, basque, bengali, bulgarian, catalan, czech, dutch, english, finnish, french, galician, german, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, portuguese, romanian, russian, sorani, spanish, swedish, turkish.
GET _analyze{"analyzer": "english","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ 2, quick, brown, foxes, jumped, over, lazy, dog, bone ]10. Customer Analyzer没啥好说的 就是当提供的 内置分词器不满足你的需求的时候 ,你可以结合 如下3部分
  • Character Filters : 主要对原文本做处理, 例如 去除 html 标签
  • Tokenizer : 按照规则 把文本切分为单词, 也就是分词
  • Token Filters : 将切分后的单词 进行加工处理,小写,删除stopwords 停顿词,增加同义词 , 扩展一些
PUT my_index{"settings": {"analysis": {"analyzer": {"my_custom_analyzer": {"type": "custom","char_filter": ["emoticons"],"tokenizer": "punctuation","filter": ["lowercase","english_stop"]}},"tokenizer": {"punctuation": {"type": "pattern","pattern": "[ .,!?]"}},"char_filter": {"emoticons": {"type": "mapping","mappings": [":) => _happy_",":( => _sad_"]}},"filter": {"english_stop": {"type": "stop","stopwords": "_english_"}}}}}POST my_index/_analyze{"analyzer": "my_custom_analyzer","text":"I'm a :) person, and you?"}[ i'm, _happy_, person, you ]

推荐阅读