- Lower Case Tokenizer : 转小写的
- Stop Token Filter : 过滤停顿词 默认使用 规则 english
- stopwords : 指定分词的规则 默认 english , 或者分词的数组
- stopwords_path : 指定分词停顿词文件
PUT /stop_example{"settings": {"analysis": {"filter": {"english_stop": {"type":"stop","stopwords":"_english_"}},"analyzer": {"rebuilt_stop": {"tokenizer": "lowercase","filter": ["english_stop"]}}}}}
设置 stopwords 参数 指定过滤的停顿词列表PUT my_index{"settings": {"analysis": {"analyzer": {"my_stop_analyzer": {"type": "stop","stopwords": ["the", "over"]}}}}}POST my_index/_analyze{"analyzer": "my_stop_analyzer","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
6. Whitespace Analyzer空格 分词器, 顾名思义 遇到空格就进行分词, 不会转小写
POST _analyze{"analyzer": "whitespace","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
6.1 DefinitionTokenizer- Whitespace Tokenizer
6.3 实验whitespace analyzer 的实现就是如下, 可以根据实际情况进行 添加 filter
PUT /whitespace_example{"settings": {"analysis": {"analyzer": {"rebuilt_whitespace": {"tokenizer": "whitespace","filter": []}}}}}
7. Keyword Analyzer很特殊 它不会进行分词, 怎么输入 就怎么输出POST _analyze{"analyzer": "keyword","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}//注意 这里并没有进行分词 而是原样输出[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
7.1 DefinitionTokennizer- Keyword Tokenizer
7.3 实验rebuit 如下 就是 Keyword Analyzer 实现
PUT /keyword_example{"settings": {"analysis": {"analyzer": {"rebuilt_keyword": {"tokenizer": "keyword","filter": []}}}}}
8. Patter Analyzer正则表达式 进行拆分 ,注意 正则匹配的是 标记, 就是要被分词的标记 默认是 按照 \w+ 正则分词
POST _analyze{"analyzer": "pattern","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}// 默认是 按照 \w+ 正则[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
8.1 DefinitionTokennizer- Pattern Tokenizer
- Lower Case Token Filter
- Stop Token Filter (默认未开启)
pattern
A Java regular expression, defaults to \W+
.flags
Java regular expression.lowercase
转小写 默认开启 true
.stopwords
停顿词过滤 默认none未开启 , Defaults to _none_
.stopwords_path
停顿词文件路径8.3 实验Pattern Analyzer 的实现 就是如下PUT /pattern_example{"settings": {"analysis": {"tokenizer": {"split_on_non_word": {"type":"pattern","pattern":"\\W+"}},"analyzer": {"rebuilt_pattern": {"tokenizer": "split_on_non_word","filter": ["lowercase"]}}}}}
9. Language Analyzer提供了如下 这么多语言分词器 , 其中 english 也在其中arabic
, armenian
, basque
, bengali
, bulgarian
, catalan
, czech
, dutch
, english
, finnish
, french
, galician
, german
, hindi
, hungarian
, indonesian
, irish
, italian
, latvian
, lithuanian
, norwegian
, portuguese
, romanian
, russian
, sorani
, spanish
, swedish
, turkish
.GET _analyze{"analyzer": "english","text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."}[ 2, quick, brown, foxes, jumped, over, lazy, dog, bone ]
10. Customer Analyzer没啥好说的 就是当提供的 内置分词器不满足你的需求的时候 ,你可以结合 如下3部分- Character Filters : 主要对原文本做处理, 例如 去除 html 标签
- Tokenizer : 按照规则 把文本切分为单词, 也就是分词
- Token Filters : 将切分后的单词 进行加工处理,小写,删除stopwords 停顿词,增加同义词 , 扩展一些
PUT my_index{"settings": {"analysis": {"analyzer": {"my_custom_analyzer": {"type": "custom","char_filter": ["emoticons"],"tokenizer": "punctuation","filter": ["lowercase","english_stop"]}},"tokenizer": {"punctuation": {"type": "pattern","pattern": "[ .,!?]"}},"char_filter": {"emoticons": {"type": "mapping","mappings": [":) => _happy_",":( => _sad_"]}},"filter": {"english_stop": {"type": "stop","stopwords": "_english_"}}}}}POST my_index/_analyze{"analyzer": "my_custom_analyzer","text":"I'm a :) person, and you?"}[ i'm, _happy_, person, you ]
推荐阅读
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- ElasticSearch这些坑记得避开
- 有用的内置Node.js APIs
- 京东云开发者|ElasticSearch降本增效常见的方法
- 记录在linux上单机elasticsearch8和kibana8
- AgileBoot - 如何集成内置数据库H2和内置Redis
- Elasticsearch rest-high-level-client 基本操作
- SpringBoot内置工具类,告别瞎写工具类了
- Hyperf使用ElasticSearch记录
- Mysql通过Canal同步Elasticsearch
- 4 Java注解:一个真实的Elasticsearch案例