The official default word segmentation plugin of elasticsearch is not ideal for Chinese word segmentation. For example, I will now take a specific example to show why the word segmentation plugin provided by the ES official website is not effective for Chinese word segmentation. Reference Documentation:
https://www.elastic.co/guide/en/ ... ting_analyzers.html
https://www.elastic.co/guide/en/ ... ndices-analyze.html
We submit a piece of data to the analysis interface, as follows:
http://ip:9200/_analyze POST请求
If you use Elasticsearch directly, you will definitely encounter embarrassing problems when dealing with Chinese content searches.Chinese words are divided into Chinese characters one by oneWhen using Kibana to draw, group according to term, and as a result, a Chinese character is divided into a group.
Fortunately, there are two Chinese word segmentation plug-ins written by medcl (one of the earliest people to study ES in China), one is ik and one is mmseg, and the following only introduces the usage of ik.
The IK Analysis plugin integrates the Lucene IK analyzer into elasticsearch and supports custom dictionaries.
elasticsearch-analysis-ik project address:https://github.com/medcl/elasticsearch-analysis-ik
Install elasticsearch-analysis-ik
First, stop Elasticsearch running, JPS finds the process ID, and kill it with kill -9 process ID! (I didn't test it anyway, anyway, it's safer to stop and install)
Install using elasticsearch-plugin (supported from v5.5.1 version):
Note: Replace 6.3.0 with your own version of elasticsearch
I installed it on my serverElasticsearch is 6.5.2version, so, the command is as follows:
The installation error is as follows:
Exception in thread "main" java.nio.file.FileSystemException: /usr/local/elasticsearch-6.5.2/config/analysis-ik: Operation not permitted
Elasticsearch installation plugin needs to be run as rootSo, we use su root to switch to administrator status, re-perform the installation, and it is successful, as shown in the figure below:
We test by submitting a post request to the interface again, and the post content is as follows:
I found that I could understand our semantics normally, and divided "architect", "beautiful", and "architecture" into one word.
What is the difference between ik_max_word and ik_smart?
ik_max_word: The text will be split into the finest granularity, such as the "National Anthem of the People's Republic of China" will be split into "People's Republic of China, Chinese People, China, Chinese, People's Republic, People, People, People, Republic, Republic, He, National Anthem", and will exhaust all possible combinations;
ik_smart: It will do the coarsest level of splitting, such as splitting the "National Anthem of the People's Republic of China" into "National Anthem of the People's Republic of China".
|