ES 分词器analyzer设置，包括创建更新索引时和查询时

最新推荐文章于 2024-07-05 15:19:18 发布

原创

最新推荐文章于 2024-07-05 15:19:18 发布 · 2w 阅读

本文深入探讨了Elasticsearch中的分词器概念，包括标准分词器、自定义分词器、第三方分词器如IK分词器的使用方法，以及如何通过配置分析器（analyzer）来优化搜索和索引过程。

analyzer

分词器使用的两个情形：
1，Index time analysis. 创建或者更新文档时，会对文档进行分词
2，Search time analysis. 查询时，对查询语句分词

- 查询时通过analyzer指定分词器

POST test_index/_search
{
  "query": {
    "match": {
      "name": {
        "query": "lin",
        "analyzer": "standard"
      }
    }
  }
}

- 创建index mapping时指定search_analyzer

PUT test_index
{
  "mappings": {
    "doc": {
      "properties": {
        "title":{
          "type": "text",
          "analyzer": "whitespace",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

索引时分词是通过配置 Index mapping中的每个字段的参数analyzer指定的

不指定分词时，会使用默认的standard分词器

注意：

明确字段是否需要分词，不需要分词的字段将type设置为keyword，可以节省空间和提高写性能。

_analyzer api

POST _analyze
{
    "analyzer": "standard",
    "text": "this is a test"
}
# 可以查看text的内容使用standard分词后的结果

{
"tokens": 
    [
        {
            "token": "this",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 5,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 8,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "test",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 3
        }
    ]
}

设置analyzer

PUT test
{
    "settings": {
        "analysis": { #自定义分词器
            "analyzer": { # 关键字
                "my_analyzer":{ # 自定义的分词器
                    "type":"standard", #分词器类型standard
                    "stopwords":"_english_" #standard分词器的参数，默认的stopwords是\_none_
                }
            }
        }
    },
    "mappings": {
        "doc":{
            "properties": {
                "my_text":{
                    "type": "text",
                    "analyzer": "standard", # my_text字段使用standard分词器
                    "fields": {
                        "english":{ # my_text.english字段使用上面自定义的my_analyzer分词器
                            "type": "text",
                            "analyzer": "my_analyzer"
                        }
                    }
                }
            }
        }
    }
}

POST test/_analyze
{
    "field": "my_text", # my_text字段使用的是standard分词器
    "text": ["The test message."]
}

-------------->[the,test,message]

POST test/_analyze
{
    "field":