使用 nGram 在 Elasticsearch 中实现中文分词

最新推荐文章于 2025-10-22 16:21:09 发布

原创最新推荐文章于 2025-10-22 16:21:09 发布 · 1.6k 阅读

6 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#elasticsearch #中文分词 #jenkins

在全文搜索中，nGram 是一种非常有用的分词技术，尤其适用于实现部分匹配和模糊搜索。本文将详细介绍如何在 Elasticsearch 中配置 nGram 分词器，并通过一个中文案例来展示其具体应用。

什么是 nGram？

nGram 是一种将文本分解成连续的 n 个字符序列的技术。在 Elasticsearch 中，nGram 分词器可以将文本分解成指定长度范围内的子串，这对于实现“输入即搜索”功能特别有用。

配置 nGram 分词器

在 Elasticsearch 中配置 nGram 分词器通常涉及以下几个步骤：

1. 创建自定义分析器

首先，我们需要创建一个自定义的分析器，该分析器使用 nGram 作为其 tokenizer。以下是一个基本的例子，展示了如何在创建索引时设置一个使用 nGram 的分析器：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {  // 自定义分析器名称
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]  // 可选：添加过滤器，例如转换为小写
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,  // 最小 n-gram 长度
          "max_gram": 4,  // 最大 n-gram 长度
          "token_chars": [  // 要考虑的字符类型
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

在这个例子中，min_gram 和 max_gram 参数分别指定了生成的 n-gram 的最小和最大长度。token_chars 参数定义了哪些类型的字符应该被包括在 token 化过程中。

2. 应用分析器到字段

一旦创建了自定义分析器，你就可以在索引映射中将其应用到特定的字段上。比如，如果你希望对 content 字段应用上面创建的 my_ngram_analyzer，你可以这样做：

PUT /my_index/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_ngram_analyzer"  // 使用自定义的 nGram 分析器
    }
  }
}

3. 测试分析器

在实际使用之前，最好先测试一下你的分析器是否按预期工作。你可以使用 _analyze API 来检查某个字符串是如何被分析的：

POST /my_index/_analyze
{
  "analyzer": "my_ngram_analyzer",
  "text": "我爱北京天安门"
}

中文案例分析

假设我们有一个中文字符串：“我爱北京天安门”，并使用 nGram 分词器对其进行分词。我们设定 min_gram 为 2，max_gram 为 4。

分词过程

min_gram = 2, max_gram = 4
- 我爱
- 爱北
- 北京
- 京天
- 天安
- 安门
- 我爱北
- 爱北京
- 京天安
- 天安门
- 我爱北京
- 爱北京天
- 北京天安
- 京天安门
解释
- 当 min_gram 为 2 时，生成的所有 2-gram 序列都会被包含进来，例如 我爱、爱北 等。
- 当 max_gram 为 4 时，生成的所有 4-gram 序列也会被包含进来，例如 我爱北、爱北京 等。
- 所有长度在 2 到 4 之间的子串都会被生成并作为 token。

Elasticsearch 示例

假设我们在 Elasticsearch 中创建一个索引，并配置一个使用 nGram 的分析器：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 4,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

然后我们将这个分析器应用到 content 字段：

PUT /my_index/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_ngram_analyzer"
    }
  }
}

接下来，我们可以使用 _analyze API 来测试这个分析器：

POST /my_index/_analyze
{
  "analyzer": "my_ngram_analyzer",
  "text": "我爱北京天安门"
}

分词结果

Elasticsearch 将返回以下 token 列表：

{
  "tokens": [
    { "token": "我爱", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 },
    { "token": "爱北", "start_offset": 1, "end_offset": 3, "type": "word", "position": 1 },
    { "token": "北京", "start_offset": 2, "end_offset": 4, "type": "word", "position": 2 },
    { "token": "京天", "start_offset": 3, "end_offset": 5, "type": "word", "position": 3 },
    { "token": "天安", "start_offset": 4, "end_offset": 6, "type": "word", "position": 4 },
    { "token": "安门", "start_offset": 5, "end_offset": 7, "type": "word", "position": 5 },
    { "token": "我爱北", "start_offset": 0, "end_offset": 3, "type": "word", "position": 6 },
    { "token": "爱北京", "start_offset": 1, "end_offset": 4, "type": "word", "position": 7 },
    { "token": "京天安", "start_offset": 3, "end_offset": 6, "type": "word", "position": 8 },
    { "token": "天安门", "start_offset": 4, "end_offset": 7, "type": "word", "position": 9 },
    { "token": "我爱北京", "start_offset": 0, "end_offset": 4, "type": "word", "position": 10 },
    { "token": "爱北京天", "start_offset": 1, "end_offset": 5, "type": "word", "position": 11 },
    { "token": "北京天安", "start_offset": 2, "end_offset": 6, "type": "word", "position": 12 },
    { "token": "京天安门", "start_offset": 3, "end_offset": 7, "type": "word", "position": 13 }
  ]
}