Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

在 Bigtable 中通过查找 K 最近邻来执行相似度向量搜索

相似度向量搜索可以帮助您识别 Bigtable 数据中的相似概念和上下文含义，这意味着在过滤存储在指定键范围内的的数据时，它可以提供更相关的结果。示例用例包括：

在收件箱搜索中对特定用户的邮件进行语义匹配。
在传感器范围内检测异常。
在已知键集中检索最相关的文档，以用于检索增强生成 (RAG)。
根据 Bigtable 存储的用户历史提示和偏好设置检索结果并对其进行排名，从而实现搜索结果的个性化，以提升用户的搜索体验。
检索相似的对话串，以查找并显示与用户当前聊天在上下文中相似的过往对话，从而提供更个性化的体验。
提示重复数据排除，以识别同一用户提交的相同或语义相似的提示，并避免冗余的 AI 处理。

注意：本页面介绍了如何使用 GoogleSQL for Bigtable 中的余弦距离和欧几里得距离向量函数在 Bigtable 中执行相似度向量搜索，以查找 K 最近邻。此方法需要您手动设置所需的 Bigtable 组件。我们建议您尝试使用 Bigtable 向量存储区（适用于 LangChain）。BigtableVectorStore 类提供了用于运行向量 K 最近邻搜索的现成方法，并添加了元数据过滤功能。

在阅读本页面之前，请务必了解以下概念：

欧几里得距离：衡量两个向量之间的最短距离。
余弦距离：衡量两个向量之间夹角的余弦值。
K 最近邻 (KNN): 一种监督式机器学习算法，用于解决分类或回归问题。

Bigtable 支持 COSINE_DISTANCE() 和 EUCLIDEAN_DISTANCE() 函数，这些函数用于对向量嵌入进行操作，让您能够找到输入嵌入的 KNN。

您可以使用 Gemini Enterprise Agent Platform 文本嵌入 API 生成 Bigtable 数据并将其存储为向量嵌入。然后，您可以在查询中提供这些向量嵌入作为输入参数，以查找 N 维空间中最接近的向量，从而搜索语义相似或相关的项。

这两个距离函数都采用 vector1 和 vector2 实参，它们属于 array<> 类型，并且必须包含相同的维度和长度。如需详细了解这些函数，请参阅以下内容：

本页面上的代码演示了如何创建嵌入、将其存储在 Bigtable 中，然后执行 KNN 搜索。

本页面上的示例使用 EUCLIDEAN_DISTANCE() 和 Python 版 Bigtable 客户端库。不过，您也可以使用 COSINE_DISTANCE() 和任何支持 GoogleSQL for Bigtable 的客户端库，例如 Java 版 Bigtable 客户端库。

准备工作

在尝试代码示例之前，请完成以下操作。

所需的角色

如需获取读取和写入 Bigtable 所需的权限，请让管理员为您授予以下 IAM 角色：

您要向其发送请求的 Bigtable 实例的 Bigtable User (roles/bigtable.user)

设置环境

下载并安装 Python 版 Bigtable 客户端库。如需使用 GoogleSQL for Bigtable 函数，您必须使用 python-bigtable 2.26.0 或更高版本。如需了解相关说明（包括如何设置身份验证），请参阅 Python hello world。
如果您没有 Bigtable 实例，请按照创建实例中的步骤操作。
确定您的资源 ID。运行代码时，请将以下占位符替换为您的 Google Cloud 项目、Bigtable 实例和表的 ID：
- PROJECT_ID
- INSTANCE_ID
- TABLE_ID

创建用于存储文本、嵌入和搜索词组的表

创建一个包含两个列族的表。

Python

from google.cloud import bigtable
from google.cloud.bigtable import column_family

client = bigtable.Client(project=PROJECT_ID, admin=True)
instance = client.instance(INSTANCE_ID)
table = instance.table(TABLE_ID)
column_families = {"docs":column_family.MaxVersionsGCRule(2), "search_phrase":column_family.MaxVersionsGCRule(2)}

if not table.exists():
  table.create(column_families=column_families)
else:
  print("Table already exists")

使用 Agent Platform 中的预训练基础模型嵌入文本

生成文本和嵌入，以与关联的键一起存储在 Bigtable 中。如需了解详情，请参阅获取文本嵌入或获取多模态嵌入。

Python

from typing import List, Optional
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
from vertexai.generative_models import GenerativeModel

#defines which LLM that we should use to generate the text
model = GenerativeModel("gemini-1.5-pro-001")

#First, use generative AI to create a list of 10 chunks for phrases
#This can be replaced with a static list of text items or your own data

chunks = []
for i in range(10):
  response = model.generate_content(
      "Generate a paragraph between 10 and 20 words that is about about either
      Bigtable or Generative AI"
)
chunks.append(response.text)
print(response.text)
#create embeddings for the chunks of text
def embed_text(
  texts: List[str] = chunks,
  task: str = "RETRIEVAL_DOCUMENT",
  model_name: str = "text-embedding-004",
  dimensionality: Optional[int] = 128,
) -> List[List[float]]:
  """Embeds texts with a pre-trained, foundational model."""
  model = TextEmbeddingModel.from_pretrained(model_name)
  inputs = [TextEmbeddingInput(text, task) for text in texts]
  kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}
  embeddings = model.get_embeddings(inputs, **kwargs)
  return [embedding.values for embedding in embeddings]

embeddings = embed_text()
print("embeddings created for text phrases")

定义可让您转换为字节对象的函数

Bigtable 针对键值对进行了优化，通常将数据存储为字节对象。如需详细了解如何为 Bigtable 设计数据模型，请参阅架构设计最佳实践。

您需要转换从 Agent Platform 返回的嵌入，这些嵌入在 Python 中存储为浮点数列表。您需要将每个元素转换为大端序 IEEE 754 浮点格式，然后将它们连接在一起。以下函数可实现此目的。

Python

import struct
def floats_to_bytes(float_list):
  """
  Convert a list of floats to a bytes object, where each float is represented
  by 4 big-endian bytes.

  Parameters:
  float_list (list of float): The list of floats to be converted.

  Returns:
  bytes: The resulting bytes object with concatenated 4-byte big-endian
  representations of the floats.
  """
  byte_array = bytearray()

  for value in float_list:
      packed_value = struct.pack('>f', value)
      byte_array.extend(packed_value)

  # Convert bytearray to bytes
  return bytes(byte_array)

将嵌入写入 Bigtable

将嵌入转换为字节对象，创建突变，然后将数据写入 Bigtable。

Python

from google.cloud.bigtable.data  import RowMutationEntry
from google.cloud.bigtable.data  import SetCell

mutations = []
embeddings = embed_text()
for i, embedding in enumerate(embeddings):
  print(embedding)

  #convert each embedding into a byte object
  vector = floats_to_bytes(embedding)

  #set the row key which will be used to pull the range of documents (ex. doc type or user id)
  row_key = f"doc_{i}"

  row = table.direct_row(row_key)

  #set the column for the embedding based on the byte object format of the embedding
  row.set_cell("docs","embedding",vector)
  #store the text associated with vector in the same key
  row.set_cell("docs","text",chunks[i])
  mutations.append(row)

#write the rows to Bigtable
table.mutate_rows(mutations)

使用 GoogleSQL for Bigtable 执行 KNN 搜索

向量存储为二进制编码的数据，可以使用从 Bigtable 中读取的转换函数，将 BYTES 类型转换为 ARRAY<FLOAT32>。

以下是 SQL 查询：

SELECT _key, TO_VECTOR32(data['embedding']) AS embedding
FROM table WHERE _key LIKE 'store123%';

您可以使用 GoogleSQL COSINE_DISTANCE 函数查找文本嵌入与您提供的搜索短语之间的相似度。由于此计算可能需要一些时间来处理，因此请使用 Python 客户端库的异步数据客户端来执行 SQL 查询。

Python

from google.cloud.bigtable.data import BigtableDataClientAsync

#first embed the search phrase
search_embedding = embed_text(texts=["Apache HBase"])

query = """
      select _key, docs['text'] as description
      FROM knn_intro
      ORDER BY COSINE_DISTANCE(TO_VECTOR32(docs['embedding']), {search_embedding})
      LIMIT 1;
      """

async def execute_query():
  async with BigtableDataClientAsync(project=PROJECT_ID) as client:
    local_query = query
    async for row in await client.execute_query(query.format(search_embedding=search_embedding[0]), INSTANCE_ID):
      return(row["_key"],row["description"])

await execute_query()

返回的响应是描述 Bigtable 的生成的文本说明。

在 Bigtable 中通过查找 K 最近邻来执行相似度向量搜索

准备工作

所需的角色

设置环境

创建用于存储文本、嵌入和搜索词组的表

Python

使用 Agent Platform 中的预训练基础模型嵌入文本

Python

定义可让您转换为字节对象的函数

Python

将嵌入写入 Bigtable

Python

使用 GoogleSQL for Bigtable 执行 KNN 搜索

Python

后续步骤