使用服务器端自动化搜索多区域沿袭

本文档介绍了如何使用 searchLineageStreaming API 查找多级跨区域数据沿袭。

searchLineageStreaming API 从一组定义的根实体开始,在指定方向(上游或 下游)执行广度优先搜索,并以实时流式传输响应的形式返回统一的 沿袭图。

如需了解详情,请参阅 多区域沿袭搜索简介

主要功能

searchLineageStreaming API 具有以下功能:

  • 广度优先搜索:逐层遍历沿袭图, 准确计算每个连接资产的深度。

  • 流式传输响应:返回后端系统 发现的子图和沿袭链接。对于广泛或深入的沿袭图,这种方式非常高效,并且可以防止请求超时。

  • 多位置和多项目遍历:虽然您在请求路径中仅指定一个 结算项目,但只要您拥有所需的权限,API 就会自动发现并 遍历多个 Google Cloud 项目和地理 位置的沿袭链接。

  • 精细的列级沿袭:支持搜索资产之间的列级 依赖项。

  • 通配符查找:让您可以通过在完全限定名称 (FQN) 后添加 * 来检索 特定实体的所有列级沿袭。

  • 流水线洞见:可以选择检索有关创建沿袭链接的转换 流水线(进程)的元数据。

准备工作

在向 API 发出请求之前,请确保您已满足以下安全和环境前提条件:

所需的角色

如需获得搜索数据沿袭链接所需的权限,请让您的管理员为您授予存储沿袭链接和进程的项目中的Data Lineage Viewer (roles/datalineage.viewer) IAM 角色。如需详细了解如何授予角色,请参阅管理对项目、文件夹和组织的访问权限

此预定义角色包含 搜索数据沿袭链接所需的权限。如需查看所需的确切权限,请展开所需权限部分:

所需权限

您必须拥有以下权限才能搜索数据沿袭链接:

  • 搜索实体级沿袭: datalineage.events.get 对存储链接的项目具有 权限
  • 搜索列级沿袭: datalineage.events.getFields 对存储链接的项目具有权限
  • 检索完整的流水线进程详细信息: datalineage.processes.get 对存储进程的项目具有

您也可以使用自定义角色或其他预定义角色来获取这些权限。

资源范围界定

配置 API 请求时,您必须区分用于管理结算的资源和 API 扫描的实际位置:

  • 结算父路径:网址请求中的 parent 路径必须使用 格式 projects/project/locations/location. 此特定项目-位置对专门用于评估结算配额和 API 速率限制。

  • 目标位置:在请求正文内的 locations 数组中明确定义您希望后端扫描的区域。

身份验证设置

使用 Google Cloud 访问令牌初始化环境变量,以 对 curl 命令进行身份验证:

export ACCESS_TOKEN=$(gcloud auth print-access-token)

用法示例

以下示例使用端点 datalineage.googleapis.com

搜索多级多项目沿袭

如需执行深度沿袭搜索,以遍历图的多个深度并扫描不同的项目,请定义以下变量: Google Cloud

  • limits.maxDepth 设置为目标遍历深度(接受 1100 之间的值)。

  • 使用您希望后端交叉对比的目标区域填充 locations 数组(例如 ["us", "us-east1"])。

C#

C#

试用此示例之前,请按照 C# 设置说明进行操作。请按照 Knowledge Catalog 快速入门:使用 客户端库中的说明进行操作。 如需了解详情,请参阅 Knowledge Catalog C# API 参考文档

如需向 Knowledge Catalog 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅 为本地开发环境设置身份验证

using Google.Api.Gax.Grpc;
using Google.Api.Gax.ResourceNames;
using Google.Cloud.DataCatalog.Lineage.V1;
using System.Threading.Tasks;

public sealed partial class GeneratedLineageClientSnippets
{
    /// <summary>Snippet for SearchLineageStreaming</summary>
    /// <remarks>
    /// This snippet has been automatically generated and should be regarded as a code template only.
    /// It will require modifications to work:
    /// - It may require correct/in-range values for request initialization.
    /// - It may require specifying regional endpoints when creating the service client as shown in
    ///   https://cloud.google.com/dotnet/docs/reference/help/client-configuration#endpoint.
    /// </remarks>
    public async Task SearchLineageStreamingRequestObject()
    {
        // Create client
        LineageClient lineageClient = LineageClient.Create();
        // Initialize request argument(s)
        SearchLineageStreamingRequest request = new SearchLineageStreamingRequest
        {
            ParentAsLocationName = LocationName.FromProjectLocation("[PROJECT]", "[LOCATION]"),
            Locations = { "", },
            RootCriteria = new SearchLineageStreamingRequest.Types.RootCriteria(),
            Direction = SearchLineageStreamingRequest.Types.SearchDirection.Unspecified,
            Filters = new SearchLineageStreamingRequest.Types.SearchFilters(),
            Limits = new SearchLineageStreamingRequest.Types.SearchLimits(),
        };
        // Make the request, returning a streaming response
        using LineageClient.SearchLineageStreamingStream response = lineageClient.SearchLineageStreaming(request);

        // Read streaming responses from server until complete
        // Note that C# 8 code can use await foreach
        AsyncResponseStream<SearchLineageStreamingResponse> responseStream = response.GetResponseStream();
        while (await responseStream.MoveNextAsync())
        {
            SearchLineageStreamingResponse responseItem = responseStream.Current;
            // Do something with streamed response
        }
        // The response stream has completed
    }
}

Java

Java

试用此示例之前,请按照 Java 设置说明进行操作,具体请参阅 Knowledge Catalog 快速入门:使用 客户端库。 如需了解详情,请参阅 Knowledge Catalog Java API 参考文档

如需向 Knowledge Catalog 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅 为本地开发环境设置身份验证

import com.google.api.gax.rpc.ServerStream;
import com.google.cloud.datacatalog.lineage.v1.LineageClient;
import com.google.cloud.datacatalog.lineage.v1.LocationName;
import com.google.cloud.datacatalog.lineage.v1.SearchLineageStreamingRequest;
import com.google.cloud.datacatalog.lineage.v1.SearchLineageStreamingResponse;
import java.util.ArrayList;

public class AsyncSearchLineageStreaming {

  public static void main(String[] args) throws Exception {
    asyncSearchLineageStreaming();
  }

  public static void asyncSearchLineageStreaming() throws Exception {
    // This snippet has been automatically generated and should be regarded as a code template only.
    // It will require modifications to work:
    // - It may require correct/in-range values for request initialization.
    // - It may require specifying regional endpoints when creating the service client as shown in
    // https://cloud.google.com/java/docs/setup#configure_endpoints_for_the_client_library
    try (LineageClient lineageClient = LineageClient.create()) {
      SearchLineageStreamingRequest request =
          SearchLineageStreamingRequest.newBuilder()
              .setParent(LocationName.of("[PROJECT]", "[LOCATION]").toString())
              .addAllLocations(new ArrayList<String>())
              .setRootCriteria(SearchLineageStreamingRequest.RootCriteria.newBuilder().build())
              .setFilters(SearchLineageStreamingRequest.SearchFilters.newBuilder().build())
              .setLimits(SearchLineageStreamingRequest.SearchLimits.newBuilder().build())
              .build();
      ServerStream<SearchLineageStreamingResponse> stream =
          lineageClient.searchLineageStreamingCallable().call(request);
      for (SearchLineageStreamingResponse response : stream) {
        // Do something when a response is received.
      }
    }
  }
}

Node.js

Java

试用此示例之前,请按照 Java 设置说明进行操作。请按照 Knowledge Catalog 快速入门:使用 客户端库中的说明进行操作。

如需向 Knowledge Catalog 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅 为本地开发环境设置身份验证

/**
 * This snippet has been automatically generated and should be regarded as a code template only.
 * It will require modifications to work.
 * It may require correct/in-range values for request initialization.
 * TODO(developer): Uncomment these variables before running the sample.
 */
/**
 *  Required. The project and location to initiate the search from.
 */
// const parent = 'abc123'
/**
 *  Required. The locations to search in.
 */
// const locations = ['abc','def']
/**
 *  Required. Criteria for the root of the search.
 */
// const rootCriteria = {}
/**
 *  Required. Direction of the search.
 */
// const direction = {}
/**
 *  Optional. Filters for the search.
 */
// const filters = {}
/**
 *  Optional. Limits for the search.
 */
// const limits = {}

// Imports the Lineage library
const {LineageClient} = require('@google-cloud/lineage').v1;

// Instantiates a client
const lineageClient = new LineageClient();

async function callSearchLineageStreaming() {
  // Construct request
  const request = {
    parent,
    locations,
    rootCriteria,
    direction,
  };

  // Run request
  const stream = await lineageClient.searchLineageStreaming(request);
  stream.on('data', (response) => { console.log(response) });
  stream.on('error', (err) => { throw(err) });
  stream.on('end', () => { /* API call completed */ });
}

callSearchLineageStreaming();

Python

Python

试用此示例之前,请按照 Python 设置说明进行操作,具体请参阅 Knowledge Catalog 快速入门:使用 使用客户端库。 如需了解详情,请参阅 Knowledge Catalog Python API 参考文档

如需向 Knowledge Catalog 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅 为本地开发环境设置身份验证

# This snippet has been automatically generated and should be regarded as a
# code template only.
# It will require modifications to work:
# - It may require correct/in-range values for request initialization.
# - It may require specifying regional endpoints when creating the service
#   client as shown in:
#   https://googleapis.dev/python/google-api-core/latest/client_options.html
from google.cloud import datacatalog_lineage_v1


def sample_search_lineage_streaming():
    # Create a client
    client = datacatalog_lineage_v1.LineageClient()

    # Initialize request argument(s)
    request = datacatalog_lineage_v1.SearchLineageStreamingRequest(
        parent="parent_value",
        locations=["locations_value1", "locations_value2"],
        direction="UPSTREAM",
    )

    # Make the request
    stream = client.search_lineage_streaming(request=request)

    # Handle the response
    for response in stream:
        print(response)

Ruby

Ruby

试用此示例之前,请按照 Ruby 设置说明进行操作。请参阅 Knowledge Catalog 快速入门:使用客户端库。如需了解详情,请参阅 Knowledge Catalog Ruby API 参考文档

如需向 Knowledge Catalog 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅 为本地开发环境设置身份验证

require "google/cloud/data_catalog/lineage/v1"

##
# Snippet for the search_lineage_streaming call in the Lineage service
#
# This snippet has been automatically generated and should be regarded as a code
# template only. It will require modifications to work:
# - It may require correct/in-range values for request initialization.
# - It may require specifying regional endpoints when creating the service
# client as shown in https://cloud.google.com/ruby/docs/reference.
#
# This is an auto-generated example demonstrating basic usage of
# Google::Cloud::DataCatalog::Lineage::V1::Lineage::Client#search_lineage_streaming.
#
def search_lineage_streaming
  # Create a client object. The client can be reused for multiple calls.
  client = Google::Cloud::DataCatalog::Lineage::V1::Lineage::Client.new

  # Create a request. To set request fields, pass in keyword arguments.
  request = Google::Cloud::DataCatalog::Lineage::V1::SearchLineageStreamingRequest.new

  # Call the search_lineage_streaming method to start streaming.
  output = client.search_lineage_streaming request

  # The returned object is a streamed enumerable yielding elements of type
  # ::Google::Cloud::DataCatalog::Lineage::V1::SearchLineageStreamingResponse
  output.each do |current_response|
    p current_response
  end
end

REST

如需搜索数据沿袭,请使用 searchLineageStreaming 方法

在使用任何请求数据之前, 请先进行以下替换:

  • PROJECT_ID:用于管理结算和配额评估的项目 ID。 Google Cloud
  • LOCATION_ID:位置,例如 us-central1。 Google Cloud
  • SOURCE_PROJECT_ID:源表所在的项目 Google Cloud ID。
  • DATASET_ID:BigQuery 数据集 ID。
  • TABLE_ID:BigQuery 表 ID。

HTTP 方法和网址:

POST https://datalineage.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID:searchLineageStreaming

请求 JSON 正文:

{
  "parent": "projects/PROJECT_ID/locations/LOCATION_ID",
  "locations": [
    "LOCATION_ID",
    "us-east1",
    "us-central1"
  ],
  "rootCriteria": {
    "entities": {
      "entities": [
        {
          "fullyQualifiedName": "bigquery:SOURCE_PROJECT_ID.DATASET_ID.TABLE_ID"
        }
      ]
    }
  },
  "direction": "DOWNSTREAM",
  "limits": {
    "maxDepth": 10,
    "maxResults": 5000
  }
}

如需发送您的请求,请展开以下选项之一:

您应该收到类似以下内容的 JSON 响应:

{
  "links": [
    {
      "source": {
        "fullyQualifiedName": "bigquery:project-prod.dataset.source_table"
      },
      "target": {
        "fullyQualifiedName": "bigquery:project-prod.dataset.target_table"
      },
      "depth": 1,
      "location": "us"
    }
  ]
}

搜索多个地理位置

您可以通过修改在 locations 重复数组字段内传递的地理区域来限制或扩大沿袭图扫描范围。

例如:

curl -H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-X POST https://datalineage.googleapis.com/v1/projects/my-billing-project/locations/us:searchLineageStreaming \
--data '{
  "parent": "projects/my-billing-project/locations/us",
  "locations": ["us", "europe-west1", "asia-south2"],
  "rootCriteria": {
    "entities": {
      "entities": [{
        "fullyQualifiedName": "bigquery:my-project.dataset.global_table"
      }]
    }
  },
  "direction": "DOWNSTREAM"
}'

默认情况下,API 会省略进程信息(maxProcessPerLink 默认为 0)。如需检索创建 数据链接的流水线的资源名称,请将 limits.maxProcessPerLink 配置为非零正 整数。

例如:

curl -H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-X POST https://datalineage.googleapis.com/v1/projects/my-billing-project/locations/us:searchLineageStreaming \
--data '{
  "parent": "projects/my-billing-project/locations/us",
  "locations": ["us"],
  "rootCriteria": {
    "entities": {
      "entities": [{
        "fullyQualifiedName": "bigquery:my-project.dataset.target_table"
      }]
    }
  },
  "direction": "UPSTREAM",
  "limits": {
    "maxProcessPerLink": 5
  }
}'

响应行为:生成的流会使用仅包含其绝对系统资源名称(例如 projects/my-project/locations/us/processes/my-process)的进程消息填充 links[].processes 字段。

使用 FieldMask 检索完整的进程详细信息

如果您需要有关流水线的完整结构元数据(例如其 displayName、系统 attributes 或执行 origin),而不是仅需要其资源名称,则必须使用 API FieldMask

  1. limits.maxProcessPerLink 提供非零值。

  2. fields 查询参数附加到网址路径,指定 links.processes.process 以及其他必需字段。

例如:

curl -H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-X POST "https://datalineage.googleapis.com/v1/projects/my-billing-project/locations/us:searchLineageStreaming?fields=links.processes.process,links.source,links.target,links.depth" \
--data '{
  "parent": "projects/my-billing-project/locations/us",
  "locations": ["us"],
  "rootCriteria": {
    "entities": {
      "entities": [{
        "fullyQualifiedName": "bigquery:my-project.dataset.target_table"
      }]
    }
  },
  "direction": "UPSTREAM",
  "limits": {
    "maxProcessPerLink": 5
  }
}'

同时搜索表级和列级沿袭

您可以在单个请求中搜索表级(资产级)和列级(字段级)沿袭,方法是在 rootCriteria.entities.entities 列表中提供多个实体:

  • 对于表级沿袭,请省略 field 数组。

  • 对于列级沿袭,请在 field 数组中指定单个列。

例如:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json" \
     -X POST https://datalineage.googleapis.com/v1/projects/my-billing-project/locations/us:searchLineageStreaming \
     --data '{
       "parent": "projects/my-billing-project/locations/us",
       "locations": ["us"],
       "rootCriteria": {
         "entities": {
           "entities": [
             {
               "fullyQualifiedName": "bigquery:my-project.dataset.table_a"
             },
             {
               "fullyQualifiedName": "bigquery:my-project.dataset.table_b",
               "field": ["email"]
             }
           ]
         }
       },
       "direction": "DOWNSTREAM"
     }'

对列级沿袭使用通配符

如需搜索特定表的所有可用列级沿袭,而无需单独列出每个列,请使用通配符 * 作为 field 数组中的单个值。

例如:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json" \
     -X POST https://datalineage.googleapis.com/v1/projects/my-billing-project/locations/us:searchLineageStreaming \
     --data '{
       "parent": "projects/my-billing-project/locations/us",
       "locations": ["us"],
       "rootCriteria": {
         "entities": {
           "entities": [{
             "fullyQualifiedName": "bigquery:my-project.dataset.my_table",
             "field": ["*"]
           }]
         }
       },
       "direction": "DOWNSTREAM"
     }'

过滤沿袭结果

您可以使用请求正文中的 filters 块来优化沿袭搜索结果。

按依赖项类型过滤

如需将结果限制为特定依赖项类型(例如直接副本 (EXACT_COPY) 或过滤和分组等转换 (OTHER)),请使用 dependencyTypes 过滤器。

例如:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json" \
     -X POST https://datalineage.googleapis.com/v1/projects/my-billing-project/locations/us:searchLineageStreaming \
     --data '{
       "parent": "projects/my-billing-project/locations/us",
       "locations": ["us"],
       "rootCriteria": {
         "entities": {
           "entities": [{
             "fullyQualifiedName": "bigquery:my-project.dataset.my_table"
           }]
         }
       },
       "direction": "DOWNSTREAM",
       "filters": {
         "dependencyTypes": ["EXACT_COPY"]
       }
     }'

仅限表级沿袭

如需确保搜索仅返回表级沿袭并完全排除列级沿袭,请将 entitySet 过滤器设置为 ENTITIES

例如:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json" \
     -X POST https://datalineage.googleapis.com/v1/projects/my-billing-project/locations/us:searchLineageStreaming \
     --data '{
       "parent": "projects/my-billing-project/locations/us",
       "locations": ["us"],
       "rootCriteria": {
         "entities": {
           "entities": [{
             "fullyQualifiedName": "bigquery:my-project.dataset.my_table"
           }]
         }
       },
       "direction": "DOWNSTREAM",
       "filters": {
         "entitySet": "ENTITIES"
       }
     }'

按时间范围过滤

您可以将沿袭搜索结果限制为特定时间间隔。

例如,如需搜索在特定时间戳之后创建的沿袭数据,请使用以下请求:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json" \
     -X POST https://datalineage.googleapis.com/v1/projects/my-billing-project/locations/us:searchLineageStreaming \
     --data '{
       "parent": "projects/my-billing-project/locations/us",
       "locations": ["us"],
       "rootCriteria": {
         "entities": {
           "entities": [{
             "fullyQualifiedName": "bigquery:my-project.dataset.my_table"
           }]
         }
       },
       "direction": "DOWNSTREAM",
       "filters": {
         "timeRange": {
           "startTime": "2026-01-01T00:00:00Z"
         }
       }
     }'

处理无法访问的位置(部分结果)

由于流式传输 API 会同时扫描一组分布式项目和位置,因此在执行期间,某些远程区域可能会暂时关闭、无法通信或配置错误。

为保护数据完整性,searchLineageStreamingResponse 流包含一个名为 unreachable 的专用诊断字段:

  • 字段名称:unreachable(表示为重复字符串)

  • 值格式:projects/PROJECT_NUMBER/locations/LOCATION (例如 projects/123456789/locations/us-east1

后续步骤