Use data profile for unstructured data

A data profile scan for unstructured data (UnstructuredDataProfileSpec) powered by Vertex AI Gemini 2.5 Pro models analyzes existing BigQuery object tables to transform raw, unstructured files in Cloud Storage (such as PDFs) into structured, queryable assets. This standalone workflow is designed for users who already have BigQuery object tables and supports guiding the extraction with a customized prompt. If you are starting with raw files in Cloud Storage and want an automated discovery workflow, see Use discovery scan for unstructured data.

This document describes how to set up the necessary permissions, prepare your object table, create a data profile scan for unstructured data using the REST API, view the generated insights, curate graph profiles, and extract the data into BigQuery.

Before you begin

Before you create a data profile scan for unstructured data, ensure you have the required permissions and APIs enabled.

Enable APIs

Enable the following APIs in your project:

  • dataplex.googleapis.com
  • bigquery.googleapis.com
  • aiplatform.googleapis.com (Vertex AI)

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Required roles and permissions

Unstructured data semantic inference is an advanced data profile scan feature that operates on BigQuery object tables. To configure and run unstructured data profiling, you must satisfy the baseline permissions for accessing the object table and grant additional roles for semantic inference across multiple service agents.

Baseline object table roles

To access and query a BigQuery object table, ensure that you and the service accounts used by Knowledge Catalog have the following baseline Identity and Access Management (IAM) roles on the project:

  • BigQuery Data Viewer (roles/bigquery.dataViewer)
  • BigQuery Connection User (roles/bigquery.connectionUser)

For a complete list of object table prerequisites, see Create object tables.

Additional roles for semantic inference

In addition to baseline table access, ensure that you and the service accounts have the following additional IAM roles.

Summary of additional identities and roles

Identity type Typical principal format Required IAM roles Core purpose
End user Your Google Cloud user account
  • Dataplex DataScan Editor
  • Dataplex Catalog Editor
  • BigQuery Data Editor
  • BigQuery Job User
You use these additional roles to configure scans, view AI-generated results, curate graph profiles, and trigger the final data extraction.
Dataplex Universal Catalog discovery agent service-<var>PROJECT_NUMBER</var>@gcp-sa-dataplex.iam.gserviceaccount.com
  • Vertex AI User
  • BigQuery Job User
  • BigQuery Data Viewer
This Google-managed service agent uses these additional roles to call Vertex AI to generate inferred schemas and metadata.
BigQuery connection service account A unique identity associated with your connection (for example, bqcx-<var>PROJECT_NUMBER</var>-<var>ID</var>@gcp-sa-bigquery-condel.iam.gserviceaccount.com)
  • Storage Object Viewer (on the source bucket)
  • Vertex AI User (on the project)
It connects BigQuery to external storage, allowing BigQuery to read the raw files, create object tables, and run AI inference without exposing your personal user credentials.
Pipeline execution service account (Optional) A user-managed service account
  • BigQuery Data Editor
  • BigQuery Job User
  • BigQuery User
  • Vertex AI User
If you choose to extract data using an automated pipeline, this identity runs the background jobs to materialize the AI-generated entities into BigQuery tables.
Default Dataform service account (Optional) service-<var>PROJECT_NUMBER</var>@gcp-sa-dataform.iam.gserviceaccount.com
  • Service Account Token Creator (granted on the pipeline execution service account)
When using the pipeline extraction method, Dataform requires permission to impersonate your pipeline execution service account to orchestrate the workflow.

End user roles and permissions

To ensure that your user account has the necessary permissions to create scans, view insights, curate graph profiles, and extract data, ask your administrator to grant the following IAM roles to your user account on the project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to create scans, view insights, curate graph profiles, and extract data. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create scans, view insights, curate graph profiles, and extract data:

  • DataScans:
    • dataplex.datascans.create
    • dataplex.datascans.get
    • dataplex.datascans.getData
    • dataplex.datascans.list
    • dataplex.datascans.update
  • Data extraction:
    • bigquery.tables.create
    • bigquery.tables.update
    • bigquery.tables.getData
    • bigquery.jobs.create

Your administrator might also be able to give your user account these permissions with custom roles or other predefined roles.

Dataplex discovery service agent roles and permissions

The Dataplex discovery service agent is a service agent that needs access to run scans and perform semantic inference using Vertex AI.

To ensure that the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) has the necessary permissions to run scans and perform semantic inference using Vertex AI, ask your administrator to grant the following IAM roles to the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) on the project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to run scans and perform semantic inference using Vertex AI. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to run scans and perform semantic inference using Vertex AI:

  • All:
    • aiplatform.endpoints.predict
    • bigquery.datasets.create
    • bigquery.datasets.get
    • bigquery.tables.get
    • bigquery.tables.getData
    • storage.buckets.get
    • storage.objects.get
    • storage.objects.list

Your administrator might also be able to give the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) these permissions with custom roles or other predefined roles.

BigQuery connection service account roles and permissions

A BigQuery Cloud resource connection lets Knowledge Catalog access unstructured data stored in Cloud Storage. When you create a connection, BigQuery automatically creates a dedicated service account on your behalf. This service account serves as the identity used to connect to your external data source.

By default, this service account doesn't have any permissions. You must explicitly grant this service account the required IAM roles on the Cloud Storage buckets containing your data. You can use an existing BigQuery connection or create a new one in the same location as your source Cloud Storage bucket. For more information about sharing connections, see Share a connection with users.

To ensure that the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details) has the necessary permissions to read object tables and run inference, ask your administrator to grant the following IAM roles to the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details):

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to read object tables and run inference. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to read object tables and run inference:

  • All:
    • storage.buckets.get on the bucket containing unstructured data
    • storage.objects.get on the bucket containing unstructured data
    • aiplatform.endpoints.predict on the project

Your administrator might also be able to give the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details) these permissions with custom roles or other predefined roles.

Pipeline execution service account roles and permissions (Optional)

If you choose to extract the inferred data using an automated pipeline, you must create or provide a dedicated service account to run the pipeline. This execution service account acts as the identity that authenticates and runs the background data extraction and analysis tasks in BigQuery. Additionally, you must grant the default Dataform service account permission to impersonate this execution service account.

To ensure that the pipeline execution service account has the necessary permissions to extract the inferred entities and relationships using a pipeline, ask your administrator to grant the following IAM roles to the pipeline execution service account on the project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to extract the inferred entities and relationships using a pipeline. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to extract the inferred entities and relationships using a pipeline:

  • All:
    • bigquery.tables.create
    • bigquery.tables.update
    • bigquery.tables.get
    • bigquery.tables.getData
    • bigquery.jobs.create
    • aiplatform.endpoints.predict

Your administrator might also be able to give the pipeline execution service account these permissions with custom roles or other predefined roles.

To ensure that the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) has the necessary permissions to impersonate the pipeline execution service account, ask your administrator to grant the following IAM roles to the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) on the pipeline execution service account:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to impersonate the pipeline execution service account. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to impersonate the pipeline execution service account:

  • All: iam.serviceAccounts.getAccessToken

Your administrator might also be able to give the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) these permissions with custom roles or other predefined roles.


Prepare your object table

A data profile scan for unstructured data operates directly on an existing BigQuery object table. Before you create the scan, ensure that your unstructured data (such as PDFs) is stored in a Cloud Storage bucket and that you have created a corresponding BigQuery object table over that bucket using a Cloud resource connection.

Ensure that you and the Knowledge Catalog service account have the BigQuery Connection User (roles/bigquery.connectionUser) role on the connection used by the object table.

For more information about creating object tables and setting up the required connection, see Create object tables.

Create a data profile scan for unstructured data

To extract semantic insights from your object table, you must create a data profile scan for unstructured data (UnstructuredDataProfileSpec). This scan uses Vertex AI Gemini 2.5 Pro models to analyze the unstructured files referenced by your object table and generate inferred metadata, schemas, and relationships.

For this initial release, scan creation is supported exclusively by using the REST API.

To create a data profile scan for unstructured data using the REST API, use the dataScans.create method with an unstructuredDataProfileSpec.

POST https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans?dataScanId=DATASCAN
{
  "description": "Data profile scan for unstructured data",
  "data": {
    "resource": "//bigquery.googleapis.com/projects/PROJECT_ID/datasets/DATASET_ID/tables/TABLE_ID"
  },
  "executionSpec": {
    "trigger": {
      "onDemand": {}
    }
  },
  "unstructuredDataProfileSpec": {
    "customizedPrompt": "",
    "graphProfilePublishingEnabled": false
  }
}

Replace the following:

  • PROJECT_ID: the ID of your Google Cloud project.
  • LOCATION: the Google Cloud region (must support Gemini 2.5 Pro).
  • DATASCAN: the name of the data profile scan.
  • DATASET_ID and TABLE_ID: the BigQuery dataset and object table name.

Data profile scan specification parameters

  • customizedPrompt: Optional. A natural language prompt instructing Gemini on specific entities or domain context to extract (for example, Focus extraction on M&A contract terms, identifying purchasing entities, target companies, and agreed escrow amounts.). By default, this is an empty string (""). There is a limit on the maximum character length for customized prompts.

  • graphProfilePublishingEnabled: Optional. Whether to automatically publish the inferred graph profile to the catalog upon scan completion. By default, this is false.

Knowledge Catalog runs the data profile scan and enriches the catalog entry with AI-generated metadata. This process usually takes a few minutes for standard datasets.

Example: Extract contract terms from seller PDFs

The following example shows a REST API request for a sample retail company creating a data profile scan (seller-contracts-scan) to analyze seller agreement PDFs stored in an object table (seller_agreements_obj_table). It uses a customized prompt to instruct Gemini to extract specific business terms, such as commission rates and payment terms:

POST https://dataplex.googleapis.com/v1/projects/example-retail-project/locations/us-central1/dataScans?dataScanId=seller-contracts-scan
{
  "description": "Data profile scan for seller PDF agreements",
  "data": {
    "resource": "//bigquery.googleapis.com/projects/example-retail-project/datasets/marketplace_operations/tables/seller_agreements_obj_table"
  },
  "executionSpec": {
    "trigger": {
      "onDemand": {}
    }
  },
  "unstructuredDataProfileSpec": {
    "customizedPrompt": "Focus extraction on seller agreement terms, identifying seller business entities, commission rates, payment terms, and termination clauses in the PDFs.",
    "graphProfilePublishingEnabled": true
  }
}

Run the data profile scan

If you configured your data profile scan to run on demand, you must manually trigger the scan to analyze your unstructured data.

To run an on-demand data profile scan using the REST API, use the dataScans.run method:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans/DATASCAN:run"

Replace the following:

  • PROJECT_ID: the ID of your Google Cloud project.
  • LOCATION: the Google Cloud region where the data profile scan is located.
  • DATASCAN: the name of the data profile scan.

Explore data profile scan results

Once the data profile scan completes, Knowledge Catalog generates a graph profile containing the inferred schemas for entities and relationships. You can explore these results using the Google Cloud console or the REST API.

Console

If you enabled graph profile publishing to the catalog (graphProfilePublishingEnabled: true), you can view the object table and its inferred semantic graphs in Knowledge Catalog:

  1. In the Google Cloud console, go to the Knowledge Catalog Search page.

    Go to Search

  2. Paste and search for the object table whose ID you configured in the scan.

  3. In the search results, click the table to open its entry page.

  4. On the Details tab, under Aspects, verify the presence of the Graph Profile aspect (dataplex-types.global.graph-profile). This aspect contains the inferred schemas for entities and relationships.

  5. Click the Insights tab. On the Insights tab, you can view the following information:

    • Semantic extraction. A banner indicates that extractable entities and relationships were detected. It includes an Extract button to materialize the data using SQL or pipeline deployment.

    • Description. An AI-generated, human-readable summary explains the unstructured data contents. It describes the primary nodes (entities) discovered and how they map to each other through edges (relationships).

    • Pipelines. A list of previously deployed data extraction pipelines associated with this resource. You can view the display name, region, creation time, and the user who created the pipeline.

    • Inferred entities and relationships. A visual, interactive graph displays the discovered semantic structure of your unstructured data. The graph contains nodes representing distinct entities, for example, Recipe and Ingredient, and edges representing the connections between them, for example, HasAllergenStatus. You can use the legend to filter and explore specific nodes and edges.

    • Entities. A detailed list of the discovered primary entities. You can expand each entity to view its AI-generated description and its inferred schema, which includes field names, data types, and field descriptions.

    • Relationships. A detailed list of the discovered connections between entities. You can expand each relationship to view its description and the schema defining how the entities map to one another.

REST

To retrieve the graph profile results directly from the scan job execution using the REST API, use the dataScans.jobs.get method with view=full:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans/DATASCAN/jobs/JOB_ID?view=full"

Replace the following:

  • PROJECT_ID: the ID of your Google Cloud project.
  • LOCATION: the Google Cloud region where the data profile scan is located.
  • DATASCAN: the name of the data profile scan.
  • JOB_ID: the unique ID of the data profile scan job execution.

The following example shows the response for the seller-contracts-scan job, including the unstructuredDataProfileResult and graphProfile:

{
  "name": "projects/example-retail-project/locations/us-central1/dataScans/seller-contracts-scan/jobs/123e4567-e89b-12d3-a456-426614174000",
  "uid": "123e4567-e89b-12d3-a456-426614174000",
  "startTime": "2026-06-08T19:12:03.102Z",
  "endTime": "2026-06-08T19:15:28.415Z",
  "state": "SUCCEEDED",
  "type": "DATA_SCAN_TYPE_UNSTRUCTURED_DATA_PROFILE",
  "unstructuredDataProfileSpec": {
    "customizedPrompt": "Focus extraction on seller agreement terms, identifying seller business entities, commission rates, payment terms, and termination clauses in the PDFs.",
    "graphProfilePublishingEnabled": true
  },
  "unstructuredDataProfileResult": {
    "description": "The unstructured data contains seller agreement PDFs. The primary entities discovered are Seller Entity, Commission Rate, Payment Terms, and Termination Clause, mapped to each other through business agreement relationships.",
    "graphProfile": {
      "nodeTypes": [
        {
          "name": "Seller Entity",
          "description": "Discovered business entity representing the seller.",
          "fields": [
            {
              "name": "seller_name",
              "dataType": "STRING",
              "description": "The legal name of the seller.",
              "mode": "NULLABLE"
            },
            {
              "name": "address",
              "dataType": "STRING",
              "description": "The physical or mailing address of the seller.",
              "mode": "NULLABLE"
            }
          ]
        },
        {
          "name": "Commission Rate",
          "description": "Discovered agreed commission rate terms.",
          "fields": [
            {
              "name": "rate_percentage",
              "dataType": "NUMBER",
              "description": "The agreed commission percentage.",
              "mode": "NULLABLE"
            }
          ]
        },
        {
          "name": "Payment Terms",
          "description": "Discovered payment schedule and terms.",
          "fields": [
            {
              "name": "billing_cycle",
              "dataType": "STRING",
              "description": "The agreed billing frequency or payment schedule.",
              "mode": "NULLABLE"
            }
          ]
        }
      ],
      "edgeTypes": [
        {
          "name": "AgreedCommission",
          "description": "Defines the commission rate agreed by the seller entity.",
          "sourceNodeType": "Seller Entity",
          "targetNodeType": "Commission Rate"
        },
        {
          "name": "HasPaymentTerms",
          "description": "Defines the payment terms applicable to the seller entity.",
          "sourceNodeType": "Seller Entity",
          "targetNodeType": "Payment Terms"
        }
      ]
    }
  }
}

Update inferred insights

Inferred insights are stored in Knowledge Catalog Catalog as an aspect attached to the object table. You can update these insights manually using the REST API.

REST

To update inferred insights using the REST API, follow these steps:

  1. Create a file named payload.json and add the JSON content of the aspect you want to update. For example:

    {
      "aspects": {
        "dataplex-types.global.graph-profile": {
          "data": {
            "nodeTypes": [],
            "edgeTypes": []
          }
        }
      }
    }
    
  2. Run the following command in your terminal:

    curl -X PATCH \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -d @payload.json \
    "https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/entryGroups/ENTRY_GROUP_ID/entries/ENTRY_ID?updateMask=aspects"
    

    Replace the following:

    • PROJECT_ID: the ID of your project—for example, example-project
    • LOCATION: the location of the entry—for example, us-central1
    • ENTRY_GROUP_ID: the ID of the entry group—for example, example-entry-group (for BigQuery object tables, use @bigquery)
    • ENTRY_ID: the ID of the entry—for example, example-entry (retrieve this from the Overview tab of the entry details page in the Google Cloud console)

For more information and code samples in other languages, see Update an entry aspect.

Extract data to BigQuery

You can materialize the inferred entities and relationships into structured tables or views in BigQuery using SQL or an automated pipeline.

  1. In the Google Cloud console, go to the Knowledge Catalog Search page.

    Go to Search

  2. Search for the object table generated by your scan.

  3. In the search results, click the table to open its entry page.

  4. Click the Insights tab.

  5. On the Insights tab, click Extraction.

  6. Choose one of the following methods based on your analytical needs and the scale of your unstructured data:

    • Extract by SQL: Choose this option for rapid, ad hoc analysis, small-to-medium datasets, or when you want a zero-infrastructure approach using BigQuery remote models.

      To extract using SQL, follow these steps:

      1. Select Extract by SQL.
      2. In the Extract with SQL pane, select a destination dataset. The dataset must be in the same location as the source.
      3. Click Extract.
      4. In the BigQuery Editor, a pre-populated query opens utilizing the ML.PROCESS_DOCUMENT function. Run the query to create standard tables and views.

      For more information about using SQL to extract document insights, see Process documents with the ML.PROCESS_DOCUMENT function.

    • Extract by pipeline: Choose this option for massive-scale data processing or when you require robust retry logic, error handling, and automated orchestration to handle large volumes of documents.

      To extract using a pipeline, follow these steps:

      1. Select Extract by pipeline.
      2. In the Extract with pipeline pane, enter a display name for the pipeline.
      3. Select a region.
      4. Select a destination dataset. The dataset must be in the same location as the source.
      5. Click Extract. This creates a BigQuery pipeline that orchestrates the data materialization using Dataform.
      6. Run all tasks in the pipeline to generate structured node and edge views.

      For more information about running data workflows, see Introduction to Dataform.

After you extract and materialize the semantic insights into BigQuery, you can perform the following tasks:

  • Query the structured data. Run standard SQL queries against the newly created tables to analyze the extracted entities and relationships.

  • Join with existing data. Combine the qualitative insights extracted from your unstructured files with your existing structured BigQuery datasets (such as joining parsed invoice data with your accounting tables).

  • Explore data insights. Use the Data insights feature in BigQuery Studio to automatically generate natural language questions and SQL queries for your new structured assets.

  • Analyze with Gemini. Use Gemini in BigQuery to perform conversational analysis, summarize trends, or create dashboards in Data Studio based on the extracted data.

What's next