Skip to main content

BigQuery Data Source

The BigQuery Data Source tool enables your agents to query Google BigQuery data warehouses using natural language. Your agents can ask questions in plain English, and the tool automatically converts them to BigQuery SQL, executes the query, and returns structured results.
Currently Supported: This tool currently supports Google BigQuery only. Other data sources may be added in future releases.
This tool has Pro status and requires a professional subscription to access advanced features.

Overview

The BigQuery Data Source tool combines the power of Google BigQuery’s analytics capabilities with GPT-4’s natural language understanding to create a seamless data access experience for your agents.

Natural Language Queries

Ask questions in plain English - no SQL knowledge required

GPT-4 Powered

Automatic conversion to BigQuery SQL using advanced AI

Real-Time Data

Query live data directly from your BigQuery datasets

Secure Access

Enterprise-grade security with Google Cloud authentication

Key Features

  • Natural Language to SQL: Converts plain English questions to BigQuery Standard SQL
  • Automatic Schema Detection: Understands your table structures automatically
  • Token Usage Tracking: Monitors OpenAI API costs for each query
  • Result Limiting: Controls query size and response volume
  • Usage Analytics: Tracks query performance and usage patterns

How It Works

  1. Your agent receives a natural language question from a user
  2. The question is sent to GPT-4 (gpt-4.1-2025-04-14) along with your BigQuery schema
  3. GPT-4 generates optimized BigQuery Standard SQL
  4. The SQL query is executed on your BigQuery dataset
  5. Results are formatted as JSON and returned to the agent
  6. Token usage is tracked for billing and monitoring

Prerequisites

Before configuring the BigQuery Data Source tool, ensure you have:
1

Google Cloud Platform Account

Active GCP account with billing enabled
2

BigQuery Dataset

One or more BigQuery datasets containing your data
3

Service Account

GCP service account with BigQuery Data Viewer permissions
4

Service Account Key

Downloaded JSON credentials file for the service account
5

Table Access

Service account granted access to specific tables you want to query

Setting Up Google Cloud Service Account

  1. Create Service Account:
  2. Grant BigQuery Permissions:
    • Add role: BigQuery Data Viewer (for read-only access)
    • Optionally add: BigQuery Job User (to run queries)
    • Click “Continue” then “Done”
  3. Create and Download Key:
    • Click on the created service account
    • Go to “Keys” tab → “Add Key” → “Create new key”
    • Select “JSON” format
    • Download and securely store the JSON file
Security: Never commit service account keys to version control. Store them securely and rotate regularly (recommended: every 90 days).

Configuration

Configuration Structure

The tool requires a specific configuration structure with your Google Cloud credentials and BigQuery dataset information:
{
  "tool_type": "EXTERNAL_DATASOURCE",
  "name": "Company Data Warehouse",
  "description": "Query company analytics data using natural language",
  "config": {
    "external_datasource_type": "BIG_QUERY",
    "big_query_config": {
      "service_account_credentials": {
        "type": "service_account",
        "project_id": "my-gcp-project-id",
        "private_key_id": "abc123def456...",
        "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvQIBADANBgkqhkiG9w0BAQ...\n-----END PRIVATE KEY-----\n",
        "client_email": "plai-bigquery@my-gcp-project.iam.gserviceaccount.com",
        "client_id": "123456789012345678901",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/plai-bigquery%40my-gcp-project.iam.gserviceaccount.com"
      },
      "dataset_id": "analytics_prod",
      "table_ids": ["customers", "orders", "products", "transactions"],
      "system_prompt": null,
      "query_max_rows": 10000,
      "output_max_rows": 10
    }
  }
}

Configuration Parameters

ParameterTypeRequiredDefaultDescription
external_datasource_typestringYes-Must be "BIG_QUERY" (only supported type)
service_account_credentialsobjectYes-Complete Google Cloud service account JSON credentials
dataset_idstringYes-BigQuery dataset ID containing your tables
table_idsstring[]Yes-List of table names the agent can access and query
system_promptstring | nullNonullOptional custom instructions for SQL generation
query_max_rowsintegerNo10000Maximum rows to retrieve from BigQuery
output_max_rowsintegerNo10Maximum rows to return to the agent (limits response size)

Configuration Through PLai Interface

The tool configuration wizard guides you through a 4-step process:
1

Step 1: Settings

Configure basic tool settings and query parameters:Required Fields:
  • Name: Descriptive name for your BigQuery tool (e.g., “Company Analytics Database”)
  • Slug: Auto-generated lowercase identifier with underscores (e.g., “company_analytics_database”)
  • Description: Explain what data this tool provides access to
Optional Fields:
  • System Prompt: Custom instructions for SQL generation (supports Markdown editor for advanced formatting)
  • Query Max Rows: Maximum rows to retrieve from BigQuery (default: 10000)
  • Output Max Rows: Maximum rows to return to agent (default: 10)
The System Prompt can be used to provide specific SQL generation guidelines, such as preferred date formats, naming conventions, or business rules.
Click Next to continue to Credentials.
2

Step 2: Credentials

Paste your Google Cloud service account credentials:Google Cloud Service Account Credentials:
  • Copy the entire JSON credentials file from your downloaded service account key
  • Paste into the text area labeled “Paste your Google Cloud service account credentials JSON here”
  • The JSON must include all required fields: type, project_id, private_key, client_email, etc.
Example format:
{
  "type": "service_account",
  "project_id": "your-project-id",
  "private_key_id": "key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "service-account@project.iam.gserviceaccount.com",
  ...
}
These credentials are stored securely and encrypted. Never share or commit credentials to version control.
Click Next to continue to Datasets (the system will validate credentials automatically).
3

Step 3: Datasets

Select the BigQuery dataset to query:Select Dataset:
  • The wizard automatically detects available datasets from your service account credentials
  • You’ll see a list of datasets with radio buttons to select one
  • Each dataset shows its ID and the associated project name below
  • Choose the dataset containing the tables you want to make available to your agent
You can only select one dataset per tool. To query multiple datasets, create separate tools for each dataset.
Click Next to continue to Tables.
4

Step 4: Tables

Select which tables the agent can access:Select Tables:
  • Check the boxes next to tables you want to make available to the agent
  • All tables display “Type: TABLE” to indicate they are queryable
  • Multiple tables can be selected
  • The agent will only be able to query the selected tables
  • Table schemas are automatically detected and used for SQL generation context
Security Best Practice: Only grant access to tables the agent needs. Apply the principle of least privilege.
Click Create Tool to finish setup (or Update Tool if editing an existing configuration).
5

Add Tool to Agent

After creating the tool, assign it to your agents:
  1. Navigate to your Agent configuration page
  2. Go to the Tools section
  3. Click Add Tool and select your newly created BigQuery tool
  4. Save the agent configuration
Your agent can now query BigQuery data using natural language questions!

Natural Language Queries

How Query Conversion Works

The tool uses OpenAI’s GPT-4 to convert natural language questions into BigQuery Standard SQL. This happens automatically and transparently: Model Details:
  • Model: gpt-4.1-2025-04-14
  • Temperature: 0.1 (low temperature for consistent, deterministic SQL)
  • Context Provided: BigQuery schema, table definitions, example queries, and constraints
Query Processing Flow:
  1. User asks a question in natural language
  2. Tool sends question + schema context to GPT-4
  3. GPT-4 generates BigQuery-compatible SQL
  4. SQL is executed against your dataset
  5. Results are formatted and returned
  6. Token usage is recorded for billing

Example Conversions

Natural Language: “How many orders were placed yesterday?”Generated SQL:
SELECT COUNT(*) as order_count
FROM `my-project.analytics_prod.orders`
WHERE DATE(order_date) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);

Natural Language: “Show me the total revenue for last month”Generated SQL:
SELECT SUM(total_amount) as total_revenue
FROM `my-project.analytics_prod.orders`
WHERE DATE_TRUNC(order_date, MONTH) = 
      DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH);

Tips for Effective Queries

Good: “Show total revenue by product category for Q4 2024”Less Good: “Show me some revenue data”Specific queries help GPT-4 generate more accurate SQL with appropriate filters and groupings.
Good: “How many new customers signed up last week?”Less Good: “How many new customers?”Including time ranges prevents queries from scanning unnecessary data and improves performance.
Good: “Which products have the highest return rate?”Less Good: “SELECT product, returns FROM…”Let GPT-4 write the SQL - describe what you want, not how to get it.
Good: “Show me top customers from the customers and orders tables”Less Good: “Show me top customers”Mentioning table names helps GPT-4 understand which tables to query.

BigQuery SQL Examples

While natural language queries are the primary interface, understanding BigQuery SQL syntax can help you configure custom system prompts and troubleshoot issues.

Basic Queries

-- Select with filtering
SELECT 
  customer_id,
  name,
  email,
  signup_date
FROM `project-id.analytics_prod.customers`
WHERE DATE(signup_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
ORDER BY signup_date DESC;

Aggregations and Grouping

-- Revenue by product category
SELECT 
  p.category,
  COUNT(DISTINCT o.order_id) as order_count,
  SUM(oi.quantity) as total_units_sold,
  SUM(oi.quantity * oi.price) as total_revenue,
  AVG(oi.price) as avg_price
FROM `project-id.analytics_prod.products` AS p
JOIN `project-id.analytics_prod.order_items` AS oi
  ON p.product_id = oi.product_id
JOIN `project-id.analytics_prod.orders` AS o
  ON oi.order_id = o.order_id
WHERE o.order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY p.category
ORDER BY total_revenue DESC;

Time Series Analysis

-- Daily active users trend
SELECT 
  DATE(event_timestamp) as date,
  COUNT(DISTINCT user_id) as daily_active_users,
  COUNT(*) as total_events,
  COUNT(*) / COUNT(DISTINCT user_id) as avg_events_per_user
FROM `project-id.analytics_prod.user_events`
WHERE event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY date
ORDER BY date;

Window Functions

-- Sales ranking with running totals
SELECT 
  sales_rep,
  sale_date,
  sale_amount,
  ROW_NUMBER() OVER (
    PARTITION BY sales_rep 
    ORDER BY sale_amount DESC
  ) as rank_within_rep,
  SUM(sale_amount) OVER (
    PARTITION BY sales_rep 
    ORDER BY sale_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) as running_total
FROM `project-id.analytics_prod.sales`
WHERE sale_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY);

BigQuery-Specific Features

-- Using STRUCT for nested data
SELECT 
  order_id,
  customer_id,
  STRUCT(
    shipping_address,
    shipping_city,
    shipping_state,
    shipping_zip
  ) as shipping_info,
  total_amount
FROM `project-id.analytics_prod.orders`
WHERE order_date >= CURRENT_DATE();

Usage & Best Practices

Query Optimization

Use Time Filters

Always include date filters to limit data scanned and reduce costs

Limit Result Sets

Use LIMIT clauses and configure appropriate query_max_rows

Select Specific Columns

Avoid SELECT * - only query columns you need

Partition Filtering

Filter on partitioned columns (usually date fields) first

Cost Management

OpenAI Costs:
  • ~$0.005-0.030 per query for GPT-4 token usage
  • Varies based on query complexity and schema size
  • Token usage is tracked and returned in results
BigQuery Costs:
  • Charged based on data scanned (per TB)
  • Typical query: $0.001-0.050 depending on data volume
  • Use partitioned tables and clustering to reduce costs
  • Preview queries in BigQuery Console to estimate costs
Cost Optimization: Set conservative query_max_rows limits and use time-range filters in your natural language queries to minimize data scanned.

Security Best Practices

Recommended Roles:
  • roles/bigquery.dataViewer - Read-only access to data
  • roles/bigquery.jobUser - Ability to run queries
Avoid:
  • roles/bigquery.admin - Too broad for agent use
  • Write permissions - Unless specifically required
Grant Access:
# Grant BigQuery Data Viewer role
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \
  --role="roles/bigquery.dataViewer"

# Grant BigQuery Job User role
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \
  --role="roles/bigquery.jobUser"
DO:
  • Store credentials encrypted in PLai’s secure storage
  • Rotate service account keys every 90 days
  • Use separate service accounts per environment (dev/staging/prod)
  • Monitor service account usage in GCP Console
DON’T:
  • Commit credentials to version control
  • Share service account keys via email/chat
  • Use personal GCP accounts for service accounts
  • Grant excessive permissions “just in case”
Table-Level Security:
  • Only include necessary tables in table_ids configuration
  • Use BigQuery authorized views for sensitive data
  • Implement row-level security in BigQuery if needed
Dataset Security:
  • Create separate datasets for different security levels
  • Use different service accounts for different datasets
  • Enable BigQuery audit logging to track access
Track Usage:
  • Monitor query patterns in agent analytics
  • Review token usage for cost anomalies
  • Check BigQuery job history regularly
  • Set up alerts for unusual query patterns
Audit Logs:
  • All queries are logged with timestamps
  • Token usage is recorded per query
  • SQL statements are stored for review
  • Failed queries are tracked with error details

Tool Response Structure

Response Format

When a query is executed, the tool returns a structured response containing:
{
  sql_query: string,           // Generated BigQuery SQL
  prompt_tokens: number,        // Tokens used for GPT-4 request
  completion_tokens: number,    // Tokens used in GPT-4 response
  json_table: {
    columns: string[],          // Column names
    rows: any[][]               // Data rows (up to output_max_rows)
  } | null
}

Example Response

{
  "sql_query": "SELECT customer_id, name, email, SUM(total_amount) as revenue FROM `project.dataset.orders` JOIN `project.dataset.customers` USING(customer_id) WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY customer_id, name, email ORDER BY revenue DESC LIMIT 10",
  "prompt_tokens": 1247,
  "completion_tokens": 156,
  "json_table": {
    "columns": ["customer_id", "name", "email", "revenue"],
    "rows": [
      ["CUST-001", "Acme Corp", "contact@acme.com", 125000.50],
      ["CUST-002", "TechStart Inc", "hello@techstart.io", 98750.25],
      ["CUST-003", "Global Industries", "info@global.com", 87500.00]
    ]
  }
}

Limitations

Be aware of these limitations when using the BigQuery Data Source tool:

Current Limitations

Database Support:
  • ✅ Google BigQuery - Fully supported
  • ❌ PostgreSQL - Not available
  • ❌ MySQL - Not available
  • ❌ MongoDB - Not available
  • ❌ Other databases - Not available
Query Constraints:
  • Maximum query result: 10,000 rows (default, configurable via query_max_rows)
  • Maximum agent output: 10 rows (default, configurable via output_max_rows)
  • Query timeout: Depends on BigQuery job limits (typically 6 hours max)
  • No support for DML operations (INSERT, UPDATE, DELETE)
  • No support for DDL operations (CREATE, ALTER, DROP)
Cost Considerations:
  • OpenAI API costs: ~$0.005-0.030 per query
  • BigQuery costs: Based on data scanned (varies by query)
  • No built-in cost limits or budgets
  • Must monitor usage actively
Query Generation:
  • GPT-4 may occasionally generate suboptimal SQL
  • Complex multi-table joins may require query refinement
  • Schema context limited to configured tables only
  • No support for dynamic table discovery
Performance:
  • Query latency: 2-10 seconds typical (GPT-4 + BigQuery execution)
  • Not suitable for sub-second real-time responses
  • Large result sets may impact performance
  • Concurrent query limits apply

Troubleshooting

Common Issues

Symptoms: Permission errors when querying BigQueryPossible Causes:
  • Service account lacks required BigQuery permissions
  • Table not included in table_ids configuration
  • Dataset doesn’t exist or is in different project
  • Service account doesn’t have access to specific tables
Solutions:
  1. Verify service account has roles/bigquery.dataViewer role:
    gcloud projects get-iam-policy PROJECT_ID \
      --flatten="bindings[].members" \
      --filter="bindings.members:serviceAccount:YOUR_SERVICE_ACCOUNT"
    
  2. Add missing tables to table_ids array in configuration
  3. Check dataset exists and project ID matches credentials:
    bq ls --project_id=PROJECT_ID
    
  4. Grant table-level access if using authorized views
Symptoms: Query fails with “Invalid SQL syntax” or “Unrecognized name”Possible Causes:
  • GPT-4 generated BigQuery-incompatible SQL
  • Table references missing backticks
  • Using functions from other SQL dialects
  • Column names don’t exist in schema
Solutions:
  1. Rephrase natural language query more clearly and specifically
  2. Mention specific table names in your question
  3. Check the generated SQL in the response and validate manually in BigQuery Console
  4. Use custom system_prompt to guide SQL generation:
    {
      "system_prompt": "Always use DATE_TRUNC for date truncation. Table references must use backticks. Follow BigQuery Standard SQL syntax strictly."
    }
    
  5. Review BigQuery Standard SQL documentation for correct syntax
Symptoms: “Quota exceeded” or “Rate limit exceeded” errorsPossible Causes:
  • BigQuery daily query quota reached
  • Too many concurrent queries
  • Exceeded slots or bytes scanned limits
Solutions:
  1. Check BigQuery quotas in GCP Console:
    • Go to IAM & Admin → Quotas
    • Filter for “BigQuery API”
    • Review usage and limits
  2. Request quota increase from Google Cloud Support
  3. Implement query result caching to reduce repeated queries
  4. Reduce query frequency or batch queries together
  5. Consider upgrading to BigQuery flat-rate pricing for predictable costs
Symptoms: Authentication failures or “Invalid credentials” errorsPossible Causes:
  • Malformed JSON credentials
  • Expired or revoked service account key
  • Missing required credential fields
  • Service account deleted from GCP
Solutions:
  1. Download fresh service account key from GCP Console
  2. Validate JSON syntax using a JSON validator
  3. Ensure all required fields are present:
    • type, project_id, private_key_id, private_key
    • client_email, client_id, auth_uri, token_uri
  4. Verify service account still exists:
    gcloud iam service-accounts describe SERVICE_ACCOUNT_EMAIL
    
  5. Create new service account if the old one was deleted
Symptoms: Query succeeds but returns no data or empty resultsPossible Causes:
  • Query filters too restrictive (no matching data)
  • output_max_rows set too low
  • Data doesn’t exist in specified tables
  • WHERE clause conditions exclude all rows
Solutions:
  1. Test the generated SQL directly in BigQuery Console
  2. Remove filters temporarily to check if data exists
  3. Increase output_max_rows if expecting more results
  4. Verify data exists in specified tables:
    SELECT COUNT(*) FROM `project.dataset.table`;
    
  5. Check date filters aren’t excluding all data
  6. Review WHERE clause conditions in generated SQL
Symptoms: Queries taking longer than expected (>10 seconds)Possible Causes:
  • Large data scan (non-partitioned queries)
  • Complex joins across multiple tables
  • Missing indexes or clustering
  • Inefficient SQL generated by GPT-4
Solutions:
  1. Add time-range filters to limit data scanned
  2. Use partitioned columns in WHERE clauses
  3. Review query execution plan in BigQuery Console
  4. Optimize table structure with partitioning and clustering
  5. Rephrase question to generate simpler SQL
  6. Consider materializing complex views for frequently-accessed data

Getting Help

If you continue to experience issues:
  1. Check BigQuery Logs: Review query history in BigQuery Console
  2. Review Token Usage: High token usage may indicate schema issues
  3. Test Manually: Run generated SQL directly in BigQuery to isolate issues
  4. Check Service Account: Verify permissions and access in GCP IAM
  5. Contact Support: Provide query details, error messages, and service account setup

Security & Compliance

Data Protection

Google Cloud Security

Inherits all Google Cloud Platform security features including encryption at rest and in transit

Service Account Auth

Uses OAuth 2.0 service account authentication with secure credential storage

Query Auditing

All queries logged with timestamps, SQL statements, and token usage

Access Controls

BigQuery IAM permissions control data access at project, dataset, and table levels

Compliance Considerations

Compliance certifications depend on your Google Cloud Platform configuration and PLai infrastructure setup. Consult with your security and compliance teams to ensure requirements are met.
Key Compliance Areas:
  • GDPR: Ensure BigQuery data processing agreements are in place with Google Cloud
  • HIPAA: Use HIPAA-compliant GCP projects if handling protected health information
  • SOC 2: Verify both PLai and Google Cloud Platform SOC 2 compliance status
  • Data Residency: Configure BigQuery datasets in appropriate geographic regions
  • Audit Trails: Enable BigQuery audit logging for compliance reporting

Encryption

  • In Transit: All data transfers use TLS 1.3 encryption
  • At Rest: BigQuery data encrypted at rest by default using Google-managed keys
  • Credentials: Service account private keys encrypted in PLai secure storage
  • Query Results: Temporary query results encrypted in BigQuery

Use Cases

Customer Support

Enable support agents to access customer data instantly:
Agent Query: "Show me the last 5 orders for customer john@example.com"

Generated SQL:
SELECT order_id, order_date, total_amount, status
FROM `project.dataset.orders`
WHERE customer_email = 'john@example.com'
ORDER BY order_date DESC
LIMIT 5;

Business Intelligence

Provide real-time analytics to business users:
Agent Query: "What was our total revenue by product category last quarter?"

Generated SQL:
SELECT 
  p.category,
  SUM(oi.quantity * oi.price) as total_revenue
FROM `project.dataset.order_items` oi
JOIN `project.dataset.products` p ON oi.product_id = p.product_id
JOIN `project.dataset.orders` o ON oi.order_id = o.order_id
WHERE DATE_TRUNC(o.order_date, QUARTER) = 
      DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 QUARTER), QUARTER)
GROUP BY p.category
ORDER BY total_revenue DESC;

Operations & Monitoring

Monitor system health and operational metrics:
Agent Query: "Show me any failed transactions in the last hour"

Generated SQL:
SELECT 
  transaction_id,
  customer_id,
  amount,
  error_message,
  timestamp
FROM `project.dataset.transactions`
WHERE status = 'FAILED'
  AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY timestamp DESC;

Sales Analytics

Track sales performance and trends:
Agent Query: "Which sales reps exceeded their quota this month?"

Generated SQL:
WITH monthly_sales AS (
  SELECT 
    sales_rep_id,
    SUM(sale_amount) as total_sales
  FROM `project.dataset.sales`
  WHERE DATE_TRUNC(sale_date, MONTH) = DATE_TRUNC(CURRENT_DATE(), MONTH)
  GROUP BY sales_rep_id
)
SELECT 
  sr.name,
  sr.quota,
  ms.total_sales,
  ms.total_sales - sr.quota as over_quota
FROM monthly_sales ms
JOIN `project.dataset.sales_reps` sr ON ms.sales_rep_id = sr.sales_rep_id
WHERE ms.total_sales > sr.quota
ORDER BY over_quota DESC;

Next Steps