This guide covers three ways to send your Browse AI scraped data into Databricks, from no-code automation to custom API integrations. Choose the method that best fits your team's technical comfort and use case.
π Prerequisites: You'll need an approved Browse AI robot with scraped data, a Databricks workspace with a SQL warehouse, and a target table in your catalog. For API-based methods, you'll also need a Browse AI API key and a Databricks personal access token.
Which method should I use?
Method | Best for | Technical level | Speed |
Zapier / Make | Teams without developers | No code | Near real-time |
Webhooks + SQL connector | Real-time pipelines with custom logic | Intermediate | Real-time |
API polling + batch insert | Large-scale batch processing | Intermediate | On schedule |
Method 1: Zapier or Make (no code)
The fastest way to connect Browse AI to Databricks. Use Zapier or Make's webhook trigger with a Databricks HTTP action to insert rows.
Setting up with Zapier
Go to zapier.com and create a new Zap.
Trigger: Choose Browse AI as the trigger app, then select New Successful Task Finished.
Connect your Browse AI account and select the robot you want to sync data from.
Action: Choose Webhooks by Zapier β Custom Request to call Databricks' SQL Statement Execution API directly. Set the URL to
https://your-workspace.cloud.databricks.com/api/2.0/sql/statements, add your access token in the Authorization header, and pass an INSERT statement in the JSON body.Test the Zap, then turn it on.
Setting up with Make (formerly Integromat)
Create a new scenario in Make.
Add a Webhooks β Custom webhook module as the trigger. Copy the webhook URL.
In Browse AI, go to your robot's Integrate tab and add the Make webhook URL under Webhooks. Select the
taskFinishedSuccessfullyevent.Add an HTTP β Make a request module to call the Databricks SQL Statement Execution API with an INSERT statement.
Activate the scenario.
Method 2: Webhooks + Databricks SQL Connector (real-time)
Use Browse AI webhooks to push data directly into Databricks as soon as a task finishes. The Databricks SQL Connector for Python makes it easy to execute SQL against your warehouse.
Step 1: Get your Databricks access token and warehouse details
In your Databricks workspace, click your username in the top right and select Settings.
Under Developer β Access tokens, click Generate new token.
Copy the token.
Go to SQL Warehouses, click on your warehouse, and note the Server hostname and HTTP path from the Connection details tab.
Step 2: Create your target table
-- Run this in a Databricks notebook or SQL editor
CREATE TABLE IF NOT EXISTS browse_ai_leads (
task_id STRING,
robot_id STRING,
first_name STRING,
last_name STRING,
email STRING,
company_name STRING,
phone STRING,
website STRING,
job_title STRING,
origin_url STRING,
scraped_at TIMESTAMP
);
Step 3: Build your webhook endpoint
from databricks import sql
from flask import Flask, request, jsonify
from datetime import datetimeapp = Flask(__name__)DATABRICKS_HOST = "your-workspace.cloud.databricks.com"
DATABRICKS_TOKEN = "your_access_token"
DATABRICKS_HTTP_PATH = "/sql/1.0/warehouses/your_warehouse_id"def get_connection():
return sql.connect(
server_hostname=DATABRICKS_HOST,
http_path=DATABRICKS_HTTP_PATH,
access_token=DATABRICKS_TOKEN
)@app.route("/browse-ai-webhook", methods=["POST"])
def handle_webhook():
payload = request.get_json() if payload.get("event") != "taskFinishedSuccessfully":
return jsonify({"status": "ignored"}), 200 task = payload["task"]
captured = task.get("capturedTexts", {}) with get_connection() as conn:
with conn.cursor() as cursor:
cursor.execute(
"""INSERT INTO browse_ai_leads
(task_id, robot_id, first_name, last_name, email,
company_name, phone, website, job_title, origin_url, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
task.get("id", ""),
task.get("robotId", ""),
captured.get("first_name", ""),
captured.get("last_name", ""),
captured.get("email", ""),
captured.get("company_name", ""),
captured.get("phone", ""),
captured.get("website", ""),
captured.get("job_title", ""),
task.get("inputParameters", {}).get("originUrl", ""),
datetime.utcnow().isoformat()
)
) return jsonify({"status": "inserted"}), 200if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
π‘ Install the connector: Run pip install databricks-sql-connector. The connector works with both Databricks SQL warehouses and all-purpose compute clusters.
Step 4: Register the webhook in Browse AI
Via the dashboard:
Open your robot and go to the Integrate tab.
Under Webhooks, click Add webhook.
Paste your endpoint URL and select the
taskFinishedSuccessfullyevent.
Via the API:
curl -X POST "https://api.browse.ai/v2/robots/YOUR_ROBOT_ID/webhooks" \
-H "Authorization: Bearer YOUR_BROWSE_AI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://yourdomain.com/browse-ai-webhook",
"events": ["taskFinishedSuccessfully"]
}'
π‘ IP allowlisting: Browse AI sends webhooks from IP address 3.228.254.190. If your server has a firewall, add this to your allowlist. See Webhooks: IP address for allowlisting.
Method 3: API polling + batch insert (scheduled)
For large-scale data extraction, poll the Browse AI API on a schedule and batch-insert results into Databricks.
from databricks import sql
import requests
from datetime import datetime, timedeltaBROWSE_AI_API_KEY = "your_browse_ai_api_key"
ROBOT_ID = "your_robot_id"
DATABRICKS_HOST = "your-workspace.cloud.databricks.com"
DATABRICKS_TOKEN = "your_access_token"
DATABRICKS_HTTP_PATH = "/sql/1.0/warehouses/your_warehouse_id"def get_recent_tasks(since_hours=1):
resp = requests.get(
f"https://api.browse.ai/v2/robots/{ROBOT_ID}/tasks",
headers={"Authorization": f"Bearer {BROWSE_AI_API_KEY}"},
params={"pageSize": 100}
)
tasks = resp.json().get("result", {}).get("robotTasks", {}).get("items", [])
cutoff = datetime.utcnow() - timedelta(hours=since_hours)
return [t for t in tasks if t.get("status") == "successful"
and datetime.fromisoformat(t["finishedAt"].replace("Z","")) > cutoff]def batch_insert(tasks):
with sql.connect(server_hostname=DATABRICKS_HOST,
http_path=DATABRICKS_HTTP_PATH,
access_token=DATABRICKS_TOKEN) as conn:
with conn.cursor() as cursor:
for task in tasks:
captured = task.get("capturedTexts", {})
cursor.execute(
"""INSERT INTO browse_ai_leads
(task_id, email, company_name, scraped_at)
VALUES (?, ?, ?, ?)""",
(task.get("id"), captured.get("email", ""),
captured.get("company_name", ""), task.get("finishedAt", ""))
)# Run on a schedule (e.g. cron job every hour)
π For full Browse AI API details, including pagination, bulk operations, and task filtering, see the API Guide: Getting started and API Guide: Bulk operations.
Databricks-specific tips
Handling duplicates with MERGE
Use Databricks' MERGE statement (Delta Lake) to upsert data and avoid duplicates:
MERGE INTO browse_ai_leads AS target
USING (SELECT ? AS task_id, ? AS email, ? AS company_name) AS source
ON target.task_id = source.task_id
WHEN NOT MATCHED THEN
INSERT (task_id, email, company_name)
VALUES (source.task_id, source.email, source.company_name);
Delta Lake benefits
If your table is stored in Delta format (the default in Databricks), you get ACID transactions, time travel (query historical versions), and automatic schema enforcement. This makes it safe to stream Browse AI data without worrying about partial writes or data corruption.
Troubleshooting
Databricks returns "TEMPORARILY_UNAVAILABLE" errors
Your SQL warehouse may have auto-stopped. Warehouses can be configured to auto-start on demand, but the first query after a cold start may take 1-2 minutes. Consider keeping the warehouse running during expected data ingestion windows.
Connection timeout errors
Check that your server can reach your Databricks workspace. If your workspace is in a VPC with restricted access, you may need to allowlist your server's IP or use a VPN.
Webhook isn't firing
Make sure the webhook URL is publicly accessible and that your server responds with a 200 status code. See the Webhooks: Set up guide for detailed debugging steps.
