This guide covers three ways to send your Browse AI scraped data into Amazon Redshift, from no-code automation to custom API integrations. Choose the method that best fits your team's technical comfort and use case.
π Prerequisites: You'll need an approved Browse AI robot with scraped data, an Amazon Redshift cluster or Redshift Serverless endpoint, and an AWS account. For API-based methods, you'll also need a Browse AI API key and AWS credentials with Redshift Data API access.
Which method should I use?
Method | Best for | Technical level | Speed |
Zapier / Make | Teams without developers | No code | Near real-time |
Webhooks + Data API | Real-time pipelines with custom logic | Intermediate | Real-time |
API polling + S3 COPY | Large-scale batch loading | Advanced | On schedule |
Method 1: Zapier or Make (no code)
The fastest way to connect Browse AI to Redshift. Zapier and Make both support Redshift as a destination, though setup requires your cluster connection details.
Setting up with Zapier
Go to zapier.com and create a new Zap.
Trigger: Choose Browse AI as the trigger app, then select New Successful Task Finished as the event.
Connect your Browse AI account and select the robot you want to sync data from.
Action: Choose Amazon Redshift as the action app, then select Create Row.
Connect your Redshift cluster (you'll need the host, port, database, username, and password).
Select your table and map each Browse AI captured field to the corresponding Redshift column.
Test the Zap, then turn it on.
Setting up with Make (formerly Integromat)
Create a new scenario in Make.
Add a Webhooks β Custom webhook module as the trigger. Copy the webhook URL.
In Browse AI, go to your robot's Integrate tab and add the Make webhook URL under Webhooks. Select the
taskFinishedSuccessfullyevent.Add an Amazon Redshift β Execute a Query module with an INSERT statement, and map your fields.
Activate the scenario.
Method 2: Webhooks + Redshift Data API (real-time)
Use Browse AI webhooks to push data into Redshift as soon as a task finishes. The Redshift Data API lets you run SQL statements without managing database connections or drivers.
Step 1: Set up AWS access
You'll need AWS credentials with permission to use the Redshift Data API and a Redshift cluster or Serverless endpoint.
Create an IAM user or role with the
AmazonRedshiftDataFullAccesspolicy (or a custom policy withredshift-data:ExecuteStatementpermission).Store a database password in AWS Secrets Manager (recommended) or use temporary credentials.
Note your cluster identifier (or Serverless workgroup name), database name, and the secret ARN.
Step 2: Create your Redshift table
CREATE TABLE browse_ai_leads (
task_id VARCHAR(255),
robot_id VARCHAR(255),
first_name VARCHAR(255),
last_name VARCHAR(255),
email VARCHAR(255),
company_name VARCHAR(255),
phone VARCHAR(100),
website VARCHAR(500),
job_title VARCHAR(255),
origin_url VARCHAR(1000),
scraped_at TIMESTAMP DEFAULT GETDATE()
);
Step 3: Build your webhook endpoint
import boto3
from flask import Flask, request, jsonifyapp = Flask(__name__)redshift_client = boto3.client("redshift-data", region_name="us-east-1")CLUSTER_ID = "your-cluster-identifier"
DATABASE = "your_database"
SECRET_ARN = "arn:aws:secretsmanager:us-east-1:123456789:secret:your-secret"@app.route("/browse-ai-webhook", methods=["POST"])
def handle_webhook():
payload = request.get_json() if payload.get("event") != "taskFinishedSuccessfully":
return jsonify({"status": "ignored"}), 200 task = payload["task"]
captured = task.get("capturedTexts", {}) sql = """
INSERT INTO browse_ai_leads (task_id, robot_id, first_name, last_name,
email, company_name, phone, website, job_title, origin_url)
VALUES (:task_id, :robot_id, :first_name, :last_name,
:email, :company_name, :phone, :website, :job_title, :origin_url)
""" parameters = [
{"name": "task_id", "value": task.get("id", "")},
{"name": "robot_id", "value": task.get("robotId", "")},
{"name": "first_name", "value": captured.get("first_name", "")},
{"name": "last_name", "value": captured.get("last_name", "")},
{"name": "email", "value": captured.get("email", "")},
{"name": "company_name", "value": captured.get("company_name", "")},
{"name": "phone", "value": captured.get("phone", "")},
{"name": "website", "value": captured.get("website", "")},
{"name": "job_title", "value": captured.get("job_title", "")},
{"name": "origin_url", "value": task.get("inputParameters", {}).get("originUrl", "")}
] response = redshift_client.execute_statement(
ClusterIdentifier=CLUSTER_ID,
Database=DATABASE,
SecretArn=SECRET_ARN,
Sql=sql,
Parameters=parameters
) return jsonify({"status": "inserted", "statement_id": response["Id"]}), 200if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
π‘ Install dependencies: Run pip install boto3 and configure your AWS credentials via environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or an IAM role.
Step 4: Register the webhook in Browse AI
Via the dashboard:
Open your robot and go to the Integrate tab.
Under Webhooks, click Add webhook.
Paste your endpoint URL and select the
taskFinishedSuccessfullyevent.
Via the API:
curl -X POST "https://api.browse.ai/v2/robots/YOUR_ROBOT_ID/webhooks" \
-H "Authorization: Bearer YOUR_BROWSE_AI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://yourdomain.com/browse-ai-webhook",
"events": ["taskFinishedSuccessfully"]
}'
π‘ IP allowlisting: Browse AI sends webhooks from IP address 3.228.254.190. If your server has a firewall, add this to your allowlist. See Webhooks: IP address for allowlisting.
Method 3: API polling + S3 COPY (batch loading)
For the highest throughput, stage your data in S3 and use Redshift's COPY command. This is the most efficient way to load large volumes of scraped data.
import boto3
import requests
import json
from datetime import datetime, timedeltaBROWSE_AI_API_KEY = "your_browse_ai_api_key"
ROBOT_ID = "your_robot_id"
S3_BUCKET = "your-bucket"
S3_KEY = "browse-ai/leads.json"s3 = boto3.client("s3")
redshift = boto3.client("redshift-data", region_name="us-east-1")def get_recent_tasks(since_hours=1):
resp = requests.get(
f"https://api.browse.ai/v2/robots/{ROBOT_ID}/tasks",
headers={"Authorization": f"Bearer {BROWSE_AI_API_KEY}"},
params={"pageSize": 100}
)
tasks = resp.json().get("result", {}).get("robotTasks", {}).get("items", [])
cutoff = datetime.utcnow() - timedelta(hours=since_hours)
return [t for t in tasks if t.get("status") == "successful"
and datetime.fromisoformat(t["finishedAt"].replace("Z","")) > cutoff]def stage_and_copy(tasks):
# Write JSONL to S3
lines = []
for task in tasks:
captured = task.get("capturedTexts", {})
lines.append(json.dumps({
"task_id": task.get("id"),
"email": captured.get("email", ""),
"company_name": captured.get("company_name", ""),
"scraped_at": task.get("finishedAt", "")
})) s3.put_object(Bucket=S3_BUCKET, Key=S3_KEY, Body="\n".join(lines)) # COPY from S3 into Redshift
copy_sql = f"""
COPY browse_ai_leads (task_id, email, company_name, scraped_at)
FROM 's3://{S3_BUCKET}/{S3_KEY}'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftS3ReadRole'
JSON 'auto'
TIMEFORMAT 'auto';
""" redshift.execute_statement(
ClusterIdentifier="your-cluster",
Database="your_database",
SecretArn="your-secret-arn",
Sql=copy_sql
)# Run on a schedule (e.g. cron job every hour)
π For full Browse AI API details, including pagination, bulk operations, and task filtering, see the API Guide: Getting started and API Guide: Bulk operations.
Redshift-specific tips
Handling duplicates
Redshift doesn't enforce unique constraints. To prevent duplicates, use a staging table pattern: load new data into a temp table, then merge into your main table:
-- Load into staging, then merge CREATE TEMP TABLE staging_leads (LIKE browse_ai_leads); -- COPY into staging_leads... INSERT INTO browse_ai_leads SELECT s.* FROM staging_leads s LEFT JOIN browse_ai_leads m ON s.task_id = m.task_id WHERE m.task_id IS NULL;
Sort and distribution keys
For best query performance, set a sort key on scraped_at (for time-based queries) and a distribution key on your most-joined column:
CREATE TABLE browse_ai_leads (
...
)
DISTSTYLE KEY
DISTKEY (email)
SORTKEY (scraped_at);
Troubleshooting
Redshift Data API returns "Access Denied"
Verify your IAM policy includes redshift-data:ExecuteStatement and that the Secrets Manager secret has the correct database credentials.
COPY command fails with "S3ServiceException"
Check that the IAM role attached to your Redshift cluster has s3:GetObject permission on the S3 bucket and that the file path is correct.
Webhook isn't firing
Make sure the webhook URL is publicly accessible and that your server responds with a 200 status code. See the Webhooks: Set up guide for detailed debugging steps.
