Skip to main content

API Guide: Bulk operations and large-scale data extraction

Learn how to extract data from thousands of URLs efficiently using Browse AI's bulk run capabilities and best practices for high-volume usage.

M
Written by Melissa Shires
Updated today

What are bulk operations?

Bulk operations let you run your robot on thousands of URLs in a single API call, rather than creating tasks one by one. This is essential for large-scale data extraction and much more efficient than individual API calls.

Key advantages:

  • Scale: Up to 500,000 tasks per bulk operation (vs 1,000-5,000 competitor limits)

  • Performance: Much faster than individual task creation

  • Resource efficiency: Optimized processing and better success rates

  • Cost effective: More efficient use of your plan's task allowance

Browse AI API limitations and best practices

Good news: Browse AI has no limitations on task volume - you can scrape all day, non-stop if needed.

Important considerations:

  1. Website limitations: The sites you're scraping may have rate limits or bot detection

  2. API best practices: Use bulk endpoints for high volume, not individual task calls

  3. Data retrieval: Consider table exports for large datasets vs individual task retrieval

When to use bulk operations vs individual tasks

Use bulk operations for:

  • Processing hundreds or thousands of similar URLs

  • Competitive intelligence across multiple sites

  • Lead generation from directory sites

  • Product catalog extraction

  • Large-scale monitoring setups

Use individual tasks for:

  • Testing and development

  • One-off data extractions

  • Real-time urgent requests

  • Custom parameter variations

Critical: If you need to create many tasks, always use the bulk-run endpoint. The individual task endpoint is not designed for frequent calls and may cause performance issues.

Understanding bulk run limits

  • Per API call: maximum 1,000 tasks

  • Total per bulk run: maximum 500,000 tasks

  • Strategy for larger datasets: submit multiple bulk runs sequentially

Step 1: Prepare your input data

Organize your URLs and parameters into chunks of 1,000 or fewer:

# Example: Bulk extract competitor product data
urls_to_scrape = [
"https://competitor1.com/product1",
"https://competitor1.com/product2",
# ... up to 1,000 URLs
]

# Convert to bulk run format
input_parameters = [
{"originUrl": url} for url in urls_to_scrape
]

Step 2: Create your first bulk run

curl -X POST "https://api.browse.ai/v2/robots/ROBOT_ID/bulk-runs" \
-H "Authorization: Bearer YOUR_SECRET_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"title": "Competitor Product Analysis - Batch 1",
"inputParameters": [
{"originUrl": "https://competitor1.com/product1"},
{"originUrl": "https://competitor1.com/product2"},
{"originUrl": "https://competitor1.com/product3"}
]
}'

Successful response:

{
"statusCode": 200,
"messageCode": "success",
"result": {
"bulkRun": {
"id": "bulk-run-uuid-here",
"title": "Competitor Product Analysis - Batch 1",
"status": "running",
"totalTaskCount": 3,
"createdAt": 1678795867879
}
}
}

Step 3: Track bulk run progress

Monitor your bulk run status and progress:

curl -X GET "https://api.browse.ai/v2/robots/ROBOT_ID/bulk-runs/BULK_RUN_ID" \
-H "Authorization: Bearer YOUR_SECRET_API_KEY"

Progress response:

{
"statusCode": 200,
"messageCode": "success",
"result": {
"id": "bulk-run-uuid",
"title": "Competitor Product Analysis - Batch 1",
"status": "completed",
"totalTaskCount": 1000,
"successfulTaskCount": 985,
"failedTaskCount": 15,
"createdAt": 1678795867879,
"finishedAt": 1678825867879
}
}

Step 4: Handle large datasets (>1,000 URLs)

For datasets larger than 1,000 URLs, submit multiple bulk runs:

import requests
import time

def submit_bulk_runs(robot_id, api_key, all_urls, chunk_size=1000):
bulk_run_ids = []

# Split URLs into chunks of 1,000
for i in range(0, len(all_urls), chunk_size):
chunk = all_urls[i:i + chunk_size]
batch_num = (i // chunk_size) + 1

input_params = [{"originUrl": url} for url in chunk]

response = requests.post(
f"https://api.browse.ai/v2/robots/{robot_id}/bulk-runs",
headers={"Authorization": f"Bearer {api_key}"},
json={
"title": f"Large Scale Extraction - Batch {batch_num}",
"inputParameters": input_params
}
)

if response.status_code == 200:
bulk_run_id = response.json()["result"]["bulkRun"]["id"]
bulk_run_ids.append(bulk_run_id)
print(f"Submitted batch {batch_num}: {bulk_run_id}")

# Brief pause between submissions
time.sleep(1)

return bulk_run_ids

Step 5: Efficient data retrieval for bulk operations

For large datasets, consider table exports instead of individual task retrieval:

Table exports vs individual API calls:

  • Table exports: Export all your data in bulk formats (CSV, JSON)

  • Webhooks: Get notified when table exports are ready

  • Scheduled exports: (Private beta) Automatically export data on schedule

Setting up table export webhooks:

curl -X POST "https://api.browse.ai/v2/robots/ROBOT_ID/webhooks" \
-H "Authorization: Bearer YOUR_SECRET_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"hookUrl": "https://your-system.com/webhook/table-export",
"eventType": "tableExportFinishedSuccessfully"
}'

This is much more efficient than retrieving thousands of individual task results via API.

Handling website limitations and bot detection

Important: While Browse AI has no usage limits, websites you're scraping might have rate limits or bot detection.

Best practices:

  1. Start gradually: Begin with smaller batches to test website tolerance

  2. Monitor success rates: Watch for increased failure rates that might indicate detection

  3. Respect robots.txt: Follow website guidelines when possible

  4. Vary timing: Don't submit all bulk runs simultaneously

Browse AI's built-in protections:

  • Human-like behavior simulation

  • IP rotation and proxy management

  • Realistic delays and scrolling patterns

  • Cookie and session management

If you encounter issues:

  • Contact support for site-specific guidance

  • Consider managed services for enterprise scale

  • Adjust bulk run timing and size

Common bulk operation patterns

Competitive intelligence:

{
"title": "Daily Competitor Price Check",
"inputParameters": [
{"originUrl": "https://competitor1.com/category/electronics"},
{"originUrl": "https://competitor2.com/category/electronics"},
{"originUrl": "https://competitor3.com/category/electronics"}
]
}

Lead generation:

{
"title": "Business Directory Extraction",
"inputParameters": [
{"originUrl": "https://directory.com/page/1"},
{"originUrl": "https://directory.com/page/2"},
{"originUrl": "https://directory.com/page/3"}
]
}

Product catalog extraction:

{
"title": "E-commerce Product Data",
"inputParameters": [
{"originUrl": "https://store.com/product/123", "category": "electronics"},
{"originUrl": "https://store.com/product/124", "category": "electronics"}
]
}

Enterprise and managed services available - book a call with our sales team to learn more.


โ€‹For high-scale operations:

  • Scale pricing available for enterprise volumes

  • Custom rate limiting and performance optimization

  • Dedicated support for large-scale implementations

When to consider managed services:

  • Processing 100,000+ URLs regularly

  • Mission-critical data extraction requirements

  • Complex multi-step workflows

  • Custom integration needs

Did this answer your question?