API Guide: Bulk operations and large-scale data extraction

What are bulk operations?

Bulk operations let you run your robot on thousands of URLs in a single API call, rather than creating tasks one by one. This is essential for large-scale data extraction and much more efficient than individual API calls.

Key advantages:

Scale: Up to 500,000 tasks per bulk operation (vs 1,000-5,000 competitor limits)
Performance: Much faster than individual task creation
Resource efficiency: Optimized processing and better success rates
Cost effective: More efficient use of your plan's task allowance

Browse AI API limitations and best practices

Good news: Browse AI has no limitations on task volume - you can scrape all day, non-stop if needed.

Important considerations:

Website limitations: The sites you're scraping may have rate limits or bot detection
API best practices: Use bulk endpoints for high volume, not individual task calls
Data retrieval: Consider table exports for large datasets vs individual task retrieval

When to use bulk operations vs individual tasks

Use bulk operations for:

Processing hundreds or thousands of similar URLs
Competitive intelligence across multiple sites
Lead generation from directory sites
Product catalog extraction
Large-scale monitoring setups

Use individual tasks for:

Testing and development
One-off data extractions
Real-time urgent requests
Custom parameter variations

Critical: If you need to create many tasks, always use the bulk-run endpoint. The individual task endpoint is not designed for frequent calls and may cause performance issues.

Understanding bulk run limits

Per API call: maximum 1,000 tasks
Total per bulk run: maximum 500,000 tasks
Strategy for larger datasets: submit multiple bulk runs sequentially

Step 1: Prepare your input data

Organize your URLs and parameters into chunks of 1,000 or fewer:

# Example: Bulk extract competitor product data
urls_to_scrape = [
    "https://competitor1.com/product1",
    "https://competitor1.com/product2", 
    # ... up to 1,000 URLs
]

# Convert to bulk run format
input_parameters = [
    {"originUrl": url} for url in urls_to_scrape
]

Step 2: Create your first bulk run

curl -X POST "https://api.browse.ai/v2/robots/ROBOT_ID/bulk-runs" \
  -H "Authorization: Bearer YOUR_SECRET_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Competitor Product Analysis - Batch 1",
    "inputParameters": [
      {"originUrl": "https://competitor1.com/product1"},
      {"originUrl": "https://competitor1.com/product2"},
      {"originUrl": "https://competitor1.com/product3"}
    ]
  }'

Successful response:

{
  "statusCode": 200,
  "messageCode": "success", 
  "result": {
    "bulkRun": {
      "id": "bulk-run-uuid-here",
      "title": "Competitor Product Analysis - Batch 1",
      "status": "running",
      "totalTaskCount": 3,
      "createdAt": 1678795867879
    }
  }
}

Step 3: Track bulk run progress

Monitor your bulk run status and progress:

curl -X GET "https://api.browse.ai/v2/robots/ROBOT_ID/bulk-runs/BULK_RUN_ID" \
  -H "Authorization: Bearer YOUR_SECRET_API_KEY"

Progress response:

{
  "statusCode": 200,
  "messageCode": "success",
  "result": {
    "id": "bulk-run-uuid",
    "title": "Competitor Product Analysis - Batch 1", 
    "status": "completed",
    "totalTaskCount": 1000,
    "successfulTaskCount": 985,
    "failedTaskCount": 15,
    "createdAt": 1678795867879,
    "finishedAt": 1678825867879
  }
}

Step 4: Handle large datasets (>1,000 URLs)

For datasets larger than 1,000 URLs, submit multiple bulk runs:

import requests
import time

def submit_bulk_runs(robot_id, api_key, all_urls, chunk_size=1000):
    bulk_run_ids = []
    
    # Split URLs into chunks of 1,000
    for i in range(0, len(all_urls), chunk_size):
        chunk = all_urls[i:i + chunk_size]
        batch_num = (i // chunk_size) + 1
        
        input_params = [{"originUrl": url} for url in chunk]
        
        response = requests.post(
            f"https://api.browse.ai/v2/robots/{robot_id}/bulk-runs",
            headers={"Authorization": f"Bearer {api_key}"},
            json={
                "title": f"Large Scale Extraction - Batch {batch_num}",
                "inputParameters": input_params
            }
        )
        
        if response.status_code == 200:
            bulk_run_id = response.json()["result"]["bulkRun"]["id"]
            bulk_run_ids.append(bulk_run_id)
            print(f"Submitted batch {batch_num}: {bulk_run_id}")
        
        # Brief pause between submissions
        time.sleep(1)
    
    return bulk_run_ids

Step 5: Efficient data retrieval for bulk operations

For large datasets, consider table exports instead of individual task retrieval:

Table exports vs individual API calls:

Table exports: Export all your data in bulk formats (CSV, JSON)
Webhooks: Get notified when table exports are ready
Scheduled exports: (Private beta) Automatically export data on schedule

Setting up table export webhooks:

curl -X POST "https://api.browse.ai/v2/robots/ROBOT_ID/webhooks" \
  -H "Authorization: Bearer YOUR_SECRET_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "hookUrl": "https://your-system.com/webhook/table-export",
    "eventType": "tableExportFinishedSuccessfully"
  }'

This is much more efficient than retrieving thousands of individual task results via API.

Handling website limitations and bot detection

Important: While Browse AI has no usage limits, websites you're scraping might have rate limits or bot detection.

Best practices:

Start gradually: Begin with smaller batches to test website tolerance
Monitor success rates: Watch for increased failure rates that might indicate detection
Respect robots.txt: Follow website guidelines when possible
Vary timing: Don't submit all bulk runs simultaneously

Browse AI's built-in protections:

Human-like behavior simulation
IP rotation and proxy management
Realistic delays and scrolling patterns
Cookie and session management

If you encounter issues:

Contact support for site-specific guidance
Consider managed services for enterprise scale
Adjust bulk run timing and size

Common bulk operation patterns

Competitive intelligence:

{
  "title": "Daily Competitor Price Check",
  "inputParameters": [
    {"originUrl": "https://competitor1.com/category/electronics"},
    {"originUrl": "https://competitor2.com/category/electronics"},
    {"originUrl": "https://competitor3.com/category/electronics"}
  ]
}

Lead generation:

{
  "title": "Business Directory Extraction",
  "inputParameters": [
    {"originUrl": "https://directory.com/page/1"},
    {"originUrl": "https://directory.com/page/2"},
    {"originUrl": "https://directory.com/page/3"}
  ]
}

Product catalog extraction:

{
  "title": "E-commerce Product Data",
  "inputParameters": [
    {"originUrl": "https://store.com/product/123", "category": "electronics"},
    {"originUrl": "https://store.com/product/124", "category": "electronics"}
  ]
}

Enterprise and managed services available - book a call with our sales team to learn more.

For high-scale operations:

Scale pricing available for enterprise volumes
Custom rate limiting and performance optimization
Dedicated support for large-scale implementations

When to consider managed services:

Processing 100,000+ URLs regularly
Mission-critical data extraction requirements
Complex multi-step workflows
Custom integration needs

Bulk run: How to use bulk runs to extract data from multiple URLs

Bulk run: How to execute bulk tasks

How to perform a custom search using Bulk Run

API Guide: Getting started

API Guide: Retrieving and managing scraped data