Skip to main content

Best practices and tips for web scraping, data extraction, and monitoring websites

Best practices for training reliable web scraping robots in Browse AI. This guide covers everything from initial setup through testing and optimization.

M
Written by Melissa Shires
Updated this week

Planning your data extraction strategy

You want to know what success looks like before you start training. Ask yourself:

  1. Is the data I need on one page, or across multiple pages?

  2. What 'fields' do I want to extract from each page, how do I want it structured?

  3. How many pages and/or rows of data am I extracting?

  4. Do I need this data on an ongoing basis?

  5. How am I using this data once it's extracted?

Site structure: Single page vs. multi-level extraction

Site structure

Extraction type

All data is on one page type

Single-page extraction

Data is spread across multiple pages

Multi-level extraction

Single page extraction

If all your data is on one page type (search results, directory listing), you'll want to:

  1. Train a robot to extract data from one of those pages.

  2. Use that robot to extract from multiple URLs with the same page by either:

Multi-level extraction (deep scraping)

If the data you need to extract is across multiple page types, this is called deep scraping. For example:

  • E-commerce (product list β†’ product details)

  • Real estate (listing β†’ property details)

  • Job boards (search results β†’ full descriptions)

You'll need to train robots for each page type, and connect them together using workflows.

  1. Train a robot to extract data from each of the page 'types'.

  2. Connect these robots together using workflows.

Example: You're looking to extract e-commerce product data based on a search term and the data you need is on the search results page, and the product pages.

  1. Train a robot to extract:

    1. Robot A: a product list based on a search term.

    2. Robot B: product details on a specific page.

  2. Connect Robot A to Robot B using workflows to automatically scrape the product detail pages that appear as a result for Robot A.

πŸ’‘ You can use workflows to chain unlimited robots together - ex: Robot A > Robot B > Robot C > Robot D.

Data structure: Extracting different data types

Depending on what kind of data you want to extract, you'll want to think about how you're going to train your robot.

How the data on the page is structured

The way data appears on the page determines which extraction method to use.

Capture List

Capture Text

Data displayed as

β€’ Repeating cards or rows
β€’ Search results grids
β€’ Product galleries
β€’ Directory listings
β€’ Comment threads
β€’ Table rows
β€’ Any pattern that repeats

β€’ Individual values
β€’ Standalone elements
β€’ Summary information
β€’ Page headers/totals
β€’ Single data points
β€’ Unique page sections

Best to use when

You see the same type of information repeated multiple times in a consistent pattern (like multiple products, each with name, price, image)

You need specific individual elements that appear once on the page, or when building a custom data structure from scattered elements.

Creates

Structured table with multiple rows - one row per item in the pattern.

Structured table where each selection becomes a column.

Example

A search page with 20 products β†’ Creates 20 rows of data

A product detail page β†’ Creates 1 row with your selected fields as columns

Key advantage

Automatically scales across all similar items and handles pagination.

Complete control over what to extract and how to structure it.

πŸ“– Learn more about when to extract data as a list vs. just text.

Form fields, and search parameters

If the data you're extracting is triggered based on forms fields or search parameters you'll need to configure these as input parameters:

  1. During training, naturally fill out the form or search box.

  2. After the robot is approved, these will be converted into input parameters.

  3. Use that robot to extract from input parameters by either:

πŸ’‘ Any text that you train a robot to enter as part of the training process is captured as an input parameter.

Dynamic content

Anything a human can see on a web page, you can train a robot to capture. This includes:

Volume of data: simple extractions, bulk runs, and service options

If you're looking to extract from a few pages, you can train your robot to extract data from other pages by either adding a new row in tables, or changing the input parameter in the robot dashboard.

Bulk extraction and web scraping

You can run a robot on thousands of pages simultaneously with Bulk Runs.

  1. Go to the approved robot and select Bulk run tasks.

  2. Upload a list of input parameters.

πŸ’‘ Input parameters depend on how you trained your robot. They always include the Origin URL, and depending on the data can include search query terms or other text inputs.

Services and solutions

We also offer dedicated service and solutions for large scale data extraction. Talk to our sales team to learn more.

Data frequency: Single use or Ongoing

If you only have a single use for the data, it's best to export or integrate the data as needed once you create your robot(s).

If however any (or all) of these apply to you, you'll want to set up a monitor:

  • Have an ongoing use for this data.

  • Want to keep this data up to date.

  • Want to maintain a historical database of this data.

  • Want to be alerted when this data changes.

Setting up monitoring

You can set up monitors to check for changes to website content on a schedule.

  1. Go to your robot dashboard and click Monitor.

  2. Create a new monitor and configure your frequency and alert settings.

πŸ’‘ Once created, your robot will check for content changes based on the schedule you selected. It will also notify you when things change, and what has changed based on your settings.

Using the data you've extracted

You can export, connect, or integrate data from a robot depending on your internal tools, apps, and workflows.

Exporting the data

Integrating the data

Did this answer your question?