Planning your data extraction strategy
You want to know what success looks like before you start training. Ask yourself:
Is the data I need on one page, or across multiple pages?
What 'fields' do I want to extract from each page, how do I want it structured?
How many pages and/or rows of data am I extracting?
Do I need this data on an ongoing basis?
How am I using this data once it's extracted?
Site structure: Single page vs. multi-level extraction
Site structure | Extraction type |
All data is on one page type | Single-page extraction |
Data is spread across multiple pages | Multi-level extraction |
Single page extraction
If all your data is on one page type (search results, directory listing), you'll want to:
Train a robot to extract data from one of those pages.
Use that robot to extract from multiple URLs with the same page by either:
Changing the Origin URL.
Multi-level extraction (deep scraping)
If the data you need to extract is across multiple page types, this is called deep scraping. For example:
E-commerce (product list β product details)
Real estate (listing β property details)
Job boards (search results β full descriptions)
You'll need to train robots for each page type, and connect them together using workflows.
Train a robot to extract data from each of the page 'types'.
Connect these robots together using workflows.
Example: You're looking to extract e-commerce product data based on a search term and the data you need is on the search results page, and the product pages.
Train a robot to extract:
Robot A: a product list based on a search term.
Robot B: product details on a specific page.
Connect Robot A to Robot B using workflows to automatically scrape the product detail pages that appear as a result for Robot A.
π Read our guide on how to extract different data structures and types across multiple web page.
π‘ You can use workflows to chain unlimited robots together - ex: Robot A > Robot B > Robot C > Robot D.
Data structure: Extracting different data types
Depending on what kind of data you want to extract, you'll want to think about how you're going to train your robot.
How the data on the page is structured
The way data appears on the page determines which extraction method to use.
| Capture List | Capture Text |
Data displayed as | β’ Repeating cards or rows | β’ Individual values |
Best to use when | You see the same type of information repeated multiple times in a consistent pattern (like multiple products, each with name, price, image) | You need specific individual elements that appear once on the page, or when building a custom data structure from scattered elements. |
Creates | Structured table with multiple rows - one row per item in the pattern. | Structured table where each selection becomes a column. |
Example | A search page with 20 products β Creates 20 rows of data | A product detail page β Creates 1 row with your selected fields as columns |
Key advantage | Automatically scales across all similar items and handles pagination. | Complete control over what to extract and how to structure it. |
π Learn more about when to extract data as a list vs. just text.
Form fields, and search parameters
If the data you're extracting is triggered based on forms fields or search parameters you'll need to configure these as input parameters:
During training, naturally fill out the form or search box.
After the robot is approved, these will be converted into input parameters.
Use that robot to extract from input parameters by either:
Changing the input parameter in the robot dashboard.
π‘ Any text that you train a robot to enter as part of the training process is captured as an input parameter.
Dynamic content
Anything a human can see on a web page, you can train a robot to capture. This includes:
Volume of data: simple extractions, bulk runs, and service options
If you're looking to extract from a few pages, you can train your robot to extract data from other pages by either adding a new row in tables, or changing the input parameter in the robot dashboard.
Bulk extraction and web scraping
You can run a robot on thousands of pages simultaneously with Bulk Runs.
Go to the approved robot and select Bulk run tasks.
Upload a list of input parameters.
π‘ Input parameters depend on how you trained your robot. They always include the Origin URL, and depending on the data can include search query terms or other text inputs.
Services and solutions
We also offer dedicated service and solutions for large scale data extraction. Talk to our sales team to learn more.
Data frequency: Single use or Ongoing
If you only have a single use for the data, it's best to export or integrate the data as needed once you create your robot(s).
If however any (or all) of these apply to you, you'll want to set up a monitor:
Have an ongoing use for this data.
Want to keep this data up to date.
Want to maintain a historical database of this data.
Want to be alerted when this data changes.
Setting up monitoring
You can set up monitors to check for changes to website content on a schedule.
Go to your robot dashboard and click Monitor.
Create a new monitor and configure your frequency and alert settings.
π‘ Once created, your robot will check for content changes based on the schedule you selected. It will also notify you when things change, and what has changed based on your settings.
Using the data you've extracted
You can export, connect, or integrate data from a robot depending on your internal tools, apps, and workflows.
