Skip to main content

How to train a robot to scrape or monitor data using Robot Studio

Train a robot to scrape or monitor web data from any website.

M
Written by Melissa Shires
Updated this week

You can train a robot to extract, scrape or monitor any data from a website.
​

Robots can be trained to:

  • Extract structured data from a web page.

  • Automatically scrape and structure a list of items.

  • Interact with a web page (ex: point, click, scroll) to scrape dynamic content.

  • Capture a screenshot of a page.

  • Paginate or infinite scroll to capture a full list.

πŸ’‘ If you're looking to scrape or monitor data across multiple pages (deep scraping), you'll need to create multiple robots and connect them together using workflows.

How to start training your robot

To start training your robot, all you'll need is the URL you'd like to scrape or monitor.

  1. From your Browse AI dashboard, click "Build New Robot".

  2. Select either:

    1. Extract structure data - if you'd like to scrape data from a web page (note: you can add a web monitor onto any robot).

    2. Monitor site changes - if you want to create a web monitor.

  3. Enter the Origin URL you would like to scrape or monitor.

  4. Click Start Training Robot.

  5. Select Use Robot Studio and wait for your web page to load.

What is the "Origin URL"?

The Original URL is where you'd like to start training your robot.

Note that it's best to get the robot to start training as close to the end data as possible.

Example: start your robot at https://www.ycombinator.com/companies vs. https://www.ycombinator.com/ in this example.

If the page you want to scrape does not have a direct URL, once in Robot Studio you can search, login, navigate, click or scroll to access the content.

How to train your robot to scrape or monitor basic text

Your website will be loaded into Robot Studio where you can train your robot to capture text.

There are two ways to capture text:

  • From a list - choose this option if you're capturing a list on a page.

  • Just text - choose this option to capture, label, and structure specific text elements on a page.

You can train a robot to scrape, monitor and structure as much text as you'd like (both from a list, and just text) across a single web page.

Scraping data 'From a list'

'From a list' is best for repeating information like product listings or search results. By training your robot to scrape or extract the data from a list, it will automatically structure the data into a table as well as trigger pagination options.

  1. Click on Capture Text, and select From a list.

  2. Hover over the list of items on the page until you see a dotted outline around the elements you want to capture.

  3. Click to select the list when the outline matches your desired data set.

  4. Robot studio will automatically structure that data into a recommended dataset (you can customize this if needed, see below).

  5. Give your list a descriptive name.

  6. Select the number of items you'd like the robot to capture.

  7. Configure the pagination settings to capture additional list items. These include:

    1. Clicking through 'next' buttons.

    2. Click "load more items"

    3. Infinite scroll (i.e. scroll up or down to load additional items)

    4. No more items to load.

  8. Click 'Save Captured List'.

  9. Click 'Finish' to finish recording your robot if you've captured all of the data you need, or keep capturing text or screenshots.

How to customize the scraped list data (optional)

If the automatically structured data doesn't meet your needs, you can also customize the structure of the list data.

  1. Hover over each item you'd like to extract.

  2. Click to select them.

  3. When finished, click Confirm.

  4. Label each data point (press Enter after each one to move to the next).

From there follow the same steps as above to determine the number of list items and pagination settings.

Scraping 'Just text' to capture and structure text

If you want to scrape, extract or monitor specific text or elements from a page, you'll want to capture 'Just text'. This feature lets you not only extract specific elements, but also allows you to structure the data you're scraping.

  1. Click 'Capture Text' and select 'Just text'.

  2. Hover and click to select what you want to capture.

  3. Select all text on the page you want to scrape or monitor by clicking on it.

  4. Click confirm when you're done.

  5. Label your captured text. Each label will be a column of data.

  6. Save the captured text.

  7. Click 'Finish' to finish recording your robot if you've captured all of the data you need, or keep capturing text or screenshots.

Note that when capturing, you can often choose between scraping the visible text, HTML, or link depending on what you've selected.

How to train your robot to capture a screenshot

In addition to scraping and monitoring content on a webpage, you can capture screenshots.

There are three types of screenshots you can capture:

  • Selections: screenshot a specific part of a page.

  • Entire page: capture the complete webpage.

  • Visible part: capture only what's currently visible without scrolling.

You can train a robot to capture screenshots, as well as text. You can also train the same robot to capture multiple types of screenshots.

  1. Select Capture Screenshot.

  2. Choose which type of screenshot you'd like to capture.

  3. Name your screenshot.

  4. Click 'Finish' to finish recording your robot if you've captured all of the data you need, or keep capturing text or screenshots.

Editing during recording

Robot Studio allows you to modify your data extraction while recording.

Removing a data point

  1. In the sidebar, hover over the data point you want to remove.

  2. Click the trash can icon to delete it.

  3. Confirm the change in the data preview table.

Removing an entire list

  1. In the Output Data Preview table, hover over the list name.

  2. Click the trash can icon below the list name.

  3. Confirm the deletion when prompted.

How to train your robot to scrape or monitor dynamic text

In addition to scraping or monitoring static HTML or text on a webpage, you can also train your robot to capture dynamic content on a page.

To do so simply mimic what you would do navigating the content naturally including:

  • Clicking, scrolling or navigating.

  • Filling out forms, text fields, or search bars.

  • Login in.

Your robot will be trained to follow this behavior based on your recording.

Approving your robot

When you've finished training your robot to scrape and capture all of the data you need, you'll need to approve your robot.

  1. Click 'Finish Recording'.

  2. Give your robot a clear, descriptive name.

  3. Review the extracted data (note: this is a preview of what and how your robot has been trained to scrape and is not the full dataset).

  4. Choose from the options at the bottom of the screen:

    • Yes, looks good to approve your robot.

    • No, let me re-train the robot to start over.

    • No, report an issue instantly trigger a ticket to our support team.

    • Delete this robot to remove it completely.

Once your robot is approved, you'll be able to configure monitoring, workflows, integrations, export the data, and more.

Did this answer your question?