1. Help Center
  2. Technical Questions

How can I extract data from lists and their associated details pages? (aka Deep scraping)

A common data scraping use case is to capture a list (e.g. Product names and links on an e-commerce site) and then capture each list item's details from its dedicated page (Product Availability at Store, for example). There are a few ways to do that.

Common Mistake: Doing It All At Once

Use cases like this are challenging at first because people naturally think of it this way: I need a robot that goes to a website that shows a list, clicks on each list item and opens the details, captures the details, goes back to the list, and repeats this for every other list item.

This workflow might be what a person would typically do; however, there are many issues with it at scale, like the following examples:

  1. What if the list has changed when you go to the details page and come back to the list? Like new items have been added or the order of items has changed. It'd be pretty hard to continue while ensuring you don't miss any data.
  2. What if it's an infinite-scroll type of list (like a Twitter feed), and if you click on an item and then go back, it will reset to the top of the list, and you have to scroll for a while to get to where you were on the list?
  3. What if it's a long list (like 10,000 items), and you click on the list items too fast, and the site assumes you're trying to take down their server with a DDoS attack and blocks you in the middle of your workflow?

We intentionally did not provide a way during the recording experience to automate a workflow like this due to these issues.

A Better Solution: Use Two Separate Robots

If each list item has a link, which is usually the case, you can avoid all these issues by taking a different 2-step approach:

Step 1: Extract links to all list items using Robot A

First, you can collect a list of links to all detail pages. You just need to build a Robot like "Extract product links from walmart.com" (let's call it Robot A) that extracts all product links. Then you can download the list of links as a CSV on your Browse AI dashboard.

Step 2: Extract item details from all links in step 1 using Robot B

Build a data extraction Robot like "Extract a single product's details on walmart.com" on an item's details page (we'll call it Robot B).

If some item details are optional and do not always exist, we recommend recording the task on an item's page that contains all the possible information you need to extract.

Then go to Robot B's Run Task tab on your dashboard and click on Bulk Run.

Upload the Links CSV from step 1. Map Robot B's variable columns to the columns imported from the CSV.

Review the imported links and make sure each row contains a different link.

Then scroll down and make sure you have synced this Robot with a Google Sheet so that all its extracted data can easily be retrieved from the Google Sheet.

Once you're ready, press the Run Task button. Once you do this, Robot B is run for every detail page link you extracted with Robot A, and you can see the results gradually added to the Google Sheetng on how many links you provided.

See this video on how to Bulk running tasks


Bulk Running Tasks FAQ

Why does the bulk run take so long?

By default, each Robot you build has a concurrency limit of 10 active task executions at any given time. That means if you bulk run a task 20 times, the first 10 times start immediately, but the 11th one will begin after one of the previous tasks is finished. This way, we avoid putting too much pressure on the task's origin website.

If you believe this concurrency limit is unnecessary for your task, please send us your use case and the desired concurrency limit.

How many credits does a bulk run take?

It depends on how many rows there are in the bulk run. For example, if there are 90 rows, the bulk run will take 90 credits from your monthly quota.

If any task executions fail, they will not count towards your quota.

Can I bulk run a task using a single API call?

Not yet! But it is something we are actively working on adding to the API.