How can I extract data from lists and their associated details pages? (Deep scraping)
A common data scraping use case is to capture a list (e.g. Product names and links on an e-commerce site) and then capture each list item's details from its dedicated page (Product Availability at Store, for example). There are a few ways to do that.
Common Mistake: Doing It All At Once
Use cases like this are challenging at first because people naturally think of it this way: I need a robot that goes to a website that shows a list, clicks on each list item and opens the details, captures the details, goes back to the list, and repeats this for every other list item.
This workflow might be what a person would typically do; however, there are many issues with it at scale, like the following examples:
- What if the list has changed when you go to the details page and come back to the list? Like new items have been added or the order of items has changed. It'd be pretty hard to continue while ensuring you don't miss any data.
- What if it's an infinite-scroll type of list (like a Twitter feed), and if you click on an item and then go back, it will reset to the top of the list, and you have to scroll for a while to get to where you were on the list?
- What if it's a long list (like 10,000 items), and you click on the list items too fast, and the site assumes you're trying to take down their server with a DDoS attack and blocks you in the middle of your workflow?
We intentionally did not provide a way during the recording experience to automate a workflow like this due to these issues.
A Better Solution: Use Two Separate Robots
If each list item has a link, which is usually the case, you can avoid all these issues by taking a different 2-step approach:
Step 1: Extract links to all list items using Robot A
First, you can collect a list of links to all detail pages. You just need to build a Robot like "Extract product links from walmart.com" (let's call it Robot A) that extracts all product links. Then you can download the list of links as a CSV on your Browse AI dashboard.
Step 2: Extract item details from all links in step 1 using Robot B
Build a data extraction Robot like "Extract a single product's details on walmart.com" on an item's details page (we'll call it Robot B).
If some item details are optional and do not always exist, we recommend recording the task on an item's page that contains all the possible information you need to extract.
Then go to Robot B's Run Task tab on your dashboard and click on Bulk Run.
Upload the Links CSV from step 1. Map Robot B's variable columns to the columns imported from the CSV.
Review the imported links and make sure each row contains a different link.
Then scroll down and make sure you have synced this Robot with a Google Sheet so that all its extracted data can easily be retrieved from the Google Sheet.
Once you're ready, press the Run Task button. Once you do this, Robot B is run for every detail page link you extracted with Robot A, and you can see the results gradually added to the Google Sheetng on how many links you provided.