How can I extract data from lists and their associated details pages? (Deep scraping)

Deep Scraping is a technique that goes beyond the limitations of a single page by systematically following links (URLs to the detail pages) within a website, this cutting-edge approach mimics the behavior of a human user navigating through various sections.

A common data scraping use case is to capture a list (e.g. Product names and links on an e-commerce site) and then capture each list item's details from its dedicated page (Product Availability at Store, for example). There are a few ways to do that.

💡 In a hurry?

  • Jump to the video about deep scraping using Bulk Run. It might clear up your questions 🙂
  • Try the interactive demo on our Workflows feature (chaining 2 robots together)

Common Mistake: Doing It All At Once

Use cases like this are challenging at first because people naturally think of it this way: I need a robot that goes to a website that shows a list, clicks on each list item and opens the details, captures the details, goes back to the list, and repeats this for every other list item.

This workflow might be what a person would typically do; however, there are many issues with it at scale, like the following examples:

  1. What if the list has changed when you go to the details page and come back to the list? Like new items have been added or the order of items has changed. It'd be pretty hard to continue while ensuring you don't miss any data.
  2. What if it's an infinite-scroll type of list (like a Twitter feed), and if you click on an item and then go back, it will reset to the top of the list, and you have to scroll for a while to get to where you were on the list?
  3. What if it's a long list (like 10,000 items), and you click on the list items too fast, and the site assumes you're trying to take down their server with a DDoS attack and blocks you in the middle of your workflow?

We intentionally did not provide a way during the recording experience to automate a workflow like this due to these issues.

A Better Solution: Use Two Separate Robots

If each list item has a link, which is usually the case, you can avoid all these issues by taking a different 2-step approach:

Step 1: Extract links to all list items using Robot A

First, you can collect a list of links to all detail pages. You just need to build a Robot like "Extract product links from" (let's call it Robot A) that extracts all product links. Then you can download the list of links as a CSV on your Browse AI dashboard.

Step 2: Extract item details from all links in step 1 using Robot B

Build a data extraction Robot like "Extract a single product's details on" on an item's details page (we'll call it Robot B).

If some item details are optional and do not always exist, we recommend recording the task on an item's page that contains all the possible information you need to extract.

Then go to Robot B'Run Task tab on your dashboard and click on Bulk Run.

Upload the Links CSV from step 1. Map Robot B's variable columns to the columns imported from the CSV.

Review the imported links and make sure each row contains a different link.

Then scroll down and make sure you have synced this Robot with a Google Sheet so that all its extracted data can easily be retrieved from the Google Sheet.

Once you're ready, press the Run Task button. Once you do this, Robot B is run for every detail page link you extracted with Robot A, and you can see the results gradually added to the Google Sheetng on how many links you provided.

We've got a video about deep scraping using Bulk Run

Can this be automated?

Our team at Browse AI is constantly enhancing our platform with new and innovative features. One such feature, known as Workflows, enables you to streamline the connection process between Robot A and Robot B, as previously mentioned in this article. By automating this process, you'll be able to save valuable time and energy. For more information on this game-changing feature, simply click here!

See it in action!

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.

Still need help? Contact Us Contact Us