Browse AI is one of the easiest and most user-friendly web scraping tools on the market. But there are some best practices when training your web scraping robot that will help to make the process as smooth as possible.

Move your mouse to ensure correct data selection

By starting with the most accurate selection, you improve the quality of data.

Move your mouse slowly across the page to see how the selection boxes appear

By moving your mouse carefully and slowly, you can see how the selection box changes. Even slight movements can ensure that you're not accidentally selecting data that doesn't belong.

Notice which elements are grouped together

There are times when elements are grouped together (things like categories, tags, etc) and by slightly shifting your mouse, you can isolate the exact element you want. Getting this right at this early stage will make your data cleaner and more accurate.

Pay attention to how different websites structure their data differently

Not all websites of the same "type" structure their data identically. One great example is real estate websites. There are countless ways to organize and display property listings and details. Be aware of this as you select the data.

Test your selection to ensure data accuracy

By verifying data accuracy, you reduce the need to make changes later.

After selecting data, run a test to ensure you're capturing the right information

When training your robot, you will be shown a data preview after you've selected your desired items. Review this carefully in order to make sure it's accurate.

If you've taken advantage of the Recommended Dataset, it's especially important to make sure the AI-powered suggestions are accurate. If they're not, you can try running it again or select the data manually.

Verify that all required fields are being extracted

Whether you rely on the Recommended Dataset or make the selections yourself, it's possible that a desired field was not captured. Making sure that you've trained the robot to extract the proper data will reduce the need for changes later.

Check that the data format matches your needs

While there are ways to format extracted data after Browse AI has scraped it, when at all possible it's best to scrape the desired format from the start. This might mean targeting only numbers in square footage (i.e. just capturing the 2,350 in "2,350 sq ft" rather than including the words).

Watch out for these common issues

Despite your best efforts, there are some things to look out for.

Some websites may group different pieces of information in the same box

There are times when it simply isn't possible to isolate the data because the website structure has placed them inside the same element. An example might be putting a person's name and job title with the same box. Our Recommended Datasets do their best to intelligently separate these, but there are times when disparate data gets lumped together.

Others may separate related information into different boxes

Conversely, you may encounter websites where seemingly related information has been separated. This may mean that a person's first and last name are not in the same box, even though you would expect them to be.

Test your robot on multiple pages to ensure consistent extraction

In a perfect world, pages that appear to be the same when viewed by a human would also be structured the same when scraping the underlying data. While it's unlikely, there may be cases when they differ, therefore leading to inconsistent results.

Web crawling vs. web scraping

How Browse AI uses AI (artificial intelligence)

How to build a robot using Robot Studio

How to extract data from iFrames

How to chain more than two robots together