Skip to main content

Extracting data with different formats on the same page

Learn how to handle pages with mixed content types - text, numbers, dates, links, and currencies - all in one extraction.

M
Written by Melissa Shires
Updated today

Many web pages display the same type of data in different formats, or mix various data types together.

Common mixed format scenarios

  • Product pages with prices in different currencies

  • Dates shown in multiple formats

  • Numbers with and without units

  • Links mixed with plain text

  • Optional fields that may or may not appear

Capturing different data types

During robot training

When you encounter different formats on the same page, Browse AI offers different data capture options:

Data type

What you see

Capture options

Text with link

Clickable text

• Text only
• Link URL
• Both (as HTML)

Numbers with units

"2,350 sq ft"

• Full text
• Number only
• Remove commas

Formatted prices

"$1,299.99"

• With currency
• Number only

Dates

"Nov 21, 2024"

• As shown
• Standardized format

Mixed content

Text + image + link

• Visible text
• HTML with all elements

💡 You are not limited to a single capture option. Example - for a link on a page, you can capture the text as a variable, and the link url as a second variable.

Choosing the right format

When capturing, you'll see options like:

  • Capture as Text

  • Capture as Link

  • Capture as HTML

Choose based on your needs:

  • Text: Just the visible content

  • Link: The URL behind clickable text

  • HTML: Everything, including formatting and links

Practical examples

E-commerce product page

Mixed content present:

  • Product name (text)

  • Price (currency)

  • Rating (number)

  • Review count (number with text)

  • Availability (text)

  • Product link (URL)

Training approach:

1. Product name → Capture as Text 
2. Price → Capture number only (for calculations)
3. Rating → Capture as number
4. Reviews → "245 reviews" → Capture full text
5. Availability → Capture as Text
6. Link → Capture as Link

Real estate listing

Mixed formats present:

  • Price: "$450,000"

  • Size: "2,350 sq ft" and "218 m²"

  • Year: "Built in 2019" and "2019"

  • Features: Some with icons, some text only

Training approach:

  1. Be consistent within each field

  2. Choose most useful format for analysis

  3. Train with various examples

News article page

Different elements:

  • Headline (text)

  • Author (text with profile link)

  • Date (various formats)

  • Article body (HTML with formatting)

  • Related links (URLs)

Extraction strategy:

  • Headline → Text only

  • Author → Text (ignore profile link)

  • Date → Capture as shown

  • Body → HTML to preserve formatting

  • Links → Capture URLs separately

Format selection guide

If you need to...

Choose this format

Do calculations

Numbers only

Preserve styling

HTML

Simple analysis

Plain text

Follow links

URL extraction

Keep everything

Full HTML

Handling special characters

Currency symbols, units, special characters:

  • Decide if you need them before training

  • Be consistent across similar fields

  • Consider post-processing needs

Troubleshooting format issues

Common problems and solutions

Issue

Cause

Solution

Missing currency symbols

Extracted as number only

Retrain to capture full text

Broken dates

Mixed format parsing

Capture as text, parse later

Lost links

Captured as text only

Retrain to capture URLs

Garbled special characters

Encoding issues

Use HTML capture

Post-extraction formatting

Sometimes it's better to extract everything and format later:

Extract raw → Process in spreadsheet:

  • More flexible

  • Easier updates

  • Better for complex transformations

Example workflow:

  1. Extract all prices with currency symbols

  2. Export to CSV

  3. Use spreadsheet formulas to standardize

  4. Create clean dataset

Did this answer your question?