Many web pages display the same type of data in different formats, or mix various data types together.

Common mixed format scenarios

Product pages with prices in different currencies
Dates shown in multiple formats
Numbers with and without units
Links mixed with plain text
Optional fields that may or may not appear

Capturing different data types

During robot training

When you encounter different formats on the same page, Browse AI offers different data capture options:

Data type	What you see	Capture options
Text with link	Clickable text	• Text only • Link URL • Both (as HTML)
Numbers with units	"2,350 sq ft"	• Full text • Number only • Remove commas
Formatted prices	"$1,299.99"	• With currency • Number only
Dates	"Nov 21, 2024"	• As shown • Standardized format
Mixed content	Text + image + link	• Visible text • HTML with all elements

💡 You are not limited to a single capture option. Example - for a link on a page, you can capture the text as a variable, and the link url as a second variable.

Choosing the right format

When capturing, you'll see options like:

Capture as Text
Capture as Link
Capture as HTML

Choose based on your needs:

Text: Just the visible content
Link: The URL behind clickable text
HTML: Everything, including formatting and links

Practical examples

E-commerce product page

Mixed content present:

Product name (text)
Price (currency)
Rating (number)
Review count (number with text)
Availability (text)
Product link (URL)

Training approach:

1. Product name → Capture as Text 
2. Price → Capture number only (for calculations) 
3. Rating → Capture as number 
4. Reviews → "245 reviews" → Capture full text 
5. Availability → Capture as Text 
6. Link → Capture as Link

Real estate listing

Mixed formats present:

Price: "$450,000"
Size: "2,350 sq ft" and "218 m²"
Year: "Built in 2019" and "2019"
Features: Some with icons, some text only

Training approach:

Be consistent within each field
Choose most useful format for analysis
Train with various examples

News article page

Different elements:

Headline (text)
Author (text with profile link)
Date (various formats)
Article body (HTML with formatting)
Related links (URLs)

Extraction strategy:

Headline → Text only
Author → Text (ignore profile link)
Date → Capture as shown
Body → HTML to preserve formatting
Links → Capture URLs separately

Format selection guide

If you need to...	Choose this format
Do calculations	Numbers only
Preserve styling	HTML
Simple analysis	Plain text
Follow links	URL extraction
Keep everything	Full HTML

Handling special characters

Currency symbols, units, special characters:

Decide if you need them before training
Be consistent across similar fields
Consider post-processing needs

Troubleshooting format issues

Common problems and solutions

Issue	Cause	Solution
Missing currency symbols	Extracted as number only	Retrain to capture full text
Broken dates	Mixed format parsing	Capture as text, parse later
Lost links	Captured as text only	Retrain to capture URLs
Garbled special characters	Encoding issues	Use HTML capture

Post-extraction formatting

Sometimes it's better to extract everything and format later:

Extract raw → Process in spreadsheet:

More flexible
Easier updates
Better for complex transformations

Example workflow:

Extract all prices with currency symbols
Export to CSV
Use spreadsheet formulas to standardize
Create clean dataset

Can I extract data from PDF files?

Capture text: How to extract and structure specific data from a page using 'Just text'

How to extract data based on a search query

How to extract data from tables on a web page

Handling multi-language content

Extracting data with different formats on the same page