Many web pages display the same type of data in different formats, or mix various data types together.
Common mixed format scenarios
Product pages with prices in different currencies
Dates shown in multiple formats
Numbers with and without units
Links mixed with plain text
Optional fields that may or may not appear
Capturing different data types
During robot training
When you encounter different formats on the same page, Browse AI offers different data capture options:
Data type | What you see | Capture options |
Text with link | Clickable text | • Text only |
Numbers with units | "2,350 sq ft" | • Full text |
Formatted prices | "$1,299.99" | • With currency |
Dates | "Nov 21, 2024" | • As shown |
Mixed content | Text + image + link | • Visible text |
💡 You are not limited to a single capture option. Example - for a link on a page, you can capture the text as a variable, and the link url as a second variable.
Choosing the right format
When capturing, you'll see options like:
Capture as Text
Capture as Link
Capture as HTML
Choose based on your needs:
Text: Just the visible content
Link: The URL behind clickable text
HTML: Everything, including formatting and links
Practical examples
E-commerce product page
Mixed content present:
Product name (text)
Price (currency)
Rating (number)
Review count (number with text)
Availability (text)
Product link (URL)
Training approach:
1. Product name → Capture as Text
2. Price → Capture number only (for calculations)
3. Rating → Capture as number
4. Reviews → "245 reviews" → Capture full text
5. Availability → Capture as Text
6. Link → Capture as Link
Real estate listing
Mixed formats present:
Price: "$450,000"
Size: "2,350 sq ft" and "218 m²"
Year: "Built in 2019" and "2019"
Features: Some with icons, some text only
Training approach:
Be consistent within each field
Choose most useful format for analysis
Train with various examples
News article page
Different elements:
Headline (text)
Author (text with profile link)
Date (various formats)
Article body (HTML with formatting)
Related links (URLs)
Extraction strategy:
Headline → Text only
Author → Text (ignore profile link)
Date → Capture as shown
Body → HTML to preserve formatting
Links → Capture URLs separately
Format selection guide
If you need to... | Choose this format |
Do calculations | Numbers only |
Preserve styling | HTML |
Simple analysis | Plain text |
Follow links | URL extraction |
Keep everything | Full HTML |
Handling special characters
Currency symbols, units, special characters:
Decide if you need them before training
Be consistent across similar fields
Consider post-processing needs
Troubleshooting format issues
Common problems and solutions
Issue | Cause | Solution |
Missing currency symbols | Extracted as number only | Retrain to capture full text |
Broken dates | Mixed format parsing | Capture as text, parse later |
Lost links | Captured as text only | Retrain to capture URLs |
Garbled special characters | Encoding issues | Use HTML capture |
Post-extraction formatting
Sometimes it's better to extract everything and format later:
Extract raw → Process in spreadsheet:
More flexible
Easier updates
Better for complex transformations
Example workflow:
Extract all prices with currency symbols
Export to CSV
Use spreadsheet formulas to standardize
Create clean dataset
