Skip to main content

Handling multi-language content

Browse AI can extract data from websites in any language, handling different character sets, writing directions, and regional formats automatically.

M
Written by Melissa Shires
Updated over a week ago

How Browse AI handles languages

Browse AI extracts content exactly as it appears on the website, regardless of language:

  • Any language: Chinese, Arabic, Russian, Japanese, etc.

  • Any script: Latin, Cyrillic, Arabic, Han, Devanagari, etc.

  • Any direction: Left-to-right (LTR) and right-to-left (RTL)

  • Mixed content: Pages with multiple languages

  • Special characters: Accents, umlauts, tildes, etc.

💡 You don't need special configuration - Browse AI preserves the original text.

Extracting from different language websites

Training your robot

  1. Navigate to the website in any language.

  2. Train normally - click, select, extract.

  3. Label in your language - field names can be in English or any language.

  4. Extract - robot captures text exactly as displayed.

What if the language I want to extract doesn't appear naturally?

  • You can train your robot to select the language on the website using a drop down, link, or button. From there continue the normal training steps.

  • You can change the country of an approved robot to trigger a different language.

  • Some websites trigger the language dynamically in the URL. If this is the case use your desired language as the Original URL.

Example multilingual extraction:

Product name (Arabic): منتج رائع
Price (Numbers): ٥٠٠ ريال
Description (English): Amazing product
Tags (Chinese): 高质量, 快速配送

What gets preserved

Content type

How it's handled

Original text

Extracted exactly as shown

Character encoding

UTF-8 preserved automatically

Number formats

Regional formats maintained

Date formats

Kept as displayed

Currency symbols

Preserved with text

RTL text

Direction maintained

Common multi-language scenarios

E-commerce sites with multiple languages

Product names in local language, descriptions in English.

  • Extract each field as it appears.

  • Don't try to translate during extraction.

  • Post-process if translation needed.

  • Keep original for reference.

Mixed script content

Example: Japanese sites mixing Kanji, Hiragana, and English.

商品名: ノートパソコン (Laptop)
価格: ¥125,000
仕様: Intel Core i7

Right-to-left languages

Examples include: Arabic, Hebrew, Persian, and Urdu.

  • Text flows right-to-left

  • Numbers still read left-to-right

  • Mixed with LTR elements (URLs, English terms)

Browse AI automatically preserves direction and formatting.

Regional number and date formats

Different regions display numbers and dates differently. Browse AI approach extracts as displayed (you can standardize later if needed):

Region

Number format

Date format

US

1,234.56

12/31/2024

Europe

1.234,56

31/12/2024

India

1,234.56

31-12-2024

ISO

1234.56

2024-12-31

Post-extraction processing

Exporting multi-language data

  • UTF-8 encoding preserved

  • Open with Excel using UTF-8 option

  • Google Sheets handles automatically

  • Perfect for preserving encoding

  • Maintains all special characters

  • Better for programmatic processing

Common issues

Problem

Cause

Solution

Characters show as ???

Wrong encoding in viewer

Use UTF-8 compatible software

Excel shows gibberish

Default encoding wrong

Import as UTF-8

Database errors

Column encoding mismatch

Set database to UTF-8

Did this answer your question?