Skip to main content
All CollectionsRobotsBuilding robots
Can I extract data from PDF files?
Can I extract data from PDF files?

Not directly, however, you can convert PDFs to HTML first, then use Browse AI to extract the data from the HTML version.

Nick Simard avatar
Written by Nick Simard
Updated over a month ago

Why PDFs aren't directly supported

Browse AI robots are designed to work with websites by interacting with their HTML structure. PDFs use a different format that our robots can't process directly.

PDF extraction workaround

  1. Convert your PDF to HTML format using one of these methods:

    • Online converters like Adobe's PDF to HTML service, Zamzar, or PDF2Go

    • Software tools like Adobe Acrobat Pro (paid) or Calibre (free)

    • Programming libraries if you're technical (e.g., pdf2htmlEX or pdftohtml)

  2. Save or host the HTML file where your robot can access it:

    • Upload it to a web hosting service

    • Store it in a cloud storage service with public access

    • Use a temporary file hosting service

  3. Create a Browse AI robot targeting the HTML version:

    • Enter the URL where your HTML file is hosted

    • Build your robot as you normally would for any webpage

    • Test to ensure the data is being extracted correctly

  4. Extract and export your data in your preferred format (CSV, JSON, etc.)

Important considerations

  • Conversion quality varies: Different tools produce different HTML results. If one converter doesn't work well, try another.

  • Complex formatting may be lost: Tables, charts, and complex layouts might not convert perfectly.

  • Text-heavy PDFs work best: PDFs that are mostly text typically convert more successfully than highly designed documents.

Did this answer your question?