Skip to main content

Can I extract data from PDF files?

Not directly, however, you can convert PDFs to HTML first, then use Browse AI to extract the data from the HTML version.

M
Written by Melissa Shires
Updated over 8 months ago

Why PDFs aren't directly supported

Browse AI robots are designed to work with websites by interacting with their HTML structure. PDFs use a different format that our robots can't process directly.

PDF extraction workaround

  1. Convert your PDF to HTML format using one of these methods:

    • Online converters like Adobe's PDF to HTML service, Zamzar, or PDF2Go

    • Software tools like Adobe Acrobat Pro (paid) or Calibre (free)

    • Programming libraries if you're technical (e.g., pdf2htmlEX or pdftohtml)

  2. Save or host the HTML file where your robot can access it:

    • Upload it to a web hosting service

    • Store it in a cloud storage service with public access

    • Use a temporary file hosting service

  3. Create a Browse AI robot targeting the HTML version:

    • Enter the URL where your HTML file is hosted

    • Build your robot as you normally would for any webpage

    • Test to ensure the data is being extracted correctly

  4. Extract and export your data in your preferred format (CSV, JSON, etc.)

Important considerations

  • Conversion quality varies: Different tools produce different HTML results. If one converter doesn't work well, try another.

  • Complex formatting may be lost: Tables, charts, and complex layouts might not convert perfectly.

  • Text-heavy PDFs work best: PDFs that are mostly text typically convert more successfully than highly designed documents.

Did this answer your question?