Quick links
ParseHawk is a local-first document AI solution for extracting structured data from unstructured documents. It processes PDFs, scans, images, text files and Markdown and returns validated JSON through a REST API, CLI and Web UI.
Needs addressed:
Public administrations and regulated organizations often receive forms, reports, referrals, invoices, legacy PDFs and scanned documents that need to be transformed into structured data. These documents may contain sensitive personal, medical, financial or administrative data and should not be sent to third-party AI APIs.
Features:
- Extract structured JSON from PDFs, scans, images, text files and Markdown
- Define custom extraction schemas
- Run locally by default on macOS Apple Silicon and Linux/NVIDIA
- Use REST API, CLI and Web UI
- Apache-2.0 open-source licence
- Local model serving using open-source infrastructure
Intended audience:
Developers, public-sector IT teams, municipalities, healthcare IT providers, DMS/ECM vendors and regulated organizations that need document extraction while retaining control over sensitive data.
Reuse:
ParseHawk can be installed from its GitHub repository and run locally. Users can define schemas and instructions for the document types they need to process, then integrate the REST API into existing document, case-management or data workflows.
Standards and interoperability:
ParseHawk returns JSON and validates output against JSON Schema Draft 2020-12. It exposes an HTTP API suitable for integration into existing digital public-service workflows.
Policy contribution:
ParseHawk supports European digital sovereignty, open-source reuse and privacy-conscious AI adoption by enabling local document extraction without dependence on external AI APIs.