A fast, helpful, and open-source document parser
-
Updated
Jun 8, 2026 - Rust
A fast, helpful, and open-source document parser
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
Pure Rust PDF library for AI/RAG: structure-aware chunking, no ML, no C deps.
3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.
Rust CLI implementing the Recursive Language Model (RLM) pattern for Claude Code. Process documents 100x larger than context windows through intelligent chunking, SQLite persistence, and recursive sub-LLM orchestration.
The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.
Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.
Cloud-native document extraction platform — SaaS at kreuzberg.dev or self-host on any Kubernetes cluster. 90+ formats, REST API, webhooks. Built on Kreuzberg.
Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.
Convert, extract, and process documents from the command line
PDF and image to-text converter with XFA forms support. It extract embedded text, and/or render pages into upscaled images for OCR to handle complex layouts and scans. Single static binary, reads stdin/writes stdout. Built for n8n, Power Automate, and containerized workflows.
Pure Rust OCR inference engine powered by GLM-OCR vision-language model. No Python. No PyTorch. Just cargo build and go.
Logstash-like RAG ingestion daemon for local files, chunking, embeddings, and Qdrant indexing.
Lokal, privat AI för juridiska dokument — sammanfatta, översätt, anonymisera och förenkla känsligt material direkt på din Mac. Inget lämnar datorn. (macOS · Ollama)
📄 Ingest documents into structured datasets for LLMs, ensuring numeric integrity and easy export across multiple frameworks with doc2dataset.
Convert scans of handwritten notes to PDF.
A fast, helpful, and open-source document parser
Local-first AI document agent — generate, read, modify, and convert Word/Excel/PPT/PDF/Markdown through natural language conversations. Built with Tauri 2 + Rust + React.
Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.
To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."