document-processing

Here are 30 public repositories matching this topic...

run-llama / liteparse

A fast, helpful, and open-source document parser

pdf ocr text-extraction ocr-recognition pdf-parser document-processing document-ocr

Updated Jun 8, 2026
Rust

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

python markdown rust fast pdf text-extraction data-extraction pdf-generation pdf-to-text pdf-library pdf-parser document-processing rag pyo3 pdf-editor image-extraction llm pdf-to-markdown

Updated Jun 7, 2026
Rust

bzsanti / oxidizePdf

Star

Pure Rust PDF library for AI/RAG: structure-aware chunking, no ML, no C deps.

Updated Jun 8, 2026
Rust

3DCF-Labs / doc2dataset

Star

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

nlp rust cli machine-learning ocr evaluation dataset-generation data-pipeline document-processing fine-tuning rag document-understanding llm 3dcf doc2dataset numguard

Updated Feb 10, 2026
Rust

zircote / rlm-rs

Star

Rust CLI implementing the Recursive Language Model (RLM) pattern for Claude Code. Process documents 100x larger than context windows through intelligent chunking, SQLite persistence, and recursive sub-LLM orchestration.

Updated Jun 8, 2026
Rust

yfedoseev / office_oxide

Star

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

Updated Jun 4, 2026
Rust

clark-labs-inc / pdfsink-rs

Star

Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.

rust pdf text-extraction rust-library pdf-to-text rust-crate table-extraction pdf-parser document-processing layout-analysis pdf-to-json pdf-extraction pdfplumber document-ai clark-labs

Updated Jun 6, 2026
Rust

kreuzberg-dev / kreuzberg-cloud

Star

Cloud-native document extraction platform — SaaS at kreuzberg.dev or self-host on any Kubernetes cluster. 90+ formats, REST API, webhooks. Built on Kreuzberg.

api kubernetes rust pdf microservices ocr nextjs helm postgresql self-hosted saas nats text-extraction cloud-native document-processing document-extraction busl axum kreuzberg

Updated Jun 8, 2026
Rust

carles-abarca / docling-rs

Star

Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.

text-extraction document-processing pdf-document-processor xlsx-parser docx-converter rag-pipeline docling

Updated Dec 15, 2025
Rust

KimSeogyu / undocx

Star

Extract clean, structured Markdown from DOCX for LLM and RAG contexts.

python markdown rust converter docx document-processing rag llm

Updated Mar 23, 2026
Rust

dipankar / datalab-cli

Star

Convert, extract, and process documents from the command line

rust cli developer-tools data-extraction crates-io ai-agents datalab document-processing agent-tools agentic-ai agent-first json-native dipankar

Updated Apr 1, 2026
Rust

wmahfoudh / crabocr

Star

PDF and image to-text converter with XFA forms support. It extract embedded text, and/or render pages into upscaled images for OCR to handle complex layouts and scans. Single static binary, reads stdin/writes stdout. Built for n8n, Power Automate, and containerized workflows.

Updated Feb 20, 2026
Rust

saravananravi08 / glm-ocr-rs

Star

Pure Rust OCR inference engine powered by GLM-OCR vision-language model. No Python. No PyTorch. Just cargo build and go.

rust ocr cuda candle vlm document-processing vision-language-model glm-ocr

Updated Mar 4, 2026
Rust

laofahai / linch-docx-rs

Star

A reliable DOCX reading and writing library for Rust with round-trip preservation

rust word office docx ooxml document-processing

Updated Mar 20, 2026
Rust

ragloom / ragloom

Star

Logstash-like RAG ingestion daemon for local files, chunking, embeddings, and Qdrant indexing.

rust embeddings ingestion document-processing rag vector-database qdrant retrieval-augmented-generation

Updated Jun 8, 2026
Rust

johanolofsson72 / juradrop

Star

Lokal, privat AI för juridiska dokument — sammanfatta, översätt, anonymisera och förenkla känsligt material direkt på din Mac. Inget lämnar datorn. (macOS · Ollama)

react desktop-app macos rust privacy typescript offline-first swedish gdpr document-processing tauri privacy-by-design legal-tech llm local-llm local-ai ollama juridik

Updated Jun 8, 2026
Rust

reisel-g / doc2dataset

Star

📄 Ingest documents into structured datasets for LLMs, ensuring numeric integrity and easy export across multiple frameworks with doc2dataset.

nlp rust cli machine-learning ocr big-data text evaluation dataset document data-pipeline document-processing fine-tuning interleaved multimodal document-understanding 3dcf numguard

Updated Jun 8, 2026
Rust

gumienny / cn

Star

Convert scans of handwritten notes to PDF.

rust cli entropy notes image-processing clean image-thresholding k-means document-processing separation foreground-background tsallis

Updated Sep 5, 2018
Rust

System32manager / liteparse-574

Star

A fast, helpful, and open-source document parser

pdf ocr text-extraction ocr-recognition pdf-parser document-processing document-ocr

Updated May 31, 2026
Rust

XuMingKe-06 / DocAgent

Star

Local-first AI document agent — generate, read, modify, and convert Word/Excel/PPT/PDF/Markdown through natural language conversations. Built with Tauri 2 + Rust + React.

react desktop-app markdown rust pdf excel word powerpoint document-processing tauri ai-agent llm tool-calling

Updated Jun 8, 2026
Rust

Improve this page

Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-processing

Here are 30 public repositories matching this topic...

run-llama / liteparse

yfedoseev / pdf_oxide

bzsanti / oxidizePdf

3DCF-Labs / doc2dataset

zircote / rlm-rs

yfedoseev / office_oxide

clark-labs-inc / pdfsink-rs

kreuzberg-dev / kreuzberg-cloud

carles-abarca / docling-rs

KimSeogyu / undocx

dipankar / datalab-cli

wmahfoudh / crabocr

saravananravi08 / glm-ocr-rs

laofahai / linch-docx-rs

ragloom / ragloom

johanolofsson72 / juradrop

reisel-g / doc2dataset

gumienny / cn

System32manager / liteparse-574

XuMingKe-06 / DocAgent

Improve this page

Add this topic to your repo