Enterprise-Grade Forensic Data Analysis & Logic Engine
Transparent, Traceable, AI-Powered Document Intelligence
Features β’ Installation β’ Quick Start β’ Documentation β’ Contributing
- Overview
- Key Features
- System Requirements
- Installation
- Quick Start
- Usage Guide
- Architecture
- Troubleshooting
- Use Cases
- Contributing
- License
- Acknowledgments
LLM-CerebroScope is a sophisticated forensic data analysis platform that combines Large Language Models (LLMs) with advanced vector search technology. Designed for professionals who require transparent, auditable, and traceable document analysis, CerebroScope provides enterprise-grade capabilities for:
- Document Intelligence: Multi-format ingestion with intelligent chunking
- Semantic Retrieval: Advanced vector search using ChromaDB
- LLM-Powered Analysis: Integration with Ollama for local, privacy-preserving inference
- Source Attribution: Automatic citation tracking with chunk-level granularity
- Conflict Detection: AI-powered contradiction and inconsistency identification
- Reliability Scoring: Metadata-based source credibility assessment
β
Transparency: Every answer includes traceable citations to source documents
β
Privacy: Fully local processing with Ollama (no cloud dependencies)
β
Accuracy: Reliability scoring and conflict detection ensure high-quality results
β
Flexibility: Dual interfaces (CLI & Web GUI) for different workflows
β
Extensibility: Modular architecture designed for customization
- Multi-Format Support: PDF, CSV, TXT, XLSX/XLS
- Intelligent Chunking: Configurable size (800 chars) and overlap (100 chars)
- Metadata Preservation: Source, page numbers, timestamps, and file modification dates
- Incremental Ingestion: Automatic detection of new/modified files
- Vector Database: ChromaDB-powered persistent storage
- Embedding Generation: Automatic text embeddings using default models
- Source Filtering: Search within specific documents or collections
- Top-K Retrieval: Configurable result ranking (default: 5 chunks)
- Model Agnostic: Works with any Ollama-compatible LLM
- Citation Tracking: Automatic
[ID: xxxxxxxx]format citations in responses - Context Awareness: Prioritizes newer sources when conflicts arise
- Prompt Engineering: Optimized forensic analysis prompts
- Conflict Detection: LLM-powered contradiction identification
- Reliability Scoring: Heuristic algorithm considering:
- File format (structured data preferred)
- Document recency (time-based decay)
- Metadata completeness
- Evidence Heatmaps: Visual highlighting of used vs. ignored chunks
- Interactive Graphs: Knowledge graph visualization using spaCy NER
- Entity Extraction: Automatic identification of organizations, people, locations, dates, monetary values
- Evidence Cards: Color-coded display (green = used, gray = ignored)
- Markdown Reports: Comprehensive, timestamped analysis reports
- Rich CLI: Beautiful terminal interface using Rich library
- Streamlit GUI: Modern web dashboard with drag-and-drop file upload
- Feature Parity: Both interfaces support all core functionality
- Python: 3.8+
- RAM: 4 GB minimum (8 GB recommended)
- Storage: 500 MB + space for documents and database
- Ollama: Install from ollama.ai
- Optional: spaCy model for graph visualization (
python -m spacy download en_core_web_sm)
# 1. Clone repository
git clone https://github.com/oskarbrzycki/llm-cerebroscope.git
cd LLM-CerebroScope
# 2. Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install ollama chromadb streamlit streamlit-agraph rich pandas pypdf spacy
python -m spacy download en_core_web_sm
# 4. Setup Ollama
ollama serve # Keep running in separate terminal
ollama pull llama3 # Download a model
# 5. Initialize data directory
mkdir -p data/raw- Add documents to
data/raw/directory - Launch interface:
- CLI:
python -m cerebro.cli - GUI:
streamlit run cerebro/gui.pyβhttp://localhost:8501
- CLI:
- Query documents: Ask questions in natural language
- Review results: Check citations
[ID: xxxxxxxx], conflicts, and download reports
python -m cerebro.cliWorkflow: Select model β Auto-ingest documents β Query in natural language β View formatted results β Type exit to quit
streamlit run cerebro/gui.pyFeatures: Model selection, file upload, source filtering, chat interface, graph visualization, evidence cards, report download
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interface Layer β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ€
β Rich CLI β Streamlit GUI β
ββββββββββββ¬ββββββββββββ΄βββββββββββββ¬ββββββββββββββββββββββ
β β
ββββββββββββββ¬ββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β Application Core β
ββββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββββββββ€
β Ingester β VectorStore β Tracer β
β β β β
β β’ PDF β β’ ChromaDB β β’ LLM Query Analysis β
β β’ CSV/TXT β β’ Embeddings β β’ Citation Generation β
β β’ Excel β β’ Metadata β β’ Context Formatting β
ββββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β Validation & Reporting Layer β
ββββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββββββββ€
β Validator β Reporter β Graph β
β β β β
β β’ Conflicts β β’ Markdown β β’ Entity Extraction β
β β’ Reliabilityβ β’ Timestamps β β’ Knowledge Graphs β
β β’ Scoring β β’ Evidence β β’ Visualization β
ββββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β External Services β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ€
β Ollama API β ChromaDB β
β (LLM Inference) β (Vector Storage) β
ββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ
ββββββββββββββββ
β Documents β (PDF/CSV/TXT/XLSX)
ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ
β Ingester β β Chunks (800 chars, 100 overlap)
ββββββββ¬ββββββββ Metadata (source, page, timestamp)
β
βΌ
ββββββββββββββββ
β VectorStore β β Embeddings β ChromaDB
ββββββββ¬ββββββββ
β
β Query
βΌ
ββββββββββββββββ
β VectorStore β β Top-K Retrieval β Chunks
ββββββββ¬ββββββββ
β
βββββββββββββββββββ
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β Tracer β β Validator β
β (Analysis) β β (Conflicts) β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β
ββββββββββ¬βββββββββ
βΌ
ββββββββββββββββ
β Reporter β β Markdown Report
ββββββββββββββββ
Ingester: Document processing and chunking (PDF, CSV, TXT, Excel)CerebroVectorStore: ChromaDB vector database managementCerebroTracer: LLM-powered query analysis with citation generationCerebroValidator: Conflict detection and reliability scoringCerebroReporter: Markdown report generationCerebroGraph: Knowledge graph visualization (spaCy NER)CerebroFormatter: Rich CLI formatting
| Issue | Solution |
|---|---|
| Ollama connection error | Run ollama serve in separate terminal |
| No models available | Download with ollama pull llama3 |
| spaCy model missing | Run python -m spacy download en_core_web_sm |
| ChromaDB errors | Reset database: rm -rf data/chroma_db |
| Import errors | Reinstall: pip install -r requirements.txt |
| Memory issues | Reduce chunk_size or process in batches |
-
Batch Processing: Process documents in smaller batches
# Process files in batches of 10 for batch in chunked(files, 10): chunks = ingester.ingest_directory(batch) vector_db.add_chunks(chunks)
-
Optimize Chunking: Adjust chunk size based on document type
# For technical documents ingester = Ingester(chunk_size=1000, chunk_overlap=150) # For structured data (CSV) ingester = Ingester(chunk_size=500, chunk_overlap=50)
-
Database Indexing: ChromaDB automatically indexes, but you can:
- Use persistent storage (default)
- Monitor database size with
du -sh data/chroma_db
-
Model Selection: Use smaller models for faster inference
# Faster but less accurate ollama pull phi # Balanced ollama pull llama3 # Slower but more accurate ollama pull mistral
- Use Production Ollama: Deploy Ollama as a service
- Database Backup: Regularly backup
data/chroma_db - Monitor Resources: Track CPU, RAM, and disk usage
- Caching: Implement result caching for frequent queries
- Load Balancing: Use multiple Ollama instances for high traffic
- Legal Analysis: Search case files, identify precedents, detect contradictions
- Research Review: Analyze papers, detect conflicting methodologies, extract entities
- Business Intelligence: Query financial reports, extract metrics, generate insights
- Compliance Auditing: Check policy violations, generate audit trails
- Knowledge Base: Internal Q&A system with traceable citations
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m "Add amazing feature" - Push to branch:
git push origin feature/amazing-feature - Open Pull Request
Code Style: Follow PEP 8, use type hints, add docstrings.
This project is licensed under the MIT License - see the LICENSE file for details.
- Ollama - Local LLM inference engine
- ChromaDB - Open-source vector database
- Streamlit - Rapid web app development
- Rich - Beautiful terminal formatting
- spaCy - Natural language processing
Built with β€οΈ for professionals who need transparent, traceable, and auditable AI-powered document analysis.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with π by the CerebroScope Team
Empowering transparent AI-powered document intelligence