Vector DB setup is fairly standardized now, but getting high-quality, consistent text + metadata into it still takes a lot of brittle glue code. ragctl aims to make that “pre-vector” step repeatable: turn messy documents into retrieval-ready chunks in a few commands.
Features • Multi-format input: PDF, DOCX, HTML, images • OCR for scanned/image-based docs • Semantic chunking (LangChain) • Batch runs with retries + error handling • Output: direct ingestion into Qdrant (for now)
Looking for feedback • DX: is the CLI intuitive? • Performance / edge cases: weird PDFs, mixed layouts, tables • Roadmap: which connectors (S3, Slack, Notion) or vector stores should be next?
Repo: https://github.com/datallmhub/ragstudio Happy to answer questions about the architecture and chunking approach.