Building a Vision-Language API to Convert PDFs into Markdown with SmolDocling
Building a blazing-simple API on top of SmolDocling, an ultra-compact vision-language model that turns messy PDFs into clean, structured Markdown: tables, equations, code blocks, and all.
Have you ever wished you could convert a messy PDF into a clean, structured Markdown, complete with tables, lists, code, equations, and all, with one simple API call?
Well, now you can. Thanks to SmolDocling, an ultra-compact vision-language model, we’ve built a blazing-simple API that ingests a PDF or image and returns beautifully structured Markdown.
What is SmolDocling?
SmolDocling is a 256M parameter document understanding model that can:
- Understand full pages of documents as images
- Capture visual structure like tables, equations, code blocks, forms, and charts
- Output a universal markup format called DocTags
- Convert PDFs or scans into DocTags, which we then parse into Markdown
Unlike traditional OCR + NLP pipelines, SmolDocling is end-to-end vision-to-text. And it’s small enough to run on a Mac.
How It Works Under the Hood
- PDF to Image: We use pdf2image to render pages as high-quality PNGs.
- Model Inference: The page image is paired with a prompt (“Convert this page to docling.”) and passed to SmolDocling.
- DocTags Parsing: The model returns DocTags (like HTML/XML). We parse it with docling_core.
- Markdown Output: We convert the document into Markdown via
doc.export_to_markdown().
What SmolDocling Can Understand
| Category | Capabilities |
|---|---|
| Text structure | Lists, Headings, Code blocks |
| Visual elements | Tables, Equations, Charts (text labels) |
| Document layouts | Multi-column, Business forms |
It’s trained on diverse real-world document types, not just academic papers.
Local Setup
Cloning & Running the Repository
git clone https://github.com/mustafa-zidan/document-converter.git
cd document-converter
Set Up the Environment
./setup.sh
Also install Poppler for pdf2image:
# macOS
brew install poppler
# Ubuntu
sudo apt install poppler-utils
Run the API
python run.py
Test It
python examples/client_example.py --api-version=v2 your_document.pdf
Tips for Usage
- SmolDocling is lightweight, but still uses a transformer. Keep PDF pages under 1500px wide for speed.
- On Mac, it runs with Apple MPS (Metal GPU); not as fast as CUDA, but it works.
- Best results on scanned documents, forms, technical reports, and structured layouts.
Anyone who’s tried pulling clean data out of PDFs, scanned reports, or invoices knows how painful it gets.
SmolDocling changes that. It’s an end-to-end vision-language model that, paired with a simple Python API, lets you:
- Convert PDFs straight into Markdown
- Keep tables, lists, equations, and the rest of the document structure intact
- Skip fragile OCR pipelines — one model handles it all
- Run it however you like: locally, on Apple Silicon, or scaled up on GPUs
Whether you’re building a document ingestion system, automating reports, or just wrestling content out of messy PDFs, this gives you a solid starting point.