Building a Vision-Language API to Convert PDFs into Markdown with SmolDocling

Have you ever wished you could convert a messy PDF into a clean, structured Markdown, complete with tables, lists, code, equations, and all, with one simple API call?

Well, now you can. Thanks to SmolDocling, an ultra-compact vision-language model, we’ve built a blazing-simple API that ingests a PDF or image and returns beautifully structured Markdown.

What is SmolDocling?

SmolDocling is a 256M parameter document understanding model that can:

Understand full pages of documents as images
Capture visual structure like tables, equations, code blocks, forms, and charts
Output a universal markup format called DocTags
Convert PDFs or scans into DocTags, which we then parse into Markdown

Unlike traditional OCR + NLP pipelines, SmolDocling is end-to-end vision-to-text. And it’s small enough to run on a Mac.

How It Works Under the Hood

PDF to Image: We use pdf2image to render pages as high-quality PNGs.
Model Inference: The page image is paired with a prompt (“Convert this page to docling.”) and passed to SmolDocling.
DocTags Parsing: The model returns DocTags (like HTML/XML). We parse it with docling_core.
Markdown Output: We convert the document into Markdown via doc.export_to_markdown().

What SmolDocling Can Understand

Category	Capabilities
Text structure	Lists, Headings, Code blocks
Visual elements	Tables, Equations, Charts (text labels)
Document layouts	Multi-column, Business forms

It’s trained on diverse real-world document types, not just academic papers.

Local Setup

Cloning & Running the Repository

git clone https://github.com/mustafa-zidan/document-converter.git
cd document-converter

Set Up the Environment

./setup.sh

Also install Poppler for pdf2image:

# macOS
brew install poppler
# Ubuntu
sudo apt install poppler-utils

Run the API

python run.py

Test It

python examples/client_example.py --api-version=v2 your_document.pdf

Tips for Usage

SmolDocling is lightweight, but still uses a transformer. Keep PDF pages under 1500px wide for speed.
On Mac, it runs with Apple MPS (Metal GPU); not as fast as CUDA, but it works.
Best results on scanned documents, forms, technical reports, and structured layouts.

Anyone who’s tried pulling clean data out of PDFs, scanned reports, or invoices knows how painful it gets.

SmolDocling changes that. It’s an end-to-end vision-language model that, paired with a simple Python API, lets you:

Convert PDFs straight into Markdown
Keep tables, lists, equations, and the rest of the document structure intact
Skip fragile OCR pipelines — one model handles it all
Run it however you like: locally, on Apple Silicon, or scaled up on GPUs

Whether you’re building a document ingestion system, automating reports, or just wrestling content out of messy PDFs, this gives you a solid starting point.