Back to Blog

Docling: Convert PDFs to AI-ready data for RAG

Most enterprise data lives in PDFs, DOCX, and HTML. Here’s how Docling turns messy, unstructured files into clean, AI-ready context that improves RAG accuracy and reliability.

Tech Team
September 1, 2025
10 min read
Docling: Convert PDFs to AI-ready data for RAG

Estimates consistently suggest that most organizational data—often cited at 80–90%—is unstructured. Think PDFs, DOCX, and HTML files stuffed with tables, headers, images, and footnotes. Generative AI and retrieval‑augmented generation (RAG) can only be as accurate as the context you feed them, and these formats don’t always translate cleanly into high‑quality, machine‑consumable text. The result is brittle pipelines, garbled tables, and hallucination‑prone answers.

Docling is an open‑source project designed to solve that last‑mile problem. It parses common document formats and reconstructs a clean, structured representation—complete with hierarchy and provenance—so your RAG system can retrieve the right chunks and ground answers reliably. This guide explains how it works, why it matters, how to get started, and how it compares to other options in the ecosystem.

Why unstructured documents break RAG

Traditional text extraction rarely preserves structure. Consider cross‑page tables, multi‑column layouts, nested lists, image captions, or footnotes. Off‑the‑shelf OCR or naive PDF‑to‑text conversion typically loses cell boundaries, column order, and reading order. That hurts retrieval quality, increases chunking errors, and ultimately degrades answer accuracy.

  • Tables that span pages: Row/column boundaries get lost, breaking semantic continuity.
  • Complex layouts: Multi‑column text can interleave incorrectly, scrambling reading order.
  • Images and figures: Captions, legends, and referenced labels are often dropped.
  • Annotations and lists: Bullets, headings, and footnotes frequently flatten into ambiguous text.
  • Compliance constraints: Many teams cannot send sensitive documents to third‑party cloud services.

For background on just how much critical information lives outside databases, see this overview of unstructured data. For an accessible primer on retrieval‑augmented generation itself, this RAG explainer is a clear starting point.

What Docling does

Docling converts PDFs, DOCX, and HTML into a single, unified document model that preserves hierarchy (sections, paragraphs, tables, figures) and provenance (page numbers, geometric locations). It can export to formats such as Markdown or JSON and integrate directly with RAG frameworks. Because it runs locally, it suits teams with strict data governance, and it avoids per‑page processing fees associated with some cloud pipelines.

  • Formats: PDFs, HTML, DOCX, and other common sources.
  • Outputs: Structured document model with export options for RAG ingestion.
  • Integrations: Works with popular frameworks like LangChain and LlamaIndex.
  • Deployment: Use as a CLI, a Python library, or expose a local REST API.

Explore the project home at Docling and the source code on GitHub.

How it works under the hood

Docling’s design revolves around three ideas: a parser backend for initial file reading, modular pipelines that enrich the document representation, and a final structured output (the Docling document) that captures the full hierarchy and provenance.

1) Parser backend

For low‑level formats like PDFs—originally optimized for printing rather than machine reading—the parser extracts characters, objects, and properties while preserving coordinates and font cues. That foundation allows higher‑level analysis to rebuild logical structure.

2) Enrichment pipelines

Layout analysis predicts bounding boxes for titles, paragraphs, lists, and figures. A specialized table model reconstructs rows and columns so the logical grid is retained. If needed, a vision model can annotate images and captions so referenced content isn’t lost during ingestion. For already‑structured inputs (HTML or DOCX), the pipeline leverages robust parsing libraries to normalize and enrich the existing hierarchy.

  • Layout understanding: Tools such as LayoutParser illustrate the kind of modern layout analysis approaches widely used for document AI.
  • Table structure recognition: Models like Microsoft’s Table Transformer show how row/column structure can be recovered from PDFs.

3) Unified document model with provenance

The final output is a structured, pedantic document representation that retains page numbers and geometric locations for every element. That provenance enables targeted redaction (e.g., PII), accurate citation in RAG answers, and precise chunking per element type.

Why this matters for RAG quality

Chunking strategy is one of the biggest drivers of RAG accuracy. When the source stays hierarchical—sections, tables, figures, lists—you can create one chunk per detected element and store metadata like page and section. That improves retrieval granularity, reduces context dilution, and makes citations both accurate and trustworthy.

  • Hybrid chunking: Combine semantic text chunks with structure‑aware chunks for tables and figures.
  • Provenance‑aware retrieval: Keep page, section, and table IDs in metadata for precise referencing.
  • Table‑first logic: Index tables separately from prose, because their retrieval semantics are different.

If you’re building with RAG frameworks, see the getting‑started guides for LangChain and LlamaIndex for ingestion, metadata, and vector store patterns that pair well with Docling’s structured output.

Quick start

Install

Install the CLI and Python package locally:

pip install docling

Parse a document and export

The general flow is: parse a PDF/DOCX/HTML file, obtain the structured document, then export to Markdown or JSON for indexing. Check the project documentation for the latest API surface and CLI flags.

# Pseudocode: inspect project docs for exact APIs from docling import parse_document, export_markdown, export_json doc = parse_document('report.pdf') md = export_markdown(doc) js = export_json(doc, include_provenance=True)

Serve as a local service

Many teams expose Docling behind a lightweight REST endpoint so apps and workers can submit documents for conversion without sharing data externally.

# Example: run a local service (refer to official docs for exact command) docling serve --host 0.0.0.0 --port 8080

Integrating with LangChain or LlamaIndex

Once you have Markdown or JSON with element‑level metadata, create structure‑aware chunks and index them.

# Example outline with LangChain from langchain_text_splitters import RecursiveCharacterTextSplitter # Combine text with metadata pulled from Docling elements text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=120) docs = text_splitter.create_documents(texts=[md], metadatas=[{'page': 3, 'element': 'table'}]) # Upsert into your vector store here

A similar pattern works with LlamaIndex using Document objects and metadata fields. See the LlamaIndex docs for loaders and node parsers that respect metadata and enable citation‑rich responses.

Performance and platform considerations

Docling emphasizes CPU‑friendly performance so you can run it on developer workstations or inside on‑prem containers. In public benchmarks reported by the project, Docling processes pages on commodity CPUs in roughly the order of seconds (around 1.26 seconds per page in one test set), while preserving high‑quality structure for tables and layout. See the project’s repository for the latest benchmark details and datasets.

Source and updates: Docling on GitHub.

How it compares to other options

  • Unstructured: A widely used toolkit for parsing many file types with flexible connectors. Great ecosystem; structure fidelity varies by format and configuration. See Unstructured on GitHub.
  • MinerU: An open‑source PDF/Office parser focused on turning documents into structured data for AI pipelines. See MinerU.
  • OCR stacks: Traditional OCR, such as Tesseract, can extract text but often loses table and layout structure unless paired with a layout model.

Where Docling stands out is tight attention to logical reconstruction (especially tables) and a unified document model designed for downstream AI workflows.

Best practices for AI‑ready documents

  • Prefer structure‑aware extraction: Avoid naive PDF‑to‑text for anything with tables or multi‑column layouts.
  • Chunk by element type: Create one chunk per section, paragraph, table, or figure. Keep provenance metadata like page and element IDs.
  • Index tables separately: Use a dedicated embedding model for tables or store them in a tabular‑aware index if available.
  • Normalize headings and lists: Preserve heading levels (h1–h3) and list types so retrievers can leverage hierarchy.
  • Redact early: Use the preserved coordinates and provenance to remove PII before indexing.
  • Evaluate retrieval quality: Track hit rate, MRR, and grounded‑answer accuracy; iterate on chunking and metadata.

Troubleshooting tips

  • Interleaved columns: Verify layout detection; consider adjusting pipeline parameters or upgrading your layout model.
  • Broken table cells: Ensure a table‑structure model is active; tune table detection thresholds.
  • Image captions lost: Enable image annotation in the pipeline and capture captions as separate elements with references.
  • Large documents: Batch processing and streaming exports can keep memory usage predictable.

Further reading and resources

Key takeaways

  • RAG quality depends on structure fidelity, not just text extraction.
  • Docling produces a provenance‑rich document model that’s ideal for chunking and citation‑aware retrieval.
  • Local execution helps with compliance and cost control.
  • Benchmarks indicate competitive speed on CPUs while preserving tables and layout.
  • Integrations with LangChain and LlamaIndex make it straightforward to plug into existing AI pipelines.

If your AI answers falter on PDFs or complex reports, improving document structure—not just bigger models—is often the simplest, highest‑leverage fix. Docling gives you that structure.

Tech Team

Door to online tech team

More Articles

Continue reading our latest insights

Need Expert Help?

Ready to implement the solutions discussed in this article? Let's discuss your project.

Get Consultation