In the era of “Agentic AI,” privacy is the new gold. You want to use LLMs to summarize your bank statements, legal contracts, and medical records. But uploading them to ChatGPT or Claude feels… risky. Even with “Enterprise Privacy” promises, once data leaves your machine, it’s out of your control.

The solution? Build your own Local AI Document Processor.

It sounds hard, but thanks to two open-source tools—PikePDF and Ollama—it is now trivially easy.

This guide will show you how to build a Python script that:

  • Takes a messy PDF folder as input.
  • Uses PikePDF to clean, decrypt, and split the files.
  • Uses Ollama (Llama 3.2) to read and summarize them locally.
  • Costs $0.00 and sends zero data to the internet.

The Stack: Why This Combo?

Tool 1: PikePDF (The Surgeon)

Most Python PDF libraries (like PyPDF2) are “readers.” They break easily on complex, encrypted, or malformed PDFs.

PikePDF is different. It is a Python wrapper around QPDF, a C_++ library used by governments and archives. It doesn’t just read PDFs; it performs surgery on them.

  • It can strip passwords (if you know them).
  • *It repairs broken metadata.
  • It handles “Linearized” (Fast Web View) PDFs that confuse other parsers.

Tool 2: Ollama (The Brain)

We’ve covered Ollama before. It is the easiest way to run LLMs on macOS/Linux.

For this tutorial, we will use Llama 3.2 (3B). Why the 3B model? because it is:

  • Fast: fast enough to process 100 pages/minute on a MacBook Air.
  • Smart Enough: Great at summarization and entity extraction.
  • Small: Only 2GB of RAM required.

Step 1: The Setup

First, let’s install the “Surgeon” and the “Brain.”

Install Ollama

Download it from ollama.com. Once installed, pull the model:

bash ollama pull llama3.

Install Python Libraries

We need pikepdf for the surgery and langchain-community to glue it to Ollama.

`bash pip install pikepdf

pip install pikepdf langchain-community langchain-core pypdf

# Elegant, Pythonic API with pikepdf.open(‘input.pdf’) as pdf: num_pages = len(pdf.pages) del pdf.pages[-1] pdf.save(‘output.pdf’)

`
(Note: We still use
pypdf for simple text extraction, but PikePDF handles the file management.)

Step 2: The "PDF Hygiene" Script

Before you feed a PDF to an AI, you must clean it. AI models hate:

1. Corrupt EOF markers.

2. Encryption/Passwords.

3. 1000-page dumps (Context Window overflow).

Here is a cleaner.py script using PikePDF:

`python

import pikepdf

import os

def clean_pdf(input_path, output_path):

try:

with pikepdf.open(input_path) as pdf:

pdf.save(output_path)

print(f"✓ Cleaned: {input_path}")

except pikepdf.PasswordError:

print(f"✗ Locked: {input_path} (Needs Password)")

except Exception as e:

print(f"✗ Error: {e}")

for file in os.listdir("raw_docs"):

if file.endswith(".pdf"):

clean_pdf(f"raw_docs/{file}", f"clean_docs/{file}")

Why this matters: If you try to load a slightly corrupt PDF directly into LangChain, it crashes completely. PikePDF acts as the "Sanitization Layer," ensuring the AI pipeline never chokes.

---

Step 3: The "Local Brain" Script

Now that we have clean PDFs, let's feed them to Ollama.

We will write a script analyze.py that summarizes each document.

`python

from langchain_community.llms import Ollama

from langchain_community.document_loaders import PyPDFLoader

from langchain.chains.summarize import load_summarize_chain

import os

llm = Ollama(model="llama3.2")

def summarize_doc(filepath):

loader = PyPDFLoader(filepath)

docs = loader.load_and_split()

chain = load_summarize_chain(llm, chain_type="map_reduce")

print(f"🧠 Reading {filepath}...")

summary = chain.run(docs)

return summary

for file in os.listdir("clean_docs"):

path = f"clean_docs/{file}"

summary = summarize_doc(path)

print(f"\n📄 SUMMARY for {file}:\n{summary}\n" + "-"*40)

The "Agentic" Upgrade: Auto-Sorting

Summarization is boring. Let's make it Agentic.

We can change the prompt to ask Ollama to Classify the document, and then move the file to the correct folder.

Modified logic:

`python

def classify_and_move(filepath):

text = extract_first_n_pages(filepath, 2)

prompt = f"""

Analyze this text. Return ONLY one word: 'INVOICE', 'CONTRACT', 'RESUME', or 'OTHER'.

Text: {text[:2000]}

"""

category = llm.invoke(prompt).strip()

new_dir = f"organized_docs/{category}"

os.makedirs(new_dir, exist_ok=True)

shutil.move(filepath, f"{new_dir}/{os.path.basename(filepath)}")

print(f"🚚 Moved to {category}")

Now you have a system that:

1. Takes a dump of 500 random files.

2. Cleans them (PikePDF).

3. Reads them (Ollama).

4. Sorts them into folders ("Invoices", "Contracts").

Total Cost: $0.
Privacy: 100%.

---

Performance Tuning: Choosing the Right Model

Everything depends on your hardware.

* 8GB RAM (Standard Laptop): Use llama3.2:1b or phi-3. They are blazing fast but occasionally misclassify complex legal docs.

* 16GB RAM (MacBook Air): Use llama3.2:3b or mistral. This is the sweet spot.

* 64GB+ RAM (Workstation): Use llama3:70b (quantized). This gives you GPT-4 class understanding for intricate analysis.

Conclusion

We are entering a world where "Software" is just "AI glued together with Python."

Two years ago, building a document sorting engine required Google Vision API, AWS Textract, and thousands of dollars in credits.

Today, it requires pip install ollama` and 20 lines of code

This PikePDF + Ollama stack is a foundational skill for the AI Engineer. It teaches you the most important lesson of 2026: Data Quality (PikePDF) matters as much as Model Intelligence (Ollama).

Categorized in:

AI,

Last Update: January 13, 2026