In the era of “Agentic AI,” privacy is the new gold. You want to use LLMs to summarize your bank statements, legal contracts, and medical records. But uploading them to ChatGPT or Claude feels… risky. Even with “Enterprise Privacy” promises, once data leaves your machine, it’s out of your control.
The solution? Build your own Local AI Document Processor.
It sounds hard, but thanks to two open-source tools—PikePDF and Ollama—it is now trivially easy.
This guide will show you how to build a Python script that:
- Takes a messy PDF folder as input.
- Uses PikePDF to clean, decrypt, and split the files.
- Uses Ollama (Llama 3.2) to read and summarize them locally.
- Costs $0.00 and sends zero data to the internet.
The Stack: Why This Combo?
Tool 1: PikePDF (The Surgeon)
Most Python PDF libraries (like PyPDF2) are “readers.” They break easily on complex, encrypted, or malformed PDFs.
PikePDF is different. It is a Python wrapper around QPDF, a C_++ library used by governments and archives. It doesn’t just read PDFs; it performs surgery on them.
- It can strip passwords (if you know them).
- *It repairs broken metadata.
- It handles “Linearized” (Fast Web View) PDFs that confuse other parsers.
Tool 2: Ollama (The Brain)
We’ve covered Ollama before. It is the easiest way to run LLMs on macOS/Linux.
For this tutorial, we will use Llama 3.2 (3B). Why the 3B model? because it is:
- Fast: fast enough to process 100 pages/minute on a MacBook Air.
- Smart Enough: Great at summarization and entity extraction.
- Small: Only 2GB of RAM required.
Step 1: The Setup
First, let’s install the “Surgeon” and the “Brain.”
Install Ollama
Download it from ollama.com. Once installed, pull the model:
“bash ollama pull llama3.
Install Python Libraries
We need pikepdf for the surgery and langchain-community to glue it to Ollama.
`bash pip install pikepdf
pip install pikepdf langchain-community langchain-core pypdf
`pypdf
(Note: We still use for simple text extraction, but PikePDF handles the file management.)
Step 2: The "PDF Hygiene" Script
Before you feed a PDF to an AI, you must clean it. AI models hate:
1. Corrupt EOF markers.
2. Encryption/Passwords.
3. 1000-page dumps (Context Window overflow).
Here is a cleaner.py script using PikePDF:
`python
import pikepdf
import os
def clean_pdf(input_path, output_path):
try:
with pikepdf.open(input_path) as pdf:
pdf.save(output_path)
print(f"✓ Cleaned: {input_path}")
except pikepdf.PasswordError:
print(f"✗ Locked: {input_path} (Needs Password)")
except Exception as e:
print(f"✗ Error: {e}")
for file in os.listdir("raw_docs"):
if file.endswith(".pdf"):
clean_pdf(f"raw_docs/{file}", f"clean_docs/{file}")
Why this matters: If you try to load a slightly corrupt PDF directly into LangChain, it crashes completely. PikePDF acts as the "Sanitization Layer," ensuring the AI pipeline never chokes.
---
Step 3: The "Local Brain" Script
Now that we have clean PDFs, let's feed them to Ollama.
We will write a script analyze.py that summarizes each document.
`python
from langchain_community.llms import Ollama
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains.summarize import load_summarize_chain
import os
llm = Ollama(model="llama3.2")
def summarize_doc(filepath):
loader = PyPDFLoader(filepath)
docs = loader.load_and_split()
chain = load_summarize_chain(llm, chain_type="map_reduce")
print(f"🧠Reading {filepath}...")
summary = chain.run(docs)
return summary
for file in os.listdir("clean_docs"):
path = f"clean_docs/{file}"
summary = summarize_doc(path)
print(f"\n📄 SUMMARY for {file}:\n{summary}\n" + "-"*40)
The "Agentic" Upgrade: Auto-Sorting
Summarization is boring. Let's make it Agentic.
We can change the prompt to ask Ollama to Classify the document, and then move the file to the correct folder.
Modified logic:
`python
def classify_and_move(filepath):
text = extract_first_n_pages(filepath, 2)
prompt = f"""
Analyze this text. Return ONLY one word: 'INVOICE', 'CONTRACT', 'RESUME', or 'OTHER'.
Text: {text[:2000]}
"""
category = llm.invoke(prompt).strip()
new_dir = f"organized_docs/{category}"
os.makedirs(new_dir, exist_ok=True)
shutil.move(filepath, f"{new_dir}/{os.path.basename(filepath)}")
print(f"🚚 Moved to {category}")
Now you have a system that:
1. Takes a dump of 500 random files.
2. Cleans them (PikePDF).
3. Reads them (Ollama).
4. Sorts them into folders ("Invoices", "Contracts").
Total Cost: $0.
Privacy: 100%.
---
Performance Tuning: Choosing the Right Model
Everything depends on your hardware.
* 8GB RAM (Standard Laptop): Use llama3.2:1b or phi-3. They are blazing fast but occasionally misclassify complex legal docs.
* 16GB RAM (MacBook Air): Use llama3.2:3b or mistral. This is the sweet spot.
* 64GB+ RAM (Workstation): Use llama3:70b (quantized). This gives you GPT-4 class understanding for intricate analysis.
Conclusion
We are entering a world where "Software" is just "AI glued together with Python."
Two years ago, building a document sorting engine required Google Vision API, AWS Textract, and thousands of dollars in credits.
Today, it requires pip install ollama` and 20 lines of code
This PikePDF + Ollama stack is a foundational skill for the AI Engineer. It teaches you the most important lesson of 2026: Data Quality (PikePDF) matters as much as Model Intelligence (Ollama).
