Skip to content

Architecture

“So I was sitting in my cubicle today, and I realized… every document format is worse than the one before it.”

mcwaddams is designed around a single principle: never silently fail. When extraction works, you get content. When it doesn’t, you get a clear explanation why.


┌─────────────────────────────────────────────────────────────┐
│ MCP Client │
│ (Claude Code, Claude Desktop) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ FastMCP Server │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Universal │ │ Word │ │ Excel │ │
│ │ Mixin │ │ Mixin │ │ Mixin │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Format Detection & Routing │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ python- │ │ openpyxl │ │ python- │ │
│ │ docx │ │ pandas │ │ pptx │ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Fallback Handlers │ │
│ │ mammoth · olefile · xlrd · legacy parsers │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

mcwaddams uses mixins to organize tools by document type:

class MCWaddamsServer(
UniversalMixin, # extract_text, extract_images, etc.
WordMixin, # convert_to_markdown, extract_tables, etc.
ExcelMixin, # analyze_data, extract_formulas, etc.
):
"""The main server combines all capabilities."""
pass

Benefits:

  1. Separation of concerns — Word tools don’t clutter Excel code
  2. Independent testing — Each mixin can be tested in isolation
  3. Easy extension — Add PowerPoint tools without touching existing code
  4. Clear ownership — Easy to find where a tool is defined

Every request starts with format detection:

async def detect_format(file_path: str) -> FormatInfo:
# 1. Check file extension
ext = Path(file_path).suffix.lower()
# 2. Validate file exists and is readable
validate_file(file_path)
# 3. Detect actual format (extension might lie)
with open(file_path, 'rb') as f:
magic_bytes = f.read(8)
# 4. Check for encryption
is_encrypted = check_encryption(file_path)
return FormatInfo(
extension=ext,
actual_format=detect_from_magic(magic_bytes),
is_legacy=(ext in ['.doc', '.xls', '.ppt']),
is_encrypted=is_encrypted
)

We don’t trust file extensions alone:

Magic BytesFormat
50 4B 03 04ZIP-based (OOXML: docx/xlsx/pptx)
D0 CF 11 E0OLE Compound (doc/xls/ppt)
EF BB BF or starts with textCSV/plain text

Primary methods are tried first, then fallbacks:

async def extract_text(file_path: str, method: str = "auto"):
if method == "auto":
# Try primary method
try:
return await primary_extraction(file_path)
except ExtractionError:
pass
# Try fallback
try:
return await fallback_extraction(file_path)
except ExtractionError:
pass
# All methods failed — return helpful error
return {
"error": "Extraction failed",
"tried": ["python-docx", "mammoth"],
"hint": "File may be corrupted. Try analyze_document_health."
}
FormatPrimaryFallback 1Fallback 2
.docxpython-docxmammoth
.docolefiletextract*raw OLE
.xlsxopenpyxlpandas
.xlsxlrdpandasolefile
.pptxpython-pptx
.csvpandasbuilt-in csv
*textract only if installed

For large documents, the resource system enables on-demand fetching:

┌─────────────────────────────────────────────────────────────┐
│ index_document() │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Resource Store │
│ │
│ doc_id: "abc123def456" │
│ │
│ chapters: [ │
│ {id: 1, uri: "chapter://abc123/1", content: "..."}, │
│ {id: 2, uri: "chapter://abc123/2", content: "..."}, │
│ ] │
│ │
│ images: [ │
│ {id: 0, uri: "image://abc123/0", data: <bytes>}, │
│ ] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ MCP Resource Protocol │
│ │
│ Client requests: chapter://abc123/1.md │
│ Server returns: Markdown-formatted Chapter 1 │
└─────────────────────────────────────────────────────────────┘
  • chapter://doc_id/N — Single chapter
  • chapter://doc_id/N.md — Chapter as Markdown
  • chapter://doc_id/N.txt — Chapter as plain text
  • chapter://doc_id/N.html — Chapter as HTML
  • chapters://doc_id/N-M — Range of chapters
  • image://doc_id/N — Single image
  • sheet://doc_id/name — Excel sheet
  • slide://doc_id/N — PowerPoint slide

Documents from URLs are downloaded and cached:

CACHE_DURATION = 3600 # 1 hour
async def resolve_file_path(path_or_url: str) -> str:
if not path_or_url.startswith(('http://', 'https://')):
return path_or_url # Local file
# Check cache
cache_key = hash_url(path_or_url)
cached = get_from_cache(cache_key)
if cached and not expired(cached):
return cached.path
# Download
async with aiohttp.ClientSession() as session:
async with session.get(path_or_url) as response:
content = await response.read()
# Save to temp and cache
temp_path = save_to_temp(content, extension_from_url(path_or_url))
add_to_cache(cache_key, temp_path)
return temp_path

Every error must be:

  1. Clear — What went wrong
  2. Actionable — What can be done about it
  3. Helpful — Suggestions for next steps
class ExtractionError(Exception):
def __init__(self, message: str, hint: str = None, tried: list = None):
self.message = message
self.hint = hint or "Check file integrity with analyze_document_health"
self.tried = tried or []
def to_dict(self):
return {
"error": self.message,
"hint": self.hint,
"tried_methods": self.tried
}

Large documents are paginated at ~25,000 tokens:

MAX_TOKENS = 25000
TOKENS_PER_CHAR = 0.25 # Rough estimate
def should_paginate(content: str) -> bool:
estimated_tokens = len(content) * TOKENS_PER_CHAR
return estimated_tokens > MAX_TOKENS

Resources aren’t loaded until requested:

# index_document only creates references
index = await index_document("huge-book.docx")
# Returns immediately with URIs
# Content loaded only when requested
chapter_1 = await mcp_resources.read("chapter://abc123/1")
# Now the content is loaded

Want to add new functionality?

# In the appropriate mixin
@mcp.tool()
async def my_new_tool(file_path: str) -> dict:
"""Description for MCP discovery."""
path = await resolve_file_path(file_path)
validate_file(path)
# Your logic here
return {"result": "..."}
# In format detection
FORMAT_HANDLERS = {
'.docx': DocxHandler,
'.doc': DocHandler,
'.rtf': RtfHandler, # Add new format
}

“What would you say… you do here?”


We extract documents. It’s pretty straightforward.
🎉

Flair Earned!

Badge Name

🎖️

You earned your first flair!

What should we call you?