Architecture

“So I was sitting in my cubicle today, and I realized… every document format is worse than the one before it.”

mcwaddams is designed around a single principle: never silently fail. When extraction works, you get content. When it doesn’t, you get a clear explanation why.

High-Level Overview

┌─────────────────────────────────────────────────────────────┐
│                      MCP Client                             │
│              (Claude Code, Claude Desktop)                  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     FastMCP Server                          │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Universal  │  │    Word     │  │    Excel    │         │
│  │    Mixin    │  │    Mixin    │  │    Mixin    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│         │                │                │                 │
│         └────────────────┼────────────────┘                 │
│                          ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Format Detection & Routing              │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│         ┌────────────────┼────────────────┐                │
│         ▼                ▼                ▼                │
│  ┌───────────┐    ┌───────────┐    ┌───────────┐          │
│  │ python-   │    │ openpyxl  │    │ python-   │          │
│  │   docx    │    │   pandas  │    │   pptx    │          │
│  └───────────┘    └───────────┘    └───────────┘          │
│         │                │                │                 │
│         └────────────────┼────────────────┘                │
│                          ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                 Fallback Handlers                    │   │
│  │    mammoth · olefile · xlrd · legacy parsers         │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

The Mixin Pattern

mcwaddams uses mixins to organize tools by document type:

Why Mixins?

class MCWaddamsServer(
    UniversalMixin,      # extract_text, extract_images, etc.
    WordMixin,           # convert_to_markdown, extract_tables, etc.
    ExcelMixin,          # analyze_data, extract_formulas, etc.
):
    """The main server combines all capabilities."""
    pass

Benefits:

Separation of concerns — Word tools don’t clutter Excel code
Independent testing — Each mixin can be tested in isolation
Easy extension — Add PowerPoint tools without touching existing code
Clear ownership — Easy to find where a tool is defined

Format Detection

Every request starts with format detection:

async def detect_format(file_path: str) -> FormatInfo:
    # 1. Check file extension
    ext = Path(file_path).suffix.lower()

    # 2. Validate file exists and is readable
    validate_file(file_path)

    # 3. Detect actual format (extension might lie)
    with open(file_path, 'rb') as f:
        magic_bytes = f.read(8)

    # 4. Check for encryption
    is_encrypted = check_encryption(file_path)

    return FormatInfo(
        extension=ext,
        actual_format=detect_from_magic(magic_bytes),
        is_legacy=(ext in ['.doc', '.xls', '.ppt']),
        is_encrypted=is_encrypted
    )

Magic Bytes

We don’t trust file extensions alone:

Magic Bytes	Format
`50 4B 03 04`	ZIP-based (OOXML: docx/xlsx/pptx)
`D0 CF 11 E0`	OLE Compound (doc/xls/ppt)
`EF BB BF` or starts with text	CSV/plain text

The Fallback Strategy

Primary methods are tried first, then fallbacks:

async def extract_text(file_path: str, method: str = "auto"):
    if method == "auto":
        # Try primary method
        try:
            return await primary_extraction(file_path)
        except ExtractionError:
            pass

        # Try fallback
        try:
            return await fallback_extraction(file_path)
        except ExtractionError:
            pass

        # All methods failed — return helpful error
        return {
            "error": "Extraction failed",
            "tried": ["python-docx", "mammoth"],
            "hint": "File may be corrupted. Try analyze_document_health."
        }

Fallback Chain

Format	Primary	Fallback 1	Fallback 2
`.docx`	python-docx	mammoth	—
`.doc`	olefile	textract*	raw OLE
`.xlsx`	openpyxl	pandas	—
`.xls`	xlrd	pandas	olefile
`.pptx`	python-pptx	—	—
`.csv`	pandas	built-in csv	—

*textract only if installed

Resource System

For large documents, the resource system enables on-demand fetching:

┌─────────────────────────────────────────────────────────────┐
│                    index_document()                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Resource Store                            │
│                                                              │
│   doc_id: "abc123def456"                                     │
│                                                              │
│   chapters: [                                                │
│     {id: 1, uri: "chapter://abc123/1", content: "..."},     │
│     {id: 2, uri: "chapter://abc123/2", content: "..."},     │
│   ]                                                          │
│                                                              │
│   images: [                                                  │
│     {id: 0, uri: "image://abc123/0", data: <bytes>},        │
│   ]                                                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                 MCP Resource Protocol                        │
│                                                              │
│   Client requests: chapter://abc123/1.md                     │
│   Server returns:  Markdown-formatted Chapter 1              │
└─────────────────────────────────────────────────────────────┘

URI Schemes

chapter://doc_id/N — Single chapter
chapter://doc_id/N.md — Chapter as Markdown
chapter://doc_id/N.txt — Chapter as plain text
chapter://doc_id/N.html — Chapter as HTML
chapters://doc_id/N-M — Range of chapters
image://doc_id/N — Single image
sheet://doc_id/name — Excel sheet
slide://doc_id/N — PowerPoint slide

URL Processing & Caching

Documents from URLs are downloaded and cached:

CACHE_DURATION = 3600  # 1 hour

async def resolve_file_path(path_or_url: str) -> str:
    if not path_or_url.startswith(('http://', 'https://')):
        return path_or_url  # Local file

    # Check cache
    cache_key = hash_url(path_or_url)
    cached = get_from_cache(cache_key)
    if cached and not expired(cached):
        return cached.path

    # Download
    async with aiohttp.ClientSession() as session:
        async with session.get(path_or_url) as response:
            content = await response.read()

    # Save to temp and cache
    temp_path = save_to_temp(content, extension_from_url(path_or_url))
    add_to_cache(cache_key, temp_path)

    return temp_path

Error Handling Philosophy

Every error must be:

Clear — What went wrong
Actionable — What can be done about it
Helpful — Suggestions for next steps

class ExtractionError(Exception):
    def __init__(self, message: str, hint: str = None, tried: list = None):
        self.message = message
        self.hint = hint or "Check file integrity with analyze_document_health"
        self.tried = tried or []

    def to_dict(self):
        return {
            "error": self.message,
            "hint": self.hint,
            "tried_methods": self.tried
        }

Performance Considerations

Token Limits

Large documents are paginated at ~25,000 tokens:

MAX_TOKENS = 25000
TOKENS_PER_CHAR = 0.25  # Rough estimate

def should_paginate(content: str) -> bool:
    estimated_tokens = len(content) * TOKENS_PER_CHAR
    return estimated_tokens > MAX_TOKENS

Lazy Loading

Resources aren’t loaded until requested:

# index_document only creates references
index = await index_document("huge-book.docx")
# Returns immediately with URIs

# Content loaded only when requested
chapter_1 = await mcp_resources.read("chapter://abc123/1")
# Now the content is loaded

Extending mcwaddams

Want to add new functionality?

Add a New Tool

# In the appropriate mixin
@mcp.tool()
async def my_new_tool(file_path: str) -> dict:
    """Description for MCP discovery."""
    path = await resolve_file_path(file_path)
    validate_file(path)

    # Your logic here

    return {"result": "..."}

Add Format Support

# In format detection
FORMAT_HANDLERS = {
    '.docx': DocxHandler,
    '.doc': DocHandler,
    '.rtf': RtfHandler,  # Add new format
}

“What would you say… you do here?”

We extract documents. It’s pretty straightforward.

🎉

Flair Earned!

Badge Name