Architecture
“So I was sitting in my cubicle today, and I realized… every document format is worse than the one before it.”
mcwaddams is designed around a single principle: never silently fail. When extraction works, you get content. When it doesn’t, you get a clear explanation why.
High-Level Overview
Section titled “High-Level Overview”┌─────────────────────────────────────────────────────────────┐│ MCP Client ││ (Claude Code, Claude Desktop) │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ FastMCP Server ││ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ Universal │ │ Word │ │ Excel │ ││ │ Mixin │ │ Mixin │ │ Mixin │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ │ │ │ ││ └────────────────┼────────────────┘ ││ ▼ ││ ┌─────────────────────────────────────────────────────┐ ││ │ Format Detection & Routing │ ││ └─────────────────────────────────────────────────────┘ ││ │ ││ ┌────────────────┼────────────────┐ ││ ▼ ▼ ▼ ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ python- │ │ openpyxl │ │ python- │ ││ │ docx │ │ pandas │ │ pptx │ ││ └───────────┘ └───────────┘ └───────────┘ ││ │ │ │ ││ └────────────────┼────────────────┘ ││ ▼ ││ ┌─────────────────────────────────────────────────────┐ ││ │ Fallback Handlers │ ││ │ mammoth · olefile · xlrd · legacy parsers │ ││ └─────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────┘The Mixin Pattern
Section titled “The Mixin Pattern”mcwaddams uses mixins to organize tools by document type:
Why Mixins?
Section titled “Why Mixins?”class MCWaddamsServer( UniversalMixin, # extract_text, extract_images, etc. WordMixin, # convert_to_markdown, extract_tables, etc. ExcelMixin, # analyze_data, extract_formulas, etc.): """The main server combines all capabilities.""" passBenefits:
- Separation of concerns — Word tools don’t clutter Excel code
- Independent testing — Each mixin can be tested in isolation
- Easy extension — Add PowerPoint tools without touching existing code
- Clear ownership — Easy to find where a tool is defined
Format Detection
Section titled “Format Detection”Every request starts with format detection:
async def detect_format(file_path: str) -> FormatInfo: # 1. Check file extension ext = Path(file_path).suffix.lower()
# 2. Validate file exists and is readable validate_file(file_path)
# 3. Detect actual format (extension might lie) with open(file_path, 'rb') as f: magic_bytes = f.read(8)
# 4. Check for encryption is_encrypted = check_encryption(file_path)
return FormatInfo( extension=ext, actual_format=detect_from_magic(magic_bytes), is_legacy=(ext in ['.doc', '.xls', '.ppt']), is_encrypted=is_encrypted )Magic Bytes
Section titled “Magic Bytes”We don’t trust file extensions alone:
| Magic Bytes | Format |
|---|---|
50 4B 03 04 | ZIP-based (OOXML: docx/xlsx/pptx) |
D0 CF 11 E0 | OLE Compound (doc/xls/ppt) |
EF BB BF or starts with text | CSV/plain text |
The Fallback Strategy
Section titled “The Fallback Strategy”Primary methods are tried first, then fallbacks:
async def extract_text(file_path: str, method: str = "auto"): if method == "auto": # Try primary method try: return await primary_extraction(file_path) except ExtractionError: pass
# Try fallback try: return await fallback_extraction(file_path) except ExtractionError: pass
# All methods failed — return helpful error return { "error": "Extraction failed", "tried": ["python-docx", "mammoth"], "hint": "File may be corrupted. Try analyze_document_health." }Fallback Chain
Section titled “Fallback Chain”| Format | Primary | Fallback 1 | Fallback 2 |
|---|---|---|---|
.docx | python-docx | mammoth | — |
.doc | olefile | textract* | raw OLE |
.xlsx | openpyxl | pandas | — |
.xls | xlrd | pandas | olefile |
.pptx | python-pptx | — | — |
.csv | pandas | built-in csv | — |
Resource System
Section titled “Resource System”For large documents, the resource system enables on-demand fetching:
┌─────────────────────────────────────────────────────────────┐│ index_document() │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Resource Store ││ ││ doc_id: "abc123def456" ││ ││ chapters: [ ││ {id: 1, uri: "chapter://abc123/1", content: "..."}, ││ {id: 2, uri: "chapter://abc123/2", content: "..."}, ││ ] ││ ││ images: [ ││ {id: 0, uri: "image://abc123/0", data: <bytes>}, ││ ] │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ MCP Resource Protocol ││ ││ Client requests: chapter://abc123/1.md ││ Server returns: Markdown-formatted Chapter 1 │└─────────────────────────────────────────────────────────────┘URI Schemes
Section titled “URI Schemes”chapter://doc_id/N— Single chapterchapter://doc_id/N.md— Chapter as Markdownchapter://doc_id/N.txt— Chapter as plain textchapter://doc_id/N.html— Chapter as HTMLchapters://doc_id/N-M— Range of chaptersimage://doc_id/N— Single imagesheet://doc_id/name— Excel sheetslide://doc_id/N— PowerPoint slide
URL Processing & Caching
Section titled “URL Processing & Caching”Documents from URLs are downloaded and cached:
CACHE_DURATION = 3600 # 1 hour
async def resolve_file_path(path_or_url: str) -> str: if not path_or_url.startswith(('http://', 'https://')): return path_or_url # Local file
# Check cache cache_key = hash_url(path_or_url) cached = get_from_cache(cache_key) if cached and not expired(cached): return cached.path
# Download async with aiohttp.ClientSession() as session: async with session.get(path_or_url) as response: content = await response.read()
# Save to temp and cache temp_path = save_to_temp(content, extension_from_url(path_or_url)) add_to_cache(cache_key, temp_path)
return temp_pathError Handling Philosophy
Section titled “Error Handling Philosophy”Every error must be:
- Clear — What went wrong
- Actionable — What can be done about it
- Helpful — Suggestions for next steps
class ExtractionError(Exception): def __init__(self, message: str, hint: str = None, tried: list = None): self.message = message self.hint = hint or "Check file integrity with analyze_document_health" self.tried = tried or []
def to_dict(self): return { "error": self.message, "hint": self.hint, "tried_methods": self.tried }Performance Considerations
Section titled “Performance Considerations”Token Limits
Section titled “Token Limits”Large documents are paginated at ~25,000 tokens:
MAX_TOKENS = 25000TOKENS_PER_CHAR = 0.25 # Rough estimate
def should_paginate(content: str) -> bool: estimated_tokens = len(content) * TOKENS_PER_CHAR return estimated_tokens > MAX_TOKENSLazy Loading
Section titled “Lazy Loading”Resources aren’t loaded until requested:
# index_document only creates referencesindex = await index_document("huge-book.docx")# Returns immediately with URIs
# Content loaded only when requestedchapter_1 = await mcp_resources.read("chapter://abc123/1")# Now the content is loadedExtending mcwaddams
Section titled “Extending mcwaddams”Want to add new functionality?
Add a New Tool
Section titled “Add a New Tool”# In the appropriate mixin@mcp.tool()async def my_new_tool(file_path: str) -> dict: """Description for MCP discovery.""" path = await resolve_file_path(file_path) validate_file(path)
# Your logic here
return {"result": "..."}Add Format Support
Section titled “Add Format Support”# In format detectionFORMAT_HANDLERS = { '.docx': DocxHandler, '.doc': DocHandler, '.rtf': RtfHandler, # Add new format}“What would you say… you do here?”
We extract documents. It’s pretty straightforward.