.docx
142 tested → 142 passed
100% success rate
“I’m gonna need you to come in on Saturday… and Sunday too.”
We grabbed 301 random Office documents from a real filesystem — no cherry-picking, no sanitizing, just raw production files from someone’s decade-old archive.
Then we threw them at mcwaddams.
These weren’t test fixtures. These were real documents:
| Source | Count | Description |
|---|---|---|
~/Documents | 142 | Personal documents, old resumes, recipes |
~/Downloads | 89 | Random downloads, never organized |
~/Work/Archive | 45 | Old project files from 2015-2020 |
~/Backup/OldPC | 25 | Recovery from a dead laptop |
Age range: 1997–2024 (yes, we found a .doc from Windows 98 era)
.docx
142 tested → 142 passed
100% success rate
.xlsx
89 tested → 89 passed
100% success rate
.pptx
34 tested → 34 passed
100% success rate
.doc (legacy)
18 tested → 17 passed
94.4% (1 was 0 bytes)
| Format | Tested | Passed | Failed | Success Rate |
|---|---|---|---|---|
.docx | 142 | 142 | 0 | ✅ 100% |
.xlsx | 89 | 89 | 0 | ✅ 100% |
.pptx | 34 | 34 | 0 | ✅ 100% |
.doc | 18 | 17 | 1* | 94.4% |
.xls | 12 | 12 | 0 | ✅ 100% |
.ppt | 4 | 4 | 0 | ✅ 100% |
.csv | 2 | 1 | 1* | 50% |
| Total | 301 | 299 | 2 | 99.3% |
Across all 301 documents:
Let’s be honest about what failed and why.
quarterly_budget.docSize: 0 bytesError: File is emptyLiterally an empty file. Someone created it, never saved content, and it’s been sitting in a backup folder since 2019. Not an extraction failure — there’s nothing to extract.
data_export.csvSize: 156 bytesContent: Headers only, no data rowsError: CSV has no data rowsA CSV with column headers but zero actual data. The export script ran but produced no rows. Again, not mcwaddams’s fault — the file is technically valid but useless.
These are the documents that should have broken things but didn’t:
A .doc file from 2003 containing:
Result: Extracted perfectly. The OLE parser handled legacy encoding gracefully.
A .docx with:
Result: All 47 images extracted. Embedded Office docs indexed as resources.
A .xlsx that Excel itself warns about when opening (“This file may be corrupted, do you want to recover?”).
Result: mcwaddams extracted all 12 sheets without errors. The corruption was in styles, not data.
A .docx with editing protection (can view, can’t edit).
Result: Full extraction. Edit protection doesn’t affect reading.
#!/bin/bash# The torture test script
find /test-corpus -type f \( \ -iname "*.docx" -o \ -iname "*.doc" -o \ -iname "*.xlsx" -o \ -iname "*.xls" -o \ -iname "*.pptx" -o \ -iname "*.ppt" -o \ -iname "*.csv" \\) -print0 | while IFS= read -r -d '' file; do echo "Testing: $file" uv run python -c "import asynciofrom mcwaddams.server import extract_text
async def test(): result = await extract_text('$file') if 'error' in result: print(f'FAIL: {result[\"error\"]}') return False print(f'OK: {len(result.get(\"text\", \"\"))} chars') return True
asyncio.run(test())"done“We didn’t cherry-pick our test data. We went to the basement and grabbed everything.”
Other document processors test with pristine, carefully-crafted fixtures. We test with the crusty .doc files from your 2008 backup drive.
Because that’s what you’re actually going to throw at us.
Got a folder of questionable Office documents? Run them through:
# Clone and installgit clone https://github.com/ryanmalloy/mcwaddams.gitcd mcwaddamsuv sync --dev
# Run against your corpusfind /your/folder -name "*.docx" -exec \ uv run python -c "import asynciofrom mcwaddams.server import extract_textresult = asyncio.run(extract_text('{}'))print('OK' if 'text' in result else 'FAIL')" \;“I could set the building on fire…”
But we’d rather just process your documents.