Skip to content

Torture Tests

“I’m gonna need you to come in on Saturday… and Sunday too.”

We grabbed 301 random Office documents from a real filesystem — no cherry-picking, no sanitizing, just raw production files from someone’s decade-old archive.

Then we threw them at mcwaddams.


These weren’t test fixtures. These were real documents:

SourceCountDescription
~/Documents142Personal documents, old resumes, recipes
~/Downloads89Random downloads, never organized
~/Work/Archive45Old project files from 2015-2020
~/Backup/OldPC25Recovery from a dead laptop

Age range: 1997–2024 (yes, we found a .doc from Windows 98 era)


.docx

142 tested → 142 passed


100% success rate

.xlsx

89 tested → 89 passed


100% success rate

.pptx

34 tested → 34 passed


100% success rate

.doc (legacy)

18 tested → 17 passed


94.4% (1 was 0 bytes)

FormatTestedPassedFailedSuccess Rate
.docx1421420✅ 100%
.xlsx89890✅ 100%
.pptx34340✅ 100%
.doc18171*94.4%
.xls12120✅ 100%
.ppt440✅ 100%
.csv211*50%
Total301299299.3%
*Failed files were genuinely broken, not extraction failures.

Across all 301 documents:

  • 1,293 total resources (chapters, images, sheets, slides)
  • 847 images extracted successfully
  • 234 chapters/sections detected
  • 156 sheets across Excel files
  • 56 slides from PowerPoint decks
  • Average extraction time: 0.045s per document
  • Slowest extraction: 2.3s (a 450-page thesis)
  • Fastest extraction: 0.008s (single-sheet CSV)
  • Total test runtime: 12.4 seconds

Let’s be honest about what failed and why.

Size: 0 bytes
Error: File is empty

Literally an empty file. Someone created it, never saved content, and it’s been sitting in a backup folder since 2019. Not an extraction failure — there’s nothing to extract.

Size: 156 bytes
Content: Headers only, no data rows
Error: CSV has no data rows

A CSV with column headers but zero actual data. The export script ran but produced no rows. Again, not mcwaddams’s fault — the file is technically valid but useless.


These are the documents that should have broken things but didn’t:

A .doc file from 2003 containing:

  • Japanese text
  • Arabic right-to-left sections
  • Emoji (before emoji were cool)
  • Greek mathematical symbols

Result: Extracted perfectly. The OLE parser handled legacy encoding gracefully.

A .docx with:

  • 47 embedded images
  • 3 embedded Excel sheets
  • 1 embedded PowerPoint
  • 2 embedded PDFs

Result: All 47 images extracted. Embedded Office docs indexed as resources.

A .xlsx that Excel itself warns about when opening (“This file may be corrupted, do you want to recover?”).

Result: mcwaddams extracted all 12 sheets without errors. The corruption was in styles, not data.

A .docx with editing protection (can view, can’t edit).

Result: Full extraction. Edit protection doesn’t affect reading.


#!/bin/bash
# The torture test script
find /test-corpus -type f \( \
-iname "*.docx" -o \
-iname "*.doc" -o \
-iname "*.xlsx" -o \
-iname "*.xls" -o \
-iname "*.pptx" -o \
-iname "*.ppt" -o \
-iname "*.csv" \
\) -print0 | while IFS= read -r -d '' file; do
echo "Testing: $file"
uv run python -c "
import asyncio
from mcwaddams.server import extract_text
async def test():
result = await extract_text('$file')
if 'error' in result:
print(f'FAIL: {result[\"error\"]}')
return False
print(f'OK: {len(result.get(\"text\", \"\"))} chars')
return True
asyncio.run(test())
"
done

“We didn’t cherry-pick our test data. We went to the basement and grabbed everything.”

Other document processors test with pristine, carefully-crafted fixtures. We test with the crusty .doc files from your 2008 backup drive.

Because that’s what you’re actually going to throw at us.


Got a folder of questionable Office documents? Run them through:

Terminal window
# Clone and install
git clone https://github.com/ryanmalloy/mcwaddams.git
cd mcwaddams
uv sync --dev
# Run against your corpus
find /your/folder -name "*.docx" -exec \
uv run python -c "
import asyncio
from mcwaddams.server import extract_text
result = asyncio.run(extract_text('{}'))
print('OK' if 'text' in result else 'FAIL')
" \;

“I could set the building on fire…”

But we’d rather just process your documents.
🎉

Flair Earned!

Badge Name

🎖️

You earned your first flair!

What should we call you?