Extract Tables from Word Documents
“We need to talk about your table extraction…”
Word documents often contain important data locked in tables. mcwaddams extracts them as structured data you can actually use.
The Problem
Section titled “The Problem”You have a Word document with a table like this:
| Product | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|
| Widget A | 1,234 | 1,456 | 1,678 | 1,890 |
| Widget B | 987 | 1,012 | 1,045 | 1,089 |
| Widget C | 456 | 489 | 512 | 534 |
You need that data in a format you can process, not trapped in a binary .docx file.
The Solution
Section titled “The Solution”Extract tables from quarterly-report.docx as markdownResult:
| Product | Q1 | Q2 | Q3 | Q4 ||---------|-----|-----|-----|-----|| Widget A | 1,234 | 1,456 | 1,678 | 1,890 || Widget B | 987 | 1,012 | 1,045 | 1,089 || Widget C | 456 | 489 | 512 | 534 |Output Formats
Section titled “Output Formats”Extract tables from report.docx as markdownPerfect for documentation, README files, or pasting into other documents.
Extract tables from report.docx as JSONReturns structured data:
{ "tables": [ { "index": 0, "headers": ["Product", "Q1", "Q2", "Q3", "Q4"], "rows": [ ["Widget A", "1,234", "1,456", "1,678", "1,890"], ["Widget B", "987", "1,012", "1,045", "1,089"] ] } ]}Extract tables from report.docx as CSVReady for spreadsheet import:
Product,Q1,Q2,Q3,Q4Widget A,1234,1456,1678,1890Widget B,987,1012,1045,1089Extract tables from report.docxFull metadata with styling information:
{ "tables": [ { "index": 0, "position": "after paragraph 12", "headers": ["Product", "Q1", "Q2", "Q3", "Q4"], "rows": [...], "styling": { "has_header_row": true, "borders": "all", "alignment": "left" } } ]}Handling Complex Tables
Section titled “Handling Complex Tables”Merged Cells
Section titled “Merged Cells”Tables with merged cells are normalized:
Extract tables, preserve merged cellsThe tool maintains cell relationships while making the data usable.
Nested Tables
Section titled “Nested Tables”Tables within tables? We handle those too:
{ "tables": [ { "index": 0, "nested_tables": [ { "cell": [1, 2], "data": [...] } ] } ]}Large Tables
Section titled “Large Tables”Tables spanning multiple pages extract completely — no truncation.
Multiple Tables
Section titled “Multiple Tables”Documents with several tables return all of them:
Extract all tables from multi-section-report.docx{ "tables": [ {"index": 0, "context": "Executive Summary", "rows": 5}, {"index": 1, "context": "Financial Data", "rows": 24}, {"index": 2, "context": "Appendix A", "rows": 100} ], "total_tables": 3}Specific Table Selection
Section titled “Specific Table Selection”Extract only the table you need:
Extract the second table from report.docxOr by context:
Extract the table under "Financial Summary"Common Issues
Section titled “Common Issues”Empty Cells
Section titled “Empty Cells”Empty cells are preserved as empty strings, not null:
{ "rows": [ ["Widget A", "", "1,456", "1,678", "1,890"] ]}Formatting in Cells
Section titled “Formatting in Cells”Bold, italic, and other formatting is stripped for clean data. If you need styled output, use markdown format.
Numeric Detection
Section titled “Numeric Detection”Numbers remain as strings to preserve formatting (commas, currency symbols). Convert as needed:
import jsondata = json.loads(result)# Convert strings to numbersfor row in data["tables"][0]["rows"]: row[1] = int(row[1].replace(",", ""))Legacy Documents
Section titled “Legacy Documents”Tables in .doc files work the same way:
Extract tables from old-report.doc as JSONThe OLE parser handles legacy table structures automatically.
Next Steps
Section titled “Next Steps”- Analyze Excel Data — Statistical analysis of spreadsheets
- Convert to Markdown — Full document conversion
- All Tools Reference — Complete documentation
“PC Load Letter? What does that mean?”
We don’t know either, but at least tables make sense now.