Extract Tables from Word Documents

“We need to talk about your table extraction…”

Word documents often contain important data locked in tables. mcwaddams extracts them as structured data you can actually use.

The Problem

You have a Word document with a table like this:

Product	Q1	Q2	Q3	Q4
Widget A	1,234	1,456	1,678	1,890
Widget B	987	1,012	1,045	1,089
Widget C	456	489	512	534

You need that data in a format you can process, not trapped in a binary .docx file.

The Solution

Extract tables from quarterly-report.docx as markdown

Result:

| Product | Q1 | Q2 | Q3 | Q4 |
|---------|-----|-----|-----|-----|
| Widget A | 1,234 | 1,456 | 1,678 | 1,890 |
| Widget B | 987 | 1,012 | 1,045 | 1,089 |
| Widget C | 456 | 489 | 512 | 534 |

Output Formats

Extract tables from report.docx as markdown

Perfect for documentation, README files, or pasting into other documents.

Extract tables from report.docx as JSON

Returns structured data:

{
  "tables": [
    {
      "index": 0,
      "headers": ["Product", "Q1", "Q2", "Q3", "Q4"],
      "rows": [
        ["Widget A", "1,234", "1,456", "1,678", "1,890"],
        ["Widget B", "987", "1,012", "1,045", "1,089"]
      ]
    }
  ]
}

Extract tables from report.docx as CSV

Ready for spreadsheet import:

Product,Q1,Q2,Q3,Q4
Widget A,1234,1456,1678,1890
Widget B,987,1012,1045,1089

Extract tables from report.docx

Full metadata with styling information:

{
  "tables": [
    {
      "index": 0,
      "position": "after paragraph 12",
      "headers": ["Product", "Q1", "Q2", "Q3", "Q4"],
      "rows": [...],
      "styling": {
        "has_header_row": true,
        "borders": "all",
        "alignment": "left"
      }
    }
  ]
}

Handling Complex Tables

Merged Cells

Tables with merged cells are normalized:

Extract tables, preserve merged cells

The tool maintains cell relationships while making the data usable.

Nested Tables

Tables within tables? We handle those too:

{
  "tables": [
    {
      "index": 0,
      "nested_tables": [
        {
          "cell": [1, 2],
          "data": [...]
        }
      ]
    }
  ]
}

Large Tables

Tables spanning multiple pages extract completely — no truncation.

Multiple Tables

Documents with several tables return all of them:

Extract all tables from multi-section-report.docx

{
  "tables": [
    {"index": 0, "context": "Executive Summary", "rows": 5},
    {"index": 1, "context": "Financial Data", "rows": 24},
    {"index": 2, "context": "Appendix A", "rows": 100}
  ],
  "total_tables": 3
}

Specific Table Selection

Extract only the table you need:

Extract the second table from report.docx

Or by context:

Extract the table under "Financial Summary"

Common Issues

Empty Cells

Empty cells are preserved as empty strings, not null:

{
  "rows": [
    ["Widget A", "", "1,456", "1,678", "1,890"]
  ]
}

Formatting in Cells

Bold, italic, and other formatting is stripped for clean data. If you need styled output, use markdown format.

Numeric Detection

Numbers remain as strings to preserve formatting (commas, currency symbols). Convert as needed:

import json
data = json.loads(result)
# Convert strings to numbers
for row in data["tables"][0]["rows"]:
    row[1] = int(row[1].replace(",", ""))

Legacy Documents

Tables in .doc files work the same way:

Extract tables from old-report.doc as JSON

The OLE parser handles legacy table structures automatically.

Next Steps

Analyze Excel Data — Statistical analysis of spreadsheets
Convert to Markdown — Full document conversion
All Tools Reference — Complete documentation

“PC Load Letter? What does that mean?”

We don’t know either, but at least tables make sense now.

🎉

Flair Earned!

Badge Name