PDF Table Extraction Tool

Some times while working on one project, it ends up being necessary to build a tool that doesn’t exist in the form you want. That recently happened while we were building our open-source STM32F libraries. The result of this, however, turned out to be kind of cool and we figured it might be useful for others. So we stuck it in a GitHub repo.

What the code does is help pdftotext to extract well delineated (black rectangular bounded) table cells, and outputs them in a number of different formats for use with down stream tools. In one format, the code can output the scanned table and colour code what it thinks are distinct cells, so you can check that it’s getting things right (see figure, below).

Continue reading