Extracting Tables from PDF


PDF or Portable Document Format is one of the most popular document format in the world right now for writing and sharing documents. Despite it popularity, it gets tricky to extract records out of a PDF files for a programmer. This blogs look at a specific problem statement for extracting tables from PDF documents.

Apache Tika is an open source tool which extracts metadata and data as a text format. Tika is an amazing tool for extracting records out of the documents but it doesn’t quite detects tables or tabular format records out of a PDF. This is where tabula comes to the picture.

Tabula is an opensource app which helps you detect tables out of a PDF file. You can detect a table in a pdf document and save the records in an CSV, JSON, TSV format. Tabula comes with a web interface which you can start and do your manual extraction.

Tabula exposes a java api for detection of the tables. It is under the name of tabula-java in the maven repository. So let us assume we want to extract Page 1 of a sample PDF document as below:

Sample Input PDF Document

In order to extract the one table out of this document, let us open an eclipse and use maven to import the tabula-java jar:

<dependency>
   <groupId>technology.tabula</groupId
   <artifactId>tabula</artifactId>
   <version>1.0.3</version>
</dependency>

Will next write a java class to read and open a PDF document. PDDocument is a helpful class to open a pdf file:

PDDocument pd = PDDocument.load(new File(FILENAME));

Next is the bit of magic which tabula provides. SpreadsheetExtractionAlgorithm is the magic class which detects table out of the pdf document.

ObjectExtractor oe = new ObjectExtractor(pd);

SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm(); // Tabula algo.

Page page = oe.extract(1); // extract only the first page
List<Table> table = sea.extract(page);

log.info("Tables detected: "+table)

Since our document is having a very simple table format. What is meant is we are having tables in the document with a definitive table boundaries. SpreadsheetExtractionAlgorithm will work as charms in such cases. For documents having tables with complex boundary or headers, you will have to use a slightly exhaustive algorithm: NurminenDectectionAlgorithm. For this algo, you will require to define table boundaries and then extract the cell from tables.

Now once you have detected table from the document then you will require to iterate over the table and then extract or print/extract/store the rows out.

Output of the above table in an java console is as below:

Output of the above sample input file

Gist of the code is available here. Write in your comments and queries.


Processing…
Success! You're on the list.

6 thoughts on “Extracting Tables from PDF

  1. Rakesh

    Hi Sir,
    how define table boundaries and then extract the cell from tables.

    Like

    1. Hello Rakesh, you could use the NurminenDectectionAlgorithm within Tabula library to define the boundaries and then extract the content. Another alternate approach and slightly exhaustive process would be to extract all the contents of the table and then use tika or some other csv libraries to extract the fields needed.

      Hope this helps.

      Like

  2. Naveed Jameel

    Hello sir!
    sir how we extract the entire text within cell at once . Mean to say sir you use for loop for at the end to get text within cell but entire text within cell not get properly it get line by line mean 1st it get text of 1st line of all cells but i want to get data of entire cell one by one. Mean when for loop run then it get all the text of 1st cell, then all the text of 2nd cell and soo on. please sir do it with example code . Thank you very much sir in advance and am hope for soon reply with example code .

    Like

  3. Naveed Jameel

    Actually sir i want to read a table in pdf file and write same table with same data on text file. i tried my best to do but unable to do. so kindly sir do it with example code. Thanks in advance

    Like

  4. Akin

    Thank you… this really helps… at least to start with…

    Like

    1. Akin

      Sir, can you please write a code example of how and where “NurminenDectectionAlgorithm” can be used.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: