Thursday, March 14, 2013

Content Metadata UDJC step (using Apache Tika)

I recently stumbled across the Apache Tika project, which is a content analysis toolkit that offers great capabilities such as extracting metadata from various documents.  Depending on the document type, various kinds of metadata are available.  Some examples of metadata include MIME type, last modified date, author, publisher, etc.

I think (more) content analysis would be a great capability to add the Pentaho suite (especially Pentaho Data Integration), so I set out to write a quick UDJC step using Tika, followed by a sample transformation to extract document metadata using that step:



The first thing I noticed when I started writing the UDJC code to interface with Tika is that most of the useful code for recognizing document types and outputting to various formats is buried as private in a class called TikaCLI.  It appears the current state of Tika is such that it is meant to be used as a command-line tool that can be extended by adding your own content types, output formats, etc.  However, for this test I just wanted to be able to use Tika programmatically from my code.  Since Tika is licensed under Apache 2.0, I simply grabbed the sections of code I needed from TikaCLI and pasted them into my UDJC step.

The actual UDJC step processing basically does the following:

  1. Reads in a filename or URL and converts it (if necessary) to a URL
  2. Calls Tika methods to extract metadata from the document at the URL
  3. For each metadata item (a property/value pair), create a new row and add the property and value
  4. If Tika throws any errors and the step is connected to a error-handling step, send the error row(s)
I ran the sample transformation on my Downloads directory, here is a snippet of the output:


If you know ahead of time which metadata properties you want, you can use a Row Denormaliser step to have the properties become field names, and their values be the values in those fields.  This helps reduce the amount of output, since the denormalizer will output one row per document, whereas the UDJC step outputs one row per metadata property per document.  For my example transformation (see above), I chose the "Content-Type" property to denormalise.  Here is the output snippet corresponding to the same run as above:


Tika does a lot more than just metadata extraction, it can extract text from document formats such as Microsoft Word, PDF, etc. and it can even guess the language (English, French, e.g.) of the content.  Adding these features to PDI would be a great thing, and if I ever get the time, I will create a proper "Analyze Content" step, using as many of Tika's features as I can pack in :)  We could even integrate the Automatic Documentation Output functionality by adding content recognizers and such for PDI artifacts like jobs and transformations.

The sample transformation is on GitHub here.  As always, I welcome your questions, comments, and suggestions. If you try this out, let me know how it works for you. Cheers!

No comments:

Post a Comment