Thursday, March 14, 2013

Content Metadata UDJC step (using Apache Tika)

I recently stumbled across the Apache Tika project, which is a content analysis toolkit that offers great capabilities such as extracting metadata from various documents.  Depending on the document type, various kinds of metadata are available.  Some examples of metadata include MIME type, last modified date, author, publisher, etc.

I think (more) content analysis would be a great capability to add the Pentaho suite (especially Pentaho Data Integration), so I set out to write a quick UDJC step using Tika, followed by a sample transformation to extract document metadata using that step:



The first thing I noticed when I started writing the UDJC code to interface with Tika is that most of the useful code for recognizing document types and outputting to various formats is buried as private in a class called TikaCLI.  It appears the current state of Tika is such that it is meant to be used as a command-line tool that can be extended by adding your own content types, output formats, etc.  However, for this test I just wanted to be able to use Tika programmatically from my code.  Since Tika is licensed under Apache 2.0, I simply grabbed the sections of code I needed from TikaCLI and pasted them into my UDJC step.

The actual UDJC step processing basically does the following:

  1. Reads in a filename or URL and converts it (if necessary) to a URL
  2. Calls Tika methods to extract metadata from the document at the URL
  3. For each metadata item (a property/value pair), create a new row and add the property and value
  4. If Tika throws any errors and the step is connected to a error-handling step, send the error row(s)
I ran the sample transformation on my Downloads directory, here is a snippet of the output:


If you know ahead of time which metadata properties you want, you can use a Row Denormaliser step to have the properties become field names, and their values be the values in those fields.  This helps reduce the amount of output, since the denormalizer will output one row per document, whereas the UDJC step outputs one row per metadata property per document.  For my example transformation (see above), I chose the "Content-Type" property to denormalise.  Here is the output snippet corresponding to the same run as above:


Tika does a lot more than just metadata extraction, it can extract text from document formats such as Microsoft Word, PDF, etc. and it can even guess the language (English, French, e.g.) of the content.  Adding these features to PDI would be a great thing, and if I ever get the time, I will create a proper "Analyze Content" step, using as many of Tika's features as I can pack in :)  We could even integrate the Automatic Documentation Output functionality by adding content recognizers and such for PDI artifacts like jobs and transformations.

The sample transformation is on GitHub here.  As always, I welcome your questions, comments, and suggestions. If you try this out, let me know how it works for you. Cheers!

Friday, March 8, 2013

Pentaho Data Integration 4.4 and Hadoop 1.0.3

While working with a few new Hadoop-based technologies (blog posts to come later), the need arose to get Pentaho Data Integration (PDI) and its Big Data plugin (source available on GitHub) working with an Apache Hadoop 1.0.3 cluster.  Currently, PDI 4.4 only supports the following distributions (and any distributions compatible with them):


  • Apache Hadoop 0.20.x (hadoop-20)
  • Cloudera CDH3u4 (cdh3u4)
  • Cloudera CDH4 (cdh4)
  • MapR (mapr)


The values in parentheses in the list above are the folder names under the Big Data plugin's "hadoop-configurations", each of which contains JARs and other resources needed to run PDI against a particular distribution.  To select a distribution for PDI to use, you edit the plugin.properties file in the Big Data plugin's root folder and set the "active.hadoop.configuration" property to one of the folder names above.  The default setting is for Apache Hadoop 0.20.x:

active.hadoop.configuration=hadoop-20

Apache Hadoop 1.0.3 is not compatible with the Apache Hadoop 0.20.x line, and thus PDI doesn't work with 1.0.3 out-of-the-box.  So I set out to find a way to make that happen.

First, I simply copied the hadoop-20 folder to a "hadoop-103" folder in the same directory (pentaho-big-data-plugin/hadoop-configurations/).  Then I replaced the following JARs in the client/ subfolder with the versions from the Apache Hadoop 1.0.3 distribution:

commons-codec-<version>.jar
hadoop-core-<version>.jar

and I added the following JAR from the Hadoop 1.0.3 distribution to the client/ subfolder as well:

commons-configuration-<version>.jar

Then I changed the property in plugins.properties to point to my new folder:

active.hadoop.configuration=hadoop-103

Then I started PDI and was able to use steps like Hadoop Copy Files and Pentaho MapReduce (see the Wiki for How-Tos).

NOTE: I didn't try to get all functionality working or tested.  Specifically, I didn't try anything related to Hive, HBase, Sqoop, or Oozie.  For Hive, I'm hoping the PDI client will work against any Hive server running on an Apache Hadoop 0.20.x cluster, or any compatible configuration.  If I test any of these Hadoop technologies, I will update this blog post.

If you try this procedure (for 1.0.3, 1.0.x, or any other Hadoop distribution), let me know if it works for you, especially if you had to do anything I haven't listed here :)  Cheers!