Fun with Pentaho Data Integration: September 2014

Monday, September 22, 2014

Groovy Datasources with Pentaho Report Designer

Ok, so this blog is called "Fun with Pentaho Data Integration", but I recently fielded a question about using scriptable data sources in Pentaho Report Designer (PRD), and rather than start a whole new blog, I thought I'd just post it here. The techniques are generally usable for all Pentaho design tools that support scripting, including PDI :)

I had the following goals with this POC:

1) Instead of just using Groovy to build a TypedTableModel with static values, I wanted to get the data from an external source. I chose a Twitter search with the #Pentaho hashtag.

2) I want to showcase the ability to use the Groovy Grapes/Grab functionality, to minimize/remove the number of JARs needed in the PRD classpath. I was able to get it to zero, with one caveat (see below)

3) I wanted to show off Groovy's idioms and syntactic sugar to show how easy it is to generate a scripted data source for PRD. I got it down to 17 SLOC :)

Goals #1 and #2 led me to Twitter4J, a Java client for the Twitter 1.1 API. In order to use it in a Groovy script, I needed Twitter4J and its dependencies. The most straighforward way to do this is to use Groovy's Grab annotation:

@Grab(group="org.twitter4j", module="twitter4j-core", version="4.0.2")

However, PRD doesn't include Ivy (or at least the parts of Ivy) that Grab needs to resolve dependencies. Rather than add the JAR(s) to PRD, I just installed Ivy separately and added the folder to the "libraries" property of the launcher.properties file:

libraries=lib:lib/jdbc:lib/bigdata:docs:../../apache-ivy-2.4.0-rc1

I think there's only one JAR of interest so it might be just as easy to toss it in report-designer/lib, but I like to keep a clean shop where possible :)

NOTE to Mac users: I don't know how to do this with the Mac application distribution, it's probably not difficult but I usually just download the regular distribution and run the shell script to start PRD.

Once this was done, I started PRD, created a new report, and created a new Advanced->Scripting data source:

I selected "groovy" as the engine, created a new query called "Get Pentaho Tweets", and entered the following script (gist here):

@Grab(group="org.twitter4j", module="twitter4j-core", version="4.0.2")
import twitter4j.*
import twitter4j.api.*
import twitter4j.conf.*
import org.pentaho.reporting.engine.classic.core.util.TypedTableModel;

colNames = ['Screen Name', 'Tweet Date', 'Text'] as String[]
colTypes = [String, Date, String] as Class[]
model = new TypedTableModel(colNames, colTypes);

ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setDebugEnabled(true)
  .setOAuthConsumerKey(System.getenv('TWITTER_CONSUMER_TOKEN'))
  .setOAuthConsumerSecret(System.getenv('TWITTER_CONSUMER_TOKEN_SECRET'))
  .setOAuthAccessToken(System.getenv('TWITTER_ACCESS_TOKEN'))
  .setOAuthAccessTokenSecret(System.getenv('TWITTER_ACCESS_TOKEN_SECRET'));
TwitterFactory tf = new TwitterFactory(cb.build());
Twitter twitter = tf.getInstance();

def query = new Query('#pentaho')
def result     = twitter.search(query)
result?.tweets.each { tweet ->
  model.addRow( [tweet.user.screenName, tweet.createdAt, tweet.text] as Object[] )
}
return model;

Note that I am reading in token info from the environment, you can also change this to use System properties or just hard code them in, although I (and the folks at Twitter4J) highly recommend against it. Perhaps the best approach is to require some/all of the tokens as report parameters. I didn't try this but I suspect it's quite possible with a scripted datasource.

For goal #3, I think both the creation of the model and the population of the model with data speak to how easy and succinct it is. Take a look at the BeanShell sample (which *almost compiles to Groovy) for the Sample report with a Scripted Datasource to see how much more verbose it is. I use the "as" keyword/idiom quite often in Groovy, usually when interacting with a Java API.

Anyway, that's all there is to it! The script creates a 3-column table containing the last 100 tweets mentioning "#pentaho", with the columns being the tweeter's screen name, the date the tweet was created, and the text of the tweet itself. Be careful not to exceed your rate limit; I didn't put in a try/catch for the search call, but you can do that in order to return an empty table or something as you like.

As I said before, this approach should be very similar for the Script step in PDI, although you won't be populating a model; rather, you'll specify the fields in the step dialog and fill them with the script. For consistency and maintainability, you could always use the PDI transformation as the datasource in your PRD report, but that's not what I was trying to do for this post :)

Cheers!

Saturday, September 20, 2014

Using Apache Drill with PDI

One of the non-Pentaho side projects I've become interested in is Apache Drill, I like all the different aspects of it and hope to contribute in some meaningful way shortly :) As a first step, without touching any code, I thought I'd see if I could configure PDI and Drill to play nicely together. My Proof-Of-Concept was a single Table Input step in PDI, using a Generic driver to point at my local Drill instance. I was able to query the datasources as described in the Drill wiki.

At first I wanted to be able to start Drill in embedded mode, as my end goal was to be able to provide a PDI plugin so "Apache Drill" would show up on the Database types list in the Database Connection dialog. However I ran into a bunch of classloading issues (see previous post), so I thought I'd try a different approach that worked much better. Here's my procedure:

1) Install Drill and Zookeeper.
2) Start Zookeeper (I used Standalone mode)
3) Start Drill in Distributed mode (I used runbit). By default this should look for a localhost:2181 ZK
4) Add a relative path pointing at the <path_to_drill>/jars/jdbc-driver folder to the "libraries" property in PDI's data-integration/launcher/launcher.properties file. For my install, I used ../../../../apache-drill-0.5.0-incubating/jars/jdbc-driver
4) Open PDI and create a new Transformation with a Table Input step.
5) In the Table Input step dialog, create a new Connection with the following properties:

Click the Test button, it may take a while the first time but it should come back with Success. Don't click the Explore button or try to look for tables or columns, there is an Unsupported Operation in a few of the JDBC calls. The reason I've found is usually that a Drill class has subclassed an Avatica class and hasn't overridden the methods that PDI is calling. This is common with newer technologies that provide JDBC drivers; the JDBC API is huge so many young drivers don't implement all the methods.

6) Click OK to exit the Database Connection dialog
7) Enter your SQL query in the Table Input step
8) Click Preview or run the transformation

I was able to run queries like the ones in the wiki (see link above) as well as things like:

SELECT GENDER, AVG(salary) AS AVG_SALARY FROM cp.`employee.json` GROUP BY GENDER

I haven't tried this with anything but Table Input. In my experience if I am getting UnsupportedOperationExceptions in the Database Connection Dialog, I won't get very far with other Pentaho tools. This is due to the lack of implemented methods in the driver. As a Pentaho employee, I've done a few things to "fix" these on the fly. I looked into doing the same for Drill while trying to create a plugin for it, but I spent too many hours in Dependency Hell and eventually gave up (so far) trying to embed a Drill client/instance in a PDI plugin. Now that I have it working with Drill in Distributed mode, perhaps I will give it another try. If I can get around the classloading / service locator problems with Jersey, perhaps I'll achieve my end goal :)

Cheers!

PDI plugins and Dependency Hell

I've written quite a few plugins for Pentaho Data Integration, some are "finished" in terms of being in the PDI Marketplace, and some are still works in progress, Proofs of Concept, etc. The usual pattern for my plugins is to integrate some 3rd party open-source tech with PDI. For one-JAR solutions, this is pretty straightforward and rarely causes problems. For larger projects, however, this can be quite painful.

When a plugin is loaded into PDI, a new self-first ClassLoader is created, the JARs at the root of the plugin folder are scanned for plugin type annotations and plugin.xml files, and then the JARs in the lib/ folder of the plugin are added to the ClassLoader. This allows plugins to (sometimes) override the versions of JARs that are provided with PDI, and to provide their own JARs in a sandbox of sorts, so as not to infiltrate or influence other plugins.

However there are a few JARs that always seem to cause problems, where a plugin's version of the JAR will not play well with the version provided by PDI. Slf4J is one of these, as are the Jersey JARs. A lot of it has to do with which PDI classes are trying to use which of its own loaded classes while operating using a plugin which might have the same class in its own classloader. In this case, there are two separate versions of the class in the JVM, one loaded by the App classloader, and one by the plugin's classloader. As long as it's only the plugin referencing the class, there should be no problem; however, if the "duplicate" class is passed to a PDI-loaded class, then a ClassCastException can occur, as even though they appear to be the same class (same fully-qualified class name, e.g.), they are different due to the different classloaders.

I've run into this a few times in my projects, so I finally wised up and wrote a small Groovy script to look for common JARs between two paths. This script ignores the versions on the JARs (assuming they're named in "normal" style (name-version.jar). It doesn't work for JARs with classifiers (yet), it just does a quick substring looking for the last dash in the name. For each JAR "common" to the two paths, it then prints the versions in each path. This allows me to use transitive dependencies when building my plugin package, but then I can inspect which JARs might cause problems as they will already be provided by the PDI runtime when the plugin is deployed.

Here's the script code (Gist here):

commonLibs = [:]
leftLibs = []
rightLibs = []
if(args?.length > 1) {
    new File(args[0]).eachFileRecurse {
     fileName = (it.name - '.jar')
     fileNameNoVersion = fileName[0..fileName.lastIndexOf('-')-1]
     leftLibs += fileNameNoVersion
     commonLibs.putAt(fileNameNoVersion, commonLibs[fileNameNoVersion] ? (commonLibs[fileNameNoVersion] + [1:fileName]) : [1:fileName])
    }
    new File(args[1]).eachFileRecurse {
     fileName = (it.name - '.jar')
     fileNameNoVersion = fileName[0..fileName.lastIndexOf('-')-1]
     rightLibs += fileNameNoVersion
     commonLibs.putAt(fileNameNoVersion,  commonLibs[fileNameNoVersion] ? (commonLibs[fileNameNoVersion] + [2:fileName]) : [2:fileName])
    }
    leftLibs.intersect(rightLibs).each { lib ->
      println "${commonLibs[lib]}"
    }
}
else {
  // print usage
  println 'Usage: jardiff  \nDisplays the JARs common to both paths (ignoring version) and the versions of those JARs in each path.'
}

When I have more time and interest, I will switch the lastIndexOf() to a regex instead, to make it more robust, and I'm sure I can make things more "Groovy". However I just needed something to get me unstuck so I can start integrating more cool tech into PDI. Stay tuned to this blog and the Marketplace, hopefully I'll have something new soon :)

Cheers!

Fun with Pentaho Data Integration

Monday, September 22, 2014

Groovy Datasources with Pentaho Report Designer

Saturday, September 20, 2014

Using Apache Drill with PDI

PDI plugins and Dependency Hell

About Me

What I'm Reading