Monday, September 22, 2014

Groovy Datasources with Pentaho Report Designer

Ok, so this blog is called "Fun with Pentaho Data Integration", but I recently fielded a question about using scriptable data sources in Pentaho Report Designer (PRD), and rather than start a whole new blog, I thought I'd just post it here. The techniques are generally usable for all Pentaho design tools that support scripting, including PDI :)

I had the following goals with this POC:

1) Instead of just using Groovy to build a TypedTableModel with static values, I wanted to get the data from an external source. I chose a Twitter search with the #Pentaho hashtag.

2) I want to showcase the ability to use the Groovy Grapes/Grab functionality, to minimize/remove the number of JARs needed in the PRD classpath. I was able to get it to zero, with one caveat (see below)

3) I wanted to show off Groovy's idioms and syntactic sugar to show how easy it is to generate a scripted data source for PRD. I got it down to 17 SLOC :)

Goals #1 and #2 led me to Twitter4J, a Java client for the Twitter 1.1 API. In order to use it in a Groovy script, I needed Twitter4J and its dependencies. The most straighforward way to do this is to use Groovy's Grab annotation:

@Grab(group="org.twitter4j", module="twitter4j-core", version="4.0.2")

However, PRD doesn't include Ivy (or at least the parts of Ivy) that Grab needs to resolve dependencies. Rather than add the JAR(s) to PRD, I just installed Ivy separately and added the folder to the "libraries" property of the launcher.properties file:

libraries=lib:lib/jdbc:lib/bigdata:docs:../../apache-ivy-2.4.0-rc1

I think there's only one JAR of interest so it might be just as easy to toss it in report-designer/lib, but I like to keep a clean shop where possible :)

NOTE to Mac users: I don't know how to do this with the Mac application distribution, it's probably not difficult but I usually just download the regular distribution and run the shell script to start PRD.


Once this was done, I started PRD, created a new report, and created a new Advanced->Scripting data source:



I selected "groovy" as the engine, created a new query called "Get Pentaho Tweets", and entered the following script (gist here):




@Grab(group="org.twitter4j", module="twitter4j-core", version="4.0.2")
import twitter4j.*
import twitter4j.api.*
import twitter4j.conf.*
import org.pentaho.reporting.engine.classic.core.util.TypedTableModel;

colNames = ['Screen Name', 'Tweet Date', 'Text'] as String[]
colTypes = [String, Date, String] as Class[]
model = new TypedTableModel(colNames, colTypes);

ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setDebugEnabled(true)
  .setOAuthConsumerKey(System.getenv('TWITTER_CONSUMER_TOKEN'))
  .setOAuthConsumerSecret(System.getenv('TWITTER_CONSUMER_TOKEN_SECRET'))
  .setOAuthAccessToken(System.getenv('TWITTER_ACCESS_TOKEN'))
  .setOAuthAccessTokenSecret(System.getenv('TWITTER_ACCESS_TOKEN_SECRET'));
TwitterFactory tf = new TwitterFactory(cb.build());
Twitter twitter = tf.getInstance();

def query = new Query('#pentaho')
def result     = twitter.search(query)
result?.tweets.each { tweet ->
  model.addRow( [tweet.user.screenName, tweet.createdAt, tweet.text] as Object[] )
}
return model;



Note that I am reading in token info from the environment, you can also change this to use System properties or just hard code them in, although I (and the folks at Twitter4J) highly recommend against it. Perhaps the best approach is to require some/all of the tokens as report parameters. I didn't try this but I suspect it's quite possible with a scripted datasource.

For goal #3, I think both the creation of the model and the population of the model with data speak to how easy and succinct it is. Take a look at the BeanShell sample (which *almost compiles to Groovy) for the Sample report with a Scripted Datasource to see how much more verbose it is. I use the "as" keyword/idiom quite often in Groovy, usually when interacting with a Java API.

Anyway, that's all there is to it!  The script creates a 3-column table containing the last 100 tweets mentioning "#pentaho", with the columns being the tweeter's screen name, the date the tweet was created, and the text of the tweet itself.  Be careful not to exceed your rate limit; I didn't put in a try/catch for the search call, but you can do that in order to return an empty table or something as you like.

As I said before, this approach should be very similar for the Script step in PDI, although you won't be populating a model; rather, you'll specify the fields in the step dialog and fill them with the script.  For consistency and maintainability, you could always use the PDI transformation as the datasource in your PRD report, but that's not what I was trying to do for this post :)

Cheers!

No comments:

Post a Comment