Fun with Pentaho Data Integration

Tuesday, December 4, 2012

GroovyConsoleSpoonPlugin with JSR-223 Support

I recently gave a presentation of my GroovyConsoleSpoonPlugin (see earlier posts) to the Pentaho crew, and I got a lot of great feedback on it. Specifically, Pentaho Architect Nick Baker suggested it would be good to have the Groovy-ized API available as a step, so we could leverage it during transformation execution. For the presentation I had made changes to the Kettle code to allow the adding of RowListeners and TransListeners, so the Groovy Console plugin could interact with a running transformation. However, his suggestion makes a lot of sense, as I can keep the plugin a proper plugin, with no changes to Kettle code needed.

I thought about creating a Groovy scripting step and adding my staging code to make scripting easier. However that involves a lot of setup and boilerplate code that already exists in other steps (such as the JavaScript and Script steps). I still may do that, in order to leverage the full power of the Groovy language, but in the meantime it occurred to me that I could just create a script engine wrapper (using the javax.script.ScriptEngine interface, etc.) and use the experimental Script step as the vehicle.

So to that end I added GroovySpoonScriptEngine and GroovySpoonScriptEngineFactory classes, which wrap the existing plugin code inside a JSR-223 compliant scripting engine. Then the Script step can execute Groovy code (with my staging code already injected) during a transformation.

To get Spoon to recognize my scripting engine, I had to add my plugin directory to the list of libraries when launching Spoon. This is done in launcher/launcher.properties like so:

After starting Spoon, I created a transformation with a Generate Rows (for one row) with a field called "fld1" with a value of "hello". Then I wired it to a Script step, which is wired to a Dummy step.

The key to selecting a scripting engine in the Script step is to name the step with an extension that corresponds to the engine name. So for the Groovy scripting engine the step name ends in ".groovy", and for my GroovySpoonScriptEngine the step name must end in ".groovyspoon". My sample transformation looks like this:

Inside the Script step I put some very simple code showing the use of the row data in GStrings, output variables, local variables (which are not returned to the step processor), and the Groovy-ized API available through the plugin:

The script step does have a bit of a limitation (due to its generic nature) that the output fields must be specified in the Fields table at the bottom of the dialog. This is in contrast, for example, to the User Defined Java Class (UDJC) step, which can add values to the row on-the-fly (see previous posts).

Previewing the Dummy step gives the following result:

So now the plugin supports the Groovy-ized API in the Groovy Console, the command line, and the Script step. With these three entry points, hopefully interacting with PDI will become alot easier and even more powerful!

I hope to get time to document the Groovy-ized API on the project's wiki. In the meantime, take a look in the code at staging.groovy, that contains most of the methods, properties, etc. available. In addition, of course, is the full Kettle API and all Groovy language features, so the sky's the limit when it comes to what can be done :)

The project is on GitHub under GroovyConsoleSpoonPlugin. As always, I welcome all comments, questions, and suggestions. Cheers!

Wednesday, November 7, 2012

UDJC to Verify Transformations

The "Verify Transformation" capability of Pentaho Data Integration (aka Kettle) is very handy for spotting issues with your transformations before running them. As a sanity check or as an auditing feature, I thought it would be nice to verify all transformations in the repository using a Kettle transformation.

To do this, I wrote a User Defined Java Class (UDJC) step to call the TransMeta.checkSteps() API method, just as the Spoon GUI's "Verify transformation" button does. However, instead of displaying a dialog, I put the same information out onto the stream.

In order to get access to the repository's transformations, I started with a Get Repository Names step. Since Kettle jobs don't (currently) have the same verification functionality, I set the filter to only return transformations:

NOTE: The UDJC step will accept all incoming rows (Jobs, Transformations, etc.) but will only process those whose "meta" object is of type TransMeta.

To get access to the TransMeta object for each transformation, I used the Auto Documentation Output step, with the output type set to METADATA. This puts (among other things) a field on the stream containing the TransMeta object associated with each transformation:

I wired the autodoc step to my UDJC step and output the same fields as are available in the GUI version:

If this is determined to be a useful step, I may turn it into a proper step plugin, to remove the need for the "show_successful" field, and to instead provide a dialog box to let the user choose which fields (and their names) to put out on the stream. UDJC steps are just an easy way for me to get basic functionality out there and try to get early feedback.

I created a dummy repository on GitHub so I could have a Downloads area where I will start storing sample transformations that contain UDJC steps, etc. This is slightly easier than putting the UDJC code on Gist or Pastebin, especially in this case since there are multiple steps involved. The direct link to the above transformation is here.

If you give this a try or otherwise have comments, I'm eager to hear them :)

Cheers!

Thursday, October 18, 2012

Groovy Console Spoon Plugin Update

I've been working on the Groovy Console plugin for Spoon, and I seem to have been able to sidestep the PermGen issues (at least the ones I was having during my last post). Also I added some more functionality, such as the ability to get JobMeta and JobGraph information using job, jobGraph, activeJob(), and activeJobGraph().

I really wanted to have the ability to do "automated" testing from this console, so I added a runTrans() helper method. If you pass in a TransMeta object, it will run that transformation; otherwise, it will try to run the active transformation.

However, in implementing the runTrans() method I ran into a couple of issues:

1) The execution settings dialog pops up if you run Spoon.executeTransformation() with no arguments.

2) If you are looping and trying to run the transformation once during each loop, the executions will step on each other as they are executed asynchronously.

To fix 1, I faked a call to executeTransformation() using a replay date of the current time, which prevents the execution settings dialog from being displayed. As far as getting the settings configured, that will be a little trickier. So runTrans() at the time of this writing works on transformations with no execution-specific settings, or on transformations that have already been run with execution-specific settings. This is because the transformation is effectively replayed.

To fix 2, I put in a hack after the call to executeTransformation() to sleep for 10 msec while the trans is NOT running, then to sleep for 100 msec while the trans IS running. This is a classic no-no for concurrency but it's in there to get things working in the happy-path case. This turns runTrans() into a synchronous call, which is fine for my purposes in the console.

Using the database() and runTrans() methods, I was able to automate the testing of a single transformation using two separate database connections:

One thing to note is that I'm changing the step metadata using "back-door" API calls, so the transformation graph is not marked as "dirty". If for some reason you need to save the changes you should be able to call some form of Spoon.save API method.

The updated code and downloads are on GitHub. Have fun!

Wednesday, October 10, 2012

Getting Groovy with PDI

I'm a huge fan of Pentaho Data Integration and also of the Groovy programming language. Since PDI is written in Java and Groovy and Java are so tightly integrated, it seems only natural to bring some Groovy-ness into PDI :)

Some Groovy support already exists in the form of the Script step in the Experimental category. This step allows any language that supports JSR-223 to be used in scripting a PDI step. Also I believe there is some Groovy support in the Data Mining product in the Pentaho suite.

What I wanted was to offer a Groovy console to the Spoon user where he/she could interact with the PDI environment. It turns out that Groovy's groovy.ui.Console class provides this capability. So I created a Spoon plugin with a Tools menu overlay offering the Groovy console:

And before it opens (the first time takes a while as it loads all the classes it needs including Swing), it adds bindings to the singletons in the PDI environment:

- spoon: this variable is bound to Spoon.getInstance(), and thus offers the full Spoon API such as executeTransformation(), getActiveDatabases(), etc.

- pluginRegistry: this variable is bound to PluginRegistry.getInstance(), and offers the Plugin Registry API, such as findPluginWithName(), etc.

- kettleVFS: this variable is bound to KettleVFS.getInstance(), and offers such methods as getTextFileContent(), etc.

- slaveConnectionManager: this variable is bound to SlaveConnectionManager.getInstance(), and offers such methods as createHttpClient

In addition to the above singletons, I also added the following variables that are resolved when the console comes up:

- trans: This variable is resolved to Spoon.getInstance().getActiveTransformation()

- defaultVarMap: This variable is resolved to KettleVariablesList.getInstance().getDefaultValueMap()

- defaultVarDescMap: This variable is resolved to KettleVariablesList.getInstance().getDescriptionMap()

- transGraph: This variable is resolved to Spoon.getInstance().getActiveTransGraph());

Then I added a few helper methods:

- methods(Object o): This method returns o.getClass().getDeclaredMethods()

- printMethods(Object o): This method will print a line with information for each of the methods obtained from calling methods() on the specified object

- props(Object o): This is an alias for the properties() method

- properties(Object o): This method returns o.getClass().getDeclaredFields()

- printProperties: This method will print a line with information for each of the fields obtained from calling properties() on the specified object

- activeTrans: This method calls Spoon.getInstance().getActiveTransformation(), and is thus more up-to-date than using the "trans" variable.

- activeTransGraph: This method calls Spoon.getInstance().getActiveTransGraph(), and is thus more up-to-date than using the "transGraph" variable.

- database(String dbName): This method returns the DatabaseMeta object for the database connection with the specified name. It searches the active databases (retrieved from Spoon.getInstance().getActiveDatabases())

- step(String stepName): This method returns the StepMeta object for the step in the active transformation with the specified name.

- createdb(Map args): This method takes named parameters (passed to Java as a Map) of name, dbType, dbName, host, port, user, and password, and creates and returns a DatabaseMeta object

Once all the bindings are done, the console is displayed:

This allows a Groovy way of doing inspection of the PDI environment, but since almost all of the API is exposed in one way or another, you can also edit metadata. For example, in the above transformation I'd like to run against two different database connections. Without this plugin you run the transformation manually against one database connection, then edit the steps to switch the connection, then run the transformation again. With the Groovy Console you can automate this:

Right now I'm having PermGen memory issues when running transformations, so until I get that figured out the console is probably best used for looking at and editing metadata rather than executing "power methods" like executeTransformation().

Besides swapping database connections, you can also create a new one using the createdb() helper method. Here's an example of this being done, as well as testing the connection and adding it to the active transformation:

Using the API, you should also be able to create new steps, transformations, etc. I'd be interested to hear any uses you come up with, so please comment here if and how you've used this plugin.

The code and downloads are up on GitHub, just unzip the folder into plugins/spoon. Cheers!

UPDATE: It appears that in PDI 4.3.0 there is an older version of the Groovy JAR in libext/reporting, which causes the console plugin not to work. You can try replacing that JAR with the one included in the plugin, but I'm not sure if the reporting stuff will continue to work. In the upcoming release, the version of the Groovy JAR has been updated to 1.8.0.

Sunday, September 30, 2012

Bucket Partitioner plugin

Lately I've been fooling around with the various PDI/Kettle plugin types while reading Matt Casters et al. book "Pentaho Kettle Solutions", and I reached the section on partitioners, which intrigued me. Around the same time, someone had asked a StackOverflow and/or Pentaho Forums question on how to get at certain data from a "flat" XML file, meaning the data he was interested in was not contained in a parent tag, rather it was a certain number of fields AFTER a particular tag.

The solution I proposed was to use the row number to create a sort of "bucket", where the first N rows would be in the first bucket, the next N in the second bucket, and so on. Then it occurred to me that this is just a different form of the Mod Partitioner, except you use the quotient of the division (over the number of partitions) rather than the remainder.

This seemed like an easy thing to implement since all the infrastructure code was already done by the ModPartitioner classes. So I copied the source files into a new project, using a @PartitionerPlugin annotation on my BucketPartitioner class (the ModPartitioner, being part of Kettle core, has its plugin attributes located in the kettle-partition-plugins.xml file).

All I had to change was to add a "/ nrPartitions" expression to the getPartition() code :)

Anyway, the code is up on GitHub (with the plugin JAR), and when I ran it against the sample customers-100.txt CSV file into partitioned Text File Output steps:

I got the results I desired for a partition size of 4, with the first 4 rows (starting with row number 0, so the first "group" only had 3 rows in the bucket versus the usual 4) in file-A.txt, the second 4 rows in file-B.txt, etc. I left the (% nrPartitions) expression in (as suggested by the authors) so that row 16 would be in bucket A and so on.

Now that I got my feet wet with partitioner plugins, I'm thinking about the following partitioners:

- A Number Range partitioner (much like -- and re-using code from -- the Number Range step)
- A String Difference partitioner (same algorithms available in Fuzzy Match & Calculator steps)
- An XPath-based partitioner?
- A dynamic partitioner of some kind

As always, I welcome all comments, questions, and suggestions :) Cheers!

Thursday, September 20, 2012

Creating a reproduction transformation using UDJC

I got some pretty good feedback about my previous UDJC step that creates a Data Grid model of a Kettle stream, but I noticed right away that the process for using it is a bit long. You have to paste in the step, wire it into the trans, run the trans, remove the steps before the UDJC, then wire the Data Grid in place of the UDJC, delete the UDJC, and save the trans with a different name.

The Kettle API provides all of the above operations, so I thought I should upgrade the step to create a new transformation with the Data Grid step as the source, connected to the rest of the transformation (after the UDJC step). Along the way I also found some issues with threading and metadata in the previous UDJC step, so I fixed them in the new one.

The code for the new "Create Reproduction Transformation" UDJC step is located on Gist here. You "install" it the same way as the previous version, either by pasting the code into a new UDJC step, or by pasting it in from a transformation that already contains the filled-in UDJC step. Then you wire it up in the stream:

Then run the transformation. The transformation runs as if the step weren't there (it passes through all data), but a new transformation is created, containing the Data Grid step hooked up to the "rest of the transformation":

Looking at the Data Grid step, you can see the data from the stream:

One thing to remember is that the step the Data Grid is wired to might have referenced the previous step(s). Because the metadata is different for each step, I couldn't interrogate the step to see if it referenced previous steps, and thus I couldn't change the setting. This means if the step after the Data Grid references previous steps (like Merge Rows for example), you will have to edit it to reference the Data Grid instead.

Anyway, I hope this helps! If you try it out, please let me know how/if it works, if it's useful, and if you'd like to see any other features. Cheers!

Tuesday, September 18, 2012

Create a Data Grid from a PDI step's output

Sometimes when folks write Jira cases against PDI, they post sample transformations that reproduce the problem, which can be very helpful. However, often the transformation references a database that PDI developers can't access, which renders the transformation not-so-helpful for debugging and reproduction purposes. We usually ask that the case writer create a Data Grid step or something that will recreate the offending data without requiring a database connection. This can get complex and may be beyond the skills of some users.

To that end, I wrote a User Defined Java Class (UDJC) that will read in rows from a step and create a Data Grid step in the transformation that contains the meta and data necessary to recreate the problem. The user can paste the UDJC step into their transformation, insert it after the step they wish to model, then run the transformation. At the end, a Data Grid step will be added to the transformation, and the user can substitute it into the transformation in place of the step as a row provider.

As an example, let's use the Get Step Plugins transformation from my previous post:

I can create a UDJC step and paste in the code (actually, I keep a transformation handy that contains only this step so I can easily cut-n-paste into transformations), then add it to the stream after Get Step Plugins:

After running or previewing the transformation, the Data Grid Step is created:

Then I insert the Data Grid after the Create Data Grid UDJC and remove the earlier steps:

Previewing the Input Steps step, I get the same results as I did with the Get Step Plugins source!

I added Undo support so you can use Ctrl-Z to undo the addition of the Data Grid step if you wish. The code for the UDJC is located on Gist here. I've already got ideas on how to make the "creating a reproduction transformation" process even easier, but I think this is a decent first step towards that. Let me know what you think!