Thursday, October 18, 2012

Groovy Console Spoon Plugin Update

I've been working on the Groovy Console plugin for Spoon, and I seem to have been able to sidestep the PermGen issues (at least the ones I was having during my last post).  Also I added some more functionality, such as the ability to get JobMeta and JobGraph information using job, jobGraph, activeJob(), and activeJobGraph().

I really wanted to have the ability to do "automated" testing from this console, so I added a runTrans() helper method.  If you pass in a TransMeta object, it will run that transformation; otherwise, it will try to run the active transformation.

However, in implementing the runTrans() method I ran into a couple of issues:

1) The execution settings dialog pops up if you run Spoon.executeTransformation() with no arguments.

2) If you are looping and trying to run the transformation once during each loop, the executions will step on each other as they are executed asynchronously.

To fix 1, I faked a call to executeTransformation() using a replay date of the current time, which prevents the execution settings dialog from being displayed.  As far as getting the settings configured, that will be a little trickier. So runTrans() at the time of this writing works on transformations with no execution-specific settings, or on transformations that have already been run with execution-specific settings.  This is because the transformation is effectively replayed.

To fix 2, I put in a hack after the call to executeTransformation() to sleep for 10 msec while the trans is NOT running, then to sleep for 100 msec while the trans IS running.  This is a classic no-no for concurrency but it's in there to get things working in the happy-path case.  This turns runTrans() into a synchronous call, which is fine for my purposes in the console.

Using the database() and runTrans() methods, I was able to automate the testing of a single transformation using two separate database connections:

One thing to note is that I'm changing the step metadata using "back-door" API calls, so the transformation graph is not marked as "dirty".  If for some reason you need to save the changes you should be able to call some form of API method.

The updated code and downloads are on GitHub. Have fun!

Wednesday, October 10, 2012

Getting Groovy with PDI

I'm a huge fan of Pentaho Data Integration and also of the Groovy programming language.  Since PDI is written in Java and Groovy and Java are so tightly integrated, it seems only natural to bring some Groovy-ness into PDI :)

Some Groovy support already exists in the form of the Script step in the Experimental category.  This step allows any language that supports JSR-223 to be used in scripting a PDI step.  Also I believe there is some Groovy support in the Data Mining product in the Pentaho suite.

What I wanted was to offer a Groovy console to the Spoon user where he/she could interact with the PDI environment.  It turns out that Groovy's groovy.ui.Console class provides this capability.  So I created a Spoon plugin with a Tools menu overlay offering the Groovy console:

And before it opens (the first time takes a while as it loads all the classes it needs including Swing), it adds bindings to the singletons in the PDI environment:

- spoon: this variable is bound to Spoon.getInstance(), and thus offers the full Spoon API such as executeTransformation(), getActiveDatabases(), etc.

- pluginRegistry: this variable is bound to PluginRegistry.getInstance(), and offers the Plugin Registry API, such as findPluginWithName(), etc.

- kettleVFS: this variable is bound to KettleVFS.getInstance(), and offers such methods as getTextFileContent(), etc.

- slaveConnectionManager: this variable is bound to SlaveConnectionManager.getInstance(), and offers such methods as createHttpClient

In addition to the above singletons, I also added the following variables that are resolved when the console comes up:

- trans: This variable is resolved to Spoon.getInstance().getActiveTransformation()

- defaultVarMap: This variable is resolved to KettleVariablesList.getInstance().getDefaultValueMap()

- defaultVarDescMap: This variable is resolved to KettleVariablesList.getInstance().getDescriptionMap()

- transGraph: This variable is resolved to Spoon.getInstance().getActiveTransGraph());

Then I added a few helper methods:

- methods(Object o): This method returns o.getClass().getDeclaredMethods()

- printMethods(Object o): This method will print a line with information for each of the methods obtained from calling methods() on the specified object

- props(Object o): This is an alias for the properties() method

- properties(Object o): This method returns o.getClass().getDeclaredFields()

- printProperties: This method will print a line with information for each of the fields obtained from calling properties() on the specified object

- activeTrans: This method calls Spoon.getInstance().getActiveTransformation(), and is thus more up-to-date than using the "trans" variable.

- activeTransGraph: This method calls Spoon.getInstance().getActiveTransGraph(), and is thus more up-to-date than using the "transGraph" variable.

- database(String dbName): This method returns the DatabaseMeta object for the database connection with the specified name. It searches the active databases (retrieved from Spoon.getInstance().getActiveDatabases())

- step(String stepName): This method returns the StepMeta object for the step in the active transformation with the specified name.

- createdb(Map args): This method takes named parameters (passed to Java as a Map) of name, dbType, dbName, host, port, user, and password, and creates and returns a DatabaseMeta object

Once all the bindings are done, the console is displayed:

This allows a Groovy way of doing inspection of the PDI environment, but since almost all of the API is exposed in one way or another, you can also edit metadata.  For example, in the above transformation I'd like to run against two different database connections.  Without this plugin you run the transformation manually against one database connection, then edit the steps to switch the connection, then run the transformation again.  With the Groovy Console you can automate this:

Right now I'm having PermGen memory issues when running transformations, so until I get that figured out the console is probably best used for looking at and editing metadata rather than executing "power methods" like executeTransformation().

Besides swapping database connections, you can also create a new one using the createdb() helper method. Here's an example of this being done, as well as testing the connection and adding it to the active transformation:

Using the API, you should also be able to create new steps, transformations, etc.  I'd be interested to hear any uses you come up with, so please comment here if and how you've used this plugin.

The code and downloads are up on GitHub, just unzip the folder into plugins/spoon.  Cheers!

UPDATE: It appears that in PDI 4.3.0 there is an older version of the Groovy JAR in libext/reporting, which causes the console plugin not to work. You can try replacing that JAR with the one included in the plugin, but I'm not sure if the reporting stuff will continue to work.  In the upcoming release, the version of the Groovy JAR has been updated to 1.8.0.