Fun with Pentaho Data Integration: April 2014

I've been trying to figure out ways to make it dead-simple to create new plugins for Kettle / Pentaho Data Integration, and as a result I've got some GitHub projects using various approaches:

- pentaho-plugin-skeletons: Skeleton projects with the build artifacts and classes in-place, with heredoc explaining how to accomplish certain tasks. So far I only have the pdi-step-plugin-skeleton with a Gradle build and the heredoc is fairly lacking as I haven't had the time to fill in sample snippets. I also want to add a .travis.yml for folks that would like to use Travis for their CI needs.

- GroovyConsoleSpoonPlugin: This project serves three purposes: first, it can build a Spoon plugin that allows the user to bring up a Groovy Console within Spoon, which can access the entire environment and manipulate it in fairly cool ways (see my previous posts). Second, it can start a standalone Groovy Console that brings in Kettle dependencies in order to prototype features without a full running PDI environment. Third, it offers an internal DSL (based on Groovy of course) to make exploration of the PDI environment as simple as possible. This includes overloaded operators, additions to the Kettle API/SDK, and all sorts of helper methods/classes to make life easier. At some point I will likely move the DSL out of this project to a standalone project, but that won't be anytime soon I'm afraid.

- pdi-pojo: This project is the real subject of this blog post, and it aims to allow the PDI plugin developer to create a single class that extends some pdi-pojo class which provides all the boilerplate and "normal" handling of PDI plugin issues.

The pdi-pojo approach is to provide superclasses for common plugins (such as StepPluginPOJO) which provide delegates, default implementations, etc. for said plugins, thereby reducing the boilerplate code needed to get up and running with a new PDI plugin. This is accomplished by annotating fields in the subclass to indicate how they should be handled by the framework.

Here is an example (taken from the sample code in the project itself) for a StepPluginPOJO subclass that declares its fields as public members (you can also make them protected/private if there are bean getter/setter methods):

@Step( id = "TestStepPluginPOJO",
image = "test-step-plugin-pojo.png",
name = "StepPluginPOJO Test",
description = "Tests the StepPluginPOJO",
categoryDescription = "Experimental" )
public class TestStepPluginPOJO extends StepPluginPOJO {

@UI( label = "Enter value" )
public String testString;

@ExcludeMeta
public String testExcludeString;

public int testInt;

public boolean testBool;

@UI( hint = "Checkbox" )
public boolean testBoolAsText;

@UI( label = "Cool bool", value = "true" )
public boolean testBoolWithLabel;

@UI( label = "Start date", hint = "Date" )
public Date startDate;

@UI( label = "End TOD", hint = "Time" )
public Date endTime;

@NewField
public String status;

The public member variables will be processed on initialization of the plugin to determine which are metadata fields, which need UI representation in the plugin's dialog, etc. By default, all public member variables are treated as metadata fields; to exclude a variable you use the ExcludeMeta annotation (see above).

Perhaps the most helpful of the annotations is the UI annotation, as this will indicate to the framework how to display and handle the graphical user element(s) associated with the member. If a UI annotation (or a hint as to how it should be displayed) is absent, the framework will choose a default representation. For example, a boolean member will be (by default) displayed as a checkbox, but a String is displayed as a text field (with Kettle variables allowed within).

For the members above, the following dialog is displayed:

Note how the "value" attribute of the UI annotation will set the default value (or a default is determined based on type).

These members are just for testing the various UI components and annotations; the only one I'm using in the code the is the "status" field, which is annotated as a NewField, which means the framework will add the metadata to the outgoing row, so all you have to do is find the index of that field by name using indexOf(), then storing your value into the output row at that index (see code below).

Although almost all of the boilerplate for a step plugin is taken care of by StepPluginPOJO, one method remains abstract: processRow(). This is ensure that your subclass is actually doing something :)

For the example, I'm basically creating a simplified version of the Add Constants step, by adding a field called "status" to the row and setting its value to the value specified in the dialog box. Here is the body of the processRow() method, followed by a screen shot of the test transformation:

@Override
public boolean processRow( StepMetaInterface smi, StepDataInterface sdi ) throws KettleException {
Object[] r = getRow(); // get row, set busy!
// no more input to be expected...
if ( r == null ) {
setOutputDone();
return false;
}

if ( first ) {
first = false;
outputRowMeta = getInputRowMeta().clone();
getFields( outputRowMeta, getStepname(), null, null, this, repository, getMetaStore() );
}

// Allocate room on the row for the new fields
Object[] newRow = RowDataUtil.resizeArray( r, r.length + getNewFields().size() );

// Do processing here, add new field(s)
int newFieldIndex = outputRowMeta.indexOfValue( "status" );
if ( newFieldIndex == -1 ) {
throw new KettleException( "Couldn't find field 'status' in output row!" );
}
newRow[newFieldIndex] = status;

putRow( outputRowMeta, newRow ); // copy row to possible alternate rowset(s).

if ( checkFeedback( getLinesRead() ) ) {
if ( isBasic() ) {
logBasic( "Lines read" + getLinesRead() );
}
}

return true;
}

In this case I set the status variable to "Not sure", and my Generate Rows step created 10 rows with a String field named "x" containing the value "Hello". Here are the results, as expected:

I wondered what the performance impact would be for creating a step plugin with StepPluginPOJO versus writing the plugin by hand. I tried to keep the processRow() as simple as possible while trying to match it up to an existing step (hence the Add Constants example). I ran both the above transformation and its counterpart (with Add Constants replacing my test step) with a million rows, and for the most part I got the same performance. The biggest difference was a single run where the Add Constants version ran a million rows in 0.6 seconds and the TestStepPluginPOJO ran in 0.7 seconds.

Because most of the reflection is done at initialization and/or UI time, the performance should be close to writing the same plugin by hand. Of course, if performance is your goal you should write the plugin by hand to fine-tune all aspects of its execution. This project is here for rapidly prototyping plugins.

Having said that, you can use pdi-pojo and incrementally move away from the provided code by overriding whichever methods you choose. For example, if the auto-generated UI is too ugly or primitive for you, you can override the getDialogClassName() method and return one you implement by hand. If you want your own initialization code, override the init() method, and so on.

So there's lots more work to do, but I figure I have a good enough start to blog about this and to welcome any contributions to the project. If you get a chance to try it out, please let me know how it works for you! I will try to get the JAR onto Bintray or Sonatype or something like that shortly, but in the meantime just fork and clone my repo, run "gradle clean plugin", and then you can drop the JAR into whatever project you want (or publish it to your local Maven repo or whatever).

Cheers!

Fun with Pentaho Data Integration

Monday, April 28, 2014

PDI Plugin POJOs

About Me

What I'm Reading