Tuesday, December 30, 2014

SuperScript PDI plugin

As readers of my blog know, I'm a huge fan of scripting languages on the JVM (especially Groovy), and of course I'm a huge fan of Pentaho Data Integration :)  While using the (experimental) Script step to do various things, I saw a few places where a script step could be improved for easier use and more powerful features.  Specifically I wanted:


  • A drop-down UI for selecting the scripting engine
  • Allow non-compilable JSR-223 scripting (such as AppleScript)
  • Enable the use of the script's returned value as an output field
  • Enable the use of the script step as an input step (doesn't need a trigger)
To that end, I set out to refactor the existing Script step, and I'm happy to announce that the SuperScript step is now available in the PDI Marketplace:


As you can see from the screenshot above, I get a list of the output fields, which are the input fields plus the fields I specified in the table at the bottom of the dialog. Also notice that I did not define the variable/field "c" in the script, but I set "Script result?" to Y for "c", and thus "c" will contain the result of the script evaluation (in this case, b * rowNumber).

The following variables are available to each script:

  • step - A reference to the SuperScript step object
  • stepName - the name of the SuperScript step
  • transName - the name of the transformation
  • row - the current row's data
  • lastRow - the last row's data
  • rowMeta - the metadata about the rows (field types, e.g.)
  • rowNumber (starts with 1 like the rest of PDI)
  • SKIP_TRANSFORMATION
  • ABORT_TRANSFORMATION
  • ERROR_TRANSFORMATION
  • CONTINUE_TRANSFORMATION

These last 4 function the same way as they do in the Script step. Users of the Script (or Modified JavaScript) step will notice that I removed leading underscores from some of the variables, this is to support script engines that don't support leading underscores in variable names.

A noticeable addition is the "lastRow" variable, this will contain null (or be undefined) for the first row but will contain the previous row's data for all subsequent rows. This opens the door for more powerful processing, such as filling empty fields with the previous row's value, changing script behavior based on if a field value has changed since the last row, etc.  UPDATE: Here is a screenshot of an example script that will fill the field (if null) with the previous field's value (if not null):



Other helpful API calls include the following:

step.getTrans() (or just step.trans in Groovy) - gets a reference to the Transformation object
step.getTrans().findStepInterface("another step") - Gets a reference to a different step
step.putRow(rowMeta, rowData) - Adds a row to the output stream

As I mentioned before, the existing Script step must be "primed" so that it will run at least once. Usually this is done with a Generate Rows step that puts out 1 or more rows. Sometimes the script itself will generate rows, so I wanted SuperScript to run at least once, whether there was an incoming row or not:



Perhaps the most fun and powerful addition is the ability of SuperScript to execute any JSR-223 Script Engine. The existing Script step requires that the Script Engine produce CompiledScript(s), which of course is the fastest but not always available. To that end, SuperScript will attempt to compile the script first, and if it cannot, it will fall back to evaluating (i.e. interpreting) the script(s).  This opens the door for a lot of new scripting languages, such as Jython, AppleScript, and Renjin (an R ScriptEngine for the JVM).

To illustrate this, and to have "Fun with Pentaho Data Integration",  I created a transformation with two SuperScripts running AppleScript, one just after a Generate Rows step, and one just before the last step. The first AppleScript starts iTunes, and the last one quits it. For long-running transformations, this pattern can be used to provide some musical entertainment while you wait for the transformation to finish :)


NOTE: The AppleScript script engine is afaik only available on the Mac and comes with its JDK.

I hope you find this plugin helpful, and I'd love to know how/if you are using it, and also how to improve it. One improvement I hope to add is the ability to choose different plugin(s) to link to, in order to use their capabilities. This would likely be done by creating a chained self-first classloader, and could be useful for things like Big Data capabilities from inside the SuperScript step.

The code is Apache 2.0-licensed and available on GitHub at https://github.com/mattyb149/pdi-scriptengine-plugin

As always, I welcome all comments, questions, and suggestions. Until next time, have fun with Pentaho Data Integration :)

Cheers!

Monday, December 8, 2014

How sorted (or sordid) is your data?

I've spent quite a bit of time looking at Pentaho Data Integration (aka Kettle) and trying to make it do things with external technologies and idioms, anywhere from Groovy, Drill, memcached, Redis, Hazelcast, and even Markov Chains. Recently though, I've been started to focus on the data coming in and out of PDI, and what I could learn from it (#datadiscovery). I'll be spending a lot more time with R and Drill soon, but as a small example of data discovery, I thought I'd look at "how sorted" data is.

Basically I wanted to know for an input stream (consisting of CSV files or database tables or whatever), is the stream close to being in a sorted state or not?  I am currently looking into approximate and probabilistic methods (like Longest Increasing Sequence and an interesting "multiplayer" version here), but this post is about a brute-force method of finding the variance of distance between an element in a stream and where it would be if the stream were sorted.

Specifically, I looked at the rank (aka row number) of the incoming data as the row number of the raw input, then in parallel I sorted on the desired columns and ranked the sorted rows. I was looking for the distance between each row's value(s) and how far the rows were from their sorted position(s).  My research (read: Google search and Wikipedia) brought me to the Spearman rank correlation coefficient.

For this I would need to sort the rows, then find the delta between the position of each desired column in the original rowset and the sorted rowset, then find the statistical dependence of the ranked values. There are more sophisticated techniques to determine the relationships between ranked items, but this one suited my purpose :)

Once I found the algorithm I was looking for, I set out to create an example using only PDI steps, with the following caveats:

1) No scripting steps: Of course you can do whatever you like with the Scripting steps, but if you don't know those programming languages, you're left with the rest of what PDI offers. Luckily the choices are plentiful and powerful.

2) No SQL steps: Most databases probably offer the kind of expressive power you'd need to write a "Spearman rho" function inline, and to be honest that's probably the best option performance-wise; but I was looking to create a data-agnostic, language-agnostic way to calculate the "sortedness" of a data set in PDI, as this could be used in a blending or refinery situation.

I decided to use the "customer-100.txt" sample file in PDI, and sorted on full name, in order to determine "how sorted" the customer data was in terms of customers' names.  I designed the following transformation:




This transformation is on Gist and Box.  The results:




The absolute value of the Spearman rho for customer-100.txt (when sorted on name) is 0.001272127. I used absolute value because I didn't care whether the stream was close to being sorted or reverse-sorted; if you care about that in your usage, then leave out the ABS(rho) calculation in the "Spearman's rho" step above.

Being so close to zero, we can determine that the data is not very well sorted, as a result of the Spearman rho telling us that there is no tendency for the raw data and the sorted data to follow any sort of trend of monotonicity (ascending or descending). If the values were to get closer to 1 (or to -1 if not using absolute value), then the stream would be closer to its sorted state and thus "more sorted".  I set up a rudimentary Value Mapper step ("How sorted is the data?" above) to indicate whether the data was well-sorted or not.  If you disable the sort path and enable the direct path around it, then the two rowsets will match and you will get 1.0 as the Spearman rho.

This might not be very useful to the PDI user-at-large, but I learned alot while working through it so I thought I'd share it here. Stay tuned for more Fun with Pentaho Data Integration ;)