Sunday, September 30, 2012

Bucket Partitioner plugin

Lately I've been fooling around with the various PDI/Kettle plugin types while reading Matt Casters et al. book "Pentaho Kettle Solutions", and I reached the section on partitioners, which intrigued me.  Around the same time, someone had asked a StackOverflow and/or Pentaho Forums question on how to get at certain data from a "flat" XML file, meaning the data he was interested in was not contained in a parent tag, rather it was a certain number of fields AFTER a particular tag.

The solution I proposed was to use the row number to create a sort of "bucket", where the first N rows would be in the first bucket, the next N in the second bucket, and so on.  Then it occurred to me that this is just a different form of the Mod Partitioner, except you use the quotient of the division (over the number of partitions) rather than the remainder.

This seemed like an easy thing to implement since all the infrastructure code was already done by the ModPartitioner classes.  So I copied the source files into a new project, using a @PartitionerPlugin annotation on my BucketPartitioner class (the ModPartitioner, being part of Kettle core, has its plugin attributes located in the kettle-partition-plugins.xml file).

All I had to change was to add a "/ nrPartitions" expression to the getPartition() code :)

Anyway, the code is up on GitHub (with the plugin JAR), and when I ran it against the sample customers-100.txt CSV file into partitioned Text File Output steps:



I got the results I desired for a partition size of 4, with the first 4 rows (starting with row number 0, so the first "group" only had 3 rows in the bucket versus the usual 4) in file-A.txt, the second 4 rows in file-B.txt, etc.  I left the (% nrPartitions) expression in (as suggested by the authors) so that row 16 would be in bucket A and so on.

Now that I got my feet wet with partitioner plugins, I'm thinking about the following partitioners:

- A Number Range partitioner (much like -- and re-using code from -- the Number Range step)
- A String Difference partitioner (same algorithms available in Fuzzy Match & Calculator steps)
- An XPath-based partitioner?
- A dynamic partitioner of some kind

As always, I welcome all comments, questions, and suggestions :)  Cheers!

1 comment:

  1. Nice, I added this post to the list of available PDI plug-ins on the Pentaho Wiki:
    http://wiki.pentaho.com/display/EAI/List+of+Available+Pentaho+Data+Integration+Plug-Ins

    ReplyDelete