To achieve this (and to prove my point from a previous post on using Groovy Grape in PDI), I wrote a Groovy script to flatten a JSON document into key/value pairs in PDI. I needed the following elements:
1) An Apache Ivy installation/JAR. In a previous post I added the Ivy folder to the launcher.properties; for this I just dropped the single JAR into data-integration/lib before starting Spoon.
2) A JSON document read into the PDI stream and passed to a Script step
To use Groovy with the (still experimental) Script step, ensure the step name ends in ".groovy". This indicates to the Script step to find a JSR-223 Scripting Engine with the name "groovy". Since PDI comes with the Groovy script engine, you can use this out-of-the-box. To use other script engines, simply add the JAR(s) to the classpath (I usually drop them in data-integration/lib).
NOTE: The Script step does not find the Groovy script engine with Java 7 on Mac. This is documented in PDI-13074. You can use Java 6 but that is not officially supported by Pentaho for PDI 5.0+
For the Groovy script, I decided to use Jackson Databind to parse the JSON, using an example I found on Stack Overflow (here). Databind is not included with PDI, so I used the technique from my PRD post to @Grab it:
@Grab(group='com.fasterxml.jackson.core', module='jackson-databind', version='2.3.3')
Then I needed the ability to add more than one output row per input row. The script step was designed to operate on a row of data and add fields to that row by setting variables in the script (and specifying those variables/fields in the step dialog). Since I needed one input row to generate multiple output rows (one per JSON scalar object), I created the output row I wanted by explicitly adding the two fields I intended to add:
outputRowMeta = _step_.getInputRowMeta().clone(); _step_.stepMeta.stepMetaInterface.getFields( outputRowMeta, _step_.getStepname(), null, null, _step_, _step_.repository, _step_.metaStore ); outputRowMeta.addValueMeta(new ValueMetaString("key")) outputRowMeta.addValueMeta(new ValueMetaString("value")) outputRow = RowDataUtil.resizeArray( row, outputRowMeta.size()+2 )
Note that the Script step does this for you if you have 1 output row for every input row. As a result, I added all but the last JSON scalar, then let the Script step do the last one for me:
int outputIndex = rowMeta.size() int count = 1 int numProps = map.size() key = null value = null map.each {kv -> if(count < numProps) { keyIndex = outputIndex valueIndex = outputIndex+1 if(keyIndex >= 0 && valueIndex >= 0) { outputRow[keyIndex] = kv.key outputRow[valueIndex] = kv.value } _step_.putRow(outputRowMeta, outputRow) } else { key = kv.key value = kv.value } count++ }
The entire script is a Gist located here, and here's a screenshot of the step dialog:
You can see where I added ".groovy" to the step name, as well as specifying the output fields in the table below (and using them in the else loop above).
I ran the step against the following JSON doc:
{ "Port": { "@alias": "defaultHttp", "Enabled": "true", "Number": "10092", "Protocol": "http", "KeepAliveTimeout": "20000", "ThreadPool": { "@enabled": "false", "Max": "150", "ThreadPriority": "5" }, "ExtendedProperties": { "Property": [ { "@name": "connectionTimeout", "$": "20000" } ] } } }
And got the following results:
Perhaps this will be helpful for you, either by using the script to flatten JSON, or as an example of using Groovy in the Script step, and/or using @Grab to get dependencies on-the-fly in PDI.
Cheers!
No comments:
Post a Comment