To achieve this (and to prove my point from a previous post on using Groovy Grape in PDI), I wrote a Groovy script to flatten a JSON document into key/value pairs in PDI. I needed the following elements:
1) An Apache Ivy installation/JAR. In a previous post I added the Ivy folder to the launcher.properties; for this I just dropped the single JAR into data-integration/lib before starting Spoon.
2) A JSON document read into the PDI stream and passed to a Script step
To use Groovy with the (still experimental) Script step, ensure the step name ends in ".groovy". This indicates to the Script step to find a JSR-223 Scripting Engine with the name "groovy". Since PDI comes with the Groovy script engine, you can use this out-of-the-box. To use other script engines, simply add the JAR(s) to the classpath (I usually drop them in data-integration/lib).
NOTE: The Script step does not find the Groovy script engine with Java 7 on Mac. This is documented in PDI-13074. You can use Java 6 but that is not officially supported by Pentaho for PDI 5.0+
For the Groovy script, I decided to use Jackson Databind to parse the JSON, using an example I found on Stack Overflow (here). Databind is not included with PDI, so I used the technique from my PRD post to @Grab it:
@Grab(group='com.fasterxml.jackson.core', module='jackson-databind', version='2.3.3')
Then I needed the ability to add more than one output row per input row. The script step was designed to operate on a row of data and add fields to that row by setting variables in the script (and specifying those variables/fields in the step dialog). Since I needed one input row to generate multiple output rows (one per JSON scalar object), I created the output row I wanted by explicitly adding the two fields I intended to add:
outputRowMeta = _step_.getInputRowMeta().clone();
_step_.stepMeta.stepMetaInterface.getFields( outputRowMeta, _step_.getStepname(), null, null, _step_, _step_.repository, _step_.metaStore );
outputRowMeta.addValueMeta(new ValueMetaString("key"))
outputRowMeta.addValueMeta(new ValueMetaString("value"))
outputRow = RowDataUtil.resizeArray( row, outputRowMeta.size()+2 )
Note that the Script step does this for you if you have 1 output row for every input row. As a result, I added all but the last JSON scalar, then let the Script step do the last one for me:
int outputIndex = rowMeta.size()
int count = 1
int numProps = map.size()
key = null
value = null
map.each {kv ->
if(count < numProps) {
keyIndex = outputIndex
valueIndex = outputIndex+1
if(keyIndex >= 0 && valueIndex >= 0) {
outputRow[keyIndex] = kv.key
outputRow[valueIndex] = kv.value
}
_step_.putRow(outputRowMeta, outputRow)
}
else {
key = kv.key
value = kv.value
}
count++
}
The entire script is a Gist located here, and here's a screenshot of the step dialog:
You can see where I added ".groovy" to the step name, as well as specifying the output fields in the table below (and using them in the else loop above).
I ran the step against the following JSON doc:
{
"Port":
{
"@alias": "defaultHttp",
"Enabled": "true",
"Number": "10092",
"Protocol": "http",
"KeepAliveTimeout": "20000",
"ThreadPool":
{
"@enabled": "false",
"Max": "150",
"ThreadPriority": "5"
},
"ExtendedProperties":
{
"Property":
[
{
"@name": "connectionTimeout",
"$": "20000"
}
]
}
}
}
And got the following results:
Perhaps this will be helpful for you, either by using the script to flatten JSON, or as an example of using Groovy in the Script step, and/or using @Grab to get dependencies on-the-fly in PDI.
Cheers!


No comments:
Post a Comment