Friday, October 17, 2014

Flatten JSON to key-value pairs in PDI

I've heard a number of comments regarding JSON and PDI, most of them having to do with difficulties parsing nested documents, using JSONPath, etc.  Personally, I've had a JSON doc I'd like to fetch fields from but I didn't want to try to figure out the JSONPath or document structure, I just wanted to get right to the values.

To achieve this (and to prove my point from a previous post on using Groovy Grape in PDI), I wrote a Groovy script to flatten a JSON document into key/value pairs in PDI. I needed the following elements:

1) An Apache Ivy installation/JAR. In a previous post I added the Ivy folder to the; for this I just dropped the single JAR into data-integration/lib before starting Spoon.

2) A JSON document read into the PDI stream and passed to a Script step

To use Groovy with the (still experimental) Script step, ensure the step name ends in ".groovy". This indicates to the Script step to find a JSR-223 Scripting Engine with the name "groovy". Since PDI comes with the Groovy script engine, you can use this out-of-the-box. To use other script engines, simply add the JAR(s) to the classpath (I usually drop them in data-integration/lib).

NOTE: The Script step does not find the Groovy script engine with Java 7 on Mac. This is documented in PDI-13074.  You can use Java 6 but that is not officially supported by Pentaho for PDI 5.0+

For the Groovy script, I decided to use Jackson Databind to parse the JSON, using an example I found on Stack Overflow (here). Databind is not included with PDI, so I used the technique from my PRD post to @Grab it:

@Grab(group='com.fasterxml.jackson.core', module='jackson-databind', version='2.3.3')

Then I needed the ability to add more than one output row per input row. The script step was designed to operate on a row of data and add fields to that row by setting variables in the script (and specifying those variables/fields in the step dialog).  Since I needed one input row to generate multiple output rows (one per JSON scalar object), I created the output row I wanted by explicitly adding the two fields I intended to add:

outputRowMeta = _step_.getInputRowMeta().clone();
_step_.stepMeta.stepMetaInterface.getFields( outputRowMeta, _step_.getStepname(), null, null, _step_, _step_.repository, _step_.metaStore );
outputRowMeta.addValueMeta(new ValueMetaString("key"))
outputRowMeta.addValueMeta(new ValueMetaString("value"))
outputRow = RowDataUtil.resizeArray( row, outputRowMeta.size()+2 )

Note that the Script step does this for you if you have 1 output row for every input row.  As a result, I added all but the last JSON scalar, then let the Script step do the last one for me:

int outputIndex = rowMeta.size()
int count = 1
int numProps = map.size()
key = null
value = null
map.each {kv ->
  if(count < numProps) {
    keyIndex = outputIndex
    valueIndex = outputIndex+1
    if(keyIndex >= 0 && valueIndex >= 0) {
      outputRow[keyIndex] = kv.key
      outputRow[valueIndex] = kv.value
    _step_.putRow(outputRowMeta, outputRow)
  else {
    key = kv.key
    value = kv.value

The entire script is a Gist located here, and here's a screenshot of the step dialog:

You can see where I added ".groovy" to the step name, as well as specifying the output fields in the table below (and using them in the else loop above).

I ran the step against the following JSON doc:

       "@alias": "defaultHttp",
       "Enabled": "true",
       "Number": "10092",
       "Protocol": "http",
       "KeepAliveTimeout": "20000",
           "@enabled": "false",
           "Max": "150",
           "ThreadPriority": "5"
                   "@name": "connectionTimeout",
                   "$": "20000"

And got the following results:

Perhaps this will be helpful for you, either by using the script to flatten JSON, or as an example of using Groovy in the Script step, and/or using @Grab to get dependencies on-the-fly in PDI.


No comments:

Post a Comment