Monday, December 8, 2014

How sorted (or sordid) is your data?

I've spent quite a bit of time looking at Pentaho Data Integration (aka Kettle) and trying to make it do things with external technologies and idioms, anywhere from Groovy, Drill, memcached, Redis, Hazelcast, and even Markov Chains. Recently though, I've been started to focus on the data coming in and out of PDI, and what I could learn from it (#datadiscovery). I'll be spending a lot more time with R and Drill soon, but as a small example of data discovery, I thought I'd look at "how sorted" data is.

Basically I wanted to know for an input stream (consisting of CSV files or database tables or whatever), is the stream close to being in a sorted state or not?  I am currently looking into approximate and probabilistic methods (like Longest Increasing Sequence and an interesting "multiplayer" version here), but this post is about a brute-force method of finding the variance of distance between an element in a stream and where it would be if the stream were sorted.

Specifically, I looked at the rank (aka row number) of the incoming data as the row number of the raw input, then in parallel I sorted on the desired columns and ranked the sorted rows. I was looking for the distance between each row's value(s) and how far the rows were from their sorted position(s).  My research (read: Google search and Wikipedia) brought me to the Spearman rank correlation coefficient.

For this I would need to sort the rows, then find the delta between the position of each desired column in the original rowset and the sorted rowset, then find the statistical dependence of the ranked values. There are more sophisticated techniques to determine the relationships between ranked items, but this one suited my purpose :)

Once I found the algorithm I was looking for, I set out to create an example using only PDI steps, with the following caveats:

1) No scripting steps: Of course you can do whatever you like with the Scripting steps, but if you don't know those programming languages, you're left with the rest of what PDI offers. Luckily the choices are plentiful and powerful.

2) No SQL steps: Most databases probably offer the kind of expressive power you'd need to write a "Spearman rho" function inline, and to be honest that's probably the best option performance-wise; but I was looking to create a data-agnostic, language-agnostic way to calculate the "sortedness" of a data set in PDI, as this could be used in a blending or refinery situation.

I decided to use the "customer-100.txt" sample file in PDI, and sorted on full name, in order to determine "how sorted" the customer data was in terms of customers' names.  I designed the following transformation:

This transformation is on Gist and Box.  The results:

The absolute value of the Spearman rho for customer-100.txt (when sorted on name) is 0.001272127. I used absolute value because I didn't care whether the stream was close to being sorted or reverse-sorted; if you care about that in your usage, then leave out the ABS(rho) calculation in the "Spearman's rho" step above.

Being so close to zero, we can determine that the data is not very well sorted, as a result of the Spearman rho telling us that there is no tendency for the raw data and the sorted data to follow any sort of trend of monotonicity (ascending or descending). If the values were to get closer to 1 (or to -1 if not using absolute value), then the stream would be closer to its sorted state and thus "more sorted".  I set up a rudimentary Value Mapper step ("How sorted is the data?" above) to indicate whether the data was well-sorted or not.  If you disable the sort path and enable the direct path around it, then the two rowsets will match and you will get 1.0 as the Spearman rho.

This might not be very useful to the PDI user-at-large, but I learned alot while working through it so I thought I'd share it here. Stay tuned for more Fun with Pentaho Data Integration ;)

No comments:

Post a Comment