Saturday, September 20, 2014

Using Apache Drill with PDI

One of the non-Pentaho side projects I've become interested in is Apache Drill, I like all the different aspects of it and hope to contribute in some meaningful way shortly :) As a first step, without touching any code, I thought I'd see if I could configure PDI and Drill to play nicely together. My Proof-Of-Concept was a single Table Input step in PDI, using a Generic driver to point at my local Drill instance. I was able to query the datasources as described in the Drill wiki.

At first I wanted to be able to start Drill in embedded mode, as my end goal was to be able to provide a PDI plugin so "Apache Drill" would show up on the Database types list in the Database Connection dialog. However I ran into a bunch of classloading issues (see previous post), so I thought I'd try a different approach that worked much better. Here's my procedure:

1) Install Drill and Zookeeper.
2) Start Zookeeper (I used Standalone mode)
3) Start Drill in Distributed mode (I used runbit). By default this should look for a localhost:2181 ZK
4) Add a relative path pointing at the <path_to_drill>/jars/jdbc-driver folder to the "libraries" property in PDI's data-integration/launcher/launcher.properties file. For my install, I used ../../../../apache-drill-0.5.0-incubating/jars/jdbc-driver
4) Open PDI and create a new Transformation with a Table Input step.
5) In the Table Input step dialog, create a new Connection with the following properties:

Click the Test button, it may take a while the first time but it should come back with Success. Don't click the Explore button or try to look for tables or columns, there is an Unsupported Operation in a few of the JDBC calls. The reason I've found is usually that a Drill class has subclassed an Avatica class and hasn't overridden the methods that PDI is calling. This is common with newer technologies that provide JDBC drivers; the JDBC API is huge so many young drivers don't implement all the methods.

6) Click OK to exit the Database Connection dialog
7) Enter your SQL query in the Table Input step
8) Click Preview or run the transformation

I was able to run queries like the ones in the wiki (see link above) as well as things like:

SELECT GENDER, AVG(salary) AS AVG_SALARY FROM cp.`employee.json` GROUP BY GENDER

I haven't tried this with anything but Table Input. In my experience if I am getting UnsupportedOperationExceptions in the Database Connection Dialog, I won't get very far with other Pentaho tools. This is due to the lack of implemented methods in the driver. As a Pentaho employee, I've done a few things to "fix" these on the fly. I looked into doing the same for Drill while trying to create a plugin for it, but I spent too many hours in Dependency Hell and eventually gave up (so far) trying to embed a Drill client/instance in a PDI plugin. Now that I have it working with Drill in Distributed mode, perhaps I will give it another try. If I can get around the classloading / service locator problems with Jersey, perhaps I'll achieve my end goal :)

Cheers!

No comments:

Post a Comment