Friday, September 7, 2012

ANTLR Recognizer step plugin (v1)

While reading "Pentaho Kettle Solutions" Chapter 20 ("Handling Complex Data Formats"), Matt Casters et al pointed out that some data is semi-structured or unstructured, and that the rules for defining the structure "may be very complex and hard to formulate".  They go on to offer regular expressions as a solution, but note that some regular expressions (such as the one for RFC 2822 email addresses) are "quite convoluted".

Fortunately there are better ways to define grammars for complicated data formats, and my favorite way is ANTLR. Grammars are specified in a BNF-like format, and ANTLR will generate Java source code for the lexer and parser(s).  There is also a slick IDE for ANTLR grammars called ANTLRWorks, both tools are released under the BSD license.

Given the user-friendliness and Java-friendliness of these tools, I thought to make a validator step in Pentaho Data Integration where you can specify the ANTLR grammar file and it will generate the Java source for the grammar, then compile it and load it for use in validating content provided in the PDI stream:


Originally I wanted to use Janino to compile the Java source, but this posed two problems related to how ANTLR generates Java source.  First, ANTLR inserts @SuppressWarnings annotations, which are not supported by Janino.  I had code to remove these lines, but then I ran into a Janino bug that prevents the compilation of ANTLR-generated Java source files.  For this reason I switched to javac for compilation.  However, this requires that you have the JDK installed and the javac command in your path.

Once the plugin is installed (by extracting the ANTLRRecognizer folder into the plugins/steps folder under the PDI location, you can add the step from the Transform group.  Double-clicking on the step brings up the configuration dialog, which looks like the following:


Here I have chosen to validate CSV files by using the grammar here, providing the content on the stream, the entry rule name, and the result fieldname.  Currently the result field will contain a 'Y' or 'N' character whether the content was valid or not, respectively.

Unfortunately ANTLR does not throw exceptions as default error handling.  For that reason, each grammar needs to override the reportError method to throw an AntlrParseException object.  The code to be added to the grammar looks like:


@header {
import org.pentaho.di.trans.steps.antlrrecognizer.AntlrParseException;
}


@members {
     @Override
     public void reportError(RecognitionException e) {
          throw new AntlrParseException(e.getMessage());
     }
}


The AntlrParseException class is provided with the ANTLRRecognizer.jar in the plugin folder.  This exception will be caught by the ANTLR Recognizer step and the result field will contain a 'N'.

Notes

  • The Java source and class files are only generated when the grammar file has been modified since the last time the source and bytecode were generated.  See the Future Work section for planned enhancements.
  • Many line-based grammars have a rule to parse a line and a rule that parses every line in the input.  Using the Rule name field, the ANTLR Recognizer step can either validate the whole input (if you are loading content from a file, for example), or line-by-line (if the lines are rows on the stream).  If the grammar has both rules, it does not need to be regenerated to use either rule. 

Future Work

  • Having a validator is all well and good, but if you need to get at specific data in complex structures, you still need regular expressions with grouping or some complicated solution.  I'm planning on specifying parser patterns and/or providing helper classes such that the ANTLR Recognizer can put recognized rules out on the stream.  This would work similarly to the XPath functionality in XML steps, although any hierarchy would have to be baked into the grammar, as a rule matcher doesn't necessarily know what parent rule it is a part of.  I may suggest or insist that this is the case, in order for a future version of the step to be more flexible.
  • If/When Janino 2.6.2 is released, I will try to switch to Janino in order to remove the JDK dependency.
  • It should be allowed to select a grammar that has already been compiled, rather than the grammar definition from a file.  The step could search the usual PDI directories looking for JARs that contain classes that extend the ANTLR lexer and parser classes.  The user should be able to select either a filename of an ANTLR grammar file (*.g) or from a dropdown of recognized grammars, and possibly even from a filename pointing at a grammar JAR that is not in the searched PDI directories.

Downloads

The project, including source and the plugin package (ready to drop into your PDI instance) are located on GitHub here.



1 comment: