turbolent: Entee

Entee

about 1 minute

Back in May I wanted to analyze dumps from various knowledge bases such as Freebase and DBpedia. I searched for a fast and simple RDF parsing library, but found that options like Jena or Raptor were too complex and too resource-intensive when parsing multi-gigabyte files.

As almost all RDF datasets are encoded as N-Triples, the simplest of all RDF serialization formats, I estimated it would not take too long and be difficult to write a reasonably fast lexer complying to the specification. Also, this would be a good opportunity to finally learn about the finite-state machine compiler Ragel.

Even though setting up a scanner using Ragel was quite cumbersome, its extensive user guide and examples helped a lot. Translating the BNF of the specification was straight forward, except for two smaller issues. One was an ambiguity in the BNF, for which the guide provided several solutions.

The other problem was that Ragel parses raw bytes, i.e., rules can not simply specify Unicode code point ranges. In the case of N-Triples, the specification defines UTF-8 as the only valid encoding. Thus I was able to simply translate the code point ranges given in the BNF to UTF-8 byte sequences. As the ranges were rather extensive, I appropriated the script unicode2ragel.rb, which is distributed with Ragel, into ragel_utf8_range.rb: It simply accepts the start and end of a Unicode code point range as arguments, and prints the UTF-8 byte ranges for use in Ragel rules.

If you need a RDF 1.1 N-Triples compliant parser, take a look at entee on GitHub. So far I used it successfully to parse large dumps from the previously mentioned knowledge bases and am pleased with the speed. The repository also contains an example which demonstrates how to parse gzip or bzip2 compressed files.