winterkoninkje | Entries tagged with java

I've been working on a tagging library (and executable) for a bit over a year now. When the project started I had the advantage of being able to choose the language to do it in. Naturally I chose Haskell. There are numerous reasons for this decision, some of which have been derided as "philosophical concerns". Certainly some of the reasons why Haskell is superior to other languages do border on the philosophical. Y'know, silly little things like the belief that type systems should prevent errors rather than encouraging them to proliferate. I'm sure you've heard the arguments before. They're good arguments, and maybe they'll convince you to try out Haskell in your basement. But in many so-called "enterprise" settings, anything that even smells like it might have basis in theoretical fact is automatically wrong or irrelevant; whatever you do in the privacy of your basement is your own business, but heaven forbid it have any influence on how decisions are made in the workplace! So, here is a short list of entirely pragmatic, practical, and non-theoretical reasons why Haskell is superior to Java for implementing enterprise programs. More specifically, these are reasons why Haskell is superior for my project. Perhaps they don't matter for your project, or perhaps they'll be enough to convince your boss to let you give Haskell a try. Because design decisions are often project-specific, each point explains why they matter for Posta in particular.

Haskell has powerful frameworks for defining modular, high-performance, non-trivial parsers (e.g., Attoparsec). In natural language processing (NLP), just like system administration, over half of the work you do involves dealing with a handful of different ad-hoc poorly defined file formats. Reading them; generating them; converting from one format to another; etc. Because every one of these formats grew out of a slow accretion of features for one particular project, they're riddled with inconsistencies, idiosyncratic magic values, corner cases, and non-context-free bits that require special handling. In Java the premiere tool (so far as I know) for defining parsers is JavaCC. (Like the C tools lex and yacc, JavaCC uses its own special syntax and requires a preprocessor, whereas Attoparsec and the like don't. However, this may be a "philosophical" issue.) However, as of last time I used it, JavaCC is designed for dealing with nice clean grammars used by programming languages and it doesn't handle inconsistent and irregular grammars very well.
Posta uses a system of coroutines (called "iteratees") in order to lazily stream data from disk, through the parsers, and into the core algorithms, all while maintaining guarantees about how long resources (e.g., file handles, memory) are held for. This allows handling large files, because we don't need to keep the whole file in memory at once, either in its raw form or in the AST generated by parsing it. For modern enterprise-scale NLP, dealing with gigabyte-sized files is a requirement; because many NLP projects are not enterprise-scale, you get to spend extra time chopping up and reformatting files to fit their limitations. Last time I used JavaCC it did not support incremental parsing, and according to the advertised features it still doesn't. In addition, implementing coroutines is problematic because Java's security model precludes simple things like tail-call optimization--- meaning that you can only support this kind of streaming when the control flow is simple enough to avoid stack overflows.
Haskell has awesome support for parallelism. One version, called STM, provides composeable atomic blocks (which matches the way we naturally think about parallelism) combined with lightweight threads (which make it cheap and easy). Java has no support for STM. I am unaware of any support for lightweight threads in Java. The only parallelism I'm aware of in Java is the monitor-style lock-based system with OS threads. As with all lock-based systems, it is non-composeable and difficult to get right; and as with using OS threads anywhere else, there is high overhead which removes the benefits of parallelizing many programs.
Posta makes extensive use of partial evaluation for improving performance; e.g., lifting computations out of loops. When doing NLP you are often dealing with triply-nested loops, so loop-invariant code motion is essential for performance. In my benchmarks, partial evaluation reduces the total running time by 10%. If raw numbers don't convince you: using partial evaluation allows us to keep the code legible, concise, modular, and maintainable. The primary use of partial evaluation is in a combinator library defining numerous smoothing methods for probability distributions; the results of which are called from within those triply-nested loops. Without partial evaluation, the only way to get performant code is to write a specialized version of the triply-nested loop for every different smoothing method you want to support. That means duplicating the core algorithm and a lot of tricky math, many times over. There's no way to implement this use of partial evaluation in anything resembling idiomatic Java.
Posta uses an implementation of persistent asymptotically optimal priority queues which come with proofs of correctness. A persistent PQ is necessary for one of the tagger's core algorithms. Since the PQ methods are called from within nested loops, performance is important. Since we're dealing with giga-scale data, asymptotics are important. A log factor here or there means more than a 10% increase in total running time. In Java there's java.util.PriorityQueue but it has inferior asymptotic performance guarantees and is neither persistent nor synchronized. I'm sure there are other PQ libraries out there, but I doubt anyone's implemented the exact version we need and shown their implementation to be correct.

I'll admit I'm not up to date on state-of-the-art Java, and I'd love to be proven wrong about these things being unavailable. But a couple years ago when I returned to Java after a long time away, I learned that all the hype I'd heard about Java improving over the preceding decade was just that: hype. I have been disappointed every time I hoped Java has some trivial thing. The most recent one I've run into is Java's complete refusal to believe in the existence of IPC (no, not RPC), but that's hardly the tip of the iceberg.

Last week I gave a debugging problem. Well, now it's time for the big reveal. If you'd like to try your hand at it, you should follow that link and stop reading this post. It took me a couple weeks to figure this one out. Given the nature of the symptoms (failure only on messages of lengths 255 and 383) it was pretty clear that this was some obscure and annoying bug, like an off-by-one error. I read through the relevant code for the whole communication stack (Haskell protobufs, Haskell FIFO IO, Java FIFO IO, Java protobufs) with a inkling of what the problem might be, but I still missed it the first time through.

In the hints I suggested that length 511 should also be problematic, and that we were using the length-delimited version of protobufs. For those who aren't familiar with protobufs there are only two relevant details of the encoding. The first is that length-delimited protobufs write the length before the payload (as mentioned in the hint), and the second is how that length is encoded. In order to reduce the message size, protobufs use a variable length encoding for integers. The encoding says that we read a byte at a time (little-endian order), keeping the low-order 7 bits as part of the number. If the 8th bit is unset, then we're done; if it's set, then we continue reading the next byte. Thus, the length 255 is encoded as 0xFF 0x01 ((0xFF .&. 0x7F) .|. (0x01 `shiftL` 7) == 255), 383 is encoded as 0xFF 0x02, and 511 is encoded as 0xFF 0x03. Conversely, the non-problematic 254 is encoded as 0xFE 0x01, and 256 is encoded as 0x80 0x02.

The crucial knowledge, however, is knowing that for some unfathomable reason Java decided to define their byte type as a signed numeric type! On the one hand, this is perverse because bytes are always used when dealing with binary formats and almost never used for actual numerical computation. On the other hand, it is sadistic because every other language in common use (and most of the ones not in common use) have unsigned bytes, so Java's choice to use signed bytes is exquisitely designed to confound any programmer who has ever used another language. More particularly, this means that the implicit conversions between bytes and other numeric types will preserve the signed value, not the representation. Thus, since 0xFF has the signed byte value -1, whenever you use this byte in, say, an int context, it will be silently and implicitly converted into the int -1 as opposed to the int 255 any sane person would expect. Further, since Java comes from the C tradition of languages, it likes to use magical values to indicate errors (e.g., using -1 when only positive values are valid). I'm sure you can see where this is headed.

The InputStream class is one of the plethora of classes for reading from a file. In particular, it's the one used by the CLIPC library for reading from Posix FIFOs. Another awful thing about Java is that it refuses to believe in anything that systems programmers use, because by definition these things aren't portable to every operating system. I believe I've alluded to this before. Among other things it means that Java steadfastly refuses to acknowledge the fact that separate processes can communicate without going through the network stack. If you want RPCs, then there are numerous options available to you. But heaven help you if you want to use IPC in order to avoid that overhead. And don't get me started on signal handling. Luckily for me, CLIPC does the nasty JNI coding necessary to convince Java to believe in FIFOs (and it even works for Windows FIFOs).

Subclasses of InputStream provide a method which allows you to read one byte at a time from wherever it reads from. However, this method has the type int read() where the return value -1 indicates end of file and values 0 through 255 indicate the byte that was read. This is just begging for bugs. You can't just return a byte, since all the bytes from 0x80 to 0xFE (-128 to -2) will be implicitly converted into invalid return values, and the byte 0xFF (-1) will be implicitly converted to EOF.

Using the standard trick for amortizing IO costs, the CLIPC library reads a buffer's worth of data whenever it needs to, and then returns that data as requested. That buffer has the exceptionally reasonable type byte[]. Unfortunately, when reading a single byte it does this innocuous little thing: int returnValue = buffer[i];. Very easy to miss when scanning though hundreds of lines of Java, even when you have an inkling that signed bytes may be at fault. This is why implicit conversion between numeric types is Evil. Haskell does the right thing in requiring all coercions to be explicit. To fix the bug we must do: int returnValue = ((int) buffer[i]) & 0xFF;. Unfortunately, the buffer field is private in FIFOInputStream so I can't make a subclass to fix the bug.

Surprisingly, except for the interpretation of 0xFF as EOF when reading the length header, Google's protobuf code works perfectly fine with all the other errorful byte values since it uses bytewise operations instead of arithmetic. And reads chunks at a time elsewhere, thus avoiding the bug. Good job Google :)

I've sent a bug report to the author of CLIPC, but I've yet to hear back, and cursory googling indicates he may have dropped off the internet a couple years ago. If I haven't heard back by the time I make the first public release of Posta (estimated mid--late May 2011), then I'll have to fork CLIPC since I don't want Posta's license to be dictated by this externality. LGPL 2.1+ is great and all, but this bug is a silly thing to base a license around.

I offer you a thought experiment. For your current project you've set up some inter-process communication. Nothing tricky involved, just your standard client–server deal. You've even outsourced the protocol design and are using Google's code to handle all the grody details of serializing and deserializing. Well, okay, on the server side you're using someone else's code but it implements the same protocol, right? Now you run into a bug.

The vast majority of the time everything works smoothly— even verified by taking SHA1 hashes of the messages on both sides and comparing them. But every so often the Java client crashes. In particular, it crashes whenever reading a result message (from the server) of length 255 or 383 and maybe some larger sizes. It does, however, work perfectly fine for intervening message lengths (including 254 and 256). So what's wrong?

( A hint ) ( Another hint ) ( One last hint )

Solution posted here.

This weekend I've been doing a solo hackathon to try to get Posta integrated with our variant of the Mink parser. All the core algorithms have already been implemented, so it's just been a whole lot of yak shaving. Namely, I have to define an IPC protocol for Haskell to talk to Java, implement the (Haskell) server executable and (Java) client stubs, and then try to shake the thing out to find bugs and performance holes. Ideally, by tuesday morning.

Unfortunately, Java doesn't believe in inter-process communication, so all the libraries out there are for doing RPCs. Since the parser and tagger will be operating interactively, and on the same machine, it's quite silly to go through the network stack just to pass a few bytes back and forth. Thankfully I found CLIPC which should do the heavy lifting of getting Java to believe in POSIX named pipes. In order to handle the "on the wire" de/encoding, I've decided to go with Google's protocol buffers since there's already a protobuf compiler for Haskell. I was considering using MessagePack (which also has Haskell bindings), but protobuf seemed a bit easier to install and work with.

For all the plumbing code I decided to try working with iteratees, which have lots of nice performance guarantees. The protobuf libraries don't have integrated support for iteratees, but the internal model is a variant of iteratees so I was able to write some conversion functions. Attoparsec also uses an iteratee-like model internally, and there's integration code available. For my uses I actually need an enumeratee instead of an iteratee, so I had to roll one of my own.

TODO: (easy) move the CG2 training module into the library
TODO: (low priority) write a CG3 training module
DONE: ~~write an unparser for TnT lexicon and trigram files~~
TODO: (easy) write function to feed trained models into the unparser
TODO: (postponed) write wrapper executable to train models and print them to files
TODO: (postponed) write function to read TnT parser output into models directly (the TnT file parsers were done previously)
DONE: ~~choose a library for commandline argument munging~~ cmdargs
TODO: add commandline arguments for passing models to the server
DONE: ~~write protocol buffer spec for IPC protocol~~
DONE: ~~write Java client handlers for the IPCs~~
TODO: (low priority) write Haskell client handlers for debugging/verification of Java
TODO: write Haskell code for dumping intern tables to files, and reading them back in
TODO: write Java code for reading intern table files, so the client can dereference the ints
DONE: ~~write functions for converting the protobuf Get monad into an iteratee or enumeratee~~
TODO: write Haskell server handlers for the IPCs
TODO: write STM code for parallelizing the tagging and IPC handling
DONE: ~~write function for converting attoparsec parsers into enumeratees~~
TODO: (low priority) integrate attoparsec enumeratees into model training, etc, to replace top-level calls to many
DONE: ~~write lots of other auxiliary functions for bytestring, attoparsec, and iteratee~~