Last week I gave a debugging problem. Well, now it's time for the big reveal. If you'd like to try your hand at it, you should follow that link and stop reading this post. It took me a couple weeks to figure this one out. Given the nature of the symptoms (failure only on messages of lengths 255 and 383) it was pretty clear that this was some obscure and annoying bug, like an off-by-one error. I read through the relevant code for the whole communication stack (Haskell protobufs, Haskell FIFO IO, Java FIFO IO, Java protobufs) with a inkling of what the problem might be, but I still missed it the first time through.
In the hints I suggested that length 511 should also be problematic, and that we were using the length-delimited version of protobufs. For those who aren't familiar with protobufs there are only two relevant details of the encoding. The first is that length-delimited protobufs write the length before the payload (as mentioned in the hint), and the second is how that length is encoded. In order to reduce the message size, protobufs use a variable length encoding for integers. The encoding says that we read a byte at a time (little-endian order), keeping the low-order 7 bits as part of the number. If the 8th bit is unset, then we're done; if it's set, then we continue reading the next byte. Thus, the length 255 is encoded as 0xFF 0x01 (
(0xFF .&. 0x7F) .|. (0x01 `shiftL` 7) == 255), 383 is encoded as 0xFF 0x02, and 511 is encoded as 0xFF 0x03. Conversely, the non-problematic 254 is encoded as 0xFE 0x01, and 256 is encoded as 0x80 0x02.
The crucial knowledge, however, is knowing that for some unfathomable reason Java decided to define their
byte type as a signed numeric type! On the one hand, this is perverse because bytes are always used when dealing with binary formats and almost never used for actual numerical computation. On the other hand, it is sadistic because every other language in common use (and most of the ones not in common use) have unsigned bytes, so Java's choice to use signed bytes is exquisitely designed to confound any programmer who has ever used another language. More particularly, this means that the implicit conversions between bytes and other numeric types will preserve the signed value, not the representation. Thus, since 0xFF has the signed byte value -1, whenever you use this byte in, say, an
int context, it will be silently and implicitly converted into the int -1 as opposed to the int 255 any sane person would expect. Further, since Java comes from the C tradition of languages, it likes to use magical values to indicate errors (e.g., using -1 when only positive values are valid). I'm sure you can see where this is headed.
The InputStream class is one of the plethora of classes for reading from a file. In particular, it's the one used by the CLIPC library for reading from Posix FIFOs. Another awful thing about Java is that it refuses to believe in anything that systems programmers use, because by definition these things aren't portable to every operating system. I believe I've alluded to this before. Among other things it means that Java steadfastly refuses to acknowledge the fact that separate processes can communicate without going through the network stack. If you want RPCs, then there are numerous options available to you. But heaven help you if you want to use IPC in order to avoid that overhead. And don't get me started on signal handling. Luckily for me, CLIPC does the nasty JNI coding necessary to convince Java to believe in FIFOs (and it even works for Windows FIFOs).
Subclasses of InputStream provide a method which allows you to read one byte at a time from wherever it reads from. However, this method has the type
int read() where the return value -1 indicates end of file and values 0 through 255 indicate the byte that was read. This is just begging for bugs. You can't just return a byte, since all the bytes from 0x80 to 0xFE (-128 to -2) will be implicitly converted into invalid return values, and the byte 0xFF (-1) will be implicitly converted to EOF.
Using the standard trick for amortizing IO costs, the CLIPC library reads a buffer's worth of data whenever it needs to, and then returns that data as requested. That buffer has the exceptionally reasonable type
byte. Unfortunately, when reading a single byte it does this innocuous little thing:
int returnValue = buffer[i];. Very easy to miss when scanning though hundreds of lines of Java, even when you have an inkling that signed bytes may be at fault. This is why implicit conversion between numeric types is Evil. Haskell does the right thing in requiring all coercions to be explicit. To fix the bug we must do:
int returnValue = ((int) buffer[i]) & 0xFF;. Unfortunately, the buffer field is private in FIFOInputStream so I can't make a subclass to fix the bug.
Surprisingly, except for the interpretation of 0xFF as EOF when reading the length header, Google's protobuf code works perfectly fine with all the other errorful byte values since it uses bytewise operations instead of arithmetic. And reads chunks at a time elsewhere, thus avoiding the bug. Good job Google :)
I've sent a bug report to the author of CLIPC, but I've yet to hear back, and cursory googling indicates he may have dropped off the internet a couple years ago. If I haven't heard back by the time I make the first public release of Posta (estimated mid--late May 2011), then I'll have to fork CLIPC since I don't want Posta's license to be dictated by this externality. LGPL 2.1+ is great and all, but this bug is a silly thing to base a license around.