Shadowcrane
Only recently, thanks to the computer, has it become feasible to solve real, nontrivial problems of reasoning from incomplete information, in which we use probability theory as a form of logic in situations where both intuition and "random variable" probability theory would be helpless. This has brought out the facts in a way that can no longer be obscured by arguments over philosophy. One can always argue with a philosophy; it is not so easy to argue with a computer printout, which says to us: "Independently of all your philosophy, here are the facts about what this method actually gives when applied."

Daaamn. That's some gettin' told right there.

The above quote (emphasis added) is from the eminently readable Probability in Quantum Theory by E.T. Jaynes, which presents a critique and alternative perspective on the role of probability within quantum mechanics. If you've any interest in philosophy of science or the philosophical disputes between frequentism and Bayesianism, even if you've no real knowledge of physics, then I highly recommend reading it. While the frequentist vs Bayesianist argument is well-known of, the details of what is actually at stake are less well-known and often quite subtle. I think the author does a good job of bringing out and highlighting what the argument is about, and why it is relevant to the future of science (especially physics).

For my part, I've been well indoctrinated into the Bayesian philosophy. This semester I'm taking a course on frequentism, or rather on "experimental methods" as they call it. A professor here has been pushing hard for Bayesian methods in behavioral sciences, and the professor of my class delights in teasing him about it (though he admits to no investment in the philosophical debate). It's been a very long time since I've seen the frequentist perspective, and I'm always of the opinion that it's good to keep an eye on one's philosophical enemies. I've known that frequentism has long dominated the behavioral sciences, but I must shamefully admit that I've attributed this to them being "soft" (even for as much as I identify with my undergrad training in anthropology and humanities). However, coming from machine learning where Bayesianism is de rigueur, one thing I found startling about the article is that, apparently, in physics too it's the frequentists who've dominated the conversation for decades. Indeed, as Jaynes portrays it, it's the frequentists who ousted Laplace, rather than the other way around as is portrayed in AI/ML circles.

Shadowcrane

For those who haven't heard, John McCarthy (1927–2011) (of Lisp and Artificial Intelligence fame) passed on last night.

Edit: And since I hadn't mentioned it, Dennis Ritchie (1941–2011) (of C and Unix fame) passed on a short while back.

Shadowcrane

I finally got around to posting the slides for a talk I gave twice this summer: Probability Smoothing for NLP: A case study for functional programming and little languages. The first version of the talk was presented at the McMaster Workshop on Domain Specific Lanaguages (and Ed Kmett has posted a video of that version on YouTube) with the presentation focused on EDSLs, with smoothing given as an example. The second version was presented at the AMMCS minisymposium on Progress and Prospects in Model-Based Scientific Software Development, where the focus was more on the domain itself and how the use of a DSL allows ensuring correctness, modularity, and maintainability of code for developing probability models. The slides are essentially the same for both talks, with the benchmarks updated a bit in the latter.

As you may have surmised, this is but a small facet of the Posta project I was working on last year. I had meant to submit it as a functional pearl for ICFP, but the timing didn't work out for that. After giving the McMaster version of the talk, Ed convinced me that I should publish the code for the smoothing DSL separately from the rest of Posta. So he's the one to blame about my being so slow in releasing the Posta code I promised this summer. Though seriously, I'd been considering breaking up and reorganizing the code anyways. Now that I'm back from ICFP and all my traveling over the summer, I hope to get that code pushed out soon. Sorry for the delay y'all.

Shadowcrane

All last week I was in Tokyo to attend ICFP and associated workshops. It was nice to finally meet a bunch of people I've been talking with online for the last few years. And I met up with Ken Shan and Oleg Kiselyov again, which is always a pleasure. Unlike last time I was in Japan, I didn't get too much time to explore and go sightseeing. I got to explore Chiyoda, which I missed last time around, and I made sure to do two of the most important things: (1) eat some okonomiyaki, (2) visit Akihabara to buy some new manga.

My newest acquisition is 銃姫 ("Gun Princess") Phantom Pain, which I'm rather enjoying so far. Anything that starts off with an execution, spell-casting based on Buddhist mantras, and a prolonged diatribe on why one of the characters is a good-for-nothing incompetent layabout, can't be half bad :) Unfortunately, I only got the first two volumes, so I'll finish them all too soon. So far it's proving easier to read than my previous acquisition (Peace Maker 鐵), though I'm not sure if that's due to getting better at Japanese or because 鐵 is written in a particular style. I definitely noticed my deterioration in fluency since five years ago; grammar's fine, but my vocab is abysmal. I need to find a decent way to work on that.

Shadowcrane

Via [personal profile] silmaril: If you're a dancer, you've heard about isolation. If you're a belly dancer, you've heard a lot about isolation. Well.... Watch this, and weep. In envy, I won't lie.

Oh yes, weeping shall commence.

Shadowcrane

Ten years ago a large number of people died tragically. After a decade of colonial warfare, torture, and oppression, Americans are starting finally to come to the realization that maybe it would be good of us to stop killing. As so eloquently phrased in On 9/11 and the War on "Terror": Names, Numbers and Events:

The events that have been taking place since 9/11 are not something that came out of the blue, but rather they are best understood as a continuation of a long history of deception, racism of Western modernity, and the ways in which those who are not white/westerners have figured into this history.

On this day, so full of jingoistic pride at "liberating" humans from their life on Earth, you should read that article and take Ibn Khaldoun's message to heart. Do not think that you are somehow special and exempt. Who funded and armed Al-Qaeda in order to oppose the Soviets? Who put Saddam Hussein into power and supported his regime? Who supported the Shah in Iran and opposed the democratic revolution in the 1970s? Who has been a close ally and long-time supporter of Mubarak? Whose officials helped Gaddafi cling to power and advised him in stomping out the beginning of Libya's uprising? The uprisings throughout the Arabic world are not uprisings against "Muslim extremists". The uprisings are Arabic peoples who are finally able to free themselves from the shackles of western hegemony.

Today, right now, a large number of people are dying tragically. They need our help; and the help they need is for us to stop killing them.

Shadowcrane

Last week was a whirlwind. It was the first week of classes, which normally wouldn't be a big thing, except this semester I'm teaching a course. The first couple months of summer were pretty sedate up in Canada. But the last month, leading into the start of fall term, was full of traveling. I came back from Canada for a couple weeks, then left for a week with Licia, came back for a couple days (literally) and then flew off to California for Lindsey and Alex's wedding, arriving home the night before I needed to teach my first 9:30am class. Things've settled down now, though I'm heading off to ICFP next friday.

One thing traveling is good for is getting caught up on pleasure reading. In addition to the Vinge mentioned last time, I also got to read some new C.S. Friedman. After returning from Canada I got a bunch of new games for the PS3 too. Portal 2 is good fun, though the atmosphere feels like a bizarre hybrid between the first Portal and the Fallout franchise; fitting in its way, but very strange. I've also been playing through El Shaddai and reveling in the beauty of Amaros. Unlike a lot of Japanese games, the US version lets you keep the original voice acting, which is fabulous. Dunno how good the English voices are actually; maybe next time I play through it I'll find out. And then there's Catherine: an adult romantic horror by the team who did the Persona series. It's actually a puzzle game, where you're trying to climb a tower that crumbles beneath you. Both the puzzling and the plot are top rate, as to be expected from Atlus and SMT. There are other books and other games, but I'm not feeling like doing any proper reviews just yet.

In addition to teaching, I'm taking two courses this term. Advanced Phonetics, continuing from the Phonetics course I took last spring. Back at Reed for my undergrad we didn't have any phonetics courses, only phonology; so I've been getting caught up on that, as well as filling out the requirements for the Linguistics half of my dual PhD. The other course (Q551) is an intro to cognitive neuropsychology. It's something of a psychology methods course, with a bit of neuroanatomy and the briefest mention of how the imaging technology works. Last spring I took a course on neuroscience for speech and hearing, and up in Canada I spent the summer with a bunch of computer scientists who work on optimizing the algorithms behind the imaging technology; so I'm not sure how much I'll get out of Q551, but it's a requirement for the CogSci half of the dual PhD. As a (meta)theoretical computational linguist, neuroimaging isn't really my area; but as it turns out there are some interesting problems there and plenty of room for theoretical mathematics. Even after the imaging is done, interpreting the images runs into a lot of the same statistical problems that you get in NLP. Both fields are in need of a new statistics, one which doesn't break down when you have enormous data sets. Maybe one day I'll try working on that.

Shadowcrane

So, one of the things I've been up to in this long silence since posting regularly is getting caught up on my Vernor Vinge reading. I first got started reading Vinge a couple years back, picking up A Deepness in the Sky whilst traveling through Union Station. I fell in love with Deepness and kept meaning to read some of his other work, but found it oddly difficult to locate it in the local bookstores. At the beginning of the summer I picked up copies of Marooned in Realtime and A Fire Upon the Deep from Amazon. Marooned reminds me a lot of G.R.R. Martin's Dying of the Light (which I very highly recommend). I'm still reading through Fire, which has a lot of what I loved about Deepness: namely detailed consideration of the cognitive nature of alien life, especially the effects of alien bodies on cognition, as opposed to the "everyone's human(oid)" perspective familiar from Star Trek and most SF.

For those unfamiliar with Vinge, one of the major themes in his works is the idea of the Singularity. Much of this was novel when he was first writing about it, though it's a mainstream idea these days. There's been a lot of discussion on the technical, technological, and philosophical considerations behind Singularities; just google for transhumanism and you'll run into it. However, I just ran across a post by Elizabeth Bear which comes at it from, IMO, a more interesting direction: namely, analyzing the Singularity as an artistic movement in literature and analyzing it through the lens of critical theory, feminism, etc. I definitely believe that SF is, and has always been, a tool for exploring the current world around us and especially for trying to interpret the effects that current technologies have on social life; but the problems we're working out are not always obvious at the time. Perhaps the Singularity is now old enough that we can start to untangle all the concerns it was invented to address. Bear's posts (both the one I linked to, and the 2006 post cited therein) are a good start in that direction.

Shadowcrane

An Atlanta area mother was recently convicted of vehicular homicide. Convicted for the crime of being a pedestrian hit by a drunken driver, a driver who was also on painkillers, also half blind, also convicted of two previous hit-and-runs. Her child was also hit, and killed, which is why she's now a criminal. In truth, she was convicted for the crime of being black and poor in America. I haven't been in the country six hours and this is the news story that greets me. Racism, the othering of people who take public transit, and the deadly violent car culture that dominates the US.

A (white) friend of mine was killed in a crosswalk in Portland years ago. The SUV driver couldn't be bothered to check if it was safe when making a left-hand turn across a busy street, at full speed without slowing at all. There were witnesses. He also fled the scene. She was headed to the corner to next to her apartment to buy a mop.

I've been hit three times in crosswalks, all three with the walk sign on and a red light for the cars. Only one of those times was it serious. But I was saved by the fact that the rich white man was driving a sporty little thing so I went over the hood instead of under.

What a welcome home.

Shadowcrane

As Glenn Greenwald helpfully pointed out, the editors of the NYT — America's allegedly liberal newspaper — reserve the word "terrorist" solely for use in conjunction with the word "Muslim".

All the hate mongering confirmation bias in the wake of the tragedy in Oslo just shows how much the extreme terrorism of the right wing has corrupted American culture. As Ahmed Moor says in zir excellent commentary on the situation,

But not all liberals are created equal.

It is a credit to the Norwegian people that their prime minister did not respond to the terror attack with scorched-earth rhetoric or a carpet-bombing campaign. A real liberal with strong principles, he did not succumb to fear or vicious speculation.

(As always, the hat-tip goes to [profile] homasse for cutting to the heart of the issue, as well as informing me of world news in my last week here in Canada.)

Shadowcrane

unification-fd 0.5.0

The unification-fd package offers generic functions for first-order structural unification (think Prolog programming or Hindley–Milner type inference). I've had this laying around for a few years, so I figured I might as well publish it.

An effort has been made to try to make this package as portable as possible. However, because it uses the ST monad and the mtl-2 package it can't be H98 nor H2010. However, it only uses the following common extensions which should be well supported[1]:

Rank2Types
MultiParamTypeClasses
FunctionalDependencies
FlexibleContexts
FlexibleInstances
UndecidableInstances

[1] With the exception of fundeps which are notoriously difficult to implement. However, they are supported by Hugs and GHC 6.6, so I don't feel bad about requiring it. Once the API stabilizes a bit more I plan to release a unification-tf package which uses type families instead, for those who feel type families are easier to implement or use.

Description

The unification API is generic in the type of the structures being unified and in the implementation of unification variables, following the two-level types pearl of Sheard (2001). This style mixes well with Swierstra (2008), though an implementation of the latter is not included in this package.

That is, all you have to do is define the functor whose fixed-point is the recursive type you're interested in:

    -- The non-recursive structure of terms
    data S a = ...
    
    -- The recursive term type
    type PureTerm = Fix S

And then provide an instance for Unifiable, where zipMatch performs one level of equality testing for terms and returns the one-level spine filled with pairs of subterms to be recursively checked (or Nothing if this level doesn't match).

    class (Traversable t) => Unifiable t where
        zipMatch :: t a -> t b -> Maybe (t (a,b))

The choice of which variable implementation to use is defined by similarly simple classes Variable and BindingMonad. We store the variable bindings in a monad, for obvious reasons. In case it's not obvious, see Dijkstra et al. (2008) for benchmarks demonstrating the cost of naively applying bindings eagerly.

There are currently two implementations of variables provided: one based on STRefs, and another based on a state monad carrying an IntMap. The former has the benefit of O(1) access time, but the latter is plenty fast and has the benefit of supporting backtracking. Backtracking itself is provided by the logict package and is described in Kiselyov et al. (2005).

In addition to this modularity, unification-fd implements a number of optimizations over the algorithm presented in Sheard (2001)— which is also the algorithm presented in Cardelli (1987).

  • Their implementation uses path compression, which we retain. Though we modify the compression algorithm in order to make sharing observable.
  • In addition, we perform aggressive opportunistic observable sharing, a potentially novel method of introducing even more sharing than is provided by the monadic bindings. Basically, we make it so that we can use the observable sharing provided by the previous optimization as much as possible (without introducing any new variables).
  • And we remove the notoriously expensive occurs-check, replacing it with visited-sets (which detect cyclic terms more lazily and without the asymptotic overhead of the occurs-check). A variant of unification which retains the occurs-check is also provided, in case you really need to fail fast for some reason.
  • Finally, a highly experimental branch of the API performs weighted path compression, which is asymptotically optimal. Unfortunately, the current implementation is quite a bit uglier than the unweighted version, and I haven't had a chance to perform benchmarks to see how the constant factors compare. Hence moving it to an experimental branch.

I haven't had a chance to fully debug these optimizations, though they pass some of the obvious tests. If you find any bugs, do be sure to let me know. Also, if you happen to have a test suite or benchmark suite for unification on hand, I'd love to get a copy.

References

Luca Cardelli (1987)
Basic polymorphic typechecking. Science of Computer Programming, 8(2): 147–172.
Atze Dijkstra, Arie Middelkoop, S. Doaitse Swierstra (2008)
Efficient Functional Unification and Substitution, Technical Report UU-CS-2008-027, Utrecht University.
Oleg Kiselyov, Chung-chieh Shan, Daniel P. Friedman, and Amr Sabry (2005)
Backtracking, Interleaving, and Terminating Monad Transformers, ICFP.
Tim Sheard (2001)
Generic Unification via Two-Level Types and Paramterized Modules, Functional Pearl, ICFP.
Tim Sheard and Emir Pasalic (2004)
Two-Level Types and Parameterized Modules. JFP 14(5): 547–587. This is an expanded version of Sheard (2001) with new examples.
Wouter Swierstra (2008)
Data types a la carte, Functional Pearl. JFP 18: 423–436.

Links

To Canada!

Jun. 3rd, 2011 08:04 am
Shadowcrane

Hello all. This summer (that is, next week) I'm heading up to Canada to teach a DSL bootcamp at McMaster along with Edward Kmett, and staying afterwards for a couple months. Rather than dealing with the TSA's bullshit I've decided to take the train up, which is cheaper and only slightly slower once you account for all the nonsense. But, this means I'll be spending a good deal of time laying over in Chicago and Buffalo. I know a bunch of you have traveled around, and some of you may happen to be there when I am, so: anyone want to meet up while I'm there? or knows of some good places to eat and spend time when visiting?

I'll be in Chicago around 4pm–9:30pm on June 10, and 9:45am–3:20pm on July 29. And in Buffalo around 9am–3pm on June 11, and 1:30pm–midnight on July 28. So that's about five hours each stop, ten hours coming back through Buffalo. Amtrak being what it is, some of this time might get eaten up by poor track scheduling, but I figure I'll still have a good deal of time regardless.

Also, anyone have recommendations for places to eat in downtown Indianapolis? I have an hour to kill around noon when heading up.

Shadowcrane

I've mentioned these a few times in different places, most recently on Reddit. So I figured I should repost them here for better googleability for folks just starting to learn about dependent types and type theory.

PSA #1: Beware: "type theory" and "Type Theory" are not the same thing, note the caps. The former describes the entire field of enquiry which explores possible theories about types; the latter is the name of one particular theory/system, namely the one Per Martin-Löf popularized (and others have extended and reinvented since then).

Yes, this is an extremely obnoxious detail, but it's one that often confuses newcomers; e.g., learning MLTT and thinking that applies to all type systems, or learning other type systems and then being confused when people make flagrantly false statements (when interpreted as statements about type theory as a whole, though they're true of MLTT). This is why I prefer to refer to Type Theory as "TT" or "MLTT", to avoid confusion, and because I am interested in the entire field of enquiry and in comparing different systems rather than focusing on just one.

ObTangent: There are similar reasons for why ML is called "ML" instead of "metalanguage".

PSA #2: There are conflicting meanings for the terms "sum" and "product" in dependent type land. Those coming from category theory and functional programming tend to say "sum" to mean tagged unions, "product" to mean pairs (e.g, cartesian products or similar), and "exponentials" to mean functions as objects— all of these exactly as in non-dependent languages. However, those coming more from the set-theory side of things use "product" to mean functions (whence the capital Pi), "sum" to mean pairs (whence the capital Sigma), and have no common term for unions.

Shadowcrane

I've been working on a tagging library (and executable) for a bit over a year now. When the project started I had the advantage of being able to choose the language to do it in. Naturally I chose Haskell. There are numerous reasons for this decision, some of which have been derided as "philosophical concerns". Certainly some of the reasons why Haskell is superior to other languages do border on the philosophical. Y'know, silly little things like the belief that type systems should prevent errors rather than encouraging them to proliferate. I'm sure you've heard the arguments before. They're good arguments, and maybe they'll convince you to try out Haskell in your basement. But in many so-called "enterprise" settings, anything that even smells like it might have basis in theoretical fact is automatically wrong or irrelevant; whatever you do in the privacy of your basement is your own business, but heaven forbid it have any influence on how decisions are made in the workplace! So, here is a short list of entirely pragmatic, practical, and non-theoretical reasons why Haskell is superior to Java for implementing enterprise programs. More specifically, these are reasons why Haskell is superior for my project. Perhaps they don't matter for your project, or perhaps they'll be enough to convince your boss to let you give Haskell a try. Because design decisions are often project-specific, each point explains why they matter for Posta in particular.

  • Haskell has powerful frameworks for defining modular, high-performance, non-trivial parsers (e.g., Attoparsec). In natural language processing (NLP), just like system administration, over half of the work you do involves dealing with a handful of different ad-hoc poorly defined file formats. Reading them; generating them; converting from one format to another; etc. Because every one of these formats grew out of a slow accretion of features for one particular project, they're riddled with inconsistencies, idiosyncratic magic values, corner cases, and non-context-free bits that require special handling. In Java the premiere tool (so far as I know) for defining parsers is JavaCC. (Like the C tools lex and yacc, JavaCC uses its own special syntax and requires a preprocessor, whereas Attoparsec and the like don't. However, this may be a "philosophical" issue.) However, as of last time I used it, JavaCC is designed for dealing with nice clean grammars used by programming languages and it doesn't handle inconsistent and irregular grammars very well.
  • Posta uses a system of coroutines (called "iteratees") in order to lazily stream data from disk, through the parsers, and into the core algorithms, all while maintaining guarantees about how long resources (e.g., file handles, memory) are held for. This allows handling large files, because we don't need to keep the whole file in memory at once, either in its raw form or in the AST generated by parsing it. For modern enterprise-scale NLP, dealing with gigabyte-sized files is a requirement; because many NLP projects are not enterprise-scale, you get to spend extra time chopping up and reformatting files to fit their limitations. Last time I used JavaCC it did not support incremental parsing, and according to the advertised features it still doesn't. In addition, implementing coroutines is problematic because Java's security model precludes simple things like tail-call optimization--- meaning that you can only support this kind of streaming when the control flow is simple enough to avoid stack overflows.
  • Haskell has awesome support for parallelism. One version, called STM, provides composeable atomic blocks (which matches the way we naturally think about parallelism) combined with lightweight threads (which make it cheap and easy). Java has no support for STM. I am unaware of any support for lightweight threads in Java. The only parallelism I'm aware of in Java is the monitor-style lock-based system with OS threads. As with all lock-based systems, it is non-composeable and difficult to get right; and as with using OS threads anywhere else, there is high overhead which removes the benefits of parallelizing many programs.
  • Posta makes extensive use of partial evaluation for improving performance; e.g., lifting computations out of loops. When doing NLP you are often dealing with triply-nested loops, so loop-invariant code motion is essential for performance. In my benchmarks, partial evaluation reduces the total running time by 10%. If raw numbers don't convince you: using partial evaluation allows us to keep the code legible, concise, modular, and maintainable. The primary use of partial evaluation is in a combinator library defining numerous smoothing methods for probability distributions; the results of which are called from within those triply-nested loops. Without partial evaluation, the only way to get performant code is to write a specialized version of the triply-nested loop for every different smoothing method you want to support. That means duplicating the core algorithm and a lot of tricky math, many times over. There's no way to implement this use of partial evaluation in anything resembling idiomatic Java.
  • Posta uses an implementation of persistent asymptotically optimal priority queues which come with proofs of correctness. A persistent PQ is necessary for one of the tagger's core algorithms. Since the PQ methods are called from within nested loops, performance is important. Since we're dealing with giga-scale data, asymptotics are important. A log factor here or there means more than a 10% increase in total running time. In Java there's java.util.PriorityQueue but it has inferior asymptotic performance guarantees and is neither persistent nor synchronized. I'm sure there are other PQ libraries out there, but I doubt anyone's implemented the exact version we need and shown their implementation to be correct.

I'll admit I'm not up to date on state-of-the-art Java, and I'd love to be proven wrong about these things being unavailable. But a couple years ago when I returned to Java after a long time away, I learned that all the hype I'd heard about Java improving over the preceding decade was just that: hype. I have been disappointed every time I hoped Java has some trivial thing. The most recent one I've run into is Java's complete refusal to believe in the existence of IPC (no, not RPC), but that's hardly the tip of the iceberg.

Shadowcrane

Immutability and Blocks, Lambdas and Closures (André Pang): Mr. Python, please call the office. Your abstractions are leaking.

Shadowcrane

stm-chans 1.0.0

The stm-chans package offers a collection of channel types, similar to TChan but with additional features. In particular it offers these types:

TBChan: Bounded FIFO channels.
When the channel is full, writers will block/retry. This ensures that the writers do not get too far ahead of the readers, which helps to make sure that memory and cpu resources are used responsibly.
TMChan: Closeable FIFO channels.
This is like TChan (Maybe a) but with a monotonicity guarantee that once Nothing is returned all future reads will be Nothing as well.
TBMChan: Bounded Closeable FIFO channels.
This combines the capabilities of TBChan and TMChan.

In addition, the stm-chans package offers a (partial) compatibility layer for some API improvements still making their way into the stm package[1]. These new functions include:

tryReadTChan :: TChan a -> STM (Maybe a)
A version of readTChan which does not retry. Instead it returns Nothing if no value is available.
peekTChan :: TChan a -> STM a
Get the next value from the TChan without removing it, retrying if the channel is empty.
tryPeekTChan :: TChan a -> STM (Maybe a)
A version of peekTChan which does not retry. Instead it returns Nothing if no value is available.

Links

Shadowcrane

Last week I gave a debugging problem. Well, now it's time for the big reveal. If you'd like to try your hand at it, you should follow that link and stop reading this post. It took me a couple weeks to figure this one out. Given the nature of the symptoms (failure only on messages of lengths 255 and 383) it was pretty clear that this was some obscure and annoying bug, like an off-by-one error. I read through the relevant code for the whole communication stack (Haskell protobufs, Haskell FIFO IO, Java FIFO IO, Java protobufs) with a inkling of what the problem might be, but I still missed it the first time through.

In the hints I suggested that length 511 should also be problematic, and that we were using the length-delimited version of protobufs. For those who aren't familiar with protobufs there are only two relevant details of the encoding. The first is that length-delimited protobufs write the length before the payload (as mentioned in the hint), and the second is how that length is encoded. In order to reduce the message size, protobufs use a variable length encoding for integers. The encoding says that we read a byte at a time (little-endian order), keeping the low-order 7 bits as part of the number. If the 8th bit is unset, then we're done; if it's set, then we continue reading the next byte. Thus, the length 255 is encoded as 0xFF 0x01 ((0xFF .&. 0x7F) .|. (0x01 `shiftL` 7) == 255), 383 is encoded as 0xFF 0x02, and 511 is encoded as 0xFF 0x03. Conversely, the non-problematic 254 is encoded as 0xFE 0x01, and 256 is encoded as 0x80 0x02.

The crucial knowledge, however, is knowing that for some unfathomable reason Java decided to define their byte type as a signed numeric type! On the one hand, this is perverse because bytes are always used when dealing with binary formats and almost never used for actual numerical computation. On the other hand, it is sadistic because every other language in common use (and most of the ones not in common use) have unsigned bytes, so Java's choice to use signed bytes is exquisitely designed to confound any programmer who has ever used another language. More particularly, this means that the implicit conversions between bytes and other numeric types will preserve the signed value, not the representation. Thus, since 0xFF has the signed byte value -1, whenever you use this byte in, say, an int context, it will be silently and implicitly converted into the int -1 as opposed to the int 255 any sane person would expect. Further, since Java comes from the C tradition of languages, it likes to use magical values to indicate errors (e.g., using -1 when only positive values are valid). I'm sure you can see where this is headed.

The InputStream class is one of the plethora of classes for reading from a file. In particular, it's the one used by the CLIPC library for reading from Posix FIFOs. Another awful thing about Java is that it refuses to believe in anything that systems programmers use, because by definition these things aren't portable to every operating system. I believe I've alluded to this before. Among other things it means that Java steadfastly refuses to acknowledge the fact that separate processes can communicate without going through the network stack. If you want RPCs, then there are numerous options available to you. But heaven help you if you want to use IPC in order to avoid that overhead. And don't get me started on signal handling. Luckily for me, CLIPC does the nasty JNI coding necessary to convince Java to believe in FIFOs (and it even works for Windows FIFOs).

Subclasses of InputStream provide a method which allows you to read one byte at a time from wherever it reads from. However, this method has the type int read() where the return value -1 indicates end of file and values 0 through 255 indicate the byte that was read. This is just begging for bugs. You can't just return a byte, since all the bytes from 0x80 to 0xFE (-128 to -2) will be implicitly converted into invalid return values, and the byte 0xFF (-1) will be implicitly converted to EOF.

Using the standard trick for amortizing IO costs, the CLIPC library reads a buffer's worth of data whenever it needs to, and then returns that data as requested. That buffer has the exceptionally reasonable type byte[]. Unfortunately, when reading a single byte it does this innocuous little thing: int returnValue = buffer[i];. Very easy to miss when scanning though hundreds of lines of Java, even when you have an inkling that signed bytes may be at fault. This is why implicit conversion between numeric types is Evil. Haskell does the right thing in requiring all coercions to be explicit. To fix the bug we must do: int returnValue = ((int) buffer[i]) & 0xFF;. Unfortunately, the buffer field is private in FIFOInputStream so I can't make a subclass to fix the bug.

Surprisingly, except for the interpretation of 0xFF as EOF when reading the length header, Google's protobuf code works perfectly fine with all the other errorful byte values since it uses bytewise operations instead of arithmetic. And reads chunks at a time elsewhere, thus avoiding the bug. Good job Google :)

I've sent a bug report to the author of CLIPC, but I've yet to hear back, and cursory googling indicates he may have dropped off the internet a couple years ago. If I haven't heard back by the time I make the first public release of Posta (estimated mid--late May 2011), then I'll have to fork CLIPC since I don't want Posta's license to be dictated by this externality. LGPL 2.1+ is great and all, but this bug is a silly thing to base a license around.

Shadowcrane

I offer you a thought experiment. For your current project you've set up some inter-process communication. Nothing tricky involved, just your standard client–server deal. You've even outsourced the protocol design and are using Google's code to handle all the grody details of serializing and deserializing. Well, okay, on the server side you're using someone else's code but it implements the same protocol, right? Now you run into a bug.

The vast majority of the time everything works smoothly— even verified by taking SHA1 hashes of the messages on both sides and comparing them. But every so often the Java client crashes. In particular, it crashes whenever reading a result message (from the server) of length 255 or 383 and maybe some larger sizes. It does, however, work perfectly fine for intervening message lengths (including 254 and 256). So what's wrong?

A hint ) Another hint ) One last hint )

Solution posted here.

Shadowcrane

unix-bytestring 0.3.2

The unix-bytestring package offers a full selection of Unix/Posix-specific functions for reading and writing ByteStrings to file descriptors.

The Story Behind It All

A while back I needed a ByteString-based version of the System.Posix.IO String-based API. I coded up the new versions and submitted a patch to the unix package for adding them in. But apparently, noone's much pleased with the System.Posix.IO API! In the ensuing discussion folks brought up their specific complaints and offered suggestions. So now, the couple of functions have become enough to release as its own package— with the hope that one day it'll be rolled into the unix package.

That's where you come in. The current API is more than good enough for anything I've needed, so I need your feedback on what you want from a ByteString-based Unix/Posix I/O library.

I've looked through Roel van Dijk's reverse dependency analysis to find the folks currently using the unix package. The maintainers of the following packages may be particularly interested in unix-bytestring:

HFuse has complained about the lack of bindings for pread(2) and pwrite(2), which are included in unix-bytestring. The only standing complaint I haven't addressed is one expressed in the iteratee, iteratee-mtl, and liboleg packages which would prefer a return type of (Either Errno _) instead of throwing exceptions.

Links

Shadowcrane

I've said it all before (and been harangued for doing so), but maybe it'll be heard better coming from someone else's mouth. Here's the shortest excerpt I can give from Why I'm leaving feminism:

My ‘issues’ being things like the rape of people in institutions, the fact that the average transgender person can expect to live for 23 years, forcible institutionalisation of people whom society doesn’t want to look at, ridiculously high domestic violence and sexual assault rates for transgender people and people with disabilities. The widening pay gap between white women and women of colour, the fact that the median net worth for Black women is $5. The fact that fat patients die without treatment due to fat hatred in the medical community. The fact that industrial pollution disproportionately impacts communities of colour, that class mobility is at an all time low, that the rich are getting richer while the poor get poorer, that protections for worker safety are steadily being eroded, that unions are under attack in the United States.

These barely scratch the surface of ‘my issues.’ Because I believe that no human is free until all humans are free, no human is equal until all humans are equal, no gains for one group at the cost of another are acceptable. I believe in social justice, in liberty for all. These are my issues. And many people who identify themselves as feminists tell me the issues need to wait. They pay lip service to them until something more important comes along and then it becomes all-consuming. They repeat the same mistakes make by older generations and appear surprised at the inevitable outcome.

[...]

People who continue to be celebrated as feminist heroes leave a legacy of ableism, racism, classism, transphobia in their wake. The feminist movement has never gotten away from this, despite the best attempts of many of its members.

For a long time, I genuinely believed I could change the feminist movement from within. I thought if I fought hard enough, and long enough, feminism would make a place at the table for me, that I would be welcome in the feminist community. But it’s painfully evident I am not wanted, not in mainstream feminism, which is the ‘feminism’ most people are exposed to. I know well enough to know where I’m not wanted. The leaders of the feminist movement don’t just have a lack of interest in ‘my issues,’ they actively want to suppress my voice, and the voices of people like me. They want us to shut up and go away. It’s evident from the palpable sighs of relief when they manage to quash us, it’s evident from the total silence when a disabled women talks about why she is leaving feminism and not one person, not one, says anything about it.

So many disabled people, nonwhite people, transgender people, people of colour, poor people, adamantly refuse to identify with feminism in its current incarnation in the United States. ‘Feminists’ talk about this in the sense that we’re all really feminist in how we think, behave, and act, we just have some irrational resistance to the label. No, we’re not really feminist. The model of feminism we see is one where oppression perpetrated in the name of ‘activism’ is acceptable, where casual ableism, racism, classism, transphobia run so deep that many of us don’t even bother to point it out anymore. The model of feminism we see is one where a handful of people profit at the expense of others. And that’s not how we think, behave, and act. That is not what we believe.

Profile

Shadowcrane
wren ng thornton

Syndicate

RSS Atom

January 2012

S M T W T F S
1234567
891011121314
15161718192021
22232425 262728
293031    

Most Popular Tags

Expand Cut Tags

No cut tags

Style Credit

Page generated Jan. 28th, 2012 05:05 pm
Powered by Dreamwidth Studios