winterkoninkje | Entries tagged with pos tagging

Last friday I passed my qualifying examinations! So now, all I have left is a bunch of paperwork about that and then proposing, writing, and defending the dissertation itself. So, in about a year or so I'll be on the job market. And, much as I despise job hunting, I can't wait!

Since defending the quals I've been spending far too much time playing Persona 3 Portable. I've played P3FES, but P3P adds a female protagonist option which changes a bunch of the social interactions, so I've been playing through that side of things. Other than the heterosexual assumptions about the relationships, I've been loving it. More rpgs should have female protagonists. That's one of the reasons I've always loved FF6. Also a big part of why I found FF13 compelling. (Though, tbh: while Lightning is awesome as a protagonist, Vanille is definitely my favorite character :) And a big part of the powerfulness of Kreia as a character in KotOR2 stems from her interactions with the canonically-female protagonist.

Speaking of women. I've been presenting as female for a couple months now, and since I have no intention of stopping nor hiding that fact, I've decided to move T-Day forward. Basically, for those who haven't already switched over to the right pronouns etc: T-Day is today. I've sent emails to the department heads in order to get them to send out the "official" memo; so if you haven't gotten it yet, that should show up on monday or tuesday.

The next couple months are going to be hectic with paper writing. I'm hoping to get a paper on syntax-based sentiment-analysis using matrix-space semantics into one of the CL conferences with deadlines this March. No Haskell involved in that one, though I'll probably spend a few posts discussing the semantic model, which may be of interest to y'all. I'm also planning on getting the work from my first qual paper published; that paper was about Posta, a functional library for interactive/online/incremental tagging with HMMs. Here I'm planning to target journals rather than conferences, and it'll spread out over a few papers: one on the overall system (which I need to actually push up to Hackage), one on the higher-order anytime n-best extraction algorithm, and one on reformulating HMM algorithms in terms of foldl and scanl (this may be combined with the HO-AnB paper, length permitting). All of these would be targeting the linguistics audience. Using folds and scans is old-hat in functional programming; my particular goal with that paper is exposing linguists to the tools of FP and how they can be used to greatly simplify how we describe our algorithms. Once those are out of the way I might also see about writing up a functional pearl on the smoothing library I presented at AMMCS a few years back.

Current Music: A/B Machines ~ Sleigh Bells (Treats)

I finally got around to posting the slides for a talk I gave twice this summer: Probability Smoothing for NLP: A case study for functional programming and little languages. The first version of the talk was presented at the McMaster Workshop on Domain Specific Lanaguages (and Ed Kmett has posted a video of that version on YouTube) with the presentation focused on EDSLs, with smoothing given as an example. The second version was presented at the AMMCS minisymposium on Progress and Prospects in Model-Based Scientific Software Development, where the focus was more on the domain itself and how the use of a DSL allows ensuring correctness, modularity, and maintainability of code for developing probability models. The slides are essentially the same for both talks, with the benchmarks updated a bit in the latter.

As you may have surmised, this is but a small facet of the Posta project I was working on last year. I had meant to submit it as a functional pearl for ICFP, but the timing didn't work out for that. After giving the McMaster version of the talk, Ed convinced me that I should publish the code for the smoothing DSL separately from the rest of Posta. So he's the one to blame about my being so slow in releasing the Posta code I promised this summer. Though seriously, I'd been considering breaking up and reorganizing the code anyways. Now that I'm back from ICFP and all my traveling over the summer, I hope to get that code pushed out soon. Sorry for the delay y'all.

I've been working on a tagging library (and executable) for a bit over a year now. When the project started I had the advantage of being able to choose the language to do it in. Naturally I chose Haskell. There are numerous reasons for this decision, some of which have been derided as "philosophical concerns". Certainly some of the reasons why Haskell is superior to other languages do border on the philosophical. Y'know, silly little things like the belief that type systems should prevent errors rather than encouraging them to proliferate. I'm sure you've heard the arguments before. They're good arguments, and maybe they'll convince you to try out Haskell in your basement. But in many so-called "enterprise" settings, anything that even smells like it might have basis in theoretical fact is automatically wrong or irrelevant; whatever you do in the privacy of your basement is your own business, but heaven forbid it have any influence on how decisions are made in the workplace! So, here is a short list of entirely pragmatic, practical, and non-theoretical reasons why Haskell is superior to Java for implementing enterprise programs. More specifically, these are reasons why Haskell is superior for my project. Perhaps they don't matter for your project, or perhaps they'll be enough to convince your boss to let you give Haskell a try. Because design decisions are often project-specific, each point explains why they matter for Posta in particular.

Haskell has powerful frameworks for defining modular, high-performance, non-trivial parsers (e.g., Attoparsec). In natural language processing (NLP), just like system administration, over half of the work you do involves dealing with a handful of different ad-hoc poorly defined file formats. Reading them; generating them; converting from one format to another; etc. Because every one of these formats grew out of a slow accretion of features for one particular project, they're riddled with inconsistencies, idiosyncratic magic values, corner cases, and non-context-free bits that require special handling. In Java the premiere tool (so far as I know) for defining parsers is JavaCC. (Like the C tools lex and yacc, JavaCC uses its own special syntax and requires a preprocessor, whereas Attoparsec and the like don't. However, this may be a "philosophical" issue.) However, as of last time I used it, JavaCC is designed for dealing with nice clean grammars used by programming languages and it doesn't handle inconsistent and irregular grammars very well.
Posta uses a system of coroutines (called "iteratees") in order to lazily stream data from disk, through the parsers, and into the core algorithms, all while maintaining guarantees about how long resources (e.g., file handles, memory) are held for. This allows handling large files, because we don't need to keep the whole file in memory at once, either in its raw form or in the AST generated by parsing it. For modern enterprise-scale NLP, dealing with gigabyte-sized files is a requirement; because many NLP projects are not enterprise-scale, you get to spend extra time chopping up and reformatting files to fit their limitations. Last time I used JavaCC it did not support incremental parsing, and according to the advertised features it still doesn't. In addition, implementing coroutines is problematic because Java's security model precludes simple things like tail-call optimization--- meaning that you can only support this kind of streaming when the control flow is simple enough to avoid stack overflows.
Haskell has awesome support for parallelism. One version, called STM, provides composeable atomic blocks (which matches the way we naturally think about parallelism) combined with lightweight threads (which make it cheap and easy). Java has no support for STM. I am unaware of any support for lightweight threads in Java. The only parallelism I'm aware of in Java is the monitor-style lock-based system with OS threads. As with all lock-based systems, it is non-composeable and difficult to get right; and as with using OS threads anywhere else, there is high overhead which removes the benefits of parallelizing many programs.
Posta makes extensive use of partial evaluation for improving performance; e.g., lifting computations out of loops. When doing NLP you are often dealing with triply-nested loops, so loop-invariant code motion is essential for performance. In my benchmarks, partial evaluation reduces the total running time by 10%. If raw numbers don't convince you: using partial evaluation allows us to keep the code legible, concise, modular, and maintainable. The primary use of partial evaluation is in a combinator library defining numerous smoothing methods for probability distributions; the results of which are called from within those triply-nested loops. Without partial evaluation, the only way to get performant code is to write a specialized version of the triply-nested loop for every different smoothing method you want to support. That means duplicating the core algorithm and a lot of tricky math, many times over. There's no way to implement this use of partial evaluation in anything resembling idiomatic Java.
Posta uses an implementation of persistent asymptotically optimal priority queues which come with proofs of correctness. A persistent PQ is necessary for one of the tagger's core algorithms. Since the PQ methods are called from within nested loops, performance is important. Since we're dealing with giga-scale data, asymptotics are important. A log factor here or there means more than a 10% increase in total running time. In Java there's java.util.PriorityQueue but it has inferior asymptotic performance guarantees and is neither persistent nor synchronized. I'm sure there are other PQ libraries out there, but I doubt anyone's implemented the exact version we need and shown their implementation to be correct.

I'll admit I'm not up to date on state-of-the-art Java, and I'd love to be proven wrong about these things being unavailable. But a couple years ago when I returned to Java after a long time away, I learned that all the hype I'd heard about Java improving over the preceding decade was just that: hype. I have been disappointed every time I hoped Java has some trivial thing. The most recent one I've run into is Java's complete refusal to believe in the existence of IPC (no, not RPC), but that's hardly the tip of the iceberg.

This weekend I've been doing a solo hackathon to try to get Posta integrated with our variant of the Mink parser. All the core algorithms have already been implemented, so it's just been a whole lot of yak shaving. Namely, I have to define an IPC protocol for Haskell to talk to Java, implement the (Haskell) server executable and (Java) client stubs, and then try to shake the thing out to find bugs and performance holes. Ideally, by tuesday morning.

Unfortunately, Java doesn't believe in inter-process communication, so all the libraries out there are for doing RPCs. Since the parser and tagger will be operating interactively, and on the same machine, it's quite silly to go through the network stack just to pass a few bytes back and forth. Thankfully I found CLIPC which should do the heavy lifting of getting Java to believe in POSIX named pipes. In order to handle the "on the wire" de/encoding, I've decided to go with Google's protocol buffers since there's already a protobuf compiler for Haskell. I was considering using MessagePack (which also has Haskell bindings), but protobuf seemed a bit easier to install and work with.

For all the plumbing code I decided to try working with iteratees, which have lots of nice performance guarantees. The protobuf libraries don't have integrated support for iteratees, but the internal model is a variant of iteratees so I was able to write some conversion functions. Attoparsec also uses an iteratee-like model internally, and there's integration code available. For my uses I actually need an enumeratee instead of an iteratee, so I had to roll one of my own.

TODO: (easy) move the CG2 training module into the library
TODO: (low priority) write a CG3 training module
DONE: ~~write an unparser for TnT lexicon and trigram files~~
TODO: (easy) write function to feed trained models into the unparser
TODO: (postponed) write wrapper executable to train models and print them to files
TODO: (postponed) write function to read TnT parser output into models directly (the TnT file parsers were done previously)
DONE: ~~choose a library for commandline argument munging~~ cmdargs
TODO: add commandline arguments for passing models to the server
DONE: ~~write protocol buffer spec for IPC protocol~~
DONE: ~~write Java client handlers for the IPCs~~
TODO: (low priority) write Haskell client handlers for debugging/verification of Java
TODO: write Haskell code for dumping intern tables to files, and reading them back in
TODO: write Java code for reading intern table files, so the client can dereference the ints
DONE: ~~write functions for converting the protobuf Get monad into an iteratee or enumeratee~~
TODO: write Haskell server handlers for the IPCs
TODO: write STM code for parallelizing the tagging and IPC handling
DONE: ~~write function for converting attoparsec parsers into enumeratees~~
TODO: (low priority) integrate attoparsec enumeratees into model training, etc, to replace top-level calls to many
DONE: ~~write lots of other auxiliary functions for bytestring, attoparsec, and iteratee~~

Classes have started up again, whence my month of absence. So I figure it's time to mention what I've been up to.

Over the summer I was working on developing an HMM-based part of speech tagger in Haskell. Most NLP folks consider POS tagging to be a "solved problem", and despite growing long in the teeth TnT (which uses second-order HMMs) is still very close to state-of-the-art; so why bother? Two reasons. Contrary to public opinion, POS tagging is not a solved problem. We can get good accuracy for English which has fixed word order and impoverished morphology, but we still don't really know how to handle morphological languages with free word order. Moreover, the taggers we have, have all been tested extensively on English and similar languages, but we don't really know how well different approaches apply to, say, Turkish, Hungarian, Japanese, Korean, Tzotzil, Quechua,...

The second reason is that my real goal is to handle supertagging for CCG, and in particular to do this for exploring online and interactive algorithms for tagging. Most of the current technology is focused on batch processing and off-line algorithms, which means that it isn't terribly useful for developing, say, an online system for real-time human--robot interaction, nor for exploring questions re the cognitive plausibility of something like supertagging serving a role in human processing of language. For doing this sort of research, TnT is too old and crotchety to work with, and the standard CCG supertaggers (OpenCCG, C&C Tools) are too integrated into their CCG parsing projects to be very amenable either. So, a new tagger writes I.

It is well-known that many common NLP algorithms for HMMs and chart parsing can be generalized to operate over arbitrary semirings. Before realizing this, some algorithms were invented over and over, specialized to different semirings. While it's common in the lore, I've yet to see any codebase that actually takes advantage of this to provide implementations that are generic over different semirings. So one of my secondary goals has been to make this parameterization explicit, and to make sure to do so in a way that doesn't diminish the performance of the tagger. By making the code modular in this way, it should also help when implementing variations on HMMs like higher-order HMMs, autoregressive HMMs, etc. And for doing this sort of thing right, you really need a type system you can trust, which means Haskell (or Agda or Coq). Also, most of the current work has been done in imperative languages only, so using a functional language provides a whole new arena of research on optimizations and the like.

So, that was the summer. Towards the end of the summer I did a writeup for it, though it's not entirely finished yet (i.e., ready for publicity/publication). I've continued developing it for my research assistanceship this year, which means integrating it with a variant of the Malt parser and seeing how well we can do online interactive semantic parsing of military data (which also presents an LM problem due to the huge number of OOVs, acronyms, and the use of terms like "green 42" as names).

Current Music: Dementia 2006-01-07 session 1 ~ DJs NoN, Wrexx, Valence

It's a well known fact that many algorithms in NLP are the same algorithm, just parameterized by a different semiring. For instance, the Viterbi algorithm for getting the probability of the best path in an HMM uses the (max,+) semiring. The forward part of the forward-backward algorithm is the same algorithm, just using the (+,*) semiring. (And sometimes you want the backward part of Viterbi too.)

Today's moment of duh: The usual implementation of Viterbi, with a sparse map, is the just the Max semiring coalesced to the MaxPriority semiring, i.e. where we map Max minBound to Nothing in order to remove the extra bottom in the lattice. And the Viterbi algorithm with backpointers so that you can extract the best path is just the Arg Max semiring (or Args Max if you want to keep ties).