This weekend I've been doing a solo hackathon to try to get Posta integrated with our variant of the Mink parser. All the core algorithms have already been implemented, so it's just been a whole lot of yak shaving. Namely, I have to define an IPC protocol for Haskell to talk to Java, implement the (Haskell) server executable and (Java) client stubs, and then try to shake the thing out to find bugs and performance holes. Ideally, by tuesday morning.
Unfortunately, Java doesn't believe in inter-process communication, so all the libraries out there are for doing RPCs. Since the parser and tagger will be operating interactively, and on the same machine, it's quite silly to go through the network stack just to pass a few bytes back and forth. Thankfully I found CLIPC which should do the heavy lifting of getting Java to believe in POSIX named pipes. In order to handle the "on the wire" de/encoding, I've decided to go with Google's protocol buffers since there's already a protobuf compiler for Haskell. I was considering using MessagePack (which also has Haskell bindings), but protobuf seemed a bit easier to install and work with.
For all the plumbing code I decided to try working with iteratees, which have lots of nice performance guarantees. The protobuf libraries don't have integrated support for iteratees, but the internal model is a variant of iteratees so I was able to write some conversion functions. Attoparsec also uses an iteratee-like model internally, and there's integration code available. For my uses I actually need an enumeratee instead of an iteratee, so I had to roll one of my own.
- TODO: (easy) move the CG2 training module into the library
- TODO: (low priority) write a CG3 training module
- DONE:
write an unparser for TnT lexicon and trigram files - TODO: (easy) write function to feed trained models into the unparser
- TODO: (postponed) write wrapper executable to train models and print them to files
- TODO: (postponed) write function to read TnT parser output into models directly (the TnT file parsers were done previously)
- DONE:
choose a library for commandline argument mungingcmdargs - TODO: add commandline arguments for passing models to the server
- DONE:
write protocol buffer spec for IPC protocol - DONE:
write Java client handlers for the IPCs - TODO: (low priority) write Haskell client handlers for debugging/verification of Java
- TODO: write Haskell code for dumping intern tables to files, and reading them back in
- TODO: write Java code for reading intern table files, so the client can dereference the ints
- DONE:
write functions for converting the protobuf Get monad into an iteratee or enumeratee - TODO: write Haskell server handlers for the IPCs
- TODO: write STM code for parallelizing the tagging and IPC handling
- DONE:
write function for converting attoparsec parsers into enumeratees - TODO: (low priority) integrate attoparsec enumeratees into model training, etc, to replace top-level calls to many
- DONE:
write lots of other auxiliary functions for bytestring, attoparsec, and iteratee