winterkoninkje: shadowcrane (clean) (Default)

This weekend I've been doing a solo hackathon to try to get Posta integrated with our variant of the Mink parser. All the core algorithms have already been implemented, so it's just been a whole lot of yak shaving. Namely, I have to define an IPC protocol for Haskell to talk to Java, implement the (Haskell) server executable and (Java) client stubs, and then try to shake the thing out to find bugs and performance holes. Ideally, by tuesday morning.

Unfortunately, Java doesn't believe in inter-process communication, so all the libraries out there are for doing RPCs. Since the parser and tagger will be operating interactively, and on the same machine, it's quite silly to go through the network stack just to pass a few bytes back and forth. Thankfully I found CLIPC which should do the heavy lifting of getting Java to believe in POSIX named pipes. In order to handle the "on the wire" de/encoding, I've decided to go with Google's protocol buffers since there's already a protobuf compiler for Haskell. I was considering using MessagePack (which also has Haskell bindings), but protobuf seemed a bit easier to install and work with.

For all the plumbing code I decided to try working with iteratees, which have lots of nice performance guarantees. The protobuf libraries don't have integrated support for iteratees, but the internal model is a variant of iteratees so I was able to write some conversion functions. Attoparsec also uses an iteratee-like model internally, and there's integration code available. For my uses I actually need an enumeratee instead of an iteratee, so I had to roll one of my own.

  • TODO: (easy) move the CG2 training module into the library
  • TODO: (low priority) write a CG3 training module
  • DONE: write an unparser for TnT lexicon and trigram files
  • TODO: (easy) write function to feed trained models into the unparser
  • TODO: (postponed) write wrapper executable to train models and print them to files
  • TODO: (postponed) write function to read TnT parser output into models directly (the TnT file parsers were done previously)
  • DONE: choose a library for commandline argument munging cmdargs
  • TODO: add commandline arguments for passing models to the server
  • DONE: write protocol buffer spec for IPC protocol
  • DONE: write Java client handlers for the IPCs
  • TODO: (low priority) write Haskell client handlers for debugging/verification of Java
  • TODO: write Haskell code for dumping intern tables to files, and reading them back in
  • TODO: write Java code for reading intern table files, so the client can dereference the ints
  • DONE: write functions for converting the protobuf Get monad into an iteratee or enumeratee
  • TODO: write Haskell server handlers for the IPCs
  • TODO: write STM code for parallelizing the tagging and IPC handling
  • DONE: write function for converting attoparsec parsers into enumeratees
  • TODO: (low priority) integrate attoparsec enumeratees into model training, etc, to replace top-level calls to many
  • DONE: write lots of other auxiliary functions for bytestring, attoparsec, and iteratee
From:
Anonymous( )Anonymous This account has disabled anonymous posting.
OpenID( )OpenID You can comment on this post while signed in with an account from many other sites, once you have confirmed your email address. Sign in using OpenID.
User
Account name:
Password:
If you don't have an account you can create one now.
Subject:
HTML doesn't work in the subject.

Message:

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org


 
Notice: This account is set to log the IP addresses of everyone who comments.
Links will be displayed as unclickable URLs to help prevent spam.

June 2017

S M T W T F S
    123
45678910
11121314151617
18192021 222324
252627282930 

Tags

Page generated 27 Jul 2017 06:39 pm
Powered by Dreamwidth Studios