winterkoninkje | 2 Oct 2010

Classes have started up again, whence my month of absence. So I figure it's time to mention what I've been up to.

Over the summer I was working on developing an HMM-based part of speech tagger in Haskell. Most NLP folks consider POS tagging to be a "solved problem", and despite growing long in the teeth TnT (which uses second-order HMMs) is still very close to state-of-the-art; so why bother? Two reasons. Contrary to public opinion, POS tagging is not a solved problem. We can get good accuracy for English which has fixed word order and impoverished morphology, but we still don't really know how to handle morphological languages with free word order. Moreover, the taggers we have, have all been tested extensively on English and similar languages, but we don't really know how well different approaches apply to, say, Turkish, Hungarian, Japanese, Korean, Tzotzil, Quechua,...

The second reason is that my real goal is to handle supertagging for CCG, and in particular to do this for exploring online and interactive algorithms for tagging. Most of the current technology is focused on batch processing and off-line algorithms, which means that it isn't terribly useful for developing, say, an online system for real-time human--robot interaction, nor for exploring questions re the cognitive plausibility of something like supertagging serving a role in human processing of language. For doing this sort of research, TnT is too old and crotchety to work with, and the standard CCG supertaggers (OpenCCG, C&C Tools) are too integrated into their CCG parsing projects to be very amenable either. So, a new tagger writes I.

It is well-known that many common NLP algorithms for HMMs and chart parsing can be generalized to operate over arbitrary semirings. Before realizing this, some algorithms were invented over and over, specialized to different semirings. While it's common in the lore, I've yet to see any codebase that actually takes advantage of this to provide implementations that are generic over different semirings. So one of my secondary goals has been to make this parameterization explicit, and to make sure to do so in a way that doesn't diminish the performance of the tagger. By making the code modular in this way, it should also help when implementing variations on HMMs like higher-order HMMs, autoregressive HMMs, etc. And for doing this sort of thing right, you really need a type system you can trust, which means Haskell (or Agda or Coq). Also, most of the current work has been done in imperative languages only, so using a functional language provides a whole new arena of research on optimizations and the like.

So, that was the summer. Towards the end of the summer I did a writeup for it, though it's not entirely finished yet (i.e., ready for publicity/publication). I've continued developing it for my research assistanceship this year, which means integrating it with a variant of the Malt parser and seeing how well we can do online interactive semantic parsing of military data (which also presents an LM problem due to the huge number of OOVs, acronyms, and the use of terms like "green 42" as names).

Other than my research assistantship, I've been taking some cool classes. Larry Moss is teaching a course on category theory for coalgebra (yes, that Larry; I realized last xmas when my copy arrived). While I have a decent background in CT from being an experienced Haskell hacker and looking into things in that direction, it's nice to see it presented in the classroom. Also, we're using Adámek's Joy of Cats which gives a very different presentation than other books I've read (e.g., Pierce) since it's focused on concrete categories from mathematics (topology, group theory, Banach spaces, etc) instead of the CCC focus common in computer science.

Sandra's teaching a course on NLP for understudied and low-resource languages. As you may have discerned from my previous post, agglutinative languages and low-resource languages are the ones I'm particularly interested in. Both because they are understudied and therefore there is much new research to be done, but also because of political reasons (alas, Mike seems to have taken down the original manifesto). We've already read a bunch of great papers, and my term paper will be working on an extension of a book that was published less than a year ago; and I should be done in time to submit it to ACL this year, which would be awesome.

My last class is in historical linguistics. I never got to take one during my undergrad, which is why I signed up for it. Matt offered one my senior year, but I was one of only two people who signed up for it, so it was cancelled. It used to be that people equated linguistics with historical, though that has been outmoded for quite some time. Unfortunately it seems that the field hasn't progressed much since then, however. Oh wells, the class is full of amusing anecdotes about language change, and the prof is very keen to impress upon us the (radically modern) polysynchronic approach to language change, as opposed to taking large diachronic leaps or focusing on historical reconstruction. And I'm rather keen on polysynchrony.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Renaissance Grrrl

2 Oct 2010

2 Oct 2010

Pre-announcing Posta: POS tagging and CCG supertagging in Haskell

What I've been up to, pt.II

Profile

April 2019

Posts

Where next?

Renaissance Grrrl

2 Oct 2010

2 Oct 2010

Pre-announcing Posta: POS tagging and CCG supertagging in Haskell

What I've been up to, pt.II

Profile

April 2019

Tags

Posts

Where next?