winterkoninkje: shadowcrane (clean) (Default)

For all you local folks, I'll be giving a talk about my dissertation on November 5th at 4:00–5:00 in Ballantine Hall 011. For those who've heard me give talks about it before, not much has changed since NLCS 2013. But the majority of current CL/NLP, PL, and logic folks haven't seen the talk, so do feel free to stop by.

Abstract: Many natural languages allow scrambling of constituents, or so-called "free word order". However, most syntactic formalisms are designed for English first and foremost. They assume that word order is rigidly fixed, and consequently these formalisms cannot handle languages like Latin, German, Russian, or Japanese. In this talk I introduce a new calculus —the chiastic lambda-calculus— which allows us to capture both the freedoms and the restrictions of constituent scrambling in Japanese. In addition to capturing these syntactic facts about free word order, the chiastic lambda-calculus also captures semantic issues that arise in Japanese verbal morphology. Moreover, chiastic lambda-calculus can be used to capture numerous non-linguistic phenomena, such as: justifying notational shorthands in category theory, providing a strong type theory for programming languages with keyword-arguments, and exploring metatheoretical issues around the duality between procedures and values.

Edit 2014.11.05: The slides from the talk are now up.

winterkoninkje: shadowcrane (clean) (Default)

This summer I've been working on optimizing compilation for a linear algebra DSL. This is an extension of Jeremy Siek's work on Built-to-Order BLAS functions. Often times it's more efficient to have a specialized function which fuses two or more BLAS functions. The idea behind BTO is that we'd like to specify these functions at a high level (i.e., with liner algebra expressions) and then automatically perform the optimizing transformations which have made BLAS such a central component of linear algebra computations.

The current/prior version of BTO already handles loop fusion, memory bandwidth constraints, and more. However, it is not currently aware any high-level algebraic laws such as the fact that matrix multiplication is associative, addition is associative and commutative, transposition reverse-distributes over multiplication, etc. My goal is to make it aware of these sort of things.

Along the way, one thing to do is solve the chain multiplication problem: given an expression like ∏[x1,x2...xN] figure out the most efficient associativity for implementing it via binary multiplication. The standard solution is to use a CKY-like dynamic programming algorithm to construct a tree covering the sequence [x1,x2...xN]. This is easy to implement, but it takes O(n^3) time and O(n^2) space.

I found a delicious alternative algorithm which solves the problem in O(n*log n) time and O(n) space! The key to this algorithm is to view the problem as determining a triangulation of convex polygons. That is, we can view [x0,x1,x2...xN] as the edges of a convex polygon, where x0 is the result of computing ∏[x1,x2...xN]. This amazing algorithm is described in the tech report by Hu and Shing (1981a), which includes a reference implementation in Pascal. Unfortunately the TR contains a number of typos and typesetting issues, but it's still pretty legible. A cleaner version of Part I is available here. And pay-walled presumably-cleaner versions of Part I and Part II are available from SIAM.

Hu and Shing (1981b) also have an algorithm which is simpler to implement and returns a heuristic answer in O(n) time, with the error ratio bounded by 15%. So if compile times are more important than running times, then you can use this version as well. A pay-walled version of the article is available from Elsevier.

winterkoninkje: shadowcrane (clean) (Default)

Last friday I passed my qualifying examinations! So now, all I have left is a bunch of paperwork about that and then proposing, writing, and defending the dissertation itself. So, in about a year or so I'll be on the job market. And, much as I despise job hunting, I can't wait!

Since defending the quals I've been spending far too much time playing Persona 3 Portable. I've played P3FES, but P3P adds a female protagonist option which changes a bunch of the social interactions, so I've been playing through that side of things. Other than the heterosexual assumptions about the relationships, I've been loving it. More rpgs should have female protagonists. That's one of the reasons I've always loved FF6. Also a big part of why I found FF13 compelling. (Though, tbh: while Lightning is awesome as a protagonist, Vanille is definitely my favorite character :) And a big part of the powerfulness of Kreia as a character in KotOR2 stems from her interactions with the canonically-female protagonist.

Speaking of women. I've been presenting as female for a couple months now, and since I have no intention of stopping nor hiding that fact, I've decided to move T-Day forward. Basically, for those who haven't already switched over to the right pronouns etc: T-Day is today. I've sent emails to the department heads in order to get them to send out the "official" memo; so if you haven't gotten it yet, that should show up on monday or tuesday.

The next couple months are going to be hectic with paper writing. I'm hoping to get a paper on syntax-based sentiment-analysis using matrix-space semantics into one of the CL conferences with deadlines this March. No Haskell involved in that one, though I'll probably spend a few posts discussing the semantic model, which may be of interest to y'all. I'm also planning on getting the work from my first qual paper published; that paper was about Posta, a functional library for interactive/online/incremental tagging with HMMs. Here I'm planning to target journals rather than conferences, and it'll spread out over a few papers: one on the overall system (which I need to actually push up to Hackage), one on the higher-order anytime n-best extraction algorithm, and one on reformulating HMM algorithms in terms of foldl and scanl (this may be combined with the HO-AnB paper, length permitting). All of these would be targeting the linguistics audience. Using folds and scans is old-hat in functional programming; my particular goal with that paper is exposing linguists to the tools of FP and how they can be used to greatly simplify how we describe our algorithms. Once those are out of the way I might also see about writing up a functional pearl on the smoothing library I presented at AMMCS a few years back.

winterkoninkje: shadowcrane (clean) (Default)
Things done today
  • Gave my advisors The Letter. The public announcement is scheduled for monday
  • Handed in the revised version of qual #2. Clocked in at 59 pages. I'm not entirely satisfied with it, but sometimes y'just gotta let it go.
What remains before I'm ABD
  • P550 final paper. Target length 10 pages.
  • Qual #3, due by x-mas. This one is just a lit-review, and I've already read the lit, so it shouldn't take too long (I hope!)
  • Qual oral defense, the first week or two of Spring. Cakewalk expected
  • Dissertation proposal. Aiming to get it done by January

Bitties

2/7/13 23:18
winterkoninkje: shadowcrane (clean) (Default)

Just got back from MFPS-LICS-CSF saturday night. T'was the first LICS I've been to, and my first time in the deep south. I had fun overall. Definitely enjoyed the French Quarter with its narrower streets, delightful architecture, and other non-American features. And I ran into the Pride parade the day after arriving; I seem to have a knack for that ;) The humidity was killer though.

The slides from my NLCS talk are available here. I've been having some issues with my bibtex2html script, so they're not linked to on the publications page yet; but they will be once I get that issue fixed.

In less happy news, I got some bloodwork back today. Cholesterol is far far too high, and I'm getting into the pre-diabetic range for bloodsugar levels. So, I'm starting a major diet change in hopes of getting those under control. Apparently lack of protein is a big part of the problem (for me), which is ironic since most americans get far too much. Damn midwestern genes. Went grocery shopping today; it's profoundly difficult to get a 1::1 carbs-to-protein ratio as a vegetarian.

winterkoninkje: shadowcrane (clean) (Default)

Next month I'll be giving a talk at the NLCS workshop, on the chiastic lambda-calculi I first presented at NASSLLI 2010 (slides[1]). After working out some of the metatheory for one of my quals, I gave more recent talks at our local PL Wonks and CLingDing seminars (slides). The NASSLLI talk was more about the linguistic motivations and the general idea, whereas the PLWonks/CLingDing talks were more about the formal properties of the calculus itself. For NLCS I hope to combine these threads a bit better— which has always been the challenge with this work.

NLCS is collocated with this year's LICS (and MFPS and CSF). I'll also be around for LICS itself, and in town for MFPS though probably not attending. So if you're around, feel free to stop by and chat.

[1] N.B., the NASSLLI syntax is a bit different than the newer version: square brackets were used instead of angle brackets (the latter were chosen because they typeset better in general); juxtaposition was just juxtaposition rather than being made explicit; and the left- vs right-chiastic distinction was called chi vs ksi (however, it turns out that ksi already has an important meaning in type theory).

Edit 2013.07.02: the slides are available here.

winterkoninkje: shadowcrane (clean) (Default)

Hmm, so I was hoping to get back into blogging over the summer. Y'know, a sort of liveblogging on quals, like how people blog about their Google Summer of Code projects these days. Turns out, some part of my brain is all, "oh noes! blogging is a waste of time! You could be working or, y'know, playing videogames or something!" So that sucks. In part because my brain is stupid, in part because I've been reading so many awesome blogs by people at least as busy as I am and now I feed bad for not measuring up to some impossible standard I made up out of thin air.

So the summer was fun, albeit not relaxing in the slightest. I really need to work on that whole Day of Rest idea. Things that happened this summer:

  • Grace and Jason got married! So that's the second California wedding I've been to in as many years. Now I can quit complaining about never having gone to a wedding. It was great to see friends from college again. Of the ones I lost touch with over the years, one works in the game industry (both indie and corporate), and another works for Wolfram Alpha (indeed, with Mr. Wolfram himself). So that's pretty cool.
  • Went to NASSLLI in Austin. There were some awesome classes there. Craige Roberts is fabulous; definitely someone to keep an eye on. Got to meet Adam Lopez, who was recently working on stuff related to one of my quals. Adam was part of the Edinburgh SMT crew, who came to JHU shortly after I left so I hadn't met him before. And, of course, got to hang out with Ken and Oleg again. Also, awesome, someone there remembered my talk from NASSLLI 2010 and asked about followup work on it.
  • Read a bunch of fun books, or rather had them read to me. Licia got a kindle and loves reading aloud; she's the best ever. Fun books include: Pride and Prejudice and Jane Eyre. Seriously, they are both delightful and if you haven't read them you should. Competent women are the best. Also, Look Me in the Eye: My Life with Asperger's (dude, what a life!), and most of God, No! and Drop Dead Healthy.
  • Bought Tales of Graces F on a whim and loved every minute of it. It starts off as your very standard JRPG about childhood friends, but then jumps ahead a few years after everyone has separated and grown up. The prologue is, as the reviews say, the least entertaining part; but it does a good job of setting the background for making the main plot poignant. Just saying people were childhood friends pales in comparison to seeing it and then seeing how they've grown apart. I haven't played any of the other Tales games, but the system is pretty similar to the Star Ocean system. Better done though, IMO. You have the fusion/cooking thing, but it's done in a way that's both extremely helpful and not obnoxious, so you use it regularly and actually care. The combat system is vibrant and engaging, and the system of badges is really cool. Overall the system has a lot of depth but doesn't get in the way of just playing. Some of the reviews complained about uneven difficulty, but I have no idea what they're on about. 10/10
  • And in a few weeks I'll be heading off to ICFP again. It'll be the first time I've been to Europe, can't wait.
winterkoninkje: shadowcrane (clean) (Default)

This past semester was a real doozy, for a number of reasons. But now that classes are over, maybe I'll get a chance to talk about some of them. In any case, at least it's done. Now I get to do quals: three months to write three papers good enough to convince people I can write a thesis. I'm looking forward to it; it's been so long since I've been free to do my own research without feeling bad about it encroaching on the work I 'should' be doing.

winterkoninkje: shadowcrane (clean) (Default)

All last week I was in Tokyo to attend ICFP and associated workshops. It was nice to finally meet a bunch of people I've been talking with online for the last few years. And I met up with Ken Shan and Oleg Kiselyov again, which is always a pleasure. Unlike last time I was in Japan, I didn't get too much time to explore and go sightseeing. I got to explore Chiyoda, which I missed last time around, and I made sure to do two of the most important things: (1) eat some okonomiyaki, (2) visit Akihabara to buy some new manga.

My newest acquisition is 銃姫 ("Gun Princess") Phantom Pain, which I'm rather enjoying so far. Anything that starts off with an execution, spell-casting based on Buddhist mantras, and a prolonged diatribe on why one of the characters is a good-for-nothing incompetent layabout, can't be half bad :) Unfortunately, I only got the first two volumes, so I'll finish them all too soon. So far it's proving easier to read than my previous acquisition (Peace Maker 鐵), though I'm not sure if that's due to getting better at Japanese or because 鐵 is written in a particular style. I definitely noticed my deterioration in fluency since five years ago; grammar's fine, but my vocab is abysmal. I need to find a decent way to work on that.

winterkoninkje: shadowcrane (clean) (Default)

Last week was a whirlwind. It was the first week of classes, which normally wouldn't be a big thing, except this semester I'm teaching a course. The first couple months of summer were pretty sedate up in Canada. But the last month, leading into the start of fall term, was full of traveling. I came back from Canada for a couple weeks, then left for a week with Licia, came back for a couple days (literally) and then flew off to California for Lindsey and Alex's wedding, arriving home the night before I needed to teach my first 9:30am class. Things've settled down now, though I'm heading off to ICFP next friday.

One thing traveling is good for is getting caught up on pleasure reading. In addition to the Vinge mentioned last time, I also got to read some new C.S. Friedman. After returning from Canada I got a bunch of new games for the PS3 too. Portal 2 is good fun, though the atmosphere feels like a bizarre hybrid between the first Portal and the Fallout franchise; fitting in its way, but very strange. I've also been playing through El Shaddai and reveling in the beauty of Amaros. Unlike a lot of Japanese games, the US version lets you keep the original voice acting, which is fabulous. Dunno how good the English voices are actually; maybe next time I play through it I'll find out. And then there's Catherine: an adult romantic horror by the team who did the Persona series. It's actually a puzzle game, where you're trying to climb a tower that crumbles beneath you. Both the puzzling and the plot are top rate, as to be expected from Atlus and SMT. There are other books and other games, but I'm not feeling like doing any proper reviews just yet.

In addition to teaching, I'm taking two courses this term. Advanced Phonetics, continuing from the Phonetics course I took last spring. Back at Reed for my undergrad we didn't have any phonetics courses, only phonology; so I've been getting caught up on that, as well as filling out the requirements for the Linguistics half of my dual PhD. The other course (Q551) is an intro to cognitive neuropsychology. It's something of a psychology methods course, with a bit of neuroanatomy and the briefest mention of how the imaging technology works. Last spring I took a course on neuroscience for speech and hearing, and up in Canada I spent the summer with a bunch of computer scientists who work on optimizing the algorithms behind the imaging technology; so I'm not sure how much I'll get out of Q551, but it's a requirement for the CogSci half of the dual PhD. As a (meta)theoretical computational linguist, neuroimaging isn't really my area; but as it turns out there are some interesting problems there and plenty of room for theoretical mathematics. Even after the imaging is done, interpreting the images runs into a lot of the same statistical problems that you get in NLP. Both fields are in need of a new statistics, one which doesn't break down when you have enormous data sets. Maybe one day I'll try working on that.

winterkoninkje: shadowcrane (clean) (Default)

I've been working on a tagging library (and executable) for a bit over a year now. When the project started I had the advantage of being able to choose the language to do it in. Naturally I chose Haskell. There are numerous reasons for this decision, some of which have been derided as "philosophical concerns". Certainly some of the reasons why Haskell is superior to other languages do border on the philosophical. Y'know, silly little things like the belief that type systems should prevent errors rather than encouraging them to proliferate. I'm sure you've heard the arguments before. They're good arguments, and maybe they'll convince you to try out Haskell in your basement. But in many so-called "enterprise" settings, anything that even smells like it might have basis in theoretical fact is automatically wrong or irrelevant; whatever you do in the privacy of your basement is your own business, but heaven forbid it have any influence on how decisions are made in the workplace! So, here is a short list of entirely pragmatic, practical, and non-theoretical reasons why Haskell is superior to Java for implementing enterprise programs. More specifically, these are reasons why Haskell is superior for my project. Perhaps they don't matter for your project, or perhaps they'll be enough to convince your boss to let you give Haskell a try. Because design decisions are often project-specific, each point explains why they matter for Posta in particular.

  • Haskell has powerful frameworks for defining modular, high-performance, non-trivial parsers (e.g., Attoparsec). In natural language processing (NLP), just like system administration, over half of the work you do involves dealing with a handful of different ad-hoc poorly defined file formats. Reading them; generating them; converting from one format to another; etc. Because every one of these formats grew out of a slow accretion of features for one particular project, they're riddled with inconsistencies, idiosyncratic magic values, corner cases, and non-context-free bits that require special handling. In Java the premiere tool (so far as I know) for defining parsers is JavaCC. (Like the C tools lex and yacc, JavaCC uses its own special syntax and requires a preprocessor, whereas Attoparsec and the like don't. However, this may be a "philosophical" issue.) However, as of last time I used it, JavaCC is designed for dealing with nice clean grammars used by programming languages and it doesn't handle inconsistent and irregular grammars very well.
  • Posta uses a system of coroutines (called "iteratees") in order to lazily stream data from disk, through the parsers, and into the core algorithms, all while maintaining guarantees about how long resources (e.g., file handles, memory) are held for. This allows handling large files, because we don't need to keep the whole file in memory at once, either in its raw form or in the AST generated by parsing it. For modern enterprise-scale NLP, dealing with gigabyte-sized files is a requirement; because many NLP projects are not enterprise-scale, you get to spend extra time chopping up and reformatting files to fit their limitations. Last time I used JavaCC it did not support incremental parsing, and according to the advertised features it still doesn't. In addition, implementing coroutines is problematic because Java's security model precludes simple things like tail-call optimization--- meaning that you can only support this kind of streaming when the control flow is simple enough to avoid stack overflows.
  • Haskell has awesome support for parallelism. One version, called STM, provides composeable atomic blocks (which matches the way we naturally think about parallelism) combined with lightweight threads (which make it cheap and easy). Java has no support for STM. I am unaware of any support for lightweight threads in Java. The only parallelism I'm aware of in Java is the monitor-style lock-based system with OS threads. As with all lock-based systems, it is non-composeable and difficult to get right; and as with using OS threads anywhere else, there is high overhead which removes the benefits of parallelizing many programs.
  • Posta makes extensive use of partial evaluation for improving performance; e.g., lifting computations out of loops. When doing NLP you are often dealing with triply-nested loops, so loop-invariant code motion is essential for performance. In my benchmarks, partial evaluation reduces the total running time by 10%. If raw numbers don't convince you: using partial evaluation allows us to keep the code legible, concise, modular, and maintainable. The primary use of partial evaluation is in a combinator library defining numerous smoothing methods for probability distributions; the results of which are called from within those triply-nested loops. Without partial evaluation, the only way to get performant code is to write a specialized version of the triply-nested loop for every different smoothing method you want to support. That means duplicating the core algorithm and a lot of tricky math, many times over. There's no way to implement this use of partial evaluation in anything resembling idiomatic Java.
  • Posta uses an implementation of persistent asymptotically optimal priority queues which come with proofs of correctness. A persistent PQ is necessary for one of the tagger's core algorithms. Since the PQ methods are called from within nested loops, performance is important. Since we're dealing with giga-scale data, asymptotics are important. A log factor here or there means more than a 10% increase in total running time. In Java there's java.util.PriorityQueue but it has inferior asymptotic performance guarantees and is neither persistent nor synchronized. I'm sure there are other PQ libraries out there, but I doubt anyone's implemented the exact version we need and shown their implementation to be correct.

I'll admit I'm not up to date on state-of-the-art Java, and I'd love to be proven wrong about these things being unavailable. But a couple years ago when I returned to Java after a long time away, I learned that all the hype I'd heard about Java improving over the preceding decade was just that: hype. I have been disappointed every time I hoped Java has some trivial thing. The most recent one I've run into is Java's complete refusal to believe in the existence of IPC (no, not RPC), but that's hardly the tip of the iceberg.

NIH

12/3/11 23:55
winterkoninkje: shadowcrane (clean) (Default)

I find it terribly unfortunate how susceptible academics are to Not Invented Here syndrome. Especially in disciplines like computer science where one of the primary acts of research is the creation of artifacts, a great amount of time and money are wasted replicating free publicly available programs. Worse than the effort wasted constructing the initial artifact is the continuous supply of effort it takes to maintain and debug these copies of the original. It's no wonder that so much of academic software is unreliable, unmaintained, and usable only by the developing team.

It's reasons like this why I support the free/open-source development model, demonstrated in academic projects like Joshua and GHC. The strong infusion of real-world software engineering methodologies that come from designing reliable software in F/OSS and industry seems to be the only way to save academia from itself.

winterkoninkje: shadowcrane (clean) (Default)

This weekend I've been doing a solo hackathon to try to get Posta integrated with our variant of the Mink parser. All the core algorithms have already been implemented, so it's just been a whole lot of yak shaving. Namely, I have to define an IPC protocol for Haskell to talk to Java, implement the (Haskell) server executable and (Java) client stubs, and then try to shake the thing out to find bugs and performance holes. Ideally, by tuesday morning.

Unfortunately, Java doesn't believe in inter-process communication, so all the libraries out there are for doing RPCs. Since the parser and tagger will be operating interactively, and on the same machine, it's quite silly to go through the network stack just to pass a few bytes back and forth. Thankfully I found CLIPC which should do the heavy lifting of getting Java to believe in POSIX named pipes. In order to handle the "on the wire" de/encoding, I've decided to go with Google's protocol buffers since there's already a protobuf compiler for Haskell. I was considering using MessagePack (which also has Haskell bindings), but protobuf seemed a bit easier to install and work with.

For all the plumbing code I decided to try working with iteratees, which have lots of nice performance guarantees. The protobuf libraries don't have integrated support for iteratees, but the internal model is a variant of iteratees so I was able to write some conversion functions. Attoparsec also uses an iteratee-like model internally, and there's integration code available. For my uses I actually need an enumeratee instead of an iteratee, so I had to roll one of my own.

  • TODO: (easy) move the CG2 training module into the library
  • TODO: (low priority) write a CG3 training module
  • DONE: write an unparser for TnT lexicon and trigram files
  • TODO: (easy) write function to feed trained models into the unparser
  • TODO: (postponed) write wrapper executable to train models and print them to files
  • TODO: (postponed) write function to read TnT parser output into models directly (the TnT file parsers were done previously)
  • DONE: choose a library for commandline argument munging cmdargs
  • TODO: add commandline arguments for passing models to the server
  • DONE: write protocol buffer spec for IPC protocol
  • DONE: write Java client handlers for the IPCs
  • TODO: (low priority) write Haskell client handlers for debugging/verification of Java
  • TODO: write Haskell code for dumping intern tables to files, and reading them back in
  • TODO: write Java code for reading intern table files, so the client can dereference the ints
  • DONE: write functions for converting the protobuf Get monad into an iteratee or enumeratee
  • TODO: write Haskell server handlers for the IPCs
  • TODO: write STM code for parallelizing the tagging and IPC handling
  • DONE: write function for converting attoparsec parsers into enumeratees
  • TODO: (low priority) integrate attoparsec enumeratees into model training, etc, to replace top-level calls to many
  • DONE: write lots of other auxiliary functions for bytestring, attoparsec, and iteratee
winterkoninkje: shadowcrane (clean) (Default)

Other than my research assistantship, I've been taking some cool classes. Larry Moss is teaching a course on category theory for coalgebra (yes, that Larry; I realized last xmas when my copy arrived). While I have a decent background in CT from being an experienced Haskell hacker and looking into things in that direction, it's nice to see it presented in the classroom. Also, we're using Adámek's Joy of Cats which gives a very different presentation than other books I've read (e.g., Pierce) since it's focused on concrete categories from mathematics (topology, group theory, Banach spaces, etc) instead of the CCC focus common in computer science.

Sandra's teaching a course on NLP for understudied and low-resource languages. As you may have discerned from my previous post, agglutinative languages and low-resource languages are the ones I'm particularly interested in. Both because they are understudied and therefore there is much new research to be done, but also because of political reasons (alas, Mike seems to have taken down the original manifesto). We've already read a bunch of great papers, and my term paper will be working on an extension of a book that was published less than a year ago; and I should be done in time to submit it to ACL this year, which would be awesome.

My last class is in historical linguistics. I never got to take one during my undergrad, which is why I signed up for it. Matt offered one my senior year, but I was one of only two people who signed up for it, so it was cancelled. It used to be that people equated linguistics with historical, though that has been outmoded for quite some time. Unfortunately it seems that the field hasn't progressed much since then, however. Oh wells, the class is full of amusing anecdotes about language change, and the prof is very keen to impress upon us the (radically modern) polysynchronic approach to language change, as opposed to taking large diachronic leaps or focusing on historical reconstruction. And I'm rather keen on polysynchrony.

winterkoninkje: shadowcrane (clean) (Default)

Classes have started up again, whence my month of absence. So I figure it's time to mention what I've been up to.

Over the summer I was working on developing an HMM-based part of speech tagger in Haskell. Most NLP folks consider POS tagging to be a "solved problem", and despite growing long in the teeth TnT (which uses second-order HMMs) is still very close to state-of-the-art; so why bother? Two reasons. Contrary to public opinion, POS tagging is not a solved problem. We can get good accuracy for English which has fixed word order and impoverished morphology, but we still don't really know how to handle morphological languages with free word order. Moreover, the taggers we have, have all been tested extensively on English and similar languages, but we don't really know how well different approaches apply to, say, Turkish, Hungarian, Japanese, Korean, Tzotzil, Quechua,...

The second reason is that my real goal is to handle supertagging for CCG, and in particular to do this for exploring online and interactive algorithms for tagging. Most of the current technology is focused on batch processing and off-line algorithms, which means that it isn't terribly useful for developing, say, an online system for real-time human--robot interaction, nor for exploring questions re the cognitive plausibility of something like supertagging serving a role in human processing of language. For doing this sort of research, TnT is too old and crotchety to work with, and the standard CCG supertaggers (OpenCCG, C&C Tools) are too integrated into their CCG parsing projects to be very amenable either. So, a new tagger writes I.

It is well-known that many common NLP algorithms for HMMs and chart parsing can be generalized to operate over arbitrary semirings. Before realizing this, some algorithms were invented over and over, specialized to different semirings. While it's common in the lore, I've yet to see any codebase that actually takes advantage of this to provide implementations that are generic over different semirings. So one of my secondary goals has been to make this parameterization explicit, and to make sure to do so in a way that doesn't diminish the performance of the tagger. By making the code modular in this way, it should also help when implementing variations on HMMs like higher-order HMMs, autoregressive HMMs, etc. And for doing this sort of thing right, you really need a type system you can trust, which means Haskell (or Agda or Coq). Also, most of the current work has been done in imperative languages only, so using a functional language provides a whole new arena of research on optimizations and the like.

So, that was the summer. Towards the end of the summer I did a writeup for it, though it's not entirely finished yet (i.e., ready for publicity/publication). I've continued developing it for my research assistanceship this year, which means integrating it with a variant of the Malt parser and seeing how well we can do online interactive semantic parsing of military data (which also presents an LM problem due to the huge number of OOVs, acronyms, and the use of terms like "green 42" as names).

winterkoninkje: shadowcrane (clean) (Default)
The Real Science Gap

If you're anything like me, you've heard tale of the shortage of American scientists and the failing standards of our education system for decades. This article offers a detailed rebuttal of the party line. The problem isn't too few talented individuals, they say, it's too few career prospects and a grist mill devoted to the exploitation of young scientists. This certainly reflects my experiences, both at a top research university and at a more vocational state school. I've always been more motivated by intellectual challenges than monetary rewards (in contrast to many of my colleagues), but it's always unsettling to look up at the house of cards. Will I actually be able to make a career of the research I so enjoy, or is this just a brief respite from the drudgery of a work-a-day programming job?
winterkoninkje: shadowcrane (clean) (Default)
So it looks like I'll be giving a talk at NASSLLI next week. It's a 30min talk, which seems rather short coming from CLSP's hour-long NLP seminars, though apparently it counts as long from a conference timeline. The Midwest Theory Day talks were also half an hour and nobody got much beyond their background material, which seems ludicrous as a means of disseminating new research. Oh well.

The talk is on a new semantic logic for handling constrained free word order in CCG. The logic is based on lambda calculus extended with something like a type system, though the extension is orthogonal to most type systems. Outside of CCG and linguistics, the logic seems like it would be helpful for typing the free ordering of keyword arguments or for capturing certain kinds of staged computation. I'm (re)writing up the paper this summer, so hopefully I'll be able to point y'all at a publication early next year. If you're interested and in the area you should stop by. Lots of other interesting things at NASSLLI this year too.
winterkoninkje: shadowcrane (clean) (Default)

I'm sitting here, the last night before, and cooking dinner. It's funny how the before always comes a few days ahead of the end itself. Tonight is Lici's last night of work. It's about a week until the drive to Indiana.

I had some music on as I was finishing up some prepacking —books and such— and unintentionally, unexpectedly, came some songs with old memories. Old memories from other befores: CTY and Reed and the Plumtree. Isn't it strange how the memory of old nostalgia can lend a spirit of nostalgia to the present? It's no secret that I was never a fan of Bal'mer, but I did do a lot of growing here. Maybe I won't miss the place, but I will miss some of the folks and the simplicity of being tied to neither past nor future.

The last couple weeks have been nice. In addition to the Buffy/Angel, B5, and PS2 overload, [livejournal.com profile] misshepeshu and [livejournal.com profile] leensterama came out to visit so I took a couple trips to DC. I was reminded how not all the East Coast is like Baltimore, but I was also reminded how long it's been since I've lived in the District. DC was never really quite a home, but it was my escape-home for years before it grew into a home-in-transition for the couple years before moving to p-town and the Plumtree. It's not that things have changed so much as the friends I had then moved on to other cities and other lives. But Baltimore never was even a home-in-transition, it was only an in-transition. I came for a year, stayed for two, but never could settle into the rhythms and flows of the place.

I think "home" is never so much a place as it is a time, a moment, a feeling. We belie this with aphorisms on our inability to return there. We try to make the home into a place, but we can never return in time and so returning to the place once left can bring only sorrow. So too can we not hold time still, whence the solastalgia of remaining too long after the party has gone. We have words like mamihlapinatapai for the yearning and never taking, but what words are there for the never having and finally letting go?

winterkoninkje: shadowcrane (clean) (Default)

Here's a quick update on my life as it stands. I seem to be building up a directory of abortive notes like this, so I'm typing this one in directly in hopes of actually posting it. Apologies for the lack of editing or, y'know, cohesion.

For those I haven't told yet, I got accepted to Indiana for a PhD in cognitive science (to be amended into a dual PhD in cogsci and computational linguistics), working with Mike Gasser and Sandra Kübler (along with Matthias Scheutz, most likely). The current plan is to move to Bloomington circa July 1st, with a previsit around June 17th to finalize leases and the like. That way I have some time to get settled and take a break before classes start. Now I need to find a place...

Employment-wise, the week before the previsit is the NIST eval for MT09. Which will be the last huzzah before signing off on my Joshua and GALE Rosetta work. Which means I have about a month to finish that, in tandem with the house hunting. One of the deliverables should be pretty easy to finish off, though it remains to explain to everyone how it works (yay monads!). Another I've done some mindcoding on, but don't have any actual code to show for; I have the unsettling prediction that Java isn't going to let me do things in as clean of a way as I'd like.

Research-wise, I've finished off my post-graduation Dyna involvement to buy time for other things. Jason still wants a meeting to discuss my involvement in the future, which is sensible. The research topics are interesting and'll probably influence my PL research for the next while, though I don't know how much of that will carry over to Dyna in the end. (And there's non-PL research I should be devoting more time to, methinks.) I'll miss working with [livejournal.com profile] qedragon, though we're planning to keep in touch.

Otherwise-wise, things are going a bit better now than they were. Tis still hard getting motivated, but the early summer days and the slow unwinding of obligations are doing some good. Lici says I tap the energy of my surroundings and that that's why I was in such higher spirits after my last visit to Bloomington. Considering how I go on about the dying of Baltimore et al, I can't help but think she's right. To that end, I've only a couple busy months left before I can bask in that relaxation once more.

Enough for now, work beckons once more. (And many thanks to [livejournal.com profile] altrus for Schinji Mix 2008.)

winterkoninkje: shadowcrane (clean) (Default)

Last night I went to a farewell dinner for Micha, who is heading back to Germany after a couple months at CLSP. About a dozen of us had delicious Ethiopian, and half hung around for drinks afterwards. Both establishments were quite nice, reminding me I should hang out in Mt Vernon more often. Micha's specialty is in "Deep MT", a variety of machine translation which makes use of linguistic factors rather than being purely statistical. Or to wit: MT done right. So there was some self-selection involved but the company was, as always, what made the night.

Three of the folks who stuck around for drinks were the first years at CLSP: two from CS who share my MT seminar, and one from ECE who seemed more grounded than most ;) Add to that Micha, myself, and one of the old-timers. It's amazing what people'll say once you get them off campus, or once you get a few drinks in 'em. On campus it's all business all the time. Which is fitting, it's a job afterall; but it does leave things rather dreary. And somehow it seems to lead to never really knowing what other folks are working on, or what they're interested in. It's nice to see the human side of people. It's also nice to see the business side of the business. But no, I need more humans in my life.

At Brewers Art I spent most of my time talking with A. She was sitting next to me and I could hear her, two excellent points in her favor. At some point we got onto that topic: what we're really interested in. I said I just finished my degree and was sticking around for a year working on GALE, "so that's why you're always so together at MT seminar," and I'm working on PhD apps for next year. The follow on question: the wheres and whys. I began to give the other face of my last rant, a presentation I've been polishing for those selfsame apps. I'm interested in morphology and its interfaces with syntax, semantics, and phonology; and I think we need to be working on linguistically-aware tools, since SMT's ignorance of morphosyntax is one of its principal failures (a point Micha demonstrated fabulously in his seminar last friday); and I think we need to be working on languages with few resources, for political reasons and also because tying ourselves to megacorpora means we will never break away from the need to invest millions to get enough training data to simulate knowledge, badly.

Shortly into my rant she said, "that's my soapbox!" For her undergrad thesis she worked on computational typology: measuring the distances between languages in typological space. The sort of work that would be essential for L3 to use a known system for translating between two languages to bootstrap translations between similar languages. When I told her the places I was thinking of heading she was surprised there were people working on our domain; she'd spent so long justifying this empirical-yet-linguistic approach, and I too know how hard it can be to convince the devout statisticians or the non-computationalists. Typology, more even than morphology, is a domain that gets a passing mention in undergrad years and yet never sees the light of day in modern research.

For her part she tried convincing me I should stick around CLSP, to join her in the battle. A tempting thought, though I worry it may be more uphill a battle than at the schools I've been thinking of. Though maybe it's worth another thought. All in all great food, great beer, great discussions, and intellectual vindication. What more could you ask for in a night?

Profile

winterkoninkje: shadowcrane (clean) (Default)
wren gayle romano

October 2014

S M T W T F S
   12 34
567891011
12131415161718
19 202122232425
262728293031 

Common Tags