- Things done today
- Gave my advisors The Letter. The public announcement is scheduled for monday
- Handed in the revised version of qual #2. Clocked in at 59 pages. I'm not entirely satisfied with it, but sometimes y'just gotta let it go.
- What remains before I'm ABD
- P550 final paper. Target length 10 pages.
- Qual #3, due by x-mas. This one is just a lit-review, and I've already read the lit, so it shouldn't take too long (I hope!)
- Qual oral defense, the first week or two of Spring. Cakewalk expected
- Dissertation proposal. Aiming to get it done by January
Just got back from MFPS-LICS-CSF saturday night. T'was the first LICS I've been to, and my first time in the deep south. I had fun overall. Definitely enjoyed the French Quarter with its narrower streets, delightful architecture, and other non-American features. And I ran into the Pride parade the day after arriving; I seem to have a knack for that ;) The humidity was killer though.
The slides from my NLCS talk are available here. I've been having some issues with my bibtex2html script, so they're not linked to on the publications page yet; but they will be once I get that issue fixed.
In less happy news, I got some bloodwork back today. Cholesterol is far far too high, and I'm getting into the pre-diabetic range for bloodsugar levels. So, I'm starting a major diet change in hopes of getting those under control. Apparently lack of protein is a big part of the problem (for me), which is ironic since most americans get far too much. Damn midwestern genes. Went grocery shopping today; it's profoundly difficult to get a 1::1 carbs-to-protein ratio as a vegetarian.
Next month I'll be giving a talk at the NLCS workshop, on the chiastic lambda-calculi I first presented at NASSLLI 2010 (slides). After working out some of the metatheory for one of my quals, I gave more recent talks at our local PL Wonks and CLingDing seminars (slides). The NASSLLI talk was more about the linguistic motivations and the general idea, whereas the PLWonks/CLingDing talks were more about the formal properties of the calculus itself. For NLCS I hope to combine these threads a bit better— which has always been the challenge with this work.
NLCS is collocated with this year's LICS (and MFPS and CSF). I'll also be around for LICS itself, and in town for MFPS though probably not attending. So if you're around, feel free to stop by and chat.
 N.B., the NASSLLI syntax is a bit different than the newer version: square brackets were used instead of angle brackets (the latter were chosen because they typeset better in general); juxtaposition was just juxtaposition rather than being made explicit; and the left- vs right-chiastic distinction was called chi vs ksi (however, it turns out that ksi already has an important meaning in type theory).
Edit 2013.07.02: the slides are available here.
Hmm, so I was hoping to get back into blogging over the summer. Y'know, a sort of liveblogging on quals, like how people blog about their Google Summer of Code projects these days. Turns out, some part of my brain is all, "oh noes! blogging is a waste of time! You could be working or, y'know, playing videogames or something!" So that sucks. In part because my brain is stupid, in part because I've been reading so many awesome blogs by people at least as busy as I am and now I feed bad for not measuring up to some impossible standard I made up out of thin air.
So the summer was fun, albeit not relaxing in the slightest. I really need to work on that whole Day of Rest idea. Things that happened this summer:
- Grace and Jason got married! So that's the second California wedding I've been to in as many years. Now I can quit complaining about never having gone to a wedding. It was great to see friends from college again. Of the ones I lost touch with over the years, one works in the game industry (both indie and corporate), and another works for Wolfram Alpha (indeed, with Mr. Wolfram himself). So that's pretty cool.
- Went to NASSLLI in Austin. There were some awesome classes there. Craige Roberts is fabulous; definitely someone to keep an eye on. Got to meet Adam Lopez, who was recently working on stuff related to one of my quals. Adam was part of the Edinburgh SMT crew, who came to JHU shortly after I left so I hadn't met him before. And, of course, got to hang out with Ken and Oleg again. Also, awesome, someone there remembered my talk from NASSLLI 2010 and asked about followup work on it.
- Read a bunch of fun books, or rather had them read to me. Licia got a kindle and loves reading aloud; she's the best ever. Fun books include: Pride and Prejudice and Jane Eyre. Seriously, they are both delightful and if you haven't read them you should. Competent women are the best. Also, Look Me in the Eye: My Life with Asperger's (dude, what a life!), and most of God, No! and Drop Dead Healthy.
- Bought Tales of Graces F on a whim and loved every minute of it. It starts off as your very standard JRPG about childhood friends, but then jumps ahead a few years after everyone has separated and grown up. The prologue is, as the reviews say, the least entertaining part; but it does a good job of setting the background for making the main plot poignant. Just saying people were childhood friends pales in comparison to seeing it and then seeing how they've grown apart. I haven't played any of the other Tales games, but the system is pretty similar to the Star Ocean system. Better done though, IMO. You have the fusion/cooking thing, but it's done in a way that's both extremely helpful and not obnoxious, so you use it regularly and actually care. The combat system is vibrant and engaging, and the system of badges is really cool. Overall the system has a lot of depth but doesn't get in the way of just playing. Some of the reviews complained about uneven difficulty, but I have no idea what they're on about. 10/10
- And in a few weeks I'll be heading off to ICFP again. It'll be the first time I've been to Europe, can't wait.
This past semester was a real doozy, for a number of reasons. But now that classes are over, maybe I'll get a chance to talk about some of them. In any case, at least it's done. Now I get to do quals: three months to write three papers good enough to convince people I can write a thesis. I'm looking forward to it; it's been so long since I've been free to do my own research without feeling bad about it encroaching on the work I 'should' be doing.
All last week I was in Tokyo to attend ICFP and associated workshops. It was nice to finally meet a bunch of people I've been talking with online for the last few years. And I met up with Ken Shan and Oleg Kiselyov again, which is always a pleasure. Unlike last time I was in Japan, I didn't get too much time to explore and go sightseeing. I got to explore Chiyoda, which I missed last time around, and I made sure to do two of the most important things: (1) eat some okonomiyaki, (2) visit Akihabara to buy some new manga.
My newest acquisition is 銃姫 ("Gun Princess") Phantom Pain, which I'm rather enjoying so far. Anything that starts off with an execution, spell-casting based on Buddhist mantras, and a prolonged diatribe on why one of the characters is a good-for-nothing incompetent layabout, can't be half bad :) Unfortunately, I only got the first two volumes, so I'll finish them all too soon. So far it's proving easier to read than my previous acquisition (Peace Maker 鐵), though I'm not sure if that's due to getting better at Japanese or because 鐵 is written in a particular style. I definitely noticed my deterioration in fluency since five years ago; grammar's fine, but my vocab is abysmal. I need to find a decent way to work on that.
Last week was a whirlwind. It was the first week of classes, which normally wouldn't be a big thing, except this semester I'm teaching a course. The first couple months of summer were pretty sedate up in Canada. But the last month, leading into the start of fall term, was full of traveling. I came back from Canada for a couple weeks, then left for a week with Licia, came back for a couple days (literally) and then flew off to California for Lindsey and Alex's wedding, arriving home the night before I needed to teach my first 9:30am class. Things've settled down now, though I'm heading off to ICFP next friday.
One thing traveling is good for is getting caught up on pleasure reading. In addition to the Vinge mentioned last time, I also got to read some new C.S. Friedman. After returning from Canada I got a bunch of new games for the PS3 too. Portal 2 is good fun, though the atmosphere feels like a bizarre hybrid between the first Portal and the Fallout franchise; fitting in its way, but very strange. I've also been playing through El Shaddai and reveling in the beauty of Amaros. Unlike a lot of Japanese games, the US version lets you keep the original voice acting, which is fabulous. Dunno how good the English voices are actually; maybe next time I play through it I'll find out. And then there's Catherine: an adult romantic horror by the team who did the Persona series. It's actually a puzzle game, where you're trying to climb a tower that crumbles beneath you. Both the puzzling and the plot are top rate, as to be expected from Atlus and SMT. There are other books and other games, but I'm not feeling like doing any proper reviews just yet.
In addition to teaching, I'm taking two courses this term. Advanced Phonetics, continuing from the Phonetics course I took last spring. Back at Reed for my undergrad we didn't have any phonetics courses, only phonology; so I've been getting caught up on that, as well as filling out the requirements for the Linguistics half of my dual PhD. The other course (Q551) is an intro to cognitive neuropsychology. It's something of a psychology methods course, with a bit of neuroanatomy and the briefest mention of how the imaging technology works. Last spring I took a course on neuroscience for speech and hearing, and up in Canada I spent the summer with a bunch of computer scientists who work on optimizing the algorithms behind the imaging technology; so I'm not sure how much I'll get out of Q551, but it's a requirement for the CogSci half of the dual PhD. As a (meta)theoretical computational linguist, neuroimaging isn't really my area; but as it turns out there are some interesting problems there and plenty of room for theoretical mathematics. Even after the imaging is done, interpreting the images runs into a lot of the same statistical problems that you get in NLP. Both fields are in need of a new statistics, one which doesn't break down when you have enormous data sets. Maybe one day I'll try working on that.
I've been working on a tagging library (and executable) for a bit over a year now. When the project started I had the advantage of being able to choose the language to do it in. Naturally I chose Haskell. There are numerous reasons for this decision, some of which have been derided as "philosophical concerns". Certainly some of the reasons why Haskell is superior to other languages do border on the philosophical. Y'know, silly little things like the belief that type systems should prevent errors rather than encouraging them to proliferate. I'm sure you've heard the arguments before. They're good arguments, and maybe they'll convince you to try out Haskell in your basement. But in many so-called "enterprise" settings, anything that even smells like it might have basis in theoretical fact is automatically wrong or irrelevant; whatever you do in the privacy of your basement is your own business, but heaven forbid it have any influence on how decisions are made in the workplace! So, here is a short list of entirely pragmatic, practical, and non-theoretical reasons why Haskell is superior to Java for implementing enterprise programs. More specifically, these are reasons why Haskell is superior for my project. Perhaps they don't matter for your project, or perhaps they'll be enough to convince your boss to let you give Haskell a try. Because design decisions are often project-specific, each point explains why they matter for Posta in particular.
- Haskell has powerful frameworks for defining modular, high-performance, non-trivial parsers (e.g., Attoparsec). In natural language processing (NLP), just like system administration, over half of the work you do involves dealing with a handful of different ad-hoc poorly defined file formats. Reading them; generating them; converting from one format to another; etc. Because every one of these formats grew out of a slow accretion of features for one particular project, they're riddled with inconsistencies, idiosyncratic magic values, corner cases, and non-context-free bits that require special handling. In Java the premiere tool (so far as I know) for defining parsers is JavaCC. (Like the C tools lex and yacc, JavaCC uses its own special syntax and requires a preprocessor, whereas Attoparsec and the like don't. However, this may be a "philosophical" issue.) However, as of last time I used it, JavaCC is designed for dealing with nice clean grammars used by programming languages and it doesn't handle inconsistent and irregular grammars very well.
- Posta uses a system of coroutines (called "iteratees") in order to lazily stream data from disk, through the parsers, and into the core algorithms, all while maintaining guarantees about how long resources (e.g., file handles, memory) are held for. This allows handling large files, because we don't need to keep the whole file in memory at once, either in its raw form or in the AST generated by parsing it. For modern enterprise-scale NLP, dealing with gigabyte-sized files is a requirement; because many NLP projects are not enterprise-scale, you get to spend extra time chopping up and reformatting files to fit their limitations. Last time I used JavaCC it did not support incremental parsing, and according to the advertised features it still doesn't. In addition, implementing coroutines is problematic because Java's security model precludes simple things like tail-call optimization--- meaning that you can only support this kind of streaming when the control flow is simple enough to avoid stack overflows.
- Haskell has awesome support for parallelism. One version, called STM, provides composeable atomic blocks (which matches the way we naturally think about parallelism) combined with lightweight threads (which make it cheap and easy). Java has no support for STM. I am unaware of any support for lightweight threads in Java. The only parallelism I'm aware of in Java is the monitor-style lock-based system with OS threads. As with all lock-based systems, it is non-composeable and difficult to get right; and as with using OS threads anywhere else, there is high overhead which removes the benefits of parallelizing many programs.
- Posta makes extensive use of partial evaluation for improving performance; e.g., lifting computations out of loops. When doing NLP you are often dealing with triply-nested loops, so loop-invariant code motion is essential for performance. In my benchmarks, partial evaluation reduces the total running time by 10%. If raw numbers don't convince you: using partial evaluation allows us to keep the code legible, concise, modular, and maintainable. The primary use of partial evaluation is in a combinator library defining numerous smoothing methods for probability distributions; the results of which are called from within those triply-nested loops. Without partial evaluation, the only way to get performant code is to write a specialized version of the triply-nested loop for every different smoothing method you want to support. That means duplicating the core algorithm and a lot of tricky math, many times over. There's no way to implement this use of partial evaluation in anything resembling idiomatic Java.
- Posta uses an implementation of persistent asymptotically optimal priority queues which come with proofs of correctness. A persistent PQ is necessary for one of the tagger's core algorithms. Since the PQ methods are called from within nested loops, performance is important. Since we're dealing with giga-scale data, asymptotics are important. A log factor here or there means more than a 10% increase in total running time. In Java there's java.util.PriorityQueue but it has inferior asymptotic performance guarantees and is neither persistent nor synchronized. I'm sure there are other PQ libraries out there, but I doubt anyone's implemented the exact version we need and shown their implementation to be correct.
I'll admit I'm not up to date on state-of-the-art Java, and I'd love to be proven wrong about these things being unavailable. But a couple years ago when I returned to Java after a long time away, I learned that all the hype I'd heard about Java improving over the preceding decade was just that: hype. I have been disappointed every time I hoped Java has some trivial thing. The most recent one I've run into is Java's complete refusal to believe in the existence of IPC (no, not RPC), but that's hardly the tip of the iceberg.
I find it terribly unfortunate how susceptible academics are to Not Invented Here syndrome. Especially in disciplines like computer science where one of the primary acts of research is the creation of artifacts, a great amount of time and money are wasted replicating free publicly available programs. Worse than the effort wasted constructing the initial artifact is the continuous supply of effort it takes to maintain and debug these copies of the original. It's no wonder that so much of academic software is unreliable, unmaintained, and usable only by the developing team.
It's reasons like this why I support the free/open-source development model, demonstrated in academic projects like Joshua and GHC. The strong infusion of real-world software engineering methodologies that come from designing reliable software in F/OSS and industry seems to be the only way to save academia from itself.
This weekend I've been doing a solo hackathon to try to get Posta integrated with our variant of the Mink parser. All the core algorithms have already been implemented, so it's just been a whole lot of yak shaving. Namely, I have to define an IPC protocol for Haskell to talk to Java, implement the (Haskell) server executable and (Java) client stubs, and then try to shake the thing out to find bugs and performance holes. Ideally, by tuesday morning.
Unfortunately, Java doesn't believe in inter-process communication, so all the libraries out there are for doing RPCs. Since the parser and tagger will be operating interactively, and on the same machine, it's quite silly to go through the network stack just to pass a few bytes back and forth. Thankfully I found CLIPC which should do the heavy lifting of getting Java to believe in POSIX named pipes. In order to handle the "on the wire" de/encoding, I've decided to go with Google's protocol buffers since there's already a protobuf compiler for Haskell. I was considering using MessagePack (which also has Haskell bindings), but protobuf seemed a bit easier to install and work with.
For all the plumbing code I decided to try working with iteratees, which have lots of nice performance guarantees. The protobuf libraries don't have integrated support for iteratees, but the internal model is a variant of iteratees so I was able to write some conversion functions. Attoparsec also uses an iteratee-like model internally, and there's integration code available. For my uses I actually need an enumeratee instead of an iteratee, so I had to roll one of my own.
- TODO: (easy) move the CG2 training module into the library
- TODO: (low priority) write a CG3 training module
write an unparser for TnT lexicon and trigram files
- TODO: (easy) write function to feed trained models into the unparser
- TODO: (postponed) write wrapper executable to train models and print them to files
- TODO: (postponed) write function to read TnT parser output into models directly (the TnT file parsers were done previously)
choose a library for commandline argument mungingcmdargs
- TODO: add commandline arguments for passing models to the server
write protocol buffer spec for IPC protocol
write Java client handlers for the IPCs
- TODO: (low priority) write Haskell client handlers for debugging/verification of Java
- TODO: write Haskell code for dumping intern tables to files, and reading them back in
- TODO: write Java code for reading intern table files, so the client can dereference the ints
write functions for converting the protobuf Get monad into an iteratee or enumeratee
- TODO: write Haskell server handlers for the IPCs
- TODO: write STM code for parallelizing the tagging and IPC handling
write function for converting attoparsec parsers into enumeratees
- TODO: (low priority) integrate attoparsec enumeratees into model training, etc, to replace top-level calls to many
write lots of other auxiliary functions for bytestring, attoparsec, and iteratee
Other than my research assistantship, I've been taking some cool classes. Larry Moss is teaching a course on category theory for coalgebra (yes, that Larry; I realized last xmas when my copy arrived). While I have a decent background in CT from being an experienced Haskell hacker and looking into things in that direction, it's nice to see it presented in the classroom. Also, we're using Adámek's Joy of Cats which gives a very different presentation than other books I've read (e.g., Pierce) since it's focused on concrete categories from mathematics (topology, group theory, Banach spaces, etc) instead of the CCC focus common in computer science.
Sandra's teaching a course on NLP for understudied and low-resource languages. As you may have discerned from my previous post, agglutinative languages and low-resource languages are the ones I'm particularly interested in. Both because they are understudied and therefore there is much new research to be done, but also because of political reasons (alas, Mike seems to have taken down the original manifesto). We've already read a bunch of great papers, and my term paper will be working on an extension of a book that was published less than a year ago; and I should be done in time to submit it to ACL this year, which would be awesome.
My last class is in historical linguistics. I never got to take one during my undergrad, which is why I signed up for it. Matt offered one my senior year, but I was one of only two people who signed up for it, so it was cancelled. It used to be that people equated linguistics with historical, though that has been outmoded for quite some time. Unfortunately it seems that the field hasn't progressed much since then, however. Oh wells, the class is full of amusing anecdotes about language change, and the prof is very keen to impress upon us the (radically modern) polysynchronic approach to language change, as opposed to taking large diachronic leaps or focusing on historical reconstruction. And I'm rather keen on polysynchrony.
Classes have started up again, whence my month of absence. So I figure it's time to mention what I've been up to.
Over the summer I was working on developing an HMM-based part of speech tagger in Haskell. Most NLP folks consider POS tagging to be a "solved problem", and despite growing long in the teeth TnT (which uses second-order HMMs) is still very close to state-of-the-art; so why bother? Two reasons. Contrary to public opinion, POS tagging is not a solved problem. We can get good accuracy for English which has fixed word order and impoverished morphology, but we still don't really know how to handle morphological languages with free word order. Moreover, the taggers we have, have all been tested extensively on English and similar languages, but we don't really know how well different approaches apply to, say, Turkish, Hungarian, Japanese, Korean, Tzotzil, Quechua,...
The second reason is that my real goal is to handle supertagging for CCG, and in particular to do this for exploring online and interactive algorithms for tagging. Most of the current technology is focused on batch processing and off-line algorithms, which means that it isn't terribly useful for developing, say, an online system for real-time human--robot interaction, nor for exploring questions re the cognitive plausibility of something like supertagging serving a role in human processing of language. For doing this sort of research, TnT is too old and crotchety to work with, and the standard CCG supertaggers (OpenCCG, C&C Tools) are too integrated into their CCG parsing projects to be very amenable either. So, a new tagger writes I.
It is well-known that many common NLP algorithms for HMMs and chart parsing can be generalized to operate over arbitrary semirings. Before realizing this, some algorithms were invented over and over, specialized to different semirings. While it's common in the lore, I've yet to see any codebase that actually takes advantage of this to provide implementations that are generic over different semirings. So one of my secondary goals has been to make this parameterization explicit, and to make sure to do so in a way that doesn't diminish the performance of the tagger. By making the code modular in this way, it should also help when implementing variations on HMMs like higher-order HMMs, autoregressive HMMs, etc. And for doing this sort of thing right, you really need a type system you can trust, which means Haskell (or Agda or Coq). Also, most of the current work has been done in imperative languages only, so using a functional language provides a whole new arena of research on optimizations and the like.
So, that was the summer. Towards the end of the summer I did a writeup for it, though it's not entirely finished yet (i.e., ready for publicity/publication). I've continued developing it for my research assistanceship this year, which means integrating it with a variant of the Malt parser and seeing how well we can do online interactive semantic parsing of military data (which also presents an LM problem due to the huge number of OOVs, acronyms, and the use of terms like "green 42" as names).
If you're anything like me, you've heard tale of the shortage of American scientists and the failing standards of our education system for decades. This article offers a detailed rebuttal of the party line. The problem isn't too few talented individuals, they say, it's too few career prospects and a grist mill devoted to the exploitation of young scientists. This certainly reflects my experiences, both at a top research university and at a more vocational state school. I've always been more motivated by intellectual challenges than monetary rewards (in contrast to many of my colleagues), but it's always unsettling to look up at the house of cards. Will I actually be able to make a career of the research I so enjoy, or is this just a brief respite from the drudgery of a work-a-day programming job?
The talk is on a new semantic logic for handling constrained free word order in CCG. The logic is based on lambda calculus extended with something like a type system, though the extension is orthogonal to most type systems. Outside of CCG and linguistics, the logic seems like it would be helpful for typing the free ordering of keyword arguments or for capturing certain kinds of staged computation. I'm (re)writing up the paper this summer, so hopefully I'll be able to point y'all at a publication early next year. If you're interested and in the area you should stop by. Lots of other interesting things at NASSLLI this year too.
I'm sitting here, the last night before, and cooking dinner. It's funny how the before always comes a few days ahead of the end itself. Tonight is Lici's last night of work. It's about a week until the drive to Indiana.
I had some music on as I was finishing up some prepacking —books and such— and unintentionally, unexpectedly, came some songs with old memories. Old memories from other befores: CTY and Reed and the Plumtree. Isn't it strange how the memory of old nostalgia can lend a spirit of nostalgia to the present? It's no secret that I was never a fan of Bal'mer, but I did do a lot of growing here. Maybe I won't miss the place, but I will miss some of the folks and the simplicity of being tied to neither past nor future.
The last couple weeks have been nice. In addition to the Buffy/Angel, B5, and PS2 overload, misshepeshu and leensterama came out to visit so I took a couple trips to DC. I was reminded how not all the East Coast is like Baltimore, but I was also reminded how long it's been since I've lived in the District. DC was never really quite a home, but it was my escape-home for years before it grew into a home-in-transition for the couple years before moving to p-town and the Plumtree. It's not that things have changed so much as the friends I had then moved on to other cities and other lives. But Baltimore never was even a home-in-transition, it was only an in-transition. I came for a year, stayed for two, but never could settle into the rhythms and flows of the place.
I think "home" is never so much a place as it is a time, a moment, a feeling. We belie this with aphorisms on our inability to return there. We try to make the home into a place, but we can never return in time and so returning to the place once left can bring only sorrow. So too can we not hold time still, whence the solastalgia of remaining too long after the party has gone. We have words like mamihlapinatapai for the yearning and never taking, but what words are there for the never having and finally letting go?
Here's a quick update on my life as it stands. I seem to be building up a directory of abortive notes like this, so I'm typing this one in directly in hopes of actually posting it. Apologies for the lack of editing or, y'know, cohesion.
For those I haven't told yet, I got accepted to Indiana for a PhD in cognitive science (to be amended into a dual PhD in cogsci and computational linguistics), working with Mike Gasser and Sandra Kübler (along with Matthias Scheutz, most likely). The current plan is to move to Bloomington circa July 1st, with a previsit around June 17th to finalize leases and the like. That way I have some time to get settled and take a break before classes start. Now I need to find a place...
Employment-wise, the week before the previsit is the NIST eval for MT09. Which will be the last huzzah before signing off on my Joshua and GALE Rosetta work. Which means I have about a month to finish that, in tandem with the house hunting. One of the deliverables should be pretty easy to finish off, though it remains to explain to everyone how it works (yay monads!). Another I've done some mindcoding on, but don't have any actual code to show for; I have the unsettling prediction that Java isn't going to let me do things in as clean of a way as I'd like.
Research-wise, I've finished off my post-graduation Dyna involvement to buy time for other things. Jason still wants a meeting to discuss my involvement in the future, which is sensible. The research topics are interesting and'll probably influence my PL research for the next while, though I don't know how much of that will carry over to Dyna in the end. (And there's non-PL research I should be devoting more time to, methinks.) I'll miss working with qedragon, though we're planning to keep in touch.
Otherwise-wise, things are going a bit better now than they were. Tis still hard getting motivated, but the early summer days and the slow unwinding of obligations are doing some good. Lici says I tap the energy of my surroundings and that that's why I was in such higher spirits after my last visit to Bloomington. Considering how I go on about the dying of Baltimore et al, I can't help but think she's right. To that end, I've only a couple busy months left before I can bask in that relaxation once more.
Last night I went to a farewell dinner for Micha, who is heading back to Germany after a couple months at CLSP. About a dozen of us had delicious Ethiopian, and half hung around for drinks afterwards. Both establishments were quite nice, reminding me I should hang out in Mt Vernon more often. Micha's specialty is in "Deep MT", a variety of machine translation which makes use of linguistic factors rather than being purely statistical. Or to wit: MT done right. So there was some self-selection involved but the company was, as always, what made the night.
Three of the folks who stuck around for drinks were the first years at CLSP: two from CS who share my MT seminar, and one from ECE who seemed more grounded than most ;) Add to that Micha, myself, and one of the old-timers. It's amazing what people'll say once you get them off campus, or once you get a few drinks in 'em. On campus it's all business all the time. Which is fitting, it's a job afterall; but it does leave things rather dreary. And somehow it seems to lead to never really knowing what other folks are working on, or what they're interested in. It's nice to see the human side of people. It's also nice to see the business side of the business. But no, I need more humans in my life.
At Brewers Art I spent most of my time talking with A. She was sitting next to me and I could hear her, two excellent points in her favor. At some point we got onto that topic: what we're really interested in. I said I just finished my degree and was sticking around for a year working on GALE, "so that's why you're always so together at MT seminar," and I'm working on PhD apps for next year. The follow on question: the wheres and whys. I began to give the other face of my last rant, a presentation I've been polishing for those selfsame apps. I'm interested in morphology and its interfaces with syntax, semantics, and phonology; and I think we need to be working on linguistically-aware tools, since SMT's ignorance of morphosyntax is one of its principal failures (a point Micha demonstrated fabulously in his seminar last friday); and I think we need to be working on languages with few resources, for political reasons and also because tying ourselves to megacorpora means we will never break away from the need to invest millions to get enough training data to simulate knowledge, badly.
Shortly into my rant she said, "that's my soapbox!" For her undergrad thesis she worked on computational typology: measuring the distances between languages in typological space. The sort of work that would be essential for L3 to use a known system for translating between two languages to bootstrap translations between similar languages. When I told her the places I was thinking of heading she was surprised there were people working on our domain; she'd spent so long justifying this empirical-yet-linguistic approach, and I too know how hard it can be to convince the devout statisticians or the non-computationalists. Typology, more even than morphology, is a domain that gets a passing mention in undergrad years and yet never sees the light of day in modern research.
For her part she tried convincing me I should stick around CLSP, to join her in the battle. A tempting thought, though I worry it may be more uphill a battle than at the schools I've been thinking of. Though maybe it's worth another thought. All in all great food, great beer, great discussions, and intellectual vindication. What more could you ask for in a night?
It would seem over the last year or two my blog has lapsed from obscurity into death. Not being one to let things rest, I figure this horse still has some beating left in it. About, what, a month ago I handed in the final project for my MSE and so I am now a masterful computer scientist. This means, in short, that I now know enough to bore even other computer scientists on at least one topic.
The funny thing is that both topics of my project —category theory and unification— are topics I knew essentially nothing about when I transfered to JHU from PSU a year ago. Of course now, I know enough to consider myself a researcher in both fields, and hence know more than all but my peers within the field. I know enough to feel I know so little only because I have a stack of theses on my desk that I haven't finished reading yet. I'm thinking I should finish reading those before recasting my project into a submission to a conference/journal. Since the project is more in the vein of figuring out how a specific language should work, rather than general theoretical work, I'm not sure exactly how that casting into publishable form should go; it seems too... particular to be worth publishing. But then maybe I'm just succumbing to the academic demon that tells me my work is obvious to everyone since it is to me.
One thing that still disappoints me is that, much as I do indeed love programming languages and type theory, when I transfered here my goal was to move from programming languages and more towards computational linguistics. (If I were to stick with PL, I could have been working with the eminent Mark Jones or Tim Sheard back at PSU.) To be fair, I've also learned an enormous amount about computational linguistics, but I worry that my final project does not belie that learning to the admission committees for the PhD programs I'll be applying to over the next few months. Another problem that has me worried about those applications is, once again, in the demesne of internecine politics. For those who aren't aware, years ago a line was drawn in the dirt between computationally-oriented linguists and linguistically-oriented computer scientists, and over the years that line has evolved into trenches and concertina wire. To be fair, the concertina seems to have been taken down over the last decade, though there are still bundles of it laying around for the unwary (such as myself) to stumble into. There are individuals on both sides who are willing to reach across the divide, but from what I've seen the division is still ingrained for the majority of both camps.
My ultimate interests lie precisely along that division, but given the choice between the two I'd rather be thrown in with the linguists. On the CS side of things, what interests me most has always been the math: type theory, automata theory, etc. These are foundational to all of CS and so everyone at least dabbles, but the NLP and MT folks (in the States, less so in Europe) seem to focus instead on probabilistic models for natural language. I don't like statistics. I can do them, but I'm not fond of them. Back in my undergraduate days this is part of why I loved anthropology but couldn't stand sociology (again, barring the exceptional individual who crosses state lines). While in some sense stats are math too, they're an entirely different kind of math than the discrete and algebraic structures that entertain me. I can talk categories and grammars and algebra and models and logic, but the terminology and symbology of stats are greek to me. Tied in somehow with the probabilistic models is a general tendency towards topics like data mining, information extraction, and text classification. And while I enjoy machine learning, once again, I prefer artificial intelligence. And to me, none of these tendencies strike me as meaningfully linguistic.
More than the baroque obfuscatory traditions of their terminology, my distaste for statistics is more a symptom than a cause. A unifying theme among all these different axes —computational linguistics vs NLP, anthropology vs sociology, mathematics vs statistics, AI vs machine learning — is that I prefer deep theoretical explanations of the universe over attempts to model observations about the universe. Sociology can tell you that some trend exists in a population, but it can make no predictions about an individual's behavior. Machine learning can generate correct classifications, but it rarely explains anything about category boundaries or human learning. An n-gram language model for machine translation can generate output that looks at least passingly like the language, but it can't generalize to new lexemes or to complex dependencies.
My latest pleasure reading is Karen Armstrong's The Battle for God: A history of fundamentalism. In the first few chapters Armstrong presents a religious lens on the history of the late-fifteenth through nineteenth centuries. Towards the beginning of this history the concepts of mythos and logos are considered complementary forces each with separate spheres of prevalence. However, as Western culture is constructed over these centuries, logos becomes ascendant and mythos is cast aside and denigrated as falsity and nonsense. Her thesis is that this division is the origin of fundamentalist movements in the three branches of the Abrahamic tradition. It's an excellent book and you should read it, but I mention it more because it seems to me that my academic interests have a similar formulation.
One of the reasons I've been recalcitrant about joining the ranks of computer scientists is that, while I love the domain, I've always been skeptical of the people. When you take a group of students from the humanities they're often vibrant and interesting; multifaceted, whether you like them or not. But when you take a group of students from engineering and mathematical sciences, there tends to be a certain... soullessness that's common there. Some of this can be attributed to purely financial concerns: students go into engineering to make money, not because they love it; students go into humanities to do something interesting before becoming a bartender. When pitting workplace drudgery against passionate curiosity, it's no wonder the personalities are different. But I think there's a deeper difference. The mathematical sciences place a very high premium on logos and have little if any room for mythos, whereas the humanities place great importance on mythos (yet they still rely on logos as a complimentary force). In the open source movement, the jargon file, and other esoterica we can see that geeks have undeniably constructed countless mythoi. And yet the average computer geek is an entirely different beast than the average computer scientist or electrical engineer. I love computer geeks like I love humanists and humanitarians, so they're not the ones I'm skeptical of, though they seem to be sparse in academia.
I've always felt that it is important to have Renaissance men and women, and that modern science's focus on hyperspecialization is an impediment to the advancement of knowledge. This is one of the reasons I love systems theory (at least as Martin Zwick teaches it). While I think it's an orthogonal consideration, this breadth seems to be somewhat at odds with logocentric (pure) computer science. The disciplines that welcome diversity —artificial intelligence/life, cognitive science, systems theory, computational linguistics— seem to constantly become marginalized, even within the multidisciplinary spectrum of linguistics, computer science, et al. Non-coincidentally these are the same disciplines I'm most attracted to. It seems to me that the Renaissance spirit requires the complementary fusion of mythos and logos, which is why it's so rare in logocentric Western society.
I finally had a few spare moments to clean up my computer a bit. It's been a couple years since I've really had the time to do a complete reorg. I've needed this for a while. I didn't finish the reorg, just cleaned up some of the cruftier corners, but still it's nice. I noticed long ago that the organization of my computer has a direct link to how organized my life feels, and hence my happiness. As far as OCD goes, it's a simple enough burden and frees me from worrying about so many other things. Unfortunately, the main impetus behind the organizing is so I can get a nice clean fresh backup. My battery's been going on the fritz and I blame the manufacturer. Might be some other power issues too. In any case, xenobia's still under warrantee, but in the event I need to use those backups, I'd rather have them be nice and pretty.
Taxes are done. Note to self: never be self-employed. Or a farmer.
I'll be attending Johns Hopkins in the fall. I've talked myself out of that pesky CS PhD again, which makes things easier. I'll have to check to see if I can apply my excess CS credits to the cog.sci PhD if I end up going for that over linguistics (which depends mainly on the school I go to, e.g. at JHU cog.sci subsumes linguistics).
Still no word from Seoul.
Classes are good this term. Much nicer than last, even if one does have a fucktonne of work. Note to self: if you've never needed programs to randomly partition and shuffle files before, you've obviously never created training sets for machine learning.
It's time to try to break the coffeeine addiction again. Boozeohol might be the key.
The light of my life is getting depressed again. I feel bad because I don't know what to do about it, how to help. Being all too familiar with depression, I know what doesn't work, but that doesn't help to know what does and the only things I know require being in person. It doesn't help that I've been stumbling up and down over my own for the last year or so from the look of my posts. Back in days of yore, she was the one who helped me through so much but whenever she needed help she would run away and hide. But how can you help someone who doesn't want it, how can you take away the barbs and scars of history, of others' failings, or a world cruel and harsh and unforgiving?
Now, back into the fold. I'll try to send a missive next we meet civilization.
I know I've been gone for a long while now. Next term is looking to be a lot nicer than this term was. Hopefully I'll be able to get back into writing (and reading) LJ then; there're quite a few posts I have built up if I ever get the chance to write them out. Well, now for the last few months in review:
My schedule was updated, though I never got around to mentioning it. Well, it's been updated again! As always, that link is reachable from my profile and will be the place I try to keep all that info up to date and disseminated.
Accepted: Indiana (3/7)
Red Rubber Ducky Stamp of Doom: Berkley (2/21), MIT (~3/5), Urbana–Champaign (3/5)
(Dates are when the letters were written and are for the benefit of those doing statistical analysis.)
Well, it looks like the winnowing process is well on its way. Indiana only offered me a nominal award which is unfortunate (though I may get to work with Michael Gasser or Doug Hofstadter, both of which would rock). Still waiting to hear from Johns Hopkins. In other news, I've been offered a programming(!) assistantship at PSU were I to stick around here. Tempting, but I'm pretty sure in the long run Indiana would make up for the price.
Of course now I'm getting tempted to just go for a PhD again, but then what the hell would I do with two PhDs? From the looks of it IU limits transfer credits for MS to 8 (a little more than a quarter) of the required 30 meaning I'd have ~7 classes to go, or a year. The PhD program requires 90 hours, but the transfer cap is 30 (of which I'll be at 21.3~23.3 at the end of next quarter) which'd be 22~23 classes, or just under four years taking it easy (possibly as low as two if I go for summer courses). And of those 90 credits, that includes a required minor which could mean 4 or more of those classes in linguistics (or cognitive sci, or systems sci). Awfully tempting if I could do it in just an extra year, even if only an extra two with either of the guys noted above and a decent fellowship or assistantship.
JHU only allows two classes to transfer (of 6~8 required) for the MSE, which means a year or just under. Whereas the PhD program requires two more classes, and doesn't specify the transfer cap. Hells, at this point if feels like with all I've put in I might as well just go for the doctorate rather than being overeducated for the slip of paper I'll have to wave around, especially if I'm going to go through the trouble of transferring (I'll have 28~32 of the 45 credits required at PSU after this spring, meaning I'm 5~6 classes away from the degree (i.e. barely two quarters) only one of which is required to be CS, so the rest could be ling if I can get my advisor to swing it).
Bah, just the same temptation as when I was filling out those damn apps. Why must I keep convincing myself I'm not really a computer scientist? Just because I'm good at it, I must keep remembering that linguistics is my passion. Fuckin' A, mate. What do y'all think I should do? Think it's worth an extra year or two knowing I'll have to go through the whole dissertation thing again another four years later?
Just finished "Surely You're Joking, Mr. Feynman!" not too long back. For those who haven't read it yet, it's a fabulous book. Hilarious, engaging, quirky, intelligent. Everything you'd ever want from a scientist. This was my bus-time reading for a few weeks, and let me tell you how hard it was to keep from laughing out loud every page or two turned. All the same, my smiles were deafening and turned many a head in those few weeks. There's a spot around two thirds on where his inveterate womanizing tendencies bleed through into something of a bitter cynicism, but he cheers up again by the end. In all, an excellent read and highly recommended.
As I mentioned in a screened post a while back, another application I filled out recently is for going off to Seoul for a couple months this summer to learn Korean. I should hear back from them mid- to late-April just a bit after I accept my grad school. Depending on finances, timing, and the details of the program exit, I may also try to stop over in Japan for a brief trip on the way back. In addition to going back to the country, there're a few friends out there it'd be nice to see. Of course all that depends on when and where I'll be moving in the fall and arranging housing there et al.
And now I must be off to sleep. Else I'll piss off the sexiest vixen by being late to pick her up tomorrow morn. I hope to be able to write a bit more afore classes start up again, though somehow I doubt that'll happen :)