winterkoninkje: shadowcrane (clean) (Default)

I finally got around to posting the slides for a talk I gave twice this summer: Probability Smoothing for NLP: A case study for functional programming and little languages. The first version of the talk was presented at the McMaster Workshop on Domain Specific Lanaguages (and Ed Kmett has posted a video of that version on YouTube) with the presentation focused on EDSLs, with smoothing given as an example. The second version was presented at the AMMCS minisymposium on Progress and Prospects in Model-Based Scientific Software Development, where the focus was more on the domain itself and how the use of a DSL allows ensuring correctness, modularity, and maintainability of code for developing probability models. The slides are essentially the same for both talks, with the benchmarks updated a bit in the latter.

As you may have surmised, this is but a small facet of the Posta project I was working on last year. I had meant to submit it as a functional pearl for ICFP, but the timing didn't work out for that. After giving the McMaster version of the talk, Ed convinced me that I should publish the code for the smoothing DSL separately from the rest of Posta. So he's the one to blame about my being so slow in releasing the Posta code I promised this summer. Though seriously, I'd been considering breaking up and reorganizing the code anyways. Now that I'm back from ICFP and all my traveling over the summer, I hope to get that code pushed out soon. Sorry for the delay y'all.

winterkoninkje: shadowcrane (clean) (Default)

All last week I was in Tokyo to attend ICFP and associated workshops. It was nice to finally meet a bunch of people I've been talking with online for the last few years. And I met up with Ken Shan and Oleg Kiselyov again, which is always a pleasure. Unlike last time I was in Japan, I didn't get too much time to explore and go sightseeing. I got to explore Chiyoda, which I missed last time around, and I made sure to do two of the most important things: (1) eat some okonomiyaki, (2) visit Akihabara to buy some new manga.

My newest acquisition is 銃姫 ("Gun Princess") Phantom Pain, which I'm rather enjoying so far. Anything that starts off with an execution, spell-casting based on Buddhist mantras, and a prolonged diatribe on why one of the characters is a good-for-nothing incompetent layabout, can't be half bad :) Unfortunately, I only got the first two volumes, so I'll finish them all too soon. So far it's proving easier to read than my previous acquisition (Peace Maker 鐵), though I'm not sure if that's due to getting better at Japanese or because 鐵 is written in a particular style. I definitely noticed my deterioration in fluency since five years ago; grammar's fine, but my vocab is abysmal. I need to find a decent way to work on that.

winterkoninkje: shadowcrane (clean) (Default)

unification-fd 0.5.0

The unification-fd package offers generic functions for first-order structural unification (think Prolog programming or Hindley–Milner type inference). I've had this laying around for a few years, so I figured I might as well publish it.

An effort has been made to try to make this package as portable as possible. However, because it uses the ST monad and the mtl-2 package it can't be H98 nor H2010. However, it only uses the following common extensions which should be well supported[1]:

Rank2Types
MultiParamTypeClasses
FunctionalDependencies
FlexibleContexts
FlexibleInstances
UndecidableInstances

[1] With the exception of fundeps which are notoriously difficult to implement. However, they are supported by Hugs and GHC 6.6, so I don't feel bad about requiring it. Once the API stabilizes a bit more I plan to release a unification-tf package which uses type families instead, for those who feel type families are easier to implement or use.

Description

The unification API is generic in the type of the structures being unified and in the implementation of unification variables, following the two-level types pearl of Sheard (2001). This style mixes well with Swierstra (2008), though an implementation of the latter is not included in this package.

That is, all you have to do is define the functor whose fixed-point is the recursive type you're interested in:

    -- The non-recursive structure of terms
    data S a = ...
    
    -- The recursive term type
    type PureTerm = Fix S

And then provide an instance for Unifiable, where zipMatch performs one level of equality testing for terms and returns the one-level spine filled with pairs of subterms to be recursively checked (or Nothing if this level doesn't match).

    class (Traversable t) => Unifiable t where
        zipMatch :: t a -> t b -> Maybe (t (a,b))

The choice of which variable implementation to use is defined by similarly simple classes Variable and BindingMonad. We store the variable bindings in a monad, for obvious reasons. In case it's not obvious, see Dijkstra et al. (2008) for benchmarks demonstrating the cost of naively applying bindings eagerly.

There are currently two implementations of variables provided: one based on STRefs, and another based on a state monad carrying an IntMap. The former has the benefit of O(1) access time, but the latter is plenty fast and has the benefit of supporting backtracking. Backtracking itself is provided by the logict package and is described in Kiselyov et al. (2005).

In addition to this modularity, unification-fd implements a number of optimizations over the algorithm presented in Sheard (2001)— which is also the algorithm presented in Cardelli (1987).

  • Their implementation uses path compression, which we retain. Though we modify the compression algorithm in order to make sharing observable.
  • In addition, we perform aggressive opportunistic observable sharing, a potentially novel method of introducing even more sharing than is provided by the monadic bindings. Basically, we make it so that we can use the observable sharing provided by the previous optimization as much as possible (without introducing any new variables).
  • And we remove the notoriously expensive occurs-check, replacing it with visited-sets (which detect cyclic terms more lazily and without the asymptotic overhead of the occurs-check). A variant of unification which retains the occurs-check is also provided, in case you really need to fail fast for some reason.
  • Finally, a highly experimental branch of the API performs weighted path compression, which is asymptotically optimal. Unfortunately, the current implementation is quite a bit uglier than the unweighted version, and I haven't had a chance to perform benchmarks to see how the constant factors compare. Hence moving it to an experimental branch.

I haven't had a chance to fully debug these optimizations, though they pass some of the obvious tests. If you find any bugs, do be sure to let me know. Also, if you happen to have a test suite or benchmark suite for unification on hand, I'd love to get a copy.

References

Luca Cardelli (1987)
Basic polymorphic typechecking. Science of Computer Programming, 8(2): 147–172.
Atze Dijkstra, Arie Middelkoop, S. Doaitse Swierstra (2008)
Efficient Functional Unification and Substitution, Technical Report UU-CS-2008-027, Utrecht University.
Oleg Kiselyov, Chung-chieh Shan, Daniel P. Friedman, and Amr Sabry (2005)
Backtracking, Interleaving, and Terminating Monad Transformers, ICFP.
Tim Sheard (2001)
Generic Unification via Two-Level Types and Paramterized Modules, Functional Pearl, ICFP.
Tim Sheard and Emir Pasalic (2004)
Two-Level Types and Parameterized Modules. JFP 14(5): 547–587. This is an expanded version of Sheard (2001) with new examples.
Wouter Swierstra (2008)
Data types a la carte, Functional Pearl. JFP 18: 423–436.

Links

To Canada!

3 Jun 2011 08:04 am
winterkoninkje: shadowcrane (clean) (Default)

Hello all. This summer (that is, next week) I'm heading up to Canada to teach a DSL bootcamp at McMaster along with Edward Kmett, and staying afterwards for a couple months. Rather than dealing with the TSA's bullshit I've decided to take the train up, which is cheaper and only slightly slower once you account for all the nonsense. But, this means I'll be spending a good deal of time laying over in Chicago and Buffalo. I know a bunch of you have traveled around, and some of you may happen to be there when I am, so: anyone want to meet up while I'm there? or knows of some good places to eat and spend time when visiting?

I'll be in Chicago around 4pm–9:30pm on June 10, and 9:45am–3:20pm on July 29. And in Buffalo around 9am–3pm on June 11, and 1:30pm–midnight on July 28. So that's about five hours each stop, ten hours coming back through Buffalo. Amtrak being what it is, some of this time might get eaten up by poor track scheduling, but I figure I'll still have a good deal of time regardless.

Also, anyone have recommendations for places to eat in downtown Indianapolis? I have an hour to kill around noon when heading up.

winterkoninkje: shadowcrane (clean) (Default)

I've been working on a tagging library (and executable) for a bit over a year now. When the project started I had the advantage of being able to choose the language to do it in. Naturally I chose Haskell. There are numerous reasons for this decision, some of which have been derided as "philosophical concerns". Certainly some of the reasons why Haskell is superior to other languages do border on the philosophical. Y'know, silly little things like the belief that type systems should prevent errors rather than encouraging them to proliferate. I'm sure you've heard the arguments before. They're good arguments, and maybe they'll convince you to try out Haskell in your basement. But in many so-called "enterprise" settings, anything that even smells like it might have basis in theoretical fact is automatically wrong or irrelevant; whatever you do in the privacy of your basement is your own business, but heaven forbid it have any influence on how decisions are made in the workplace! So, here is a short list of entirely pragmatic, practical, and non-theoretical reasons why Haskell is superior to Java for implementing enterprise programs. More specifically, these are reasons why Haskell is superior for my project. Perhaps they don't matter for your project, or perhaps they'll be enough to convince your boss to let you give Haskell a try. Because design decisions are often project-specific, each point explains why they matter for Posta in particular.

  • Haskell has powerful frameworks for defining modular, high-performance, non-trivial parsers (e.g., Attoparsec). In natural language processing (NLP), just like system administration, over half of the work you do involves dealing with a handful of different ad-hoc poorly defined file formats. Reading them; generating them; converting from one format to another; etc. Because every one of these formats grew out of a slow accretion of features for one particular project, they're riddled with inconsistencies, idiosyncratic magic values, corner cases, and non-context-free bits that require special handling. In Java the premiere tool (so far as I know) for defining parsers is JavaCC. (Like the C tools lex and yacc, JavaCC uses its own special syntax and requires a preprocessor, whereas Attoparsec and the like don't. However, this may be a "philosophical" issue.) However, as of last time I used it, JavaCC is designed for dealing with nice clean grammars used by programming languages and it doesn't handle inconsistent and irregular grammars very well.
  • Posta uses a system of coroutines (called "iteratees") in order to lazily stream data from disk, through the parsers, and into the core algorithms, all while maintaining guarantees about how long resources (e.g., file handles, memory) are held for. This allows handling large files, because we don't need to keep the whole file in memory at once, either in its raw form or in the AST generated by parsing it. For modern enterprise-scale NLP, dealing with gigabyte-sized files is a requirement; because many NLP projects are not enterprise-scale, you get to spend extra time chopping up and reformatting files to fit their limitations. Last time I used JavaCC it did not support incremental parsing, and according to the advertised features it still doesn't. In addition, implementing coroutines is problematic because Java's security model precludes simple things like tail-call optimization--- meaning that you can only support this kind of streaming when the control flow is simple enough to avoid stack overflows.
  • Haskell has awesome support for parallelism. One version, called STM, provides composeable atomic blocks (which matches the way we naturally think about parallelism) combined with lightweight threads (which make it cheap and easy). Java has no support for STM. I am unaware of any support for lightweight threads in Java. The only parallelism I'm aware of in Java is the monitor-style lock-based system with OS threads. As with all lock-based systems, it is non-composeable and difficult to get right; and as with using OS threads anywhere else, there is high overhead which removes the benefits of parallelizing many programs.
  • Posta makes extensive use of partial evaluation for improving performance; e.g., lifting computations out of loops. When doing NLP you are often dealing with triply-nested loops, so loop-invariant code motion is essential for performance. In my benchmarks, partial evaluation reduces the total running time by 10%. If raw numbers don't convince you: using partial evaluation allows us to keep the code legible, concise, modular, and maintainable. The primary use of partial evaluation is in a combinator library defining numerous smoothing methods for probability distributions; the results of which are called from within those triply-nested loops. Without partial evaluation, the only way to get performant code is to write a specialized version of the triply-nested loop for every different smoothing method you want to support. That means duplicating the core algorithm and a lot of tricky math, many times over. There's no way to implement this use of partial evaluation in anything resembling idiomatic Java.
  • Posta uses an implementation of persistent asymptotically optimal priority queues which come with proofs of correctness. A persistent PQ is necessary for one of the tagger's core algorithms. Since the PQ methods are called from within nested loops, performance is important. Since we're dealing with giga-scale data, asymptotics are important. A log factor here or there means more than a 10% increase in total running time. In Java there's java.util.PriorityQueue but it has inferior asymptotic performance guarantees and is neither persistent nor synchronized. I'm sure there are other PQ libraries out there, but I doubt anyone's implemented the exact version we need and shown their implementation to be correct.

I'll admit I'm not up to date on state-of-the-art Java, and I'd love to be proven wrong about these things being unavailable. But a couple years ago when I returned to Java after a long time away, I learned that all the hype I'd heard about Java improving over the preceding decade was just that: hype. I have been disappointed every time I hoped Java has some trivial thing. The most recent one I've run into is Java's complete refusal to believe in the existence of IPC (no, not RPC), but that's hardly the tip of the iceberg.

winterkoninkje: shadowcrane (clean) (Default)

stm-chans 1.0.0

The stm-chans package offers a collection of channel types, similar to TChan but with additional features. In particular it offers these types:

TBChan: Bounded FIFO channels.
When the channel is full, writers will block/retry. This ensures that the writers do not get too far ahead of the readers, which helps to make sure that memory and cpu resources are used responsibly.
TMChan: Closeable FIFO channels.
This is like TChan (Maybe a) but with a monotonicity guarantee that once Nothing is returned all future reads will be Nothing as well.
TBMChan: Bounded Closeable FIFO channels.
This combines the capabilities of TBChan and TMChan.

In addition, the stm-chans package offers a (partial) compatibility layer for some API improvements still making their way into the stm package[1]. These new functions include:

tryReadTChan :: TChan a -> STM (Maybe a)
A version of readTChan which does not retry. Instead it returns Nothing if no value is available.
peekTChan :: TChan a -> STM a
Get the next value from the TChan without removing it, retrying if the channel is empty.
tryPeekTChan :: TChan a -> STM (Maybe a)
A version of peekTChan which does not retry. Instead it returns Nothing if no value is available.

Links

winterkoninkje: shadowcrane (clean) (Default)

I offer you a thought experiment. For your current project you've set up some inter-process communication. Nothing tricky involved, just your standard client–server deal. You've even outsourced the protocol design and are using Google's code to handle all the grody details of serializing and deserializing. Well, okay, on the server side you're using someone else's code but it implements the same protocol, right? Now you run into a bug.

The vast majority of the time everything works smoothly— even verified by taking SHA1 hashes of the messages on both sides and comparing them. But every so often the Java client crashes. In particular, it crashes whenever reading a result message (from the server) of length 255 or 383 and maybe some larger sizes. It does, however, work perfectly fine for intervening message lengths (including 254 and 256). So what's wrong?

A hint ) Another hint ) One last hint )

Solution posted here.

winterkoninkje: shadowcrane (clean) (Default)

unix-bytestring 0.3.2

The unix-bytestring package offers a full selection of Unix/Posix-specific functions for reading and writing ByteStrings to file descriptors.

The Story Behind It All

A while back I needed a ByteString-based version of the System.Posix.IO String-based API. I coded up the new versions and submitted a patch to the unix package for adding them in. But apparently, noone's much pleased with the System.Posix.IO API! In the ensuing discussion folks brought up their specific complaints and offered suggestions. So now, the couple of functions have become enough to release as its own package— with the hope that one day it'll be rolled into the unix package.

That's where you come in. The current API is more than good enough for anything I've needed, so I need your feedback on what you want from a ByteString-based Unix/Posix I/O library.

I've looked through Roel van Dijk's reverse dependency analysis to find the folks currently using the unix package. The maintainers of the following packages may be particularly interested in unix-bytestring:

HFuse has complained about the lack of bindings for pread(2) and pwrite(2), which are included in unix-bytestring. The only standing complaint I haven't addressed is one expressed in the iteratee, iteratee-mtl, and liboleg packages which would prefer a return type of (Either Errno _) instead of throwing exceptions.

Links

winterkoninkje: shadowcrane (clean) (Default)

This weekend I've been doing a solo hackathon to try to get Posta integrated with our variant of the Mink parser. All the core algorithms have already been implemented, so it's just been a whole lot of yak shaving. Namely, I have to define an IPC protocol for Haskell to talk to Java, implement the (Haskell) server executable and (Java) client stubs, and then try to shake the thing out to find bugs and performance holes. Ideally, by tuesday morning.

Unfortunately, Java doesn't believe in inter-process communication, so all the libraries out there are for doing RPCs. Since the parser and tagger will be operating interactively, and on the same machine, it's quite silly to go through the network stack just to pass a few bytes back and forth. Thankfully I found CLIPC which should do the heavy lifting of getting Java to believe in POSIX named pipes. In order to handle the "on the wire" de/encoding, I've decided to go with Google's protocol buffers since there's already a protobuf compiler for Haskell. I was considering using MessagePack (which also has Haskell bindings), but protobuf seemed a bit easier to install and work with.

For all the plumbing code I decided to try working with iteratees, which have lots of nice performance guarantees. The protobuf libraries don't have integrated support for iteratees, but the internal model is a variant of iteratees so I was able to write some conversion functions. Attoparsec also uses an iteratee-like model internally, and there's integration code available. For my uses I actually need an enumeratee instead of an iteratee, so I had to roll one of my own.

  • TODO: (easy) move the CG2 training module into the library
  • TODO: (low priority) write a CG3 training module
  • DONE: write an unparser for TnT lexicon and trigram files
  • TODO: (easy) write function to feed trained models into the unparser
  • TODO: (postponed) write wrapper executable to train models and print them to files
  • TODO: (postponed) write function to read TnT parser output into models directly (the TnT file parsers were done previously)
  • DONE: choose a library for commandline argument munging cmdargs
  • TODO: add commandline arguments for passing models to the server
  • DONE: write protocol buffer spec for IPC protocol
  • DONE: write Java client handlers for the IPCs
  • TODO: (low priority) write Haskell client handlers for debugging/verification of Java
  • TODO: write Haskell code for dumping intern tables to files, and reading them back in
  • TODO: write Java code for reading intern table files, so the client can dereference the ints
  • DONE: write functions for converting the protobuf Get monad into an iteratee or enumeratee
  • TODO: write Haskell server handlers for the IPCs
  • TODO: write STM code for parallelizing the tagging and IPC handling
  • DONE: write function for converting attoparsec parsers into enumeratees
  • TODO: (low priority) integrate attoparsec enumeratees into model training, etc, to replace top-level calls to many
  • DONE: write lots of other auxiliary functions for bytestring, attoparsec, and iteratee
winterkoninkje: shadowcrane (clean) (Default)

Classes have started up again, whence my month of absence. So I figure it's time to mention what I've been up to.

Over the summer I was working on developing an HMM-based part of speech tagger in Haskell. Most NLP folks consider POS tagging to be a "solved problem", and despite growing long in the teeth TnT (which uses second-order HMMs) is still very close to state-of-the-art; so why bother? Two reasons. Contrary to public opinion, POS tagging is not a solved problem. We can get good accuracy for English which has fixed word order and impoverished morphology, but we still don't really know how to handle morphological languages with free word order. Moreover, the taggers we have, have all been tested extensively on English and similar languages, but we don't really know how well different approaches apply to, say, Turkish, Hungarian, Japanese, Korean, Tzotzil, Quechua,...

The second reason is that my real goal is to handle supertagging for CCG, and in particular to do this for exploring online and interactive algorithms for tagging. Most of the current technology is focused on batch processing and off-line algorithms, which means that it isn't terribly useful for developing, say, an online system for real-time human--robot interaction, nor for exploring questions re the cognitive plausibility of something like supertagging serving a role in human processing of language. For doing this sort of research, TnT is too old and crotchety to work with, and the standard CCG supertaggers (OpenCCG, C&C Tools) are too integrated into their CCG parsing projects to be very amenable either. So, a new tagger writes I.

It is well-known that many common NLP algorithms for HMMs and chart parsing can be generalized to operate over arbitrary semirings. Before realizing this, some algorithms were invented over and over, specialized to different semirings. While it's common in the lore, I've yet to see any codebase that actually takes advantage of this to provide implementations that are generic over different semirings. So one of my secondary goals has been to make this parameterization explicit, and to make sure to do so in a way that doesn't diminish the performance of the tagger. By making the code modular in this way, it should also help when implementing variations on HMMs like higher-order HMMs, autoregressive HMMs, etc. And for doing this sort of thing right, you really need a type system you can trust, which means Haskell (or Agda or Coq). Also, most of the current work has been done in imperative languages only, so using a functional language provides a whole new arena of research on optimizations and the like.

So, that was the summer. Towards the end of the summer I did a writeup for it, though it's not entirely finished yet (i.e., ready for publicity/publication). I've continued developing it for my research assistanceship this year, which means integrating it with a variant of the Malt parser and seeing how well we can do online interactive semantic parsing of military data (which also presents an LM problem due to the huge number of OOVs, acronyms, and the use of terms like "green 42" as names).

winterkoninkje: shadowcrane (clean) (Default)

In formal treatments of lazy functional languages like Haskell, appeal is often made to domain theory. For those who aren't familiar, a domain is nothing more than a set equipped with an ordering relation that obeys certain laws. The ordering relation captures a notion of "informativeness" whereby "greater" elements are more defined or more known-about than "lesser" elements. One of the particular laws is that the domain must have an element, ⊥, which is less than all others (and therefore contains no information other than what domain you're in).

Another way to think about domain theory is from the context of logic programming. The domains we often encounter in functional programming are defined structurally. That is, given an algebraic data type, we can construct a domain for the type where the elements of the domain are generated by its initial algebra on a set of variables representing unknown substructure. Indeed, this is usually the domain we are most interested in. In this case, ⊥ is simply a free variable (and thus, is actually ⊤). If we treat the variables as logic variables, then the domain ordering is defined by the subsumption relation induced by unification (aka pattern matching). Thus, a domain is simply the set of all patterns we could write for some type, and the subsumption relation between those patterns.

In Haskell we often conflate "⊥" as a domain element we know nothing about, and "⊥" as a non-terminating computation. There are natural reasons for doing so, but it's unfortunate because it loses a critical distinction between what is already-known and what is knowable. This raises questions about how much of our domain-theoretic training we can carry over to total lazy languages, a topic I've become quite interested in lately.

The idea that non-termination and free variables are the same comes from the idea that if we try to case match on an unbound variable then we will deadlock, waiting for that variable to become bound to something in order to proceed. In a fully interpreted language this holds up fairly well. However, in languages which aren't fully interpreted (e.g. dependently typed languages, languages with non-ground terms, parallel data-flow languages) it doesn't hold up because variables serve multiple roles. Depending on the context, we will want to think of a variable as either being bound to no element of the domain, or as being bound nondeterministically to every element of the domain. The former matches our functional programming intuitions about non-termination, whereas the latter matches our logic programming intuitions about non-ground terms presenting the type of all the terms they subsume.

In particular, the idea of variables bound to everything allows us to consider the idea of a hypothetical input: an input about which we know nothing, but which we are permitted to inspect for more information. Case analysis on hypothetical variables is the same as reasoning by cases in proof theory. These are different from abstract variables, which are values about which we know nothing and which we are not permitted to inspect. Loosely speaking, hypothetical variables are those introduced by universal quantification, whereas abstract variables are those introduced by existential quantification. (However, existentially bound variables may still by hypothetically analyzable rather than being held abstract; it depends on the situation.)

In a total language we are not permitted to write computations that we cannot prove will terminate. Thus, we can never obtain a nonterminating 'value' inhabiting the bottom element of a domain. However, we can still use bottom to represent things we hold abstract or which we don't force the evaluation of. Thus, in a total lazy language, domain theory should still have a role in strictness analysis and certain other formal reasoning. Given a domain we can interpret it hypothetically, as we do when performing strictness analysis or reasoning by cases, or we can interpret it abstractly, as we do when performing termination analysis. This conflict feels very similar to the conflict between parametricity and dependency. I wonder if any work has been done on this domain theoretic duality?

winterkoninkje: shadowcrane (clean) (Default)
This is a followup to a recent post on Parameterized Monads which I discovered independently, along with everyone else :)

For anyone interested in reading more about them, Oleg Kiselyov also discovered them independently, and developed some Haskell supporting code. Coq supporting code by Matthieu Sozeau is also available. And apparently Robert Atkey investigated them thoroughly in 2006.

If people are interested in more Coq support, I've been working on a Coq library for basic monadic coding in a desperate attempt to make programming (rather than theorem proving) viable in Coq. Eventually I'll post a link to the library which will include parameterized monads as well as traditional monads, applicative functors, and other basic category theoretic goodies. This library along with the Vecs library I never announced stemmed from work last term on proving compiler correctness for a dependently typed language. Hopefully there'll be more news about that this fall.
winterkoninkje: shadowcrane (clean) (Default)

I like Smalltalk. Of any of the OO options it's by far my favorite. And yet, this most powerful language of the '70s has been relegated to oblivion. Robert Martin of Object Mentor Inc. gives a talk at Rails Conf 2009, "What Killed Smalltalk Could Kill Ruby, Too", which is well worth watching. I've since abandoned the whole OO paradigm in favor of functionalism, but I think this talk also has a good deal to say to the Haskell community (in fact, hat tip to Nick Mudge on Planet Haskell).

In particular, around 37:00 to 41:00, Martin talks about one of the three major things to kill Smalltalk. This one is the greatest danger for the Haskell community: arrogance and parochialism as a result of an emphasis on purity. The complaint is a common one, though I think the mention of purity is something which should be taken with depth. (Certainly purity is one of the highest horses we Haskellers will climb upon.) In an interesting addition to the usual dialogue, Martin posits professionalism as the countervailing force we need to maintain in the face of the growth of the community.

I highly recommend the video. The actual talk starts about six minutes in, and after the ending at 50:00 there's a Q&A session with a couple good questions.

winterkoninkje: shadowcrane (clean) (Default)

I've been working on sprucing up some of my libraries on Hackage. In particular I've been working on making sure the code is portable to Hugs as well as GHC, since I'm not relying on GHC-specific extensions. This is made tricky, however, because I am using both CPP and the FFI. Along the way I've been stumbling into Dead Code Land, the place where code that was never documented goes when the project dies. Perhaps my charting of these territories will help anyone else misguided by portability.

Some background. The latest release of Hugs is September 2006, which was contemporary with, though prior to, GHC 6.6. Now that GHC 6.10 has been released, GHC 6.8 is quickly nearing the end of it's life cycle, though many people still use it (since installation is always a headache, or since 6.10 still has a few unfortunate performance regressions). For my code I'd like to support Hugs Sept06, GHC 6.8, and GHC 6.10. I'm fine to call GHC 6.6 dead; porting the code to it shouldn't be too difficult, but there are some things like LANGUAGE pragma which are just too nice to loose. There are other compilers out there like nhc98, yhc, jhc, and lhc though they tend to be niche and don't support all the standard extensions to Haskell 98.

Beyond compilers. Cabal is the de-facto packaging tool for Haskell. On GHC 6.8 we have Cabal 1.2.3.0, though because of the rapid evolution of the project, this version is ancient and all it's version-specific documentation has been blown away by the documentation for newer versions. Haddock is the de-facto documentation tool. Here we have version 2.0.0.0 which is pretty recent, though all the documentation is on how to write doc code, rather than how to run the program that compiles that into html.

And now on down to the dirt. The use of CPP for conditional compilation of Haskell has a long, if dubious, history. There's a cpphs project which aims to clean up C's original cpp, in order to make it easier to use for Haskell code which is sensitive to whitespace and other things that C isn't. The only other project even vaguely in this domain is Template Haskell, but that's GHC-only and it does far more besides. Using CPP with GHC is trivially easy. For Hugs it's a bit more difficult. Hugs doesn't have a --cpp flag, but instead has flags for being able to filter your code with any arbitrary program. A canonical invocation looks something like this:

hugs -F'cpp -P -traditional -D__HUGS__=200609' MyProgram.hs

Some tricky things to note. First is that C's cpp requires the -P flag in order to work in "pipe mode" rather than reading a file directly from disk. Second is the -traditional flag which turns off some newer cpp features which can interfere with Haskell code. And finally, for our conditional code, we define a macro that says we're running under Hugs Sept06.

Another tricky thing to note is that this will work on the command line, but Cabal 1.2.3.0 is hard-coded to use cpphs instead. In order to mix things up, cpphs has a different set of flags than cpp does, and so telling Cabal to just use cpp instead won't work. Another tricky bit is that Hugs installs a version under the name cpphs-hugs, so you may need to set up an alias or tell cabal --with-cpphs=/blah/bin/cpphs-hugs during configure.

Haddock (2.0.0.0 re Cabal 1.2.3.0) also has some misfeatures regarding CPP. In your .cabal file you can set the field cpp-options in order to define your macros, and from here they get included in the preprocessor runs for GHC or Hugs. They do not, however, get called when haddock is called. Thus, since haddock interprets the code in order to extract type information, you'll need to set the ghc-options field, which does get passed to haddock. To complicate things further, recent updates to Hackage have filters to detect setting CPP macros in ghc-options and will reject the package, telling you to use the cpp-options field instead. (A good practice in general, but doesn't work as well as it should.)

Update (Sunday, 15 March 2009): This misfeature re Haddock has been fixed in Cabal 1.6.

The foreign function interface was an early extension to the venerable Haskell 98 spec. Compiling to use the FFI is again trivial in GHC. For Hugs it's a bit more complicated since Hugs is an interpreter, not a compiler. The solution is, before calling hugs, just call ffihugs with all the same options and it will compile and link a DLL that gets loaded in when the source code is interpreted.

Again, this will work on the command line, but Cabal 1.2.3.0 has other ideas. To be fair, Cabal does automatically detect if you need to call ffihugs and will do so on your behalf. The problems are, again, more to do with option passing. The call to ffihugs does not inherit the hugs-options field in your .cabal file. There is no ffihugs-options field, but you can pass the --ffihugs-options="-98 +o" flag to configure. (Similarly there is no haddock-options field, but you can pass --haddock-options=-D__BLAH__ in order to get around the Hackage restriction.) The other problem is that cpphs is not called before ffihugs is called, therefore you must pass in the flag for the filter. This part is especially tricksy.

Scripting 101: if something has too many (or too few) levels of quoting, blame your shell first. Blame the program later, if ever.

The secret incantation is --ffihugs-option=-F'cpp -P -traditional -D__HUGS__=200609'. First trick is to use --ffihugs-option instead of --ffihugs-options. The plural does some whitespace tokenizing for being able to pass multiple flags at once, but it goes awry here. Second trick is to use single ticks in order to protect the spaces, but do not embed those within double quotes (like we usually do with --prog-options="..."). If you have that second level of quoting, then Cabal will end up looking for the program which is named by the whole quoted token, instead of evaluating it as cpp and some arguments passed to it.

Through much trial and tribulation, I can finally get Cabal to compile my packages for Hugs. Haddock still doesn't work, since it must be run with the same version (of GHC) as it was compiled with, but that's a whole other can of worms. The big sticking point now, is that all those CPP macros should be set in the .cabal file once and for all (using Cabal flags), and the user shouldn't have to know anything about them. At this point I'm thinking that there's simply no way around it without requiring a newer version of Cabal which fixes the option-passing bugs. So I guess it's back to README files to tell folks how to compile it. That's too bad really.

Moving!

17 Aug 2008 03:20 am
winterkoninkje: shadowcrane (clean) (Default)

So it's official now, I shall be moving in a couple-few weeks. Out of suburbia and into the dangerous city. We managed to find a nice little pocket that looks clean and safe enough. Just off the lightrail and only three miles by bike from campus. Also near a little neighborhood that reminds me of Alberta.

That school thing seems to be coming along. This week my advisor and I'll be discussing where to place the final nails in the coffin. My post-graduation work is starting to get underway as well. And I've started releasing various bits of my Haskell code into the wild. Eventually I'll get around to updating my main homepage again to include all that.

Speaking of which, my posts have been rather programming-centric of late. I know many of y'all are into that sort of thing, but I don't want to scare the rest off. I've been thinking about joining Planet Haskell, though I'm wondering if I should spawn off another blog more focused for that (or, y'know, get the RSS feeds from my main site working again; though that doesn't allow for comments). Thoughts?

winterkoninkje: shadowcrane (clean) (Default)

So I ran into a discussion on Lambda the Ultimate wherein was raised the topic of what the "real" mathematical type for (integral) division should be since :: Int -> Int -> Int doesn't cut it. Someone proposed :: Int -> Int -> Int+1, which is the normal way it's implemented— that is, the result is an integer or a singleton value representing (the single divide-by-zero) error it can generate. In a total functional world this has the problem of how to percolate exceptions up, since the caller isn't allowed to fail either.

That type is flagrantly wrong.

Invoking the abstract of the paper presented there (the site seems offline, so I can't read the paper), the reason it's flagrantly wrong is because it's not mathematical. Mathematics doesn't define an "undefined" value which would be permissible to return, it simply fails to define any value whatsoever. So if extending the result type is both mathematically incorrect and also a programming infrastructure nightmare, then what is the right answer?

The right answer is to restrict the inputs to the function. The usual way of doing this involves using dependent types to carry extra information about values in the types. Dependent types have many benefits (and many many issues), but we don't need even need them for this. All we need is difference types :: Int -> (Int \\ 0) -> Int.

Read more... )
winterkoninkje: shadowcrane (clean) (Default)

Okay, so my plans for weekly postings about the papers I've been reading failed. Mostly because I haven't been reading as many these last few weeks than I had been. But I left off with a comment about testing in Haskell and I figured I should pop that off my stack before going on to a different post.

One of the glorious things about having a pure, side-effect free, language is that you can state properties of your program as mathematical laws. When you, the programmer, know your code should adhere to some law like x + y == y + x, the usual thing most languages force you to do is enumerate a few hundred unit tests stating the law over and over with different values for x and y. But what if you could just state the law itself and let your test harness make up the examples? While mathematics is obviously amenable to such laws, the realm of lawful programming extends far beyond mathematics.

Consider for example that you have some function you want to write which has an implementation that's entirely straightforward yet painfully slow, and another implementation that's horrifically complex but fast. Naturally we'd like to use the fast version, but how can we know it's implemented correctly? The slow implementation is easy to decipher so it's easy to spot check for being correct. The easiest way to test the complex version is to implement both, dictate a law stating they're equal (i.e. they give the same output for the same inputs), and then let our test harness try a few thousand arbitrary inputs and check that the outputs are indeed the same.

If that idea sounds so outrageous, so unfathomable, so simple that it just might work then QuickCheck is the test harness that made it happen. For those who've been around the Haskell community for a while, you may be wondering why I'd bring up this oft-trotted-out example of the beauty of pure functional programming. Well, other than introducing the new folks, QuickCheck ain't the only lawful kid on the block.

SmallCheck, LazySmallCheck, SparseCheck, StrictCheck,... ChasingBottoms, HUnit, HTF )

Edit 2008.07.15: Oh, and don't forget HPC

winterkoninkje: shadowcrane (clean) (Default)

Now that the summer's here I've been hoping to get back into posting more regularly. Over the last year I've accumulated a large number of "to read" papers on various topics having to do with functional and logic programming. And now that the summer's here I'm finally getting a chance to go back and read them. That sounds like two great tastes that taste great together, so let the linkblogging begin.

Functional programming is historically maligned with having less efficient implementations than imperative languages do, mostly because historically that's been the case. Note the emphasis on "historically". However, if you're not careful, you may be perpetuating that sad state of history. If you think —being a non-academic programmer— that academic papers don't affect you, quoth dons on #haskell: programming is not for the naïve.

In functional languages the list data structure is foundationally ubiquitous, and in lazy languages like Haskell they also abstract over the recursion pattern of loops in imperative languages. Unfortunately this excessive use of lists places an enormous burden on allocating and reclaiming memory, which is a major source of historical inefficiencies in functional language implementations. Fortunately, much research has been done to overcome these issues... if you take advantage of it.

One of the great claims of high-level (mathematically-rigorous) languages is that the altitude gives the programmer the ability to program prettily and idiomatically, while making the true invariants more visible to the compiler thus allowing it the freedom to optimize as far as (or further than) illegible programming would've let you. Today's papers are about approaching that goal through "deforestation" techniques which are a subset of "fusion" optimizations. All of today's papers are about automating these optimizations, but compilers are complex and it takes time to work current research into them. So today's post is also a PSA: only you can start forest fires, use libraries which optimize your lists away.

Three papers to start you on your way to deforestation )
RSS Atom

March 2017

S M T W T F S
    1234
567891011
12131415161718
19202122232425
262728293031 

Tags

Page generated 28 Mar 2017 05:43 pm
Powered by Dreamwidth Studios