winterkoninkje | Social Fallacies, Millibits, Information Entropy, and the purity of the English language

As I mentioned in my last post I spent much of yesterday in an extended web of internet spidering. Below are some of the paths this vaguery followed.

Geeks are renowned for their... peculiar social skills. And even within the realm of geekdom, some are considered particularly lacking. Michael Suileabhain-Wilson wrote an opinion piece discussing Five Geek Social Fallacies that lead to some of the more egregious examples of this

In a public service announcement on LiveJournal, cerebrate mentioned that "mb" is an abbreviation for millibit which sparked a discussion about whether such a thing was even possible. Which in turn caused another discussion by lederhosen on information entropy which I found surprisingly interesting. I say surprising because I frequently have little interest in higher-level (read: post-calculus, particularly logical) mathematics because they tend to be overly omphaloskeptic and offer very little of use to the non-theoretical (read: real) world. I'm not sure what it was particularly that I found so interesting about it, but it was... pleasant. If it strikes your fancy, Lederhosen also posted some followup links to Wikipedia and Shannon's seminal paper (Shannon being the one who came up with the whole idea).

One possible reason I found it interesting has to do with another post by lederhosen regarding a linguistic side of the information theory branch of mathematics. In that post he lays out an example of designing a language of semaphores which I found particularly interesting for (a) its resemblance to "designing" the phonetics of a natural language and (b) how it seems to imply an emergent linguistic component to information.

As an example of the former, in developing a language the more possible "settings" you have per "bit" the higher the (possible) bandwidth there is. For example it takes fewer "bits"—that is, digits—to write out (and hence transmit, receive, etc) a number in decimal notation than it does in binary notation because each of those smallest chunks of information have a larger quantity of potential settings and, hence, transmit more information per bit. So in the example of semaphores the more colors of flags you have, the higher the bandwidth. But there's a limitation, there comes a point where the different colors become indistinguishable— or where the cost in time/effort to distinguish a single bit exceeds the cost of distinguishing a series of less informative bits that convey the same total amount of information.

In natural language there are phenomenal number of different sounds one can make, and different languages use different subsets of them. So why isn't there a language that uses all the possible sounds to achieve the highest bandwidth? For the same reasons; if there were a language that used aspirated, unaspirated, ejective, and injective stops (let alone other consonants) it would take a lot of effort to distinguish all those sounds, remember which sounds are associated with which words, pair the word heard with the meaning in your head, etc. Which isn't to say that there aren't languages that don't use those distinctions, but they usually limit themselves to one or two of those which makes it much easier to discriminate between them.

Another similarity to the semaphore example is that linguistic information can be transmitted along multiple different axes. In semaphores you're not just limited to the color of flags but can also use quantity (probably no more than two due to human physiology), different motions of waving them, different positions of holding them (overhead, out to the sides,...), etc. In natural languages there's an analogous list of axes: place, manner of articulation, voicing,...

But perhaps the most intriguing thing is that certain linguistic phenomena seem almost emergent from the problems in information theory. In the semaphore example we have WWWWWWWWWW meaning "Happy birthday Captain Hornblower, also do you happen to have any spare lime juice? Ours has leaked.", and RRBWRRBWR meaning "Happy birthday Captain Sparrow, also do you happen to have any spare lime juice? Ours has leaked" based on there relative importance compared to all the other messages encoded by the system. However, while computers may be able to interpret those quickly and accurately, WWWWWWWWWW and RRBWRRBWR share no structural similarity, there's nothing that would seem to imply that those two messages are at all similar. They're not at all user-friendly. Not to mention that assigning every message possible in a given language an arbitrary number based on its frequency, urgency, etc is an inherently shortsighted approach. It provides no mechanism for creating novel constructions. So if Captain Hornblower got promoted to Admiral we'd have to invent a new message to mean the same thing as WWWWWWWWWW but with "Captain" replaced by "Admiral". One can quickly see that this would lead to an exponentially increasing number of messages while never solving the problem of being able to say anything.

One solution is to break the message up into a series of "words", frex RWRRB could mean "lime juice" and so any message related to lime juice would have at least that similarity. By having these separate words it becomes possible to mix and match them to create novel messages. Ironically this actually reduces the compactness of the language introducing inefficiency.

Another problem is that if every possible combination of bits/sounds represents a legitimate word or message, then there's no margin for error, no room for mistakes since both WWWWWWWWWW and WWWWWWWWWWW are valid and with quite possibly entirely different meanings. The solution here is, again, to reduce the efficiency of the encoding of meaning by introducing certain rules of well-formedness. So we can't have, say, five consonants in a row ruling out words like "ktfpra". (For natural language there's even another reason for forbidding words like that, namely the limits of physiology, aka pronouncability.) Of course, by adding in these phonetic rules we not only strengthen the ability to transmit information by having some measure of error-checking, but there's another interesting side effect: subsequent sounds become (to some degree) predictable. By having this predictability we can more readily understand others in adverse noise environments; by allowing there to be noise in the system, we can separate signal from noise. This lets us figure out the second half of a word if the phone cuts out, or lets us be able to finish others' sentences. Or, getting back to where this thread started: it decreases the entropy in the language.