winterkoninkje | Eng: On tensors: are lists and axes the same? (Reply)

There's a conceptual problem that's been plaguing me with Eng. For a first writing on the language, this isn't the area I would normally choose. But the most logical starting points — e.g. lexical aliasing, code metastructure — I've a pretty good handle on. Those are topics I will delve into in the future, but for now, this is my problem.

One of the design considerations for Eng is to be able to provide crossplatform natively parallel code. [1] When I say this, one should bear in mind that this goal alone is as large or larger than all the other goals for the language combined. There's a specific facet to a specific part of my attempt to do this which is causing me troubles, but before that, some history.

Back in the 1980s parallelism was very much in vogue. Then, come the late-1980s/early-1990s when microprocessor speeds started bounding through the roof, everyone decided parallelism was stupid and abandoned most research on it. (Though of course, there are always a few diehards for any technology.) These days, as we're starting to reach the limits of current fabrication technologies, and with the shape of the curve between speed and cost and how it's evolved over time, the field of parallelism is starting to simmer again with a passion fit to boil over the coming decade.

To see this, just take a look at the current market: basic home computers are shipping with 2~4 processors, many of those are dual- or (will soon be) quad-core processors, or have hyperthreading. And if that home market doesn't convince you, then take a look at console game platforms like the xbox360 or the ps3 (or their previous counterparts). And this doesn't even get into the server market which is growing ever faster as computers become persistantly more ubiquitous tools for every endeavor and every business. Nor does it get into distributed systems like the microchip in your keyboard and mouse. Let alone the scientific markets for weather modelling, VLSI, and AI — i.e. those diehards who never left.

...

Now that you're convinced that a natively-parallel language is essential, the first part of my problem. There are two basic varieties of data in the universe: scalar data, and aggregate data; though the lines between them may not always be clear (cf. records or objects). We can clear that line up a bit if we discount records, or call them scalars, and only consider serialized aggregate data (e.g. arrays, matricies, etc). For this discussion we'll ignore the differences between different kinds of scalars, and just consider them unary widgets which can be put into serialized aggregate forms.

For reasons immediately apparent to any who've worked on SIMD systems, and perhaps even to those who've worked sufficiently on highly data-parallel programs on conventional SISD (aka non-parallel) systems, the normal presentation of arrays and other serialized aggregations is woefully simplistic and insufficient for dealing with massively data-parallel programs, and even for many other programs. Furthermore, that some languages have separate datatypes/syntaxes for first-order aggregations (arrays, vectors, etc) and second-order ones (matricies, 2D arrays, etc) is a kludge at best.

Luckily, there's a readily conceivable solution, though I don't know that anyone else has come up with it. The reasons built-in matrices, or the equally infamous arrays-of-arrays, and their ilk seem so kludgey is because as we look at them we can see that they're just an extended case of our basic dependable array. So we ask ourselves, why then do we treat them so differently? The ready solution is: we don't.

I've already started using some terminology which does this. In mathematics there's a concept called a tensor which is a generalization and abstraction of matrices and the like. A tensor is a multi-dimensional "thing" where the minimum number of dimensions needed to describe it is it's order (or rank). So when we're talking about arrays or vectors, we can also describe them as first-order tensors. And matrices are just second-order tensors. We can of course extend this upwards for 3rd-order tensors (cubes), 4th-order tensors (hypercubes), all the way up to infinity — though it becomes hard to visualize after four orders or so. We can also extend this downwards: a scalar, is simply a zeroth-order tensor.

In Eng, I propose that the built-in serialized datastructure is not an array nor a matrix, but rather that it is a tensor in all its abstraction. Which of course means, that we have no scalars, since scalars are just one extremum of a tensor. Now, maybe we don't want to call them tensors — maybe another name which doesn't bring to mind the horrors of linear algebra would be desirable — but the idea is the same, whatever you call it. For a fuller argument on the complexities of having tensors as your basic data type, what the benefits are, and why I've been motivated to head that direction, I will write another essay at some point.

Some more terminology is helpful for talking about tensors. The order of a tensor can also be described as the count of its axes. And every axis has a length or dimension describing how many elements it has along that axis. [2] The shape of a tensor is a full description of it's axes and each of their dimensions; this is helpful because we can have two kth-order tensors which still have a different shape (i.e. arrays can be of different lengths, matrices can be square or long or wide, etc). When performing operations on tensors their shapes must be compatible, not necessarily the same shape but shapes which "fit" together for the operation in question. For an example of different but compatible shapes, consider matrix multiplication which takes an AxB input and a BxA input.

...

Another topic I'll need to justify the importance of in a separate essay is, there's an irony in making a natively-parallel mid-level language. The idea behind MLLs is to minimize the abstraction in syntax (when compared to HLLs; they provide plenty of abstraction when compared to LLLs). And yet, say we have an array of elements that we want to perform some function on. With a conventional MLL we'd write a for-loop and iterate over all the elements. The problem is for parallel programs, the compiler has no way to look at that loop and determine that those operations could be done in parallel. And so, for parallelism, it's helpful to abstract the parallelism into the syntax.

In APL there's the notion of "verbs" and "adverbs" [3]. Verbs include functions, methods, and operators — they're things you do, operations to perform. Adverbs instead are things which affect how you perform those operations. I like the notion of adverbs because they allow for some very sophisticated code (cf. APL's generalized dot-product), and so I'd like to have them in some form or another. [4]

For an example, in most languages addition is a binary infix operator, so it takes arguments like a + b. If we want to get the summation of a list, we could use more than one addition as so: a + b +...+ z. Unfortunately this isn't always possible and seems over-redundant. In a language with adverbs we could use something like a "list-valence adverb" to modify the addition operator so that it takes a list of arguments instead of just two. So we could change the example to something like: +\ a, b, ..., z. [5]

Another nice thing that this sort of list-valence adverb provides is an abstraction for parallelism. It's easy for the compiler to look at one operation taking a large number of arguments and to decide that this should be done in parallel. With many individual calls where associativity and precedence and the like come into play, it's much harder.

...

An aside so the quandry presented later on will make sense. Now that we have these tensors, we need some way to extract elements from them to perform actions on. We'd have to be able to pull out individual elements, as a trivial requirement. But it'd also be nice to be able to pull out "slices" of data as well. For example, say we have a 3x3x3 cube of elements (that is, a third-order tensor with all axes having a dimension of 3). Perhaps this represents the contents of a room. Now, say we want to find out something about the room based on height, e.g. to discover that there are more objects in the lower portion of the room than the upper. It'd be helpful to be able to slice that cube up by its height into a number of "stacked plates" [6] so that we can find the number of objects in each "plate" to compare them. Naturally, each "plate" can be manipulated in parallel so it'd be nice to return a list of them to operate on at once, rather than needing a loop to iterate over them. Additionally, it'd be nice not to need to specify anything about any of the other dimensions than the one we're slicing along, perhaps trivial for our cube, but what if we were working with a 256th-order tensor?

Here's some working syntax I've been using, though like any working syntax the exact symbols could be changed to something better later on. So first we have the tensor in it's entirety, which could be referred to with A. To specify things about the dimensions of the tensor we use a suffix like [pos:axis] where pos is the position along the axis, and axis specifies which axis we're referring to. For example, we could get a corner element from A by typing A[0:0][0:1][0:2]. Since most tensors will be of low order and their lowest axes will be accessed the most frequently, the :axis portion can be optional with the implicit dimension being 0 for the first one, then 1 for the second, 2 for the third, etc; thus simplifying that corner element to A[0][0][0]. [7]

So far so good. Now, what about our example of wanting to get a stack of plates from a cube (that is, a list of second-order tensors corresponding to the third-order tensor). Well, what if we use the normal syntax for specifying details about an axis, but leave off the position, as in A[:0]? That seems intuitive enough. Now, if we want just one of those slices instead of all of them we could use A[0] (or A[0:0]). Note the difference between this and the fully-qualified A[0][0][0]; the first returns a 2nd-order tensor, whereas the latter returns a 0th-order tensor; we could also do A[0][0] to return a 1st-order tensor. And if we wanted more than one slice but not all of them, we could use A[0,1] (or A[0,1:0]) to return the first and the second (but discard the third).

For convenience of programming, we can also take a note from Perl. In Perl, since the length of arrays is known, we can use negative indices to count from the end of a dimension of a tensor. So -1 would be the last element, -2 the next to last, etc. For statically allocated tensors this ability is trivial to add, but even for dynamically allocated ones it should not be too difficult to implement.

...

Enter my quandry. So we have this list-valence adverb, \, which changes our binary infix operator for addition, + and turns it into an n-ary prefix operator for the sum of a list, +\. [8] And we have this notion of slices which returns a list of lower order tensors. So, if we had an array of numbers and wanted their sum, we could do something like +\ A[:0]. Or if we had a 2nd-order tensor and wanted to get a 1st-order tensor with the sums of the columns, we could do +\ A[:1] with the addition operator performing "straight-through" addition. (More on the issue of dot-products and tensor-products in a later essay.)

Now, at a first glance, a list isn't anything different than an array, and so shouldn't be any different than an axis in a tensor. Conceptually that's a very elegant idea, and in my minimalistic mindset I like it. However, there are some problems it raises. We introduced the list-valence adverb to simplify getting the sum (or product, or...) of a number of elements, both to improve parallelism and optimization, but also for easing clarity and code maintenance. So what if we wanted to do that column sum for a large number of tensors? We can't just do +\ A[:1], B[:1],... since that would give us the sum of the whole list (that is, lists are flat). We could use a sequence of +\ A[:1], +\ B[:1], etc or put it in a loop, but that's disgustingly kludgey. Maybe we could introduce a new slicewise adverb so we could do +:1 A, B,... to return a list of the 1st-order tensors.

But if a list is the same thing as a tensor's dimension, then the list-valence adverb is really just turns a binary operation which takes kth-order tensors into a unary operator which takes a single (k+1)th-order tensor with the new axis having a dimension equal to the length of the list. If that's the case, then the list-valence adverb is nearly identical to this new adverb. But that also throws off which number we use to refer to which axis, and what number do we use for the list axis? Zero and positive numbers are taken for indices, and the negatives are used for the convenient count-from-the-end indices so we can't use them either.

They both look like reduction adverbs. List-valence takes a list of kth-order tensors (or a (k+1)th-order tensor), and returns a single kth-order tensor. This new slicewise adverb takes a list of kth-order tensors (or one (k+1)th-order), and returns a list of (k-1)th-order tensors (or a kth-order tensor adding a new axis with dimenstion equal to the length of the input list, and removing the axis specified [9]). Whether we consider a list the same as an axis or not, there's a subtle distinction there, and I'm not sure whether it's relevant or not.

Of course, if we consider a list and an axis the same, then there's also potential for confusion with the list-valence adverb. What does +\ A mean? Is it the straight-through summation of a list of tensors (of which there happens to only be one and so the output is A), or does it slice A by it's first axis? This confusion is largely predicated on whether we consider the commata delimiting a list to be a constructor for this new larger tensor or not.

And then there's always the issue that in a tensor, the dimension of an axis is the same regardless of the position on other axes, i.e. any way we slice a tensor we'll get a list of lower order tensors each of which has the same shape (though shapes may differ between different lists). While this is handy for many purposes it's inconvenient for others, such as times when we may need a list of tensors with different shapes.

I suppose, in the end, it's probably not an issue where there's a "right" and a "wrong" answer. But rather I'll just have to pick one or the other — either a list is an axis, or it is not — and stick to it. Having talked through the complications raised by the question, namely the last one about lists of different shaped tensors, I'm thinking it's probably best to treat them as separate. But just as we have the slicing mechanism to convert an axis to a list, we should have some construction mechanism for turning a list into an axis.

[1] Specifically for "typical" home machines and servers rather than scientific machines, but good solutions to the former will doubtless lead to good solutions for the latter, even if they are not the same, as scientific computers are often a "simpler" form of parallelism than home machines, though there's much more of it.

[2] Of course it's possible to think of continuous tensors rather than discrete ones, in which case the axis' dimension would be a description of the range of values that exist along that axis. And we can have non-numerical or unordered axes as well, axes whose positions are names rather than integers.

[3] That's J terminology. In APL they're called "functions" and "operators" respectively, an unfortunate choice of words since it confuses those familiar with the conventional definitions, and doesn't distinguish user-defined functions like foo() and built-in 'functions' like +. Because of this confusion, J which is a modern successor to APL changed the jargon, and I'll use theirs to minimize the confusion.

[4] Aficionados of functional programming languages may point out that these sound simply like higher-order functions. And, I suppose, in a sense they are, and that may be a good way to implement the functionality. However I think that just as it's helpful to distinguish user-defined functions and operators, it's helpful to distinguish user-defined higher-order functions and adverbs.

[5] We could also just change the addition operator to be a prefix operator which takes a list, as it is in Lisp. However, there's something about the infix form which seems more intuitive and natural, and so it'd be nice to keep it around if possible.

[6] That is — depending on how we have the tensor organized — slicing along the first axis. We could also slice along the other axes as well. Since the original tensor was a cube, slicing along any axis will create a list of the same length comprised of 2nd-order tensors of the same shape (square). However, as we visualize this, it's apparent that there's a subtle difference between the items in these lists — namely, their orientation. Capturing this will require that we conceptualize all kth-order tensors as existing in an infinite-order universe within which they can be rotated. Yet again, the details of this are fodder for another essay.

[7] This simple explanation could also be extrapolated. So say we have that 256th-order tensor and we want to specify the 70th through 80th axes. We could use something like A[0:70][0]...[0], where the lack of :axis means use the previous axis plus one, or use the 0th axis if there are no previously specified axes.

[8] We could do similar things with -\, *\, or even some user-defined function foo\.

[9] Or possibly it returns a list of kth-order tensors or a single (k+1)th-order tensor, with the specified axis having a dimension of 1, to preserve sanity of axis numbering.