This summer I've been working on optimizing compilation for a linear algebra DSL. This is an extension of Jeremy Siek's work on Built-to-Order BLAS functions. Often times it's more efficient to have a specialized function which fuses two or more BLAS functions. The idea behind BTO is that we'd like to specify these functions at a high level (i.e., with liner algebra expressions) and then automatically perform the optimizing transformations which have made BLAS such a central component of linear algebra computations.

The current/prior version of BTO already handles loop fusion, memory bandwidth constraints, and more. However, it is not currently aware any high-level algebraic laws such as the fact that matrix multiplication is associative, addition is associative and commutative, transposition reverse-distributes over multiplication, etc. My goal is to make it aware of these sort of things.

Along the way, one thing to do is solve the chain multiplication problem: given an expression like `∏[x1,x2...xN]`

figure out the most efficient associativity for implementing it via binary multiplication. The standard solution is to use a CKY-like dynamic programming algorithm to construct a tree covering the sequence `[x1,x2...xN]`

. This is easy to implement, but it takes `O(n^3)`

time and `O(n^2)`

space.

I found a delicious alternative algorithm which solves the problem in `O(n*log n)`

time and `O(n)`

space! The key to this algorithm is to view the problem as determining a triangulation of convex polygons. That is, we can view `[x0,x1,x2...xN]`

as the edges of a convex polygon, where `x0`

is the result of computing `∏[x1,x2...xN]`

. This amazing algorithm is described in the tech report by Hu and Shing (1981a), which includes a reference implementation in Pascal. Unfortunately the TR contains a number of typos and typesetting issues, but it's still pretty legible. A cleaner version of Part I is available here. And pay-walled presumably-cleaner versions of Part I and Part II are available from SIAM.

Hu and Shing (1981b) also have an algorithm which is simpler to implement and returns a heuristic answer in `O(n)`

time, with the error ratio bounded by 15%. So if compile times are more important than running times, then you can use this version as well. A pay-walled version of the article is available from Elsevier.