I've been mucking about with real analysis lately. One of the things I've always hated about real analysis (and complex analysis, and other calculus) is the unrepentant lack of types. It's easy to get confused about what's going on and what goes where when you think everything is a real number (or complex number, or tensors thereof). In particular, because of this "everything is a real array" assumption, libraries for performing optimization etc end up with utterly inscrutable APIs and other nastiness like needing to manually flatten your parameters into a single array. So this has got me thinking, what should the API for optimization libraries look like? Which in turn brings me back to an age-old conundrum I've toyed around with before, namely: what is the type of the derivative operator?
From the "everything is a real number" perspective we'd say deriv : (ℝ ↝ ℝ) → ℝ → ℝ
, where the A ↝ B
function space are "nice" functions (e.g., continuous, smooth, analytic, whatever). However, this only works out because we're conflating a number of important differences. For example, for any affine space A
we have the difference space ∆A
— that is, values in ∆A
denote differences or differentials of values in A
(e.g., for any a : A
and b : A
we have a-b : ∆A
). However, it turns out that there's a bijection between ℝ
and ∆ℝ
. It also turns out that the difference of arrays, ∆(ℝn)
, is isomorphic to an array of differences, (∆ℝ)n
. Putting these together we get the common conflation between points (ℝn
) and vectors (∆(ℝn)
). For another example, consider linear transformations, written A ⊸ B
. We have a bijection between linear transformations (ℝm ⊸ ℝn
) and matrices (ℝn×m
), and hence more specifically a bijection between ℝ
and ℝ ⊸ ℝ
.
So here's what I'm currently thinking:
deriv : (F X ↝ Y) → F X → F(∆X ⊸ ∆Y)
For now, just ignore F
; instantiate it as the identity functor. Thus, the total derivative of f : X ↝ Y
at the point x : X
is a linear transformation from minor perturbations about x
to minor perturbations about f x
. We can think of ∆X ⊸ ∆Y
as being the "slope" of this mapping between minor perturbations. In particular, when X=ℝ
and Y=ℝ
, we get a bijection between ∆ℝ ⊸ ∆ℝ
and ℝ
, so this coincides with the traditional notion of the slope of a one-dimensional curve.
Now, let's think about the gradient rather than the total derivative. Assume that F
is a "nice" functor (i.e., representable, traversable, etc). Because the functor is representable, we have an isomorphism between F A
and I → A
(natural in A
), where I
is the type of positions/indices in F
. Thus, the gradient of a function g : F X ↝ Y
at the point z : F X
is essentially a function from F
-indices, i : I
, to the i
-th partial derivative of g
at z
. Why partial derivative? Because the linear transformation only takes ∆X
as an argument —not ∆(F X)
—, thus we can only modify a single "scalar" at a time, and the other scalars in z
must remain fixed.
Edit 2014.02.17 (#1): Actually, things are a bit more complicated than the above. For example, consider a function f : ℝn ↝ ℝ
. The derivative of f
with respect to its vector of inputs is not a vector of partial derivatives— rather, it's a covector of partial derivatives! (i.e., deriv (f : ℝn×1 ↝ ℝ) : ℝn×1 → ℝ1×n
.) Thus, we should really have that the return type of deriv
is some Fop(∆X ⊸ ∆Y)
where Fop
is some sort of "dual" of F
. It's not clear a priori which notion of "duality" is in play here, however.
Edit 2014.02.17 (#2): And here's what I'm currently thinking about how to incorporate the Jacobian into the above. Consider this particular instance of the derivative operator, deriv : (F X ↝ G Y) → F X → Fop(∆X ⊸ ∆(G Y))
. When G
is some type of vectors (e.g., G Y = Yn
for some fixed n
), we get ∆(G Y) = G(∆Y)
as mentioned before. And let's assume ∆X ⊸ G(∆Y) = G(∆X ⊸ ∆Y)
; presumably this follows immediately by linearity, though I haven't verified that yet. And finally, in our standard real analysis we get that G∘Fop
and Fop∘G
are the same— that is, vectors-of-covectors and covectors-of-vectors are essentially the same thing; they're just the row-major and column-major representations of matrices, respectively. And then, putting all these together we get that the original return type Fop(∆X ⊸ ∆(G Y))
is isomorphic to G(Fop(∆X ⊸ ∆Y))
. So when X=ℝ
and Y=ℝ
we get the expected deriv : (ℝm ↝ ℝn) → ℝm → ℝn×m
. Now, it remains to show how this extends to other choices of F
and G
...
Edit 2014.02.17 (#3): It was pointed out on reddit that the above is "well known"— at least for those who are familiar with differentiable manifolds and tangent spaces. Now, I don't know much about manifolds, but it certainly wasn't well-known to me— which, in itself, says a lot about how little this knowledge has spread to the rest of mathematics. I'm glad to hear there's some followup reading to be done here, but I still don't think the necessary work has been done in terms of turning this into a decent API for optimization libraries.
no subject
Date: 2014-02-13 02:56 pm (UTC)From:no subject
Date: 2014-02-18 05:08 am (UTC)From:If you're interested in dimensional analysis problems, you might also be interested in Maryam Moghadas' MSc thesis on dimensional analysis of inverse problems for fMRI. Most of the bugs in fMRI code comes from dimension/type errors, so she was trying to rule those out by performing dimensional analysis in Haskell's type system. The key contribution was figuring out what those units/dimensions actually are (rather than doing the dimensional type-checking per se, which has been done before for standard physical units).
no subject
Date: 2014-02-13 05:45 pm (UTC)From:no subject
Date: 2014-02-18 05:16 am (UTC)From:FWIW, this one is mostly off-the-cuff rambling in order to get my thoughts down for future reference. It's very much a work in progress. So I'd be curious about that other 50%, and how much is due to my taking shortcuts in writing things down vs how much is due to conceptual issues.
nice!
Date: 2014-02-13 06:21 pm (UTC)From: