graph_differentiation

Algebra

When we first learn about functions in algebra class, teachers often describe them as "black boxes" that transform inputs to outputs. This is usually accompanied by a picture of a box (representing the function) and some arrows (representing the inputs/outputs) like this,

$y = f(x)$ . After that, the following math classes tend to use the algebraic notation almost exclusively. However, in many cases the graph representation of calculations (where functions are nodes, and values are edges) is the more useful of the two.

In this document we'll cover how thinking about programs as graphs helps us understand and implement algorithmic differentiation.

Calculus

$dx$ $dy$ $dy$ $dx$ $\frac{dy}{dx}$ , gets a special name ("the derivative") and it lets us write

d y = \frac{d y}{d x} d x = \frac{d f}{d x} d x

$y$ $g$

$z$ $x$ $z(x) = g(f(x))$ . The "Chain Rule" describes how to differentiate nested functions like this:

\frac{d z}{d x} = \frac{d z}{d y} \frac{d y}{d x} = g^{'} (y) f^{'} (x)

That is, the derivative of the composite function is the product of the derivatives of the individual steps.

Forward Mode Interpretation

$dx$ in order to write

d z = \frac{d z}{d y} \frac{d y}{d x} d x = g^{'} (y) f^{'} (x) d x

From here, we can group certain terms together to keep track of the intermediate differential quantities

d z = \frac{d z}{d y} (\underset{d y}{\underset{⏟}{\frac{d y}{d x} d x}}) = g^{'} (y) (\underset{d y}{\underset{⏟}{f^{'} (x) d x}})

With all of the intermediate differential quantities accounted for now, we can draw a graph representation for them just like we did for the original calculation:

This is the graph associated with forward mode differentiation. There are a few important things to notice about it:

the structure (node and edge connectivity) of the new graph is identical to the original graph
the directions of the edges in the graph point the same direction as the original graph (hence, "forward" mode)
the data carried by each edge in the new graph is the differential of its corresponding edge in the original graph
the calculation performed at each node in the new graph is multiplication by the derivative of its corresponding calculation node in the original graph

$x$ $dx = 1$ (or the appropriate identity element for more complicated types).

Reverse Mode Interpretation

$\frac{d\square}{dz}$ $\square$ is some placeholder representing any possible quantity

\frac{d ◻}{d x} = \frac{d ◻}{d z} \frac{d z}{d x} = \frac{d ◻}{d z} \frac{d z}{d y} \frac{d y}{d x} = \frac{d ◻}{d z} g^{'} (y) f^{'} (x)

$\square$ , let's just leave it out of the expressions to make things a little more succinct.

\frac{d}{d x} = \frac{d}{d z} \frac{d z}{d y} \frac{d y}{d x} = \frac{d}{d z} g^{'} (y) f^{'} (x)

Just like with the forward mode derivation, we'll group terms together. Except this time, let's see what happens when grouping terms from the left, rather than the right:

\frac{d}{d x} = \underset{\frac{d}{d y}}{\underset{⏟}{(\frac{d}{d z} \frac{d z}{d y})}} \frac{d y}{d x} = \underset{\frac{d}{d y}}{\underset{⏟}{(\frac{d}{d z} g^{'} (y))}} f^{'} (x)

Again, this grouping of terms reveals information about the intermediate quantities. In this case, those intermediates are the derivatives w.r.t. edge data. I think this is easier to understand as a graph, so let's draw a picture

This is the graph associated with reverse mode differentiation (also referred to as "the adjoint method" and "backpropagation"). There are a few important things to notice about it:

the structure (node and edge connectivity) of the new graph is almost identical to the original graph, except ...
the directions of the edges in the graph point the opposite direction as the original graph (hence, "reverse" mode)
the data carried by each edge in the new graph is the derivative w.r.t. its corresponding edge in the original graph
the calculation performed at each node in the new graph is multiplication by the derivative of its corresponding calculation node in the original graph

$x$ $z$ $z$ $\frac{dz}{dz} = 1$ as the input to the reverse mode graph.

Multivariable Calculus

1D calculus is a good start, but in practice, calculations frequently involve functions with multiple inputs and outputs. So, let's briefly review how derivatives work for that case. For example, here's an arbitrary C++ function that has 2 inputs and 3 outputs:


std::array<double,3> f(double x1, double x2) {
  double y1 = sin(x1) + x2;
  double y2 = x1 - x2;
  double y3 = 3.0 * x2;
  return {y1, y2, y3};
}

$\displaystyle \frac{\partial y_1}{\partial x_1}, \frac{\partial y_1}{\partial x_2}, \frac{\partial y_2}{\partial x_1}, \frac{\partial y_2}{\partial x_2}, \frac{\partial y_3}{\partial x_1}, \frac{\partial y_3}{\partial x_2}$ .

$\partial$ $d$ $d$ so all the notation on the graphs is consistent.

A natural way to arrange these partial derivatives is in a matrix (called the Jacobian) with as many rows as outputs and as many columns as inputs.

\begin{matrix} \frac{d {y_{1}, y_{2}, y_{3}}}{d {x_{1}, x_{2}}} = J (x_{1}, x_{2}) : = [\begin{matrix} \frac{d y_{1}}{d x_{1}} & \frac{d y_{1}}{d x_{2}} \\ \frac{d y_{2}}{d x_{1}} & \frac{d y_{2}}{d x_{2}} \\ \frac{d y_{3}}{d x_{1}} & \frac{d y_{3}}{d x_{2}} \end{matrix}] \end{matrix}

The reason for doing this is that it lets us use matrix multiplication notation to compactly represent both forward mode and reverse mode differentiation operations:

$\begin{bmatrix} dy_1 \\[0.75em] dy_2 \\[0.75em] dy_3 \end{bmatrix} = d\{y_1, y_2, y_3\} = \displaystyle \frac{d\{y_1, y_2, y_3\}}{d\{x_1, x_2\}} \; d\{x_1, x_2\} = \underbrace{\begin{bmatrix} \displaystyle \frac{dy_1}{dx_1} & \displaystyle \frac{dy_1}{dx_2} \\ \displaystyle \frac{dy_2}{dx_1} & \displaystyle \frac{dy_2}{dx_2} \\ \displaystyle \frac{dy_3}{dx_1} & \displaystyle \frac{dy_3}{dx_2} \end{bmatrix}}_{\displaystyle \bold{J}} \begin{bmatrix} dx_1 \\ dx_2 \end{bmatrix}$

$\begin{bmatrix} \displaystyle \frac{d}{d x_1} & \displaystyle \frac{d}{d x_2} \end{bmatrix} = \displaystyle \frac{d}{d\{x_1, x_2\}} = \frac{d}{d\{y_1, y_2, y_3\}} \; \frac{d\{y_1, y_2, y_3\}}{d\{x_1, x_2\}} = \begin{bmatrix} \displaystyle \frac{d}{dy_1} & \displaystyle \frac{d}{dy_2} & \displaystyle \frac{d}{dy_3} \end{bmatrix} \underbrace{\begin{bmatrix} \displaystyle \frac{dy_1}{dx_1} & \displaystyle \frac{dy_1}{dx_2} \\ \displaystyle \frac{dy_2}{dx_1} & \displaystyle \frac{dy_2}{dx_2} \\ \displaystyle \frac{dy_3}{dx_1} & \displaystyle \frac{dy_3}{dx_2} \end{bmatrix}}_{\displaystyle \bold{J}}$

Some AD libraries use the terms "JVP" and "VJP" as an alternate way of referring to forward and reverse mode, respectively. Those terms stand for "Jacobian-Vector Product" and "Vector-Jacobian Product", referring to whether the vector appears on the right or left of the Jacobian in the product.

More Complicated Example

Let's apply our new intuition for forward and reverse mode differentiation to a slightly more involved calculation given in C++ below:


xxxxxxxxxx
double func(double x1, double x2) {
  auto [y1, y2] = f(x1, x2);
  double z = g(x1, y1);
  double w = h(z, y2);
  return w;
}

Start off by drawing the graph representation of the calculation, where edges are the variables and nodes are the calculations

and then we can apply the rules we learned about forward and reverse mode from the earlier example.

Forward Mode

Recall that in order to make the graph for forward-mode, we replace each edge by a new edge carrying the differential of the original quantity and replace the nodes by their derivatives (evaluated at their respective function's inputs) to get:

$dx_1, dx_2$ :

$\displaystyle \frac{\partial w}{\partial x_1}$ $dx_1 = 1, dx_2 = 0$
$\displaystyle \frac{\partial w}{\partial x_2}$ $dx_1 = 0, dx_2 = 1$
$dx_1, dx_2$ in the given direction

Reverse Mode

In reverse mode, we replace each edge by a new edge (in the opposite direction) that carries the derivative w.r.t. the original quantity and replace the nodes by their derivatives (evaluated at their respective function's inputs) to get:

$\frac{d\square}{dw}$ $\frac{d\square}{dx_1}, \frac{d\square}{dx_2}$ . This is different than forward-mode, which only gives a derivative w.r.t. one quantity at a time.

$\frac{dw}{dx_1}, \frac{dw}{dx_2}$ $\square = w$ $1$ as our input from the right.

Aside: Variable Reuse

$x_1$ splits and goes off to two different functions. Technically, the edges in a graph should have a single starting point and a single endpoint. But in practice, that requirement feels awkward because programming languages have no problem letting you use a variable more than once.

The way I think about it is that drawings like the one below

are really just a convenient short-hand notation for a more "proper" graph definition, where the splitting point is itself a node in the graph as well:

$\text{copy}$ function as

\begin{matrix} copy (x) : = [\begin{matrix} x \\ x \\ ⋮ \\ x \end{matrix}], \frac{d copy}{d x} = [\begin{matrix} 1 \\ 1 \\ ⋮ \\ 1 \end{matrix}] \end{matrix}

$\text{copy}$ node on the reverse pass, we need to sum the relevant derivative terms:

\begin{matrix} \frac{\partial}{\partial x} = [\begin{matrix} \frac{\partial}{\partial x^{(1)}} & \frac{\partial}{\partial x^{(2)}} & \dots & \frac{\partial}{\partial x^{(n)}} \end{matrix}] \underset{\frac{d copy}{d x}}{\underset{⏟}{[\begin{matrix} 1 \\ 1 \\ ⋮ \\ 1 \end{matrix}]}} = \frac{\partial}{\partial x^{(1)}} + \frac{\partial}{\partial x^{(2)}} + \dots + \frac{\partial}{\partial x^{(n)}} \end{matrix}

Or, in graph form

When Should I Use Forward vs. Reverse Mode?

So far, we've only discussed how forward and reverse mode work, but haven't really described their strengths and weaknesses. Let's look at a couple specific examples to better understand the performance implications of forward vs. reverse mode.

Few Inputs, Many Outputs

Here is a graph featuring a scalar-valued input, scalar-valued intermediates and a vector-valued output.

$\displaystyle \frac{d\{w_1, \cdots , w_n\}}{dx} = \frac{d\{w_1, \cdots , w_n\}}{dz} \; \frac{dz}{dy} \; \frac{dy}{dx}$ .

Evaluating this expression with forward-mode means evaluating the products from right to left:

\frac{d {w_{1}, \dots, w_{n}}}{d x} = (\frac{d {w_{1}, \dots, w_{n}}}{d z} (\frac{d z}{d y} \frac{d y}{d x}))

$\displaystyle \frac{dz}{dy} \frac{dy}{dx}$ $1$ $n$ $n+1$ operations. In contrast, reverse-mode evaluates the products from left to right:

\frac{d {w_{1}, \dots, w_{n}}}{d x} = ((\frac{d {w_{1}, \dots, w_{n}}}{d z} \frac{d z}{d y}) \frac{d y}{d x})

$n$ $n$ $2n$ operations.

So, in this case where the size of the inputs/intermediates is smaller than the outputs, forward mode differentiation is preferred.

Many Inputs, Few Outputs

Here is another graph, except this one has a vector-valued input, vector-valued intermediates and a scalar-valued output.

For this graph, the chain rule tells us that

\frac{d w}{d {x_{1}, \dots, x_{n}}} = \frac{d w}{d {z_{1}, \dots, z_{n}}} \frac{d {z_{1}, \dots, z_{n}}}{d {y_{1}, \dots, y_{n}}} \frac{d {y_{1}, \dots, y_{n}}}{d {x_{1}, \dots, x_{n}}}

$2n^3$ $2 n^2$ $2n^3$ operations.

\frac{d w}{d {x_{1}, \dots, x_{n}}} = (\frac{d w}{d {z_{1}, \dots, z_{n}}} (\frac{d {z_{1}, \dots, z_{n}}}{d {y_{1}, \dots, y_{n}}} \frac{d {y_{1}, \dots, y_{n}}}{d {x_{1}, \dots, x_{n}}}))

$2n^2$ $4n^2$ reverse mode is asymptotically faster $n$ .

This kind of graph structure is extremely important, as it describes practically every optimization problem (since they typically involve many inputs variables and scalar-valued objective functions as the outputs). This also includes machine learning training workflows, which are essentially just big optimization problems.

This is the main reason that reverse-mode differentiation is such an important part of machine learning: forward mode would be incredibly slow for training.

Similar Number of Inputs, Intermediates, Outputs

In this case, the operation count of forward and reverse mode are roughly comparable. So, the tie is broken by the fact that forward-mode is generally simpler to implement and has a smaller memory footprint (since intermediate quantities can be discarded after use).

Summary

We covered a lot in this document, so let's recap some of the important ideas:

representing programs and calculations as a graph clearly shows the flow of information and how the pieces fit together
the graphs for forward and reverse mode differentiation are very similar to the graph of the original calculation
The chain rule says that the derivative of a composite function is a product of individual derivatives
- forward mode evaluates those products from right-to-left
- reverse mode evaluates those products from left-to-right
- it's also possible to mix and match forward/reverse for subgraphs and combine them
forward mode
- transforms small perturbations in the input into small perturbations in the output
- flow of information follows the original program execution
- $\leq$ # of outputs
reverse mode
- transforms derivatives w.r.t. outputs into derivatives w.r.t. inputs
- flow of information is opposite original program execution (requires a way to restore intermediates on reverse pass)
- derivatives w.r.t. variables that are used in multiple places will have contributions from all the places they were used
- $\leq$ # of inputs (e.g. especially good for 1 output, like optimization)