In [1]:

```
import gtn
import nb_utils
nb_utils.init()
```

$\DeclareMathOperator*{\LSE}{\textrm{LSE}} \newcommand{\vx}{{\bf x}} \newcommand{\vy}{{\bf y}} \newcommand{\vz}{{\bf z}} \newcommand{\gA}{\mathcal{A}} \newcommand{\gB}{\mathcal{B}} \newcommand{\gC}{\mathcal{C}} \newcommand{\gG}{\mathcal{G}} \newcommand{\gX}{\mathcal{X}}$Modern deep learning relies on the fact that core functions are differentiable (or sub-differentiable). This is important to any type of gradient-based numerical optimization, of which the most commonly used are variations of stochastic gradient descent. The parameters, $\theta$, of the model are specified in tensors. The objective function, $f(\theta, \vx, \vy)$, is a function of the parameters and the input data $\vx$ and $\vy$. The parameters can then be optimized to improve the objective with variations of a simple update rule:

$$ \theta_t = \theta_{t-1} + \alpha \, \nabla_{\theta} f(\theta_t, \vx, \vy), $$where $\alpha$ is the learning rate, and $\nabla_{\theta}$ is the gradient operator which we describe in more detail below. This update applies equally well to parameters in graphs as it does to tensors. The main challenge is computing gradients, the subject of this section.

Many operations used with vectors, matrices, and $n$-dimensional tensors are differentiable. This means we can compute the change in any of the output elements with respect to an infinitesimal change in any of the input elements. For example, consider a vector $\vz = f(\vx, \vy)$ which is the output of a function of two vectors $\vx$ and $\vy$. The Jacobian of $\vz$ with respect to $\vx$ is the natrix of partial derivatives with entries $\frac{\partial \vz_i}{\partial \vx_j}$. The gradient is defined as the tensor of partial derivatives of a scalar function. So if $f(\vx) \in \mathbb{R}$ is a scalar function, then the gradient is:

$$ \nabla f(\vx) = \left[ \frac{\partial f(\vx)}{\partial \vx_{1}}, \ldots, \frac{\partial f(\vx)}{\partial \vx_{n}} \right]^\top. $$In the same way, we can compute partial derivatives of the arc weights of an output graph for a given operation with respect to the arc weights of any of the input graphs. Take the concatenation operation as an example. Suppose we are given two graphs, $\gA$ and $\gB$, and we construct the concatenated graph $\gC = \gA \gB$ as in the figure below.

For each of the arc weights $C_i$ in the concatenated graph $\gC$, we can compute the partial derivative with respect to the arc weights of $\gA$ and $\gB$. For any arc in $\gC$, it either has a corresponding arc in $\gA$ or $\gB$ from which it gets its weight, or it has a weight of zero. The partial derivative of an output arc weight $C_i$ with respect to an input arc weight $A_j$ or $B_j$ is $1$ if the two arcs correspond and $0$ otherwise. For example, in the above graphs we have:

$$ \frac{\partial C_1}{\partial A_1} = 1, \quad \frac{\partial C_2}{\partial A_2} = 1, \quad \frac{\partial C_4}{\partial B_1} = 1, \quad\textrm{and}\quad \frac{\partial C_5}{\partial B_2} = 1. $$The remaining partial derivatives are all $0$. For example, for $C_1$ we have:

$$ \frac{\partial C_1}{\partial A_2} = 0, \quad \frac{\partial C_1}{\partial B_1} = 0, \quad \textrm{and} \quad \frac{\partial C_1}{\partial B_2} = 0. $$In the following, I use the notation $\frac{\partial \gC}{\partial \gA}$ to generalize the Jacobian to graphs. This Jacobian is a data structure which contains the partial derivatives $\frac{\partial C_i}{\partial A_j}$ for all arc weights $C_i$ in $\gC$ and $A_j$ in $\gA$. Another way to view this Jacobian is as a set of graphs $\frac{\partial C_i}{\partial \gA}$ indexed by $i$ which are the same size as $\gA$. Alternatively we can view the Jacobian as a set of graphs $\frac{\partial \gC}{\partial A_j}$ indexed by $j$ which are the same size as $\gC$. This is analogous to viewing the Jacobian of a vector-valued function either as a set of columns or a set of rows.

Compute the partial derivatives of the arc weights of the closure of the graph $\gA$ with respect to the input arc weights.

The non-zero partial derivatives are:

$$ \frac{\partial w_2}{\partial A_1} = 1 \quad \textrm{and} \quad \frac{\partial w_3}{\partial A_2} = 1. $$The remaining partial derivatives are zero:

$$ \frac{\partial w_1}{\partial A_1} = 0, \quad \frac{\partial w_1}{\partial A_2} = 0, \quad \frac{\partial w_2}{\partial A_2} = 0, \quad \frac{\partial w_3}{\partial A_1} = 0, \quad \frac{\partial w_4}{\partial A_1} = 0, \quad \textrm{and} \quad \frac{\partial w_4}{\partial A_2} = 0. $$Compute the partial derivatives of the intersected automata weights $w_i$ with respect to the input arc weights $A_j$ for graph $\gA$ and $B_k$ for graph $\gB$ shown in the figure below.

The partial derivative $\frac{\partial w_i}{\partial A_j}$ is $1$ if the weight $w_i$ came from $A_j$ and zero otherwise. The derivatives with a value of $1$ for graph $\gA$ are:

$$ \frac{\partial w_1}{\partial A_1}, \; \frac{\partial w_2}{\partial A_2}, \; \frac{\partial w_3}{\partial A_1}, \; \frac{\partial w_4}{\partial A_2}, \; \frac{\partial w_5}{\partial A_3}, \; \frac{\partial w_6}{\partial A_2}, \; \textrm{and} \;\; \frac{\partial w_7}{\partial A_3}. $$The derivatives with a value of $1$ for graph $\gB$ are:

$$ \frac{\partial w_1}{\partial B_1}, \; \frac{\partial w_2}{\partial B_1}, \; \frac{\partial w_3}{\partial B_4}, \; \frac{\partial w_4}{\partial B_5}, \; \frac{\partial w_5}{\partial B_6}, \; \frac{\partial w_6}{\partial B_8}, \; \textrm{and} \;\; \frac{\partial w_7}{\partial B_9}. $$The remaining derivatives for both graphs are zero.

In the previous section we saw how to compute derivatives for some common automata operations. Automatic differentiation greatly simplifies the process of computing derivatives for arbitrary compositions of operations. In this section, we will discuss *reverse-mode* automatic differentiation at a high-level.

Reverse-mode automatic differentiation proceeds in two steps. First, a forward pass computes all of the operations. Then, a backward pass computes the gradients. During the forward pass the composition of operations is stored in a computation graph (not to be confused with an automata). Data and metadata are also cached during the forward pass to make the gradient comptuation more efficient. There is often a trade-off between memory and compute in that we can save more intermediate data to reduce the computation required during but increase the memory required during the bakcward pass.

Consider the graph equation:

$$ \LSE\left[ \left(\left(\gA_1 + \gA_2\right) \circ \gX^*\right) \right]. $$This equation can be represented in the computation graph below:

The leaves of the computation graph (the solid circular nodes with no incoming arrows) are either parameter graphs or input data graphs. In this case, let's assume $\gA_1$ and $\gA_2$ are the parameter graphs, and $\gX$ is the graph of input data. The square nodes are operations, and they are always followed by output graphs, which are dashed circular nodes.

During the backward pass, the gradients are computed from the output ($\gG_4$ in the graph above) following the arrows in the computation graph backwards. A graph can only compute its gradient once all of the graphs downstream of it have had their gradients computed. Thus the backward pass must be done as a reverse topological traversal of the computation graph.

For example, assume the gradient of the output $\gG_4$ with respect to graph $\gG_1$ has been computed. This gradient $\frac{\partial \gG_4}{\partial \gG_1}$ is then used to compute $\frac{\partial \gG_4}{\partial \gA_1}$ and $\frac{\partial \gG_4}{\partial \gA_2}$. Assuming we know how to differentiate $\gG_1$ with respect to $\gA_1$ and $\gA_2$, then we can compute the desired gradients essentially using the chain rule. Assume $g_i$ are the arc weights of $\gG_1$ and $a_j$ are the arc weights of $\gA_1$, then the chain rule gives:

$$ \frac{\partial \gG_4}{\partial a_j} = \sum_{i} \frac{\partial \gG_4}{\partial g_i} \frac{\partial g_i}{\partial a_j}. $$This is done recursively at every node in the computation graph until the gradients for all the leaf graphs are available.

At a high-level any implementation of reverse-mode automatic differentiation is the same. However, implementations often differ in the details. One simple approach is to have every output graph record the input graphs from which it was generated as well as a gradient computation function. The input graphs and gradient computation function (and any other metadata) can be set during the forward pass by the operation itself.

For example, after the execution of union, the data structure which holds the output graph $\gG_1$ could also hold pointers to the inputs $\gA_1$ and $\gA_2$ and a pointer to the gradient computation function for union. A simplified example of what this data structure might look like is shown below.

Once the gradient for $\gG_1$ is available, the union's gradient function is called with the inputs $\gA_1$, $\gA_2$, and $\frac{\partial \gG_4}{\partial \gG_1}$. For $\gA_1$, the union gradient function will compute $\frac{\partial \gG_1}{\partial \gA_1}$ and use the chain rule as described above to assemble the desired gradient $\frac{\partial \gG_4}{\partial \gA_1}$. It will do the same for $\gA_2$.

Let's next look at some actual code which uses automatic differentiation.

In [2]:

```
# Request the gradient for A1 and A2 by setting
# `calc_grad=True`:
A1 = gtn.Graph(calc_grad=True)
A1.add_node(start=True)
A1.add_node()
A1.add_node(accept=True)
for i in range(3):
A1.add_arc(
src_node=0, dst_node=1, ilabel=i, olabel=i, weight=i)
A1.add_arc(
src_node=1, dst_node=2, ilabel=i, olabel=i, weight=i)
A2 = gtn.Graph(calc_grad=True)
A2.add_node(start=True)
A2.add_node()
A2.add_node(accept=True)
for i in range(3, 5):
A2.add_arc(
src_node=0, dst_node=1, ilabel=i, olabel=i, weight=1)
A2.add_arc(
src_node=1, dst_node=2, ilabel=i, olabel=i, weight=1)
# Don't request gradients for X:
X = gtn.Graph(calc_grad=False)
X.add_node(start=True)
X.add_node()
X.add_node(accept=True)
X.add_arc(src_node=0, dst_node=1, ilabel=2, olabel=2, weight=1.5)
X.add_arc(src_node=0, dst_node=1, ilabel=4, olabel=4, weight=1.5)
X.add_arc(src_node=1, dst_node=2, ilabel=0, olabel=0, weight=2.5)
# Compute the function:
G1 = gtn.union([A1, A2])
G2 = gtn.closure(X)
G3 = gtn.intersect(G1, G2)
G4 = gtn.forward_score(G3)
# Attempting to access gradients before they are computed
# raises an exception:
try:
A1.grad()
except RuntimeError as e:
print(e)
```

[Graph::grad] Gradient not calculated yet.

In [3]:

```
# Compute the gradients with respect to the inputs:
gtn.backward(G4)
# A1_grad is a graph which has the same structure as A1 and
# with arc weights which contain the derivatives of G4 with
# respect to each of A1's arc weights.
A1_grad = A1.grad()
# We can also access any of the output graph gradients:
G1_grad = G1.grad()
# Attempting to access X's gradient raises an exception:
try:
X.grad()
except RuntimeError as e:
print(e)
```

[Graph::grad] Gradient calculation disabled.

In [4]:

```
# If none of a graphs inputs requires a gradient
# then the gradient for the graph will not be computed.
# Attempting to access G2's gradient fails since none of its inputs
# (in this case just X) requires a gradient:
try:
G2.grad()
except RuntimeError as e:
print(e)
```

[Graph::grad] Gradient calculation disabled.

In [5]:

```
# Clear the gradient on A1 and A2 for future gradient computations:
A1.zero_grad()
A2.zero_grad()
# The compute graph is cleared on the first backward pass,
# so we cannot call backward twice:
try:
gtn.backward(G4)
except ValueError as e:
print(e)
# Reconstruct the graph:
G4 = gtn.forward_score(
gtn.intersect(
gtn.union([A1, A2]), gtn.closure(X)))
# To compute the gradient multiple times without reconstructing
# the graph, use `retain_graph`:
gtn.backward(G4, retain_graph=True)
# Note, gradients accumulate, so they will not be correct
# after the second call to backward unless they are all
# explicitly cleared.
gtn.backward(G4, retain_graph=True)
```

[autograd::backward] Cannot Backward twice without retaining the graph.