Skip to main content

Derivatives of all layers

Derivatives with Respect to Individual Parameters

In feedforward step For a given input value x,theoutputoftheneuralnetworkiscomputed.Duringthiscomputation,theactivations`, the output of the neural network is computed. During this computation, the activations a^l$ at each layer are stored for later use.

For each unit j in the output layer, compute the error term:

ejL=JzjLe^{L}_j = \frac{\partial J}{\partial z^{L}_j}

From this, we can derive the following:

JwijL=aiL1ejL\frac{\partial J}{\partial w^{L}_{ij}} = a^{L-1}_i e^{L}_j JbjL=ejL\frac{\partial J}{\partial b^{L}_j} = e^{L}_j

For layers l=L1,L2,,1l = L-1, L-2, \ldots, 1, we compute:

ejl=(wjl+1:el+1)f(zjl)e^{l}_j = \left( w^{l+1}_j : e^{l+1} \right) f'(z^{l}_j)

Updating the derivatives for each parameter yields:

Jwijl=ail1ejl\frac{\partial J}{\partial w^{l}_{ij}} = a^{l-1}_i e^{l}_j Jbjl=ejl\frac{\partial J}{\partial b^{l}_j} = e^{l}_j

Derivatives with Respect to Matrices

The computation of derivatives for individual parameters as described above is useful for understanding the principles involved. However, in practice, we need to optimize the calculations by expressing them in vector and matrix forms to accelerate the algorithm. Let’s define:

el=[e1le2ledll]Rdl×1e^{l} = \begin{bmatrix} e^{l}_1 \\ e^{l}_2 \\ \vdots \\ e^{l}_{d^{l}} \end{bmatrix} \in \mathbb{R}^{d^{l} \times 1}

Feedforward Step: For a given input value xx, compute the network output while storing the activations ala^{l} at each layer.

For the output layer, compute:

eL=JzLe^{L} = \frac{\partial J}{\partial z^{L}}

From this, we deduce:

JWL=aL1(eL)T\frac{\partial J}{\partial W^{L}} = a^{L-1} (e^{L})^T JbL=eL\frac{\partial J}{\partial b^{L}} = e^{L}

For layers l=L1,L2,,1l = L-1, L-2, \ldots, 1:

el=(Wl+1el+1)f(zl)e^{l} = \left( W^{l+1} e^{l+1} \right) \odot f'(z^{l})

Here, \odot denotes the element-wise product (Hadamard product), meaning each component of the two vectors is multiplied together to yield a resultant vector.

Updating the Derivatives for the Weight Matrices and Bias Vectors

JWl=al1(el)T\frac{\partial J}{\partial W^{l}} = a^{l-1} (e^{l})^T Jbl=el\frac{\partial J}{\partial b^{l}} = e^{l}

Note: The expression for the derivative in the previous line might raise a question: why is it aL1(eL)Ta^{L-1} (e^{L})^T and not aL1TeLa^{L-1T} e^{L} or others? A key rule to remember is that the dimensions of the two matrices on the right-hand side must match. Testing this reveals that the left-hand side represents the derivative with respect to WLW^{L}, which has a dimension RdL1×dL\mathbb{R}^{d^{L-1} \times d^{L}}. Thus, eLRdL×1e^{L} \in \mathbb{R}^{d^{L} \times 1} and aL1RdL1×1a^{L-1} \in \mathbb{R}^{d^{L-1} \times 1} imply that the correct formulation must be aL1(eL)Ta^{L-1} (e^{L})^T. Additionally, the derivative with respect to a matrix of a scalar-valued function will have dimensions matching that of the matrix.

Backpropagation for Batch (Mini-Batch) Gradient Descent

What happens when we want to implement Batch or mini-batch Gradient Descent? In practice, mini-batch Gradient Descent is the most commonly used method. When the dataset is small, Batch Gradient Descent can be applied directly.

In this case, the pair ((X, Y)) will be in matrix form. Suppose that each computation iteration processes NN data points. Then, we have:

XRd0×N,YRdL×NX \in \mathbb{R}^{d^{0} \times N}, \quad Y \in \mathbb{R}^{d^{L} \times N}

where d0=dd^{0} = d is the dimension of the input data (excluding biases).

Consequently, the activations after each layer will have the following forms:

AlRdl×N,ElRdl×NA^{l} \in \mathbb{R}^{d^{l} \times N}, \quad E^{l} \in \mathbb{R}^{d^{l} \times N}

We can derive the update formulas as follows.

Feedforward Step: With a complete dataset (batch) or a mini-batch of input XX, compute the output of the network while storing the activations AlA^{l} at each layer. Each column of AlA^{l} corresponds to a data point in XX.

For the output layer, compute:

EL=JZLE^{L} = \frac{\partial J}{\partial Z^{L}}

From this, we derive:

JWL=AL1(EL)T\frac{\partial J}{\partial W^{L}} = A^{L-1} (E^{L})^T JbL=n=1NenL\frac{\partial J}{\partial b^{L}} = \sum_{n=1}^{N} e^{L}_n

For layers l=L1,L2,,1l = L-1, L-2, \ldots, 1:

El=(Wl+1El+1)f(Zl)E^{l} = \left( W^{l+1} E^{l+1} \right) \odot f'(Z^{l})

Here, \odot signifies the element-wise product, meaning each element of the two matrices is multiplied to produce a resultant matrix.

Updating the Derivatives for the Weight Matrices and Bias Vectors:

JWl=Al1(El)T\frac{\partial J}{\partial W^{l}} = A^{l-1} (E^{l})^T Jbl=n=1Nenl\frac{\partial J}{\partial b^{l}} = \sum_{n=1}^{N} e^{l}_n

This structured approach to backpropagation for both stochastic and batch gradient descent not only enhances understanding but also optimizes the computational efficiency of neural network training.

Backpropagation