Matrix Differentiation

Machine Learning with Python

Lecture 6. Matrix Differentiation

Alex Avdiushenko
October 17, 2023

Function Differentiation

The derivative of a function $f(x): \mathbb{R} \to \mathbb{R}$ with respect to $x$ is typically denoted as $f'(x)$ or $\frac{df}{dx}$ \[ f'(x) = \lim_{{h \to 0}} \frac{f(x+h) - f(x)}{h} \]

The differential is a linear approximation of the function $f(x)$ at the point $x$: \[ f(x+h) - f(x) = L(x)[h] + o(\|h\|) \\ L(x) = Df(x): \mathbb{R} \to \mathbb{R}\ \text{ — linear function} \]

1. For $f(x): \mathbb{R} \to \mathbb{R}$

\[ Df(x)[h] = f'(x) \cdot h \]

2. For $f(x): \mathbb{R}^n \to \mathbb{R}$

\[ Df(x)[h] = \sum\limits_i \frac{\partial f}{\partial x_i} \cdot h_i = \left< \nabla f(x), h \right> \] $\nabla f$ — gradient of function $f$

The shape of gradient must be the same, as shape of argument $h$!

3. For $f(x): \text{Matrix}_{m\times n} \to \mathbb{R}$

\[ Df(x)[H] = \sum\limits_{i,j} \frac{\partial f}{\partial x_{ij}} \cdot h_{ij} = \mathrm{tr}(\nabla f^T H) = \left< \nabla f(x), H \right> \]

As \[ \mathrm{tr}(A^TB) = \sum\limits_i (A^TB)_{ii} = \sum\limits_i \sum\limits_j (A^T)_{ij}B_{ji} \]

4. For $f(x): \mathbb{R}^m \to \mathbb{R}^m$, defined as $f(x) = (g(x_1), \dots, g(x_m))$

\[ f(x+h)-f(x) = (\dots, g(x_i+h_i) - g(x_i) ,\dots) \approx \\ (\dots, g'(x_i) \cdot h_i ,\dots) = f'(x) \odot h \]

5. For $f(X): \text{Matrix}_{m\times n} \to \text{Matrix}_{m\times k}$, defined as $f(X) = XW$

\[ f(X+H)-f(X) = HW = Df(X)[H]\text{ — is already linear function} \]

Warning: there is no $\nabla f(x)$ in 4 and 5 item and actually you don't need it.

Matrix differential properties

$D \text{const} = 0$
$D L(x)[h] = L(x)[h]$
$D (f_1 + f_2) = D f_1 + D f_2$
$D (uv)(x)[h] = Du(x)[h] \cdot v(x) + u(x) \cdot Dv(x)[h]$
$D v(u(x)) [h] = Dv (u(x)) [D u(x)[h]]$

Since \[ D v(u(x)) [h] \approx v(u(x+h)) - v(u(x)) \approx \\ Dv (u(x)) [u(x+h)-u(x)] \approx Dv (u(x)) [D u(x)[h]] \]

Tasks to differentiate

1. $f(x) = \left< a, x \right>$

$$Df(x)[h] = \left< 0, x \right> + \left< a, dx \right> = \left< a, h \right> \\ \Rightarrow \nabla f = a$$

2. $f(x) = \left< Ax, x \right>$

$$Df(x)[h] = \left< Adx, x \right> + \left< Ax, dx \right> = \left< Ah, x \right> + \left< Ax, h \right> \\ \Rightarrow \nabla f = (A^T+A)x $$

3. $f(X) = X^{-1}$

$$I = X X^{-1} \Rightarrow 0 = (dX) X^{-1} + X d(X^{-1}) \Rightarrow d(X^{-1}) = - X^{-1} (dX) X^{-1}$$

4. $f(X) = \det X$

$$(\nabla f)_{ij} = \frac{\partial \det X}{\partial x_{ij}} = C_{ij} \Rightarrow \nabla f = \det(X) \cdot X^{-T} $$ where cofactor $C_{ij}$ is defined as $C_{ij} = (−1)^{i+j} \det(M_{ij})$

5. $f(x) = \| Ax-b \|^2$

$$ f(x) = \left< Ax-b, Ax-b \right> \Rightarrow Df(x)[h] = 2\left< Ah, Ax-b \right> \\ \Rightarrow \nabla f = 2A^T(Ax-b)$$

6. $f(X) = \log \det X$

$$f(X) = u(v(X)) \Rightarrow u(y) = \log y, Du = \frac{1}{y} \cdot Dy \\ v(X) = \det X, Dv = \left< \det X \cdot X^{-T}, DX \right> \\ \Rightarrow Df = \left< X^{-T}, DX \right> \Rightarrow \nabla f = X^{-T}$$

7. $f(X) = \mathrm{tr}(AX^TX)$

$$ f(X) = \mathrm{tr}(AX^TX) = \mathrm{tr}(X^TXA) = \left< X, XA \right> \\ \Rightarrow Df(x) = \left< DX, XA \right> + \left< X, D(X) A \right> = \\ \Rightarrow \left< DX, XA \right> + \mathrm{tr}(X^T D(X)A) = \left< DX, XA \right> + \mathrm{tr}(AX^T D(X)) = \\ = \left< XA, DX \right> + \left< XA^T, DX \right> \Rightarrow \nabla f = XA + XA^T $$

8. $f(X) = \det(AX^{-1}B)$ and $A$, $B$ are not necessary square

$$f(X) = u(v(X)) \Rightarrow u(Y) = \det Y, Du = \left<\det(Y) Y^{-T}, DY \right> \\ v(X) = A X^{-1} B, Dv = -A X^{-1} D(X) X^{-1} B \\ Y = v(X) \Rightarrow Df = \left<\det(A X^{-1} B) (A X^{-1} B)^{-T}, -A X^{-1} D(X) X^{-1} B \right>$$

9. $ \begin{cases} \| X - A \|_{fro} \to \min \\ X = X^T \end{cases}$ where $\| Y \|_{fro} = \sqrt{\mathrm{tr}(Y^TY)}$

Let's write the Lagrangian for square:

$$ \mathcal{L} = \mathrm{tr}\left( (X-A)^T(X-A) \right) + \sum\limits_{ij} \lambda_{ji}(x_{ji}-x_{ij}) = \\ = \mathrm{tr}\left( (X-A)^T(X-A) \right) + \mathrm{tr}\left( \Lambda^T (X^T-X) \right) \\ D\mathcal{L} = \mathrm{tr}\left( (DX)^TX+X^TDX - A^TDX - (DX)^TA + \Lambda^T(DX)^T - \Lambda^T DX\right) \\ \mathrm{tr}\left([\quad ?\quad]DX^T \right) \Rightarrow \nabla \mathcal{L} = X+X-A-A+\Lambda^T-\Lambda \\ \nabla \mathcal{L} = 2X-2A+\Lambda^T-\Lambda = 0, \nabla \mathcal{L}^T = 2X-2A^T+\Lambda-\Lambda^T = 0 \\ \nabla \mathcal{L} + \nabla \mathcal{L}^T = 0 \Rightarrow 4X = 2A+2A^T \Rightarrow \boxed{X = \frac{A+A^T}{2}}$$

Hessian — a square matrix of second-order partial derivatives of a scalar-valued function

\[ f(x+h) - f(x) = Df(x)[h] + \frac12 D^2f(x)[h, h] + o(\|h\|^2) \] \[ D^2f(x)[h, h] = h^T \nabla^2 f(x) h \] $\nabla^2 f(x)$ is Hessian

Let $f(x) = \| Ax-b \|^2$

$$ Df(x)[h_1] = \left< 2A^T (Ax-b), h_1 \right> = g(x) \\ Dg(x)[h_2] = \left< 2A^T Ah_2, h_1 \right> \Rightarrow \nabla^2 f(x) = 2A^T A $$

When our solution $x$ of $f(x) = 0$ is unique?