Glam Prestige Journal

Bright entertainment trends with youth appeal.

$\begingroup$

In linear regression, the loss function is expressed as

$$\frac1N \left\|XW-Y\right\|_{\text{F}}^2$$

where $X, W, Y$ are matrices. Taking derivative w.r.t $W$ yields

$$\frac 2N \, X^T(XW-Y)$$

Why is this so?

$\endgroup$ 0

4 Answers

$\begingroup$

Let

$$\begin{array}{rl} f (\mathrm W) &:= \| \mathrm X \mathrm W - \mathrm Y \|_{\text{F}}^2 = \mbox{tr} \left( (\mathrm X \mathrm W - \mathrm Y)^{\top} (\mathrm X \mathrm W - \mathrm Y) \right)\\ &\,= \mbox{tr} \left( \mathrm W^{\top} \mathrm X^{\top} \mathrm X \mathrm W - \mathrm Y^{\top} \mathrm X \mathrm W - \mathrm W^{\top} \mathrm X^{\top} \mathrm Y + \mathrm Y^{\top} \mathrm Y \right)\end{array}$$

Differentiating with respect to $\mathrm W$,

$$\nabla_{\mathrm W} f (\mathrm W) = 2 \, \mathrm X^{\top} \mathrm X \mathrm W - 2 \, \mathrm X^{\top} \mathrm Y = \color{blue}{2 \, \mathrm X^{\top} \left( \mathrm X \mathrm W - \mathrm Y \right)}$$


$\endgroup$ 17 $\begingroup$

Let $X=(x_{ij})_{ij}$ and similarly for the other matrices. We are trying to differentiate $$ \|XW-Y\|^2=\sum_{i,j}(x_{ik}w_{kj}-y_{ij})^2\qquad (\star) $$ with respect to $W$. The result will be a matrix whose $(i,j)$ entry is the derivative of $(\star)$ with respect to the variable $w_{ij}$.

So think of $(i,j)$ as being fixed now. Only some of the terms in $(\star)$ depend on $w_{ij}$. Taking their derivative gives $$ \frac{d\|XW-Y\|^2}{dw_{ij}}=\sum_{k}2x_{ki}(x_{ki}w_{ij}-y_{kj})=\left[2X^T(XW-Y)\right]_{i,j}. $$

$\endgroup$ 3 $\begingroup$

Just want to have more details on the process. The process should be Denote $X = [x_{ij}], W = [w_{ij}], Y = [y_{ij}]$, then we have$$ \left \| XW - Y \right \|^{2} = \sum_{k, j} (\sum_{i} x_{ki} w_{ij} - y_{kj})^{2}, $$This is a scalar and by taking the derivative w.r.t. the matrix $W$ we get a matrix. By taking $i, j$ as the known number, we get$$ \frac{d \left \| XW - Y \right \|^{2}}{d w_{ij}} = \sum_{k} 2x_{ki} (\sum_{i} x_{ki} w_{ij} - y_{kj})\\ = \sum_{k} 2x_{ki} (XW - Y)_{kj} \\ = [2 X^{T} (XW - Y)]_{ij} $$Thus we have$$ \frac{d \left \| XW - Y \right \|^{2}}{d W} = 2 X^{T} (XW - Y) $$First time answering a question, hope it is right, thanks!

$\endgroup$ $\begingroup$

Roughly speaking, the $\textbf{Jacobian}$ of $f$ at point $x$ is the matrix/tensor $B$ such that we have \begin{equation}f(x+\delta)=f(x) + B\delta+ o(\|\delta\|).\end{equation}So, if $$f(W)=\|XW-Y\|_F^2,$$then\begin{equation} f(W+\delta)=\|X(W+\delta)-Y\|_F^2=\|XW-Y+X\delta\|_F^2=\|XW-Y\|_F^2+2\langle XW-Y,X\delta \rangle +\|X\delta\|_F^2. \end{equation}Note that we then have\begin{equation} f(W+\delta)=f(W)+2\langle X^T( XW-Y),\delta \rangle +o(\|\delta\|)= f(W)+2\left(X^T( XW-Y)\right)^T\delta +o(\|\delta\|). \end{equation}So, the Jacobian of $f$ is $2\left(X^T( XW-Y)\right)^T$, implying that the gradient is its transpose.

This Taylor expansion idea is a smart trick to make your life easier while taking derivatives.

$\endgroup$

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy