Neural Network
Feature Transformation
for example, the original space is a Cartesian coordinate system. After mathematic transformation, we can turn it into Polar coordinate system, called feature space.
In this situation, a nonlinear classifier in the original space can become a linear classifier in the feature space, which is more convenient to implement.
An application is Histogram of Oriented Gradients (HoG), the method:
- Compute edge direction/strength at each pixel
- Divide image into 8*8 regions
- Within each region, compute a histogram of edge directions weighted by the edge strength.
Fully-connected Neural Network

the max function in f is called the activation function, a classical implementation is ReLU:
#Rectified Linear Unit
def ReLU(z):
return max(0, z)
Q: What if we build a neural network without activation function?
A: In this situation f = W_2 * W_1 * x, assuming W_3 = W_2 * W_1, we get f = W_3 * x. This is still a linear classifier!
Why ReLU can work?


The two pictures above show that ReLU can transform the non-linear boundary in original space to linear boundary in feature space.
Back Propagation
an example on back propagation
以$f=(x+y)z$为例,该算式的前向传播过程可以表达为$q=x+y\quad f=qz$,反向传播的目的是为了计算偏差$\frac{\partial f}{\partial x}$,$\frac{\partial f}{\partial y}$和$\frac{\partial f}{\partial z}$。根据链式法则可知,$\frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial x}$,其中,从$q=x+y$这一步运算角度来看,$\frac{\partial f}{\partial x}$称为Downstream Gradient,而$\frac{\partial q}{\partial x}$称为Local Gradient,$\frac{\partial f}{\partial q}$则称为Upstream Gradient。
Sigmoid Function
$\sigma(x)=\large \frac{1}{1+e^{-x}}$
$\frac{\partial \sigma(x)}{\partial x}=(1-\sigma(x))\sigma(x)$
Pattern Gradient

Matrix Operation

在真正的神经网络中,输入和输出一般是以矩阵的形式给出,这种情况下的对于$\frac{\partial L}{\partial x}$的计算就如上图所示。由于矩阵的维度D,M都很大,直接计算出所有的梯度是不可能的(out of memory),因此,需要采用切片的方式,每次计算一部分的梯度值。例如,先计算$\frac{\partial L}{\partial x_{1,1}}$,那么相当于要计算$\frac{\partial L}{\partial y}\frac{\partial y}{\partial x_{1,1}}$,$\frac{\partial L}{\partial y}$作为upstream gradient是已知的,$\frac{\partial y}{\partial x_{1,1}}$形状应该与$y$相同,只需要依次求出$\frac{\partial y_{i,j}}{\partial x_{1,1}}$的值即可,例如,图中就给出了$\frac{\partial y_{1,2}}{\partial x_{1,1}}$的计算方法,考虑$\frac{\partial y_{2,3}}{\partial x_{1,1}}$,由于$y_{2,3}=x_{2,1}w_{1,3}+x_{2,2}w_{2,3}+x_{2,3}w_{3,3}$,所以$\frac{\partial y_{2,3}}{\partial x_{1,1}}=0$。
从上述的推导中,就可以总结出一个规律:
$\LARGE \frac{\partial L}{\partial x_{i,j}}=\frac{\partial L}{\partial y}\frac{\partial y}{\partial x_{i,j}}=(w_{j,:})\cdot (\frac{\partial L}{\partial y_{i,:}})$