UMich DL for CV

Recurrent Neural Networks

Key idea

RNNs have an “internal state” that is updated as a sequence is processed, the recurrence formula looks like $h_t = f_W(h_{t-1},x_t)$.

Here $h_t$ is the state at time step $t$, $x_t$ is the input at time step $t$ and $f_w$ is a function with parameters $W$.

Vanilla RNNs

The state consists of a single hidden vector $h$

$h_t = tanh(W_{hh}h_{t-1}+W_{xh}x_t)$

$y_t = W_{hy}h_t$

Backpropagation Through Time

Forward through entire sequence to compute loss

Backward through entire sequence to compute gradient

But : Very long sequence takes a lot of memory $\rightarrow$ Use chunks of the sequence instead of the entire sequence(Truncated Backpropagation Through Time), use the final hidden state of the previous chunk as the first hidden state in the next chunk ($\large h_{t,i}=h_{0,i+1}$)

Image Captioning

Image $\rightarrow$ CNN $\rightarrow$ RNN

Transfer learning : Take CNN trained on ImageNet and chop off the last FC layer. In RNN, we’ll have $h_t = tanh(W_{xh}\times x+W_{hh}\times h_{t-1}) + W_{ih}\times v$ Here $ v $ represents the feature extracted by the CNN.

Vanilla RNN Gradirnt Flow

\[h_t = tanh([W_{hh} \quad W_{hx}] \begin{bmatrix}h_{t-1}\\ x_t\end{bmatrix}) = tanh(W\begin{bmatrix}h_{t-1}\\ x_t\end{bmatrix})\]

After stacking this gradient flow, here comes a big problem : compute gradient of $h_0$ involves many factors of W and repeated tanh. If the largest singular value in W > 1, then there will be exploding gradients; if <1, then vanishing gradients.

For exploding gradients we can just scale the gradients when it > threshold, which gives a wrong gradients but at least it won’t cause expoding.

For vanishing gradients, we need to change RNN architecture, and thus Long Short Term Memory(LSTM) was invented.

LSTM

LSTM uses two states, the hidden state $h$ and the cell state $c$

$i$ : Input gate, whether to write cell

$f$ : Forget gate, whether to erase cell

$o$ : Output gate, how much to reveal cell

$g$ : Gate gate, how much to write to cell

$\begin{bmatrix}i\\f\\o\\g\end{bmatrix}= \begin{bmatrix}\sigma \\ \sigma \\ \sigma \\ tanh\end{bmatrix}W\begin{bmatrix}h_{t-1}\\x_t\end{bmatrix}$ , Here $\sigma$ is the sigmoid function $\large\sigma(x) = \frac{1}{1+e^{-x}}$

$c_t = f \odot c_{t-1} + i \odot g$

$h_t = o \odot tanh(c_t)$

so $i,f ,o \in (0,1)$ $g\in (-1,1)$. Here $\odot$ means elementwise multiplication.

Recurrent Neural Networks

Recurrent Neural Networks

Key idea

Vanilla RNNs

Backpropagation Through Time

Image Captioning

Vanilla RNN Gradirnt Flow

LSTM

CATALOG

FEATURED TAGS