Oxford Deep NLP 2017 course - Practical 3: Text Classification with RNNs
[Chris Dyer, Phil Blunsom, Yannis Assael, Brendan Shillingford, Yishu Miao]
In this practical, you can explore one of two applications of RNNs: text classification or language modelling (you are welcome to try both, too). We will be using the training/dev/test splits that we created in Practical 2.
Last week’s practical introduced text classification as a problem that could be solved with deep learning. The document representation function we used was very simple: an average over the word embeddings in the document. This week, you will use RNNs to compute the representations of the documents.
In the figure below, on the left we show the document representation function that was used in last week’s practical. Your goal in this task is to adapt your code to use the architecture on the right.
Note that in Practical 3, x is defined to be the average of the RNN hidden states (the h_t’s), just just the sum.
(Optional, for enthusiastic students) RNNs are expensive use as “readers” on long sequences. Truncated backpropagation through time (truncated BPTT) can be used to get better parallelism. You are encouraged to use this to get better computational efficiency.
As covered in lecture last week, RNN language models use the chain rule to decompose the probability of a sequence into the product of probabilities of words, conditional of previously generated words:
To avoid problems with floating point underflow, you it is customary to model this in log space.
Given a training sequence training graph for a language model looks like this:
Your task is to train an RNN language model on the training portion of the TED data, using the validation set to determine when to stop optimising the parameters of the model.
A language model can be evaluated quantitatively by computing the (per-word) perplexity of the model on a held-out test corpus,
where |test set| is the length of the test set in words, including any <UNK> tokens. (Note: you can measure length in terms of any units, including characters, words, or sentences, these are just ways of quantifying how much uncertainty the model has about different units.)
To evaluate the model qualitatively, generate random samples from the model by sampling from p(w_t | w_{<t} ) and then feeding the sampled value of wt into the RNN at time t+1.
On paper, show a practical demonstrator your response to these to get signed off.