# 13 Natural Language Processing with Deep Learning

Welcome to the 13th session of this series on Practical Machine Learning, where we will continue our discussion on Deep Learning. So we've covered a lot on Computer Vision till now, not only from the perspective of Deep Learning, but also in the entire series. In this session, we will be focussing on *Natural Language Processing* (or simply called *NLP* for short), specifically from the perspective of Deep Learning. 

Before we proceed with NLP, let us discuss what NLP even is, and what it implies in today's world. 

Till now, we have understood some basics about Natural Language Processing (in Session 4, on Bayesian Learning). But, what does it mean in the first place?

Natural Language Processing is broad field in Artificial Intelligence, that deals with understanding language. This includes building models that understand speech, audio, text, etc. For example, Google Translate, an AI Model, somehow knows how to convert a peice of text/speech into another language. You say "Hi Siri" or "Hey Google" or "Alexa", and your home/phone assistant responds to you in an intelligent way. And this is where today's state of NLP lies. But there's so much that is yet to be developed and discovered. Let us talk about the aspirations of this feild. 

NLP is not only about identifying language, but also understanding it. Currently NLP has not quite reached there. For example, we still don't know how to build models that actually *understand* the semantics of a language. Like, just apart from identifying English, we don't know how to build models that understand sarcasm, or emotions, or slightly misspelt/mistyped words. Today's models don't understand context well. Infact, there are models now that can generate entire segments of texts. On a superficial inspection, it *seems* that the model generates sensible text, but on closer inspection, you would find out that most of the text, in most cases doesn't make great sense. That is why, even in 2021, most AI chatbots don't work well. 

To be fair, it is not a fault in the modeling techniques used in NLP, but more so because of our (humans') own lack of understanding of how to model language on a very broad level. Though, I expect, that in a few years, we will be able to develop much better modeling methods to understand language. With that, there will also be new types of models that will be more suitable for NLP tasks. Present Models and optimizers tasks just do the job, but it is possible that NLP techniques will see some major changes in the coming few years, from the very fundamentals, including optimization techniques. The very basis of communication in human society is language, and unless we figure out how AI effectively processes language, it is hard for it to make its place in soceity. 

But we shouldn't deny that NLP has made some great progress over the few years, and Deep Learning is the reason behind it. If you remember building a text generator in session 4, you will see, how using Deep Learning, we can significantly improve the model. In this session, we will be learning how to build state of the art language models, that generate text, and are also capable of classifying texts into categories. We will be learning how to build Deep Learnign models that are capable of handling texts (called as *Recurrent Neural Networks* (*RNNs*), and also about some variants that have been known to perform effectively, including *LSTMs (Long Short Term Memory Models)* and *Transformers*. 

Let us begin this session by installing the necessary libraries. We'll be using the fastai library to handle data pipelining and training, so that we can really focus on building our models. And under the hood, we'll essentially be using the PyTorch library, and when you import the fastai library, you're automatically importing all necessary functionalities from the PyTorch library.


In [1]:
!pip install sentencepiece >./temp
!pip install --upgrade fastai >./temp
from fastai.text.all import *

[31mERROR: torchtext 0.9.1 has requirement torch==1.8.1, but you'll have torch 1.7.1 which is incompatible.[0m


## Setting up the Data

Before starting to explore Language Models, we need a language(text) dataset, on which we perform our experiments. We'll be using the [IMDb Movie Reviews](https://ai.stanford.edu/~amaas/data/sentiment/) by Stanford University, that contains movie reviews from IMDb's website. It contains 50,000 labeled datapoints, the labels of which are a binary class label depicting the sentiment of the review (whether the review is a 'positive' or a 'negative' review), and an additional 50,000 unlabeled movie reviews. You might have guessed that (atleast) one of the things that we will build will be a text classifier using this data. But that's not all! We will discuss how to use these unlabeled datapoints as well. We will go much beyond a text classifier, and essentially build a model that understands language in general!

Fastai gives cloud access to download famous datasets, this being one of them. The cloud link can be found by typing `URLs.IMDB`. We can then download the dataset by using the `untar_data` function, which not only downloads the dataset, but also uncompresses the data. Datasets can be in the form of zip files, tar files, 7z files, gzip files, etc.,  but `untar_data` automatically handles all these internally. And finally the function returns the local path where the dataset is stored. 

In [43]:
path=untar_data(URLs.IMDB)

And then we can simply see what is present in the path, by using a custom method of the Path class, `ls`, that enumerates all items in a particular path location. 

In [45]:
path.ls()

(#8) [Path('/root/.fastai/data/imdb/README'),Path('/root/.fastai/data/imdb/tmp_clas'),Path('/root/.fastai/data/imdb/imdb.vocab'),Path('/root/.fastai/data/imdb/train'),Path('/root/.fastai/data/imdb/unsup'),Path('/root/.fastai/data/imdb/tmp_lm'),Path('/root/.fastai/data/imdb/models'),Path('/root/.fastai/data/imdb/test')]

The text files are present in three folders, namely the `train`, the `test` and the `unsup` (for unsupervised, or unlabeled) folder. So before we start talking about any distinction between all these folders, let us create a list of *all* text files. Essentially we are also throwing away all labels for now, and just collecting the raw text files. We will discuss why in the following section. So before we start talking about any distinction between all these folders, let us create a list of *all* text files. Essentially we are also throwing away all labels for now, and just collecting the raw text files.

In [4]:
files=get_text_files(path,folders=['train','test','unsup'])

Just for visualization, let us see what these reviews look like.

In [5]:
txt=files[0].open().read()
txt

'I liked this movie I remember there was one very well done scene in this movie where Riff Randell (played by P.J. Soles) is lying in her bed smoking pot and then she begins to visualize that the Ramon'

Using these, we will be discussing about some preprocessing techniques that are essential to understand before we even jump into building models. Why do we need preprocessing? Obviously because models don't understand words. They only understand numbers. So one logical solution would be to map each word in our entire *corpus* (meaning, our collection of documents/text files) to a unique number, and simply replacing each word with its corresponding number, and using that to build our models.

There are 2 steps involved in this whole process. First we need to identify different words in our text files, and then break our sentences into lists of these indivdual words, or components. Once we do this, we don't treat our data as strings containing sentences and paragraphs, but as sequences of small chunks (words). These small chunks are called *Tokens*, and this whole process is called *Tokenization*. 

And secondly, create a mapping between these words and numbers. But there are a few nuiances in these processes that need to be discussed. 

So, let us start by creating Tokens of our text. 

### Tokenizing our text

By default, fastai uses a library called [*SpaCy*](https://spacy.io/api/tokenizer) that handles the tokenization process. You might think that the process is as simple as simply separating words between spaces or punctuation marks. But in practice, its not as simple as it seems. Here are a few examples of the problems we may face. 

* How do we deal with a word like “don’t”? Is it one word or two? 

* What about long medical or chemical words? Should they be split into their separate pieces of meaning? How about hyphenated words? 

* What about languages like German and Polish, which can create really long words from many, many pieces? 


These are just a few examples of the problems that we may face. 

#### Word Tokenization

Spacy has a sophisticated API to create word tokens, that can handle special English Words (like don't is separated as do and n't, and treating the `.` between two words as a full stop, but not in special words such as 'U.S.'), URLs, etc. This API can be easily accessed by a fastai class called `WordTokenizer`. 

In [6]:
spacy=WordTokenizer() #we store the class initialization in a variable called spacy
toks=first(spacy([txt])) # spacy[text] will create a generator that has been applied on all strings. You can then use the first function to get a list of all these tokens
toks #display the tokens

(#146) ['I','liked','this','movie','I','remember','there','was','one','very'...]

You can see below, that SpaCy knows how to handle full stops and dots in between special abbreviations such as in 'U.S.', and also words like don't properly.  

In [7]:
print(first(spacy(["The U.S. dollar $1 is 75.00 rupees today. I don't think you should convert your rupees right now" ]))) # the full stop is a separate token, but U.S. is a single token 

['The', 'U.S.', 'dollar', '$', '1', 'is', '75.00', 'rupees', 'today', '.', 'I', 'do', "n't", 'think', 'you', 'should', 'convert', 'your', 'rupees', 'right', 'now']



But we will be using fastai's `Tokenizer` class, which builds on top of the `WordTokenizer` class. It contains some additional rules like adding tokens that represent the begenning of the stream, or end of the stream, and replacing meaningless repetitions of characters, denoting capital letters by a separate token and converting the word to lowercase, etc). 

This is because we would not only like to remove redundant information, but also like to provide our text models with some additional information, like, when the stream ends (which may indicate to the model to forget whatever it has learnt till now), or when the stream begins (which may indicate to the model to start afresh), or indicate where a capitalized word is present, or replace unknown (or very rare) words with a special token (like `xxunk`)

In [8]:
tkn=Tokenizer(spacy)
tkn(txt)

(#165) ['xxbos','i','liked','this','movie','i','remember','there','was','one'...]

You can see all the rules that fastai applies here. You can find what each one of these does, through the [documentation's page](https://docs.fast.ai/text.core.html#Preprocessing-rules). 

In [9]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

or for those who like to be more hands on, you can dive right into the source code like so...

In [10]:
??core.replace_all_caps 

So what we've done is essentially built a Word Tokenizer. But now let us talk about a different kind of problem. 

#### Subword Tokenization
 
What if we're dealing with languages like Japanese and Chinese, that don’t use bases at all, and don’t really have a well-defined idea of word?

In that case, it may be a better idea to not tokenize words, but commonly occuring parts of words.The problem is, how do we make a function that can identify parts of a word? We can easily make a function to identify spaces, or in another extreme, a function that separates each character, but what rues do we use to group a few characters. One way is is to group the most commonly occuring sequence of characters. This is a lengthy process, because you need to go through the entire document multiple times in order to count the frequency of characters. 

Now we obviously need to specify some limit to the number of tokens in our vocabulary, because in a case where there are no limits, the function would ultimately tokenize to the very end, and would end up tokenizing even single characters that don't occur frequently enough with other characters, in which case, our final vocabulary can potentially be infinitely long. So in essence, it is important to set some limit to the vocaubulary size, containing the most commonly occuring subwords. All other words will be represented as unknown (`xxunk`) tokens. 

In [11]:
txts=L(o.open().read() for o in files[:2000]) #getting 2000 characters 

We define a function called as `subword` that counts up the frequencies of commonly occuring subwords. The exact working is quite complex, but feel free to go into the source code of the `SubWordTokenizer` class. 

In [12]:
def subword(sz): #sz is the maximum number of tokens that will be created 
    sp=SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts) #this creates tokens
    return ' '.join(first(sp([txt]))[:40])

In [13]:
subword(1000)

'▁I ▁like d ▁this ▁movie ▁I ▁remember ▁there ▁was ▁one ▁very ▁well ▁done ▁scene ▁in ▁this ▁movie ▁where ▁R i ff ▁R and ell ▁( play ed ▁by ▁P . J . ▁So le s ) ▁is ▁ ly ing'

Here we use a vocabuary limit of 1000 tokens. The underscores represent spaces that occur in the text. This is to differentiate the spaces that occur between tokens of subwords and the spaces that naturally occur in the text documents. 

Just for comparison, let us compare the tokens that are generated when we change the vocab limit to a shorter length. 

In [15]:
subword(200)

'▁I ▁ li k ed ▁this ▁movie ▁I ▁re m e m b er ▁the re ▁was ▁on e ▁ ver y ▁w e ll ▁ d on e ▁ s ce ne ▁in ▁this ▁movie ▁w h er e'

You may notice that the tokens in the latter case are generally smaller than the former. That's because when we increase the vocab length, we are also accounting for lesser and lesser frequent substrings, of which whole words are more likely to be a part of. When we limit our vocabulary to a lesser limit, smaller subwords (like "on" or "ed" ) are more likely to be more frequent that whole words. 

Just for a better understanding, we will use an even larger vocabulary size, and you will notice, that it is almost the same as counting entire words as tokens.

In [16]:
subword(10000)

'▁I ▁liked ▁this ▁movie ▁I ▁remember ▁there ▁was ▁one ▁very ▁well ▁done ▁scene ▁in ▁this ▁movie ▁where ▁Riff ▁Rand ell ▁( played ▁by ▁P . J . ▁Soles ) ▁is ▁lying ▁in ▁her ▁bed ▁ smoking ▁pot ▁and ▁then ▁she'

Once we've setup our tokens, we need to *numericalize* them, ie, convert each token to a unique representative number, because machines can only understand numbers. For the rest of the session, we will be using word tokens, rather than subword tokens. 

### Numericalizing the tokens

For simplicity, let us create tokens on a smaller corpus, and convert it to tokens using the word tokenizer provided by fastai.  

In [18]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#165) ['xxbos','i','liked','this','movie','i','remember','there','was','one'...]

To map each token to a unique number, we can use fastai's `Numericalize` class. This class also provides us with an attribute `vocab` to map the numericalized tokens back to their original values (words). 

In [19]:
num=Numericalize()
num.setup(toks200) #the setup function does the mapping

let us see what what the numericalized tokens would look like. the `num` class takes a list of strings, and using the tokens generated on the corpus (`toks200`), and replace each word with its corresponding number. Any word that wasn't there in the vocabulary will be replaced with an unknown token (`xxunk`).

In [24]:
nums=num(toks)
nums

TensorText([   0,  220,   21,   34,    0,  442,   64,   32,   40,   54,   75,  313,
         184,   17,   21,   34,  135,    0,    0,   33,  157,   42,    0,    0,
          31,   16,    0,   17,   63, 1275,    0,    0,   12,  129,   80,  677,
          15,    0,   20,    9,    0,   39,   17,    9,  678,   29,   63, 1058,
           9,  769,   25,    0,    0,    0,    0,   25,    0,   54,   54,  415,
           0,    0,    0,   32,  198,   10,    0,   10, 1610,   12,  415,   11,
           0,    0,  272,  888,   20,    9,  263,   16,  101,   24,  101,  158,
           9,  264,   12,  299,    0,    0,   28,   18,  102,   43,  612,   96,
          18,   16,  198,   21,   16,   13,   54,  198,   34,   11,    0,   22,
           0,   10,    0,   12,    0,    0,    0, 1278,    0,    0,    0,    0,
         285,   20,    0,    0,   32,    9,  890,   50,   32,    0,   15,  392,
          17,   21,  443,    0,    0,   79,   41,  153,   71,   21,   16,  273,
          69,   41])

We can even map back these tokens to the original text using the `vocab` attribute of the Numericalize class.

In [25]:
' '.join(num.vocab[o] for o in nums) #notice the frequent xxunk's. That is because of the words that were present in toks, but not in the corpus (toks200)

'xxunk liked this movie xxunk remember there was one very well done scene in this movie where xxunk xxunk ( played by xxunk xxunk ) is xxunk in her bed xxunk xxunk and then she begins to xxunk that the xxunk are in the room with her sing the song " xxunk xxunk xxunk xxunk " xxunk very very cool xxunk xxunk xxunk was fun , xxunk , quirky and cool . xxunk xxunk \'ll admit that the ending is way - way over the top and far xxunk xxunk but it does n\'t matter because it is fun this is a very fun movie . xxunk \'s xxunk , xxunk and xxunk xxunk xxunk forever xxunk xxunk xxunk xxunk read that xxunk xxunk was the band who was xxunk to star in this .. xxunk xxunk do not know if this is true or not'

## Creating our Language Model

We have understood the key ideas behind preprocessing our text. Now we need to feed the data into the model and create our text classifier. 

The fundamental argument is as follows. We know that transfer learning generally helps improve model performance. In the case of images, we used a pretrained model trained on ImageNet or some other dataset, the reason being that we know that training on a larger dataset helps the model understand basic image features that we would expect from the model over any standard image classification problem, like, features of humans, or trees in general. 

The NLP equivalent of that would be to use weights from a model that already knows the innards of the English Language, like, for example, basic grammar, sentence construction, the meanings of punctuation marks, contexts and so on. This is called a *Language Model*, ie a model that - understand language in a very general sense. The pretrained weights can be used to fine tune our classifier of text. 

In the case of images, we pretrain our model on a supervised learning dataset, ie a dataset that itself brings forth a supervised learning problem (classification). Turns out that you don't need explicit labels to create language models that understand language. You can train the model without any explicit labels. Essentially we're training the language model using a *self supervised learning* approach. Self Supervised Learning is a type of learning where the task is to learn the general features of the data, rather than doing a mapping from some input to some output. Let us discuss how this is done. 

So, one of the most common datasets used for pretraining language models is the [WikiText103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/), which is a dataset containing lots of text from the WikiPedia website, from random articles (over 100 million tokens in total). The idea behind this dataset is, that, no matter what your final task is, if it involves English, your model would need to understand basic grammatical, syntactical and logical rules involved in the language. Now, even though Wikipedia may contain a more formal language than the task you wish to do, but it is safe to assume that there is still a lot to learn about basic rules of English from Wikipedia, and this will help our final model, no matter what. So our task is to get a model that has been pretrained on a dataset like WikiText103. 

But how is this model even trained in the first place? As we mentioned, we use a supervised learning technique. Obviously the neural net that was used to train this model uses some sort of input/output mapping (because that is how neural nets work, there is no other way). But then, how do we make Neural Nets work without explicitly mentioning an output? Well this is done by making the output the same as the input (or of the same type as the input). In which case, the Neural Net would essentially be doing a mapping from the input to the input itself. This is another broad field in Machine Learning, and is known as Representational Learning. 

In our specific context, we feed a series of tokens as our input, and the target output is the same text, just offset by one token. So our model would be trying to predict the next sequence of text, whenever a peice of text is given. This is how supervised learning approach works in the context of NLP. 

This is how this would look like. This type of mapping of a text to a peice of text that is offset by only one token is automatically done by the `LMDataLoader` class. It is essentially a child class of the DataLoader class, the job of which is to break the data into batches and feed it into the model.

One more important thing to note is that, usually in the case of Images, the dataloader also shuffled the data, so that the model doesnt learn the sequence of images. But here, we can't shuffle text tokens, or they won't sense anymore. `LMDataLoader` takes care of that for us. 

In [26]:
nums200=toks200.map(num)
dl=LMDataLoader(nums200)

In [27]:
x,y=first(dl)
x.shape,y.shape #each batch has 64 items, each containing a fixed number of tokens

(torch.Size([64, 72]), torch.Size([64, 72]))

You can see that x and y are offset by only one token. Our Neural Net only needs to learn how to take in a sequence of text, and predict the next sequence of words.

In [None]:
x[0],y[0]

(LMTensorText([   2,    8,    0,    8, 1066,  249,  147,  611,   10,   40, 1600,   30,   40,    0,  163,   12,    9,    0,   14,  667,   10,   12,    0,  108,  254,   10,  668,  104,  667, 1601,    9,  767,
           14,   12,    0,    0,    8,  366,   11,   25,    8,    9,   95,   16, 1262,   13,  266,    0,   10,   12,  151,   60, 1602,    0,   39,   20,   10,   19,  288,   20,   18,   68, 1067,    9,
          120,    0,  255,   53,   79, 1263,   12, 1603]),
 TensorText([   8,    0,    8, 1066,  249,  147,  611,   10,   40, 1600,   30,   40,    0,  163,   12,    9,    0,   14,  667,   10,   12,    0,  108,  254,   10,  668,  104,  667, 1601,    9,  767,   14,
           12,    0,    0,    8,  366,   11,   25,    8,    9,   95,   16, 1262,   13,  266,    0,   10,   12,  151,   60, 1602,    0,   39,   20,   10,   19,  288,   20,   18,   68, 1067,    9,  120,
            0,  255,   53,   79, 1263,   12, 1603,   14]))

Let us also map these tokens back to words, because we can't visualize numbers as well as words themselves. 

In [28]:
' '.join(num.vocab[o] for o in x[0])

'xxbos i liked this movie i remember there was one very well done scene in this movie where xxmaj xxunk xxmaj xxunk ( played by xxup xxunk . xxmaj soles ) is xxunk in her bed xxunk xxunk and then she begins to xxunk that the xxmaj ramones are in the room with her sing the song " i xxmaj want xxmaj you xxmaj around " … very very cool stuff .'

In [29]:
' '.join(num.vocab[o] for o in y[0])

'i liked this movie i remember there was one very well done scene in this movie where xxmaj xxunk xxmaj xxunk ( played by xxup xxunk . xxmaj soles ) is xxunk in her bed xxunk xxunk and then she begins to xxunk that the xxmaj ramones are in the room with her sing the song " i xxmaj want xxmaj you xxmaj around " … very very cool stuff . \n\n'


Now once you have some pretrained weights, you can use those in the final task, which in this case is text classification (classification of IMDb reviews into positive and negative reviews). All you have to do now is change the task. Now instead of the target output being a sequence of text, it will be classes. Sounds straightforward!

Let us go one step beyond.

We will be discussing an interesting Language Modeling approach called *ULMFiT* (Universal Language Model Fine-tuning), which was introduced in [this](https://arxiv.org/abs/1801.06146) paper. Let us discuss the key ideas from this paper through our example of creating an IMDb text classifier. 

![](https://drive.google.com/uc?id=1itzIngHaDIyVWhyBz2oEia8tN_OsegvZ)

After creating the WikiText Language Model, that understands the English Language, our problem is this. Our model has also sort of overfitted on the Wikipedia text. So now, its style of language is more formal, and its more likely to predict text that you would see in WikiPedia, rather than movie review style text. This paper introduced a clever solution to that. 

As an extension to the self-supervised pretraining stage, why not *fine-tune* the language model on our specific dataset, so that the language model learns the style of text that is more common in the target dataset (IMDb, in this case), and carry on the classification text from that point onwards. Let us see how to do that!


### Finetuning a Language Model

just for the purpose of language modeling, we don't need any explicit labels. So we gather *all* the data that we can find, whether its is from the training set, or the validation set, or the extra unlabeled data that was present in the IMDb dataset. 

Now, you may be wondering, how come we're training the model on the validation set. That would be true if we were performing the classification task, but here, we are creating a language model ,which does not involve any specific labels. Meaning, we don't evaluate the performance of the model on a separate sub-dataset. In this case, it is wise to simply pool all the data you can find and build the language model on top of that. 

In [30]:
get_imdb=partial(get_text_files,folders=['train','test','unsup'])

In [31]:
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path,is_lm=True), #is_lm tells that explicit labels are not required, and the target would simply be text offset by one token
    get_items=get_imdb, splitter = RandomSplitter(0.1)
).dataloaders(path,path=path,bs=128,seq_len=80) #each batch would be of length 80 tokens, and a total of 128 batches are required

We can see how the data would look like, along with the target texts

In [32]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj this is one of those movies that i can watch again and again and not get tired of it . xxmaj it is by far one of the best comic book adaptations ever . i liked this one even more than xxmaj x - men . xxmaj in fact , this movie is sort of a cross between xxmaj x - men and the matrix and it came out before either . xxmaj wesley xxmaj snipes does a","xxmaj this is one of those movies that i can watch again and again and not get tired of it . xxmaj it is by far one of the best comic book adaptations ever . i liked this one even more than xxmaj x - men . xxmaj in fact , this movie is sort of a cross between xxmaj x - men and the matrix and it came out before either . xxmaj wesley xxmaj snipes does a great"
1,"in two parts : the first one tells the ' saga ' of two morons hitchhiking and being kidnapped by a psychotic killer . xxmaj the second one is about a group of persons , including the two morons , trapped in a bar , surrounded by the legion of the dead leaded by xxmaj xxunk . i believe this flick was supposed to be funny , but it is not . xxmaj the viewer can only laugh how imbecile","two parts : the first one tells the ' saga ' of two morons hitchhiking and being kidnapped by a psychotic killer . xxmaj the second one is about a group of persons , including the two morons , trapped in a bar , surrounded by the legion of the dead leaded by xxmaj xxunk . i believe this flick was supposed to be funny , but it is not . xxmaj the viewer can only laugh how imbecile a"


After which we can simply train this model as before, using the Learner class. We will be using an advanced architecture called *AWD-LSTM* (which we will be implementing later on), and this dataset has already been pretrained on WIKITEXT103. 

We use a new metric here called Perplexity, which is often used in NLP task. It is nothing but the exponential of the cross entropy loss (`torch.exp(cross_entropy)`), and depicts how well the model has been able to predict the next sequence of tokens. 


In [33]:
learn=language_model_learner(
    dls_lm,arch=AWD_LSTM,drop_mult=0.3,
    metrics=[accuracy,Perplexity()]).to_fp16()

Before we begin fine tuning our model, let us see how it performs using the pretrained WIKITEXT weights. We would expect it not to be like Movie reviews, but a more formal language. 

In [34]:
TEXT=" I liked this movie because"
N_WORDS=40
N_SENTENCES=2
preds=[learn.predict(TEXT,N_WORDS,temperature=0.75) for _ in range(N_SENTENCES)]
print("\n".join(preds))

i liked this movie because it was about " a book with a lot of horror or about it . " This was in reality a psychology game with a prominent hero . These movies revolved around a gangster who falls in love
i liked this movie because it was a " true . " The movie was a success and a success . It was the first of many films to be produced by Warner Brothers . The film was shot in


Now let us perform the fine tuning task. Its a slighlty lengthy process, just because of the sheer size of the data, so you can run the next few cells, and leave you PC, have lunch, or take a walk, and come back. 

In [35]:
learn.fit_one_cycle(1,2e-2) #As we've seen before, fastai trains the model in two stages. The first stage involves finetuning only the final few layers, that have beeem randomly initialized

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.019379,3.892854,0.301122,49.050701,27:08


In [None]:
learn.save('1epoch') #saving the model state to a file
learn.load('1epoch')

<fastai.text.learner.LMLearner at 0x7fa20660d250>

In [None]:
learn.unfreeze() #making the rest of the model trainable, by setting their `requires_grad` as True (since only then can their parameters be updated)
learn.fit_one_cycle(10,2e-3) #this is the second stage of training

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.752873,3.757712,0.316988,42.850266,23:43
1,3.709863,3.698074,0.324047,40.369473,23:44
2,3.622598,3.649691,0.329118,38.462784,23:46
3,3.570662,3.61911,0.332994,37.304348,23:51
4,3.484094,3.598005,0.335988,36.525288,23:59
5,3.426205,3.580078,0.337797,35.876343,23:55
6,3.374929,3.572237,0.339738,35.596115,23:57
7,3.297278,3.5671,0.340749,35.413754,23:48
8,3.239034,3.571592,0.340771,35.573185,23:58


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.752873,3.757712,0.316988,42.850266,23:43
1,3.709863,3.698074,0.324047,40.369473,23:44
2,3.622598,3.649691,0.329118,38.462784,23:46
3,3.570662,3.61911,0.332994,37.304348,23:51
4,3.484094,3.598005,0.335988,36.525288,23:59
5,3.426205,3.580078,0.337797,35.876343,23:55
6,3.374929,3.572237,0.339738,35.596115,23:57
7,3.297278,3.5671,0.340749,35.413754,23:48
8,3.239034,3.571592,0.340771,35.573185,23:58
9,3.21679,3.575822,0.340604,35.723976,23:57


In [36]:
path_mdl=Path('.')

You can optionally save it to your Google Drive to access it later. If you wish to do this, simply run the following lines of code, and change the path to your desired location on your google drive. 

In [None]:
# from google.colab import drive
# drive.mount('/content/gdrive')
# path_mdl=Path('/content/gdrive/My Drive/') # modify this to change the location

In [37]:
learn.save_encoder(path_mdl/'finetuned_encoder')

before building the *classifier*, lets see how this is performing anyways...

In [38]:
TEXT=" I liked this movie because"
N_WORDS=40
N_SENTENCES=2
preds=[learn.predict(TEXT,N_WORDS,temperature=0.75) for _ in range(N_SENTENCES)]
print("\n".join(preds))

i liked this movie because it was so interesting . i was very impressed by it , did it a lot more better . Yes , i was able to get a second chance . First , that was what i actually thought
i liked this movie because it was true to the myths of Clive Barker and many of Clive Barker 's stories . The story was good . 

 The characters are quite well developed . 

 This is


You can see how well this performs in comparison to the Text generation model we created in Session 4, on Bayesian Learning. Not only is the grammar and syntax much better, but it simply makes so much more sense now, and you have thus created a state of the art text generator model with only a couple of hours of training. 

The next step is to build our text classifier. 

## Text classifier

Now, we need to reinitialize the dataloaders, because we need to tell it that now, our target is not to generate a sequence of tokens, but a category. 

In [46]:
dls_classification = DataBlock(
    blocks=(TextBlock.from_folder(path,vocab=dls_lm.vocab),CategoryBlock), #we don't pass is_lm=True to TextBlock.from_folder
    #we also pass in the vocab here, because otherwise the vocab generated here may not match the vocab used in the fintuning task, which may lead to results which makes no sense
    get_y=parent_label,
    get_items=partial(get_text_files,folders=['train','test']), #now we dont used the unsupervised text anymore
    splitter=GrandparentSplitter(valid_name='test') 
).dataloaders(path,path=path,bs=128,seq_len=72)

And let's see what our data will look like now, along with the targets.

In [47]:
dls_classification.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj some have praised xxunk xxmaj lost xxmaj xxunk as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n\n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the "" crack staff "" of many older adventure movies has been done well before , ( think xxmaj the xxmaj dirty xxmaj dozen ) but xxunk represents one of the worse films in that motif . xxmaj the characters are weak . xxmaj even the background that each member trots out seems stock and awkward at best . xxmaj an xxup md / xxmaj medicine xxmaj man , a tomboy mechanic whose father always wanted sons , if we have not at least seen these before ,",neg
2,"xxbos xxmaj warning : xxmaj does contain spoilers . \n\n xxmaj open xxmaj your xxmaj eyes \n\n xxmaj if you have not seen this film and plan on doing so , just stop reading here and take my word for it . xxmaj you have to see this film . i have seen it four times so far and i still have n't made up my mind as to what exactly happened in the film . xxmaj that is all i am going to say because if you have not seen this film , then stop reading right now . \n\n xxmaj if you are still reading then i am going to pose some questions to you and maybe if anyone has any answers you can email me and let me know what you think . \n\n i remember my xxmaj grade 11 xxmaj english teacher quite well . xxmaj",pos


Now that we've changed the dataloaders, we can train the model in the same way. We do need to import the pretrained network weights, which will help us identify *features* of the language. So essentially, the pretrained model will act as the encoder. We can load the weights into a new learner using the `load_encoder` method of the learner class. 

In [48]:
learn=text_classifier_learner(dls_classification,AWD_LSTM,drop_mult=0.5, metrics=accuracy).to_fp16()

In [49]:
learn = learn.load_encoder(path_mdl/'finetuned_encoder')

And let us finally train the model. 

In [None]:
learn.fit_one_cycle(1,2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.256699,0.189843,0.92588,01:11


In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1,slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.237436,0.168869,0.9342,01:14


In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(2,slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.199224,0.159543,0.9386,01:31
1,0.188962,0.157887,0.93896,01:32


In [None]:
learn.unfreeze()
learn.fit_one_cycle(2,slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.185853,0.153275,0.9418,01:52
1,0.166478,0.154529,0.94292,01:53


That's a wonderful accuracy. This is almost close to the best accuracy ever achieved on this dataset. The best accuracy is about 96%, and involves some very complex data augmentation techniques, including translating the text to another language, and then translating it back to the original language. But let us not go into such complicated endeavors for now. This accuracy is wonderful in itself!


Now having learnt how to build these models, let us build these models from scratch. We'll be starting with a simple Neural Net that can handle sequence of texts, called *RNNs*, or Recurrent Neural Networks. From there on, we will go on to build more advanced models. 

## Creating RNNs from scratch

Before we create an RNN, let us first set up some basic type of data. The IMDb dataset is too large for experiments, and as you may have seen, even simple models take many hours to train because of the size of the dataset. So for simplicity, we choose a dataset that simply includes the first 10,000 numbers written sequentially in word form. 

### Setting up a very basic dataset 

In [39]:
path=untar_data(URLs.HUMAN_NUMBERS)
path.ls()

(#2) [Path('/root/.fastai/data/human_numbers/valid.txt'),Path('/root/.fastai/data/human_numbers/train.txt')]

In [40]:
#initially lets just join both the training and the validation dataset
lines=L() #empty list
with open(path/'train.txt') as f: lines+=L(*f.readlines())
with open(path/'valid.txt') as f: lines+=L(*f.readlines())
lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

Instead of spaces and new line characters, let us just replace all separaters by one common separateor, which is a ' . '. 

In [52]:
text = ' . '.join([l.strip() for l in lines]) #the strip function removes all separators including spaces, newlines, tabs, etc
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

In [53]:
#for demonstration purpose
'  asd   \n \t'.strip()

'asd'

And let us now split this text into tokens. Note that the full stop is meant to be part of the sequence as it tells the model when one number finishes, and the next starts. 

In [54]:
tokens=text.split(' ')
tokens[:10]

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

And since this text contains many repetitions of words, we will first create a vocabulary of unique words. 

In [55]:
#to numericalize, we need to create a list of all unique words (aka vocab)
vocab=L(*tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

So this tells us, that in the first 10000 numbers written in words, there are only 29 unique words, plus one for the full stop. 

In [56]:
print(vocab)

['one', '.', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'hundred', 'thousand']


And now we need to create a mapping from words to numbers. We store these mappings in a dictionary, called `word2idx`. 

In [57]:
word2idx={w:i for i,w in enumerate(vocab)}
nums=L(word2idx[i] for i in tokens)
nums

(#63095) [0,1,2,1,3,1,4,1,5,1...]

### Building a simple RNN

Just for experimental and comparison purpose, let us build a model that resembles the model we built in session 4. If you remember, the exercise involved building a model that predicted the next word given 2 previous words. In this case, we will be building a model that predicts the word given 3 previous words. For that, all we need to do is to modify the dataset in which the input x contains a tensor of 3 word indexes, and the output y contains the index for the next word. 

In [58]:
# creating a simple Neural Net that predicts next word on the basis of the previous three words. Let us set up the dataset for that purpose

#right now lets just test it out on tokens, to see what it will look like in the end
L((tokens[i:i+3],tokens[i+3]) for i in range(0,len(tokens)-4,3))

(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

In [59]:
#now let us do it for real. Remember we need tensors for PyTorch. Also tensors will take the numericalized tokens
seqs=L((tensor(nums[i:i+3]),nums[i+3]) for i in range(0,len(nums)-4,3))
seqs 
#its just a classification problem now, with input as tensor of shape 3,1 and output as a single number ranging from 0 to len(nums)-1

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

Once we set up the dataset, we need to put them in DataLoaders. 

In [62]:
#feed these into a Dataloader.
bs=64
cut=int(len(seqs)*0.8)
dls=DataLoaders.from_dsets(seqs[:cut],seqs[cut:],bs=64,shuffle=False) #in language models, we shouldnt shuffle the sequence of words

Let us understand how an RNN is built. An RNN is essentially a looping Neural Network. It takes in one input at a time, and passes it through a Linear Layer to get activations. These activations are added to the next word in the sequence, and the result is again passed through the Linear Layer to get new activations. Here we have three words as input, so after repeating this process for a total of three times, we will get some activations from the Linear Layers, which can be passed to a final linear layer (that outputs some probability about which token is the final prediction). This is what an RNN is.

![](https://drive.google.com/uc?id=1SprFLnlXXIvTarkaR5XSGrMShrUNdCLe)

In comparison to a standard fully connected Neural Network, where we had an input layer, multiple hidden layers, and one final output layer, an RNN is simply the input layer, the output layer, and one single hidden layer, the output of which is fed into itself again and again. Hence the name *Recurrent* Neural Network. 

There is one small tweak in this. 

Instead of directly feeding the tokens as the input, what if we had the ability to represent each token as a much more sophisticated tensor of features? We call this an Embedding Matrix, which basically maps each token to a new tensor containing different features. And these features can be learned through standard optimization techniques. 


![](https://drive.google.com/uc?id=1U6EXZ8uFikW-i12qet9sIi5dn37GeWS2)

This basically helps us represent more information in comparison to one single token number. So this is essentially a mapping from a token number to an entire tensor of learnable parameters.

In [60]:
#refactoring the first LM model
class LMModel1(Module):
    def __init__(self,vocab_sz,n_hidden):
        self.i_h=nn.Embedding(vocab_sz,n_hidden) #embedding layer
        self.h_h=nn.Linear(n_hidden,n_hidden) #linear layer: to create activations for the next word
        self.h_o=nn.Linear(n_hidden,vocab_sz) #final layer to predict the fourth word in the end

    def forward(self,x):
        h=0
        for i in range(3):
            h=h+self.i_h(x[:,i])
            h=F.relu(self.h_h(h))
        return self.h_o(h)

Let us train this model using our Learner class, and see how it performs.

In [63]:
learn=Learner(dls,LMModel1(len(vocab),64),loss_func=F.cross_entropy,metrics=accuracy)
learn.fit_one_cycle(3,1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.634714,1.883425,0.464702,00:01
1,1.420086,1.704659,0.480628,00:01
2,1.418769,1.593465,0.489422,00:01


This may look like a terrible performance, but let us compare this with a model that would make completely random predictions. 

In [None]:
1/len(vocab)

0.03333333333333333

So you will realize that the model has essentially learnt atleast something. 

Now let us compare this with an even more sophisticated dummy model, that only predicts THE most commonly occuring word. 

In [64]:
#checking if this model is any good at all?
#comparing with a model that would only predict the most commonly occuring word 
n,counts=0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n+=y.shape[0]
    for i in range_of(vocab):counts[i]+=(y==i).long().sum()
idx=torch.argmax(counts)
idx,vocab[idx.item()],counts[idx].item()/n


(tensor(29), 'thousand', 0.15165200855716662)

thousand is the most common word. And if we only predicted the word thousand, we would get an accuracy of 15% only, so the RNN is actually working okayish. 

### Improving the RNN

Let us first do a very simple tweak to this model, that will help us save a lot of GPU memory, which is based on the problem, that PyTorch keeps track of all computations since the initialization of any layer. So once you've propagated through the three layers, you essentially don't need the computational history from previous iterations. So we can simply *detach* the tensor from the computation history, and it will no longer occupy space in the GPU. 

In [69]:
class LMModel2(Module):
    def __init__(self,vocab_sz,n_hidden):
        self.i_h=nn.Embedding(vocab_sz,n_hidden) #embedding layer
        self.h_h=nn.Linear(n_hidden,n_hidden) #linear layer: to create activations for the next word
        self.h_o=nn.Linear(n_hidden,vocab_sz) #final layer to predict the fourth word in the end
        self.h=0

    def forward(self,x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h=F.relu(self.h_h(self.h))
        out =self.h_o(self.h)
        self.h=self.h.detach()
        return out
    
    def reset(self): self.h =0

To use this model, we need to make sure that all batches are sequentially ordered. So let us build this functionality for our purpose. 


In [65]:
bs=64
m=len(seqs)//bs
m,bs,len(seqs)

(328, 64, 21031)

In [66]:
 def group_chunks(ds,bs):
     m=len(ds)//bs
     new_ds=L()
     for i in range(m): new_ds+=L(ds[i + m*j] for j in range(bs))
     return new_ds

In [67]:
cut = int(len(seqs)*0.8)
dls=DataLoaders.from_dsets(
    group_chunks(seqs[:cut],bs),
    group_chunks(seqs[cut:],bs),
    bs=bs,drop_last=True,shuffle=False)

Let train our model using this little tweak. You will also notice a callback (`cbs`) named ModelResetter. Don't worry about it. A callback is essentially a function called at a certian stage of training (like after an epoch ends, or after loss is calucated, or after parameters are updated, and so on). In this case, ModelResetter calls the models `reset` method before the training/validation cycle begins, and also at the very end of training. This helps the model start afresh.  Don't worry about the details for now. 

In [74]:
??ModelResetter

In [70]:
learn=Learner(dls,LMModel2(len(vocab),64),loss_func=F.cross_entropy,metrics=accuracy,cbs=ModelResetter)
learn.fit_one_cycle(10,3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.719994,1.873316,0.464663,00:01
1,1.274188,1.764337,0.439423,00:01
2,1.083032,1.676496,0.521635,00:01
3,0.988174,1.657305,0.538221,00:01
4,0.946185,1.811629,0.557452,00:01
5,0.900589,1.713292,0.552885,00:01
6,0.856381,1.791603,0.58125,00:01
7,0.821015,1.769067,0.595673,00:01
8,0.789512,1.822121,0.595913,00:01
9,0.776309,1.781443,0.598798,00:01


This is already better. This is mostly because of the increased precision that was possible because of the freed memory. 

Now let us do another small tweak which will help us improve our accuracy even better. But first, let us also make our task more challenging. Instead of 3 tokens, we will be using 16 tokens at once into the model, and predicting the next sequence of tokens. This is variable, so you are free to change the sequence length. 

In [71]:
sl=16 #sequence length
seqs=L((tensor(nums[i:i+sl]),tensor(nums[i+1:i+sl+1])) for i in range(0,len(nums)-sl-1,sl))

cut=int(len(seqs)*0.8)
dls=DataLoaders.from_dsets(group_chunks(seqs[:cut],bs),
                           group_chunks(seqs[cut:],bs),
                           bs=bs,drop_last=True,shuffle=False) #drop_last=True because you wont be able to ofset the target by one word in that case

In [72]:
#lets see what this dataset will look like
[L(vocab[o] for o in s) for s in seqs[0]]

[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

You can see that not only is the input sequence of length 16, but the output sequence too is of length 16, which is offset by only one token. This is already a more complicated task.

For this model, we will be modifying the model so that it can process as many tokens as possible. 

In [77]:
class LMModel2(Module):
    def __init__(self,vocab_sz,n_hidden):
        self.i_h=nn.Embedding(vocab_sz,n_hidden) #embedding layer
        self.h_h=nn.Linear(n_hidden,n_hidden) #linear layer: to create activations for the next word
        self.h_o=nn.Linear(n_hidden,vocab_sz) #final layer to predict the fourth word in the end
        self.h=0
    
    def forward(self,x):
        outs=[]
        for i in range(sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            outs.append(self.h_o(self.h))
        self.h = self.h.detach()
        return torch.stack(outs,dim=1) #shape bs,sl,vocab_sz
    
    def reset(self): self.h = 0

The output is of shape `(bs,sl,vocab_sz)` while the target is of shape `(bs,sl)`. So we need to change the loss function definition, otherwise it will throw a compatibility issue. All we need to do is to flatten the output before feeding it to the Cross Entropy loss.

In [78]:
#we have to modify the loss_func because we stacked the outputs on dimension 1
def loss_func(inp,target): return F.cross_entropy(inp.view(-1,len(vocab)),target.view(-1))

And then we can follow the same procedure. Let us also try training for a bit longer and see the results. 

In [79]:
learn = Learner(dls,LMModel2(len(vocab),64), loss_func=loss_func,metrics=accuracy,cbs=ModelResetter)
learn.fit_one_cycle(15,3e-3) #train for longer because its a more complicated task

epoch,train_loss,valid_loss,accuracy,time
0,3.1993,2.998189,0.326497,00:00
1,2.30689,2.035344,0.469564,00:00
2,1.752365,1.836322,0.467122,00:00
3,1.465121,1.853118,0.490967,00:00
4,1.304587,1.776199,0.501546,00:00
5,1.175223,1.715397,0.522135,00:00
6,1.059304,1.672083,0.532878,00:00
7,0.950992,1.693639,0.565348,00:00
8,0.874582,1.823799,0.577881,00:00
9,0.812443,1.850246,0.581624,00:00


These are much better results. Now let us do another tweak to our model. What if, instead of a single hidden layer, we stacked two hidden layers, so the output of the second hidden layer is passed as the input to the first? Essentially we're making the model deeper, and it will only help in learning more complex features. 

### Multilayer RNNs

Let us save ourselves the trouble of writing this from scratch, and simply use PyTorch's `nn.RNN` class, which can create a sequence of these hidden states. The rest will be the same. 

In [80]:
class LMModel3(Module):
    def __init__(self,vocab_sz,n_hidden,n_layers):
        self.i_h = nn.Embedding(vocab_sz,n_hidden)
        self.rnn = nn.RNN(n_hidden,n_hidden,n_layers,batch_first=True) #if we want to create 2 layers stacked together, set n_layers as 2
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = torch.zeros(n_layers,bs,n_hidden)

    def forward(self,x):
        res,h = self.rnn(self.i_h(x),self.h)
        self.h = h.detach()
        return self.h_o(res)
    
    def reset(self): self.h.zero_()

In [82]:
learn=Learner(dls,LMModel3(len(vocab),64,2), #for now we are using only 2 layers in the RNN
              loss_func=CrossEntropyLossFlat(),
              metrics=accuracy,cbs=ModelResetter)

learn.fit_one_cycle(15,3e-3) #deeper model, exploding/vanishing gradients

epoch,train_loss,valid_loss,accuracy,time
0,3.03959,2.616475,0.462077,00:01
1,2.133121,1.805064,0.470296,00:01
2,1.700932,1.930571,0.322103,00:01
3,1.489591,1.882982,0.460531,00:01
4,1.345939,1.931348,0.468506,00:01
5,1.218477,2.034932,0.503906,00:01
6,1.099908,1.957535,0.545654,00:01
7,0.989068,1.781985,0.549886,00:01
8,0.881065,1.782087,0.545247,00:01
9,0.788906,1.801425,0.533447,00:01


That's weird. Why is this model performing worse than a single layer model? That's because of the exploding/vanishing gradient problem, that arises in deep Neural Networks. 

Essentially we're multiplying an input by the same matrices again and again. Imagine what happens when you multiply a number again and again by a number. 

There can be two cases.

1. If you're multiplying again and again by a number greater than 1, you will reach an extremely large value soon. And this means, the gradients too will explode, or become very large, in other terms. This may make the model erratic and the performance will deteriorate. 

2. If you're multiplying again and again by a number less than 0, you will reach an extremely small value soon. This means, the gradients too will be close to zero, and essentially no learning will take place. 

There are a few methods to prevent these. In the context of Sequence Models, one of the research developments that addressed this issue was the LSTM, or *Long Short Term Memory* Models. . 

## Long Short Term Memory (LSTM) Models

LSTM is an architecture that was introduced back in 1997 by Jürgen Schmidhuber and Sepp Hochreiter. In this architecture, there are not one, but two, hidden states. In our base RNN, the hidden state is the output of the RNN at the previous time step. That hidden state is then responsible for two things:

* Having the right information for the output layer to predict the correct next
token
* Retaining memory of everything that happened in the sentence
Consider, for example, the sentences “Henry has a dog and he likes his dog very much” and “Sophie has a dog and she likes her dog very much.” It’s very clear that the RNN needs to remember the name at the beginning of the sentence to be able to pre‐ dict he/she or his/her.

In practice, RNNs are really bad at retaining memory of what happened much earlier in the sentence, which is the motivation to have another hidden state (called cell state) in the LSTM. The cell state will be responsible for keeping long short-term memory, while the hidden state will focus on the next token to predict. Let’s take a closer look at how this is achieved and build an LSTM from scratch.

![](https://drive.google.com/uc?id=1ll89W_Fy0zJN7LNVEqcMvjk7eHmTKk4L)

In this picture, our input $x_{t}$ enters on the left with the previous hidden state ($h_{t-1}$) and cell state ($c_{t-1}$). The four orange boxes represent four layers (our neural nets) with the activation being either sigmoid ($\sigma$) or tanh. tanh is just a sigmoid function rescaled to the range -1 to 1. Its mathematical expression can be written like this:

$$\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x}+e^{-x}} = 2 \sigma(2x) - 1$$

where $\sigma$ is the sigmoid function. The green circles are elementwise operations. What goes out on the right is the new hidden state ($h_{t}$) and new cell state ($c_{t}$), ready for our next input. The new hidden state is also used as output, which is why the arrow splits to go up.

Let's go over the four neural nets (called *gates*) one by one and explain the diagram—but before this, notice how very little the cell state (at the top) is changed. It doesn't even go directly through a neural net! This is exactly why it will carry on a longer-term state.

First, the arrows for input and old hidden state are joined together. In the RNN we wrote earlier in this chapter, we were adding them together. In the LSTM, we stack them in one big tensor. This means the dimension of our embeddings (which is the dimension of $x_{t}$) can be different than the dimension of our hidden state. If we call those `n_in` and `n_hid`, the arrow at the bottom is of size `n_in + n_hid`; thus all the neural nets (orange boxes) are linear layers with `n_in + n_hid` inputs and `n_hid` outputs.

The first gate (looking from left to right) is called the *forget gate*. Since it’s a linear layer followed by a sigmoid, its output will consist of scalars between 0 and 1. We multiply this result by the cell state to determine which information to keep and which to throw away: values closer to 0 are discarded and values closer to 1 are kept. This gives the LSTM the ability to forget things about its long-term state. For instance, when crossing a period or an `xxbos` token, we would expect to it to (have learned to) reset its cell state.

The second gate is called the *input gate*. It works with the third gate (which doesn't really have a name but is sometimes called the *cell gate*) to update the cell state. For instance, we may see a new gender pronoun, in which case we'll need to replace the information about gender that the forget gate removed. Similar to the forget gate, the input gate decides which elements of the cell state to update (values close to 1) or not (values close to 0). The third gate determines what those updated values are, in the range of –1 to 1 (thanks to the tanh function). The result is then added to the cell state.

The last gate is the *output gate*. It determines which information from the cell state to use to generate the output. The cell state goes through a tanh before being combined with the sigmoid output from the output gate, and the result is the new hidden state.

In terms of code, we can write the same steps like this:

(Cited from Deep Learning for Coders with fastai and Pytorch). 


In [None]:
class LSTMCell(Module):
    def __init__(self,ni,nh):
        self.forget_gate = nn.Linear(ni+nh,nh)
        self.input_gate  = nn.Linear(ni+nh,nh)
        self.cell_gate   = nn.Linear(ni+nh,nh)
        self.output_gate = nn.Linear(ni+nh,nh)
    
    def forward(self,input,state):
        h,c = state
        h=torch.stack([h,input],dim=1)
        forget = torch.sigmoid(self.input_gate(h)) #value between 0 and 1
        c=c*forget
        inp=torch.sigmoid(self.input_gate(h))
        cell=torch.tanh(self.cell_gate(h))
        c=c+ inp*cell
        out=torch.sigmoid(self.output_gate(h))
        h=out*torch.tanh(c)

        return h,(h,c)

This is almost the same functionality that you will find in PyTorch's `nn.LSTM`. This can easily replace the `nn.RNN` layer in our model. Let's see if that would improve the performance

In [83]:
class LMModel4(Module):
    def __init__(self,vocab_sz,n_hidden,n_layers):
        self.i_h = nn.Embedding(vocab_sz,n_hidden)
        self.rnn=nn.LSTM(n_hidden,n_hidden,n_layers,batch_first=True)
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = [torch.zeros(n_layers,bs,n_hidden) for _ in range(2)] #LSTM gives 2 outputs, and we store both of them here

    def forward(self,x):
        res,h = self.rnn(self.i_h(x),self.h)
        self.h=[h_.detach() for h_ in h]
        return self.h_o(res)

    def reset(self): 
        for h in self.h: h.zero_()

In [85]:
learn=Learner(dls,LMModel4(len(vocab),64,2),
              loss_func=CrossEntropyLossFlat(),
              metrics=accuracy,cbs=ModelResetter)

learn.fit_one_cycle(15,1e-2) #we can train it at a higher learning rate, for a shorter time, and get better accuracy , since it now does not suffer from exploding/vanishing grads

epoch,train_loss,valid_loss,accuracy,time
0,3.013325,2.733004,0.432454,00:02
1,2.199492,1.907451,0.305827,00:02
2,1.611354,1.687851,0.478923,00:02
3,1.298396,2.002836,0.468099,00:02
4,1.013445,1.985033,0.536051,00:02
5,0.686701,1.803597,0.622721,00:02
6,0.412793,1.687815,0.716797,00:02
7,0.234133,1.502601,0.756836,00:02
8,0.130238,1.461595,0.7736,00:02
9,0.074822,1.649599,0.786214,00:02


This is much better! This is solely because LSTM was able to prevent vanishing/exploding gradients to some extent. 

But its not that LSTMs are completely free from VG/EG problem.
We need to perform a few tweaks to this, as was introduced in [this](https://arxiv.org/abs/1708.02182) paper. The resultant LSTM is called the *AWD-LSTM*, which we used to build the text classifier initially. It involves applying certain regularization techniques on the Neural Net layers inside the LSTM, as well as a weight tying method.  

### Regularizing LSTMs

Some additional things were implemented in this paper as well. However, we will only be implementing Dropout. We will also be implementing weight tying, which is essentially assigning the same weight matrix to the input layer and the output layer.

The idea is as follows. The input layers job is to do a mapping from a token to an embedding vector, and the job of the output layer is to do a mapping from an embedding layer to a token. So ideally, they both are doing the same thing, and dont't need two different matrices to perform this task. 

#### Dropout
We have shown a simple demonstration of how to implement dropout in PyTorch, but this is essentially the same as PyTorch's nn.Dropout. (Through PyTorch's native implementation is done in C, not python) 

In [86]:
#implementation in PyTorch is really simple. Though PyTorch's native layer is written in C, not python
class Dropout(Module):
    def __init__(self,p): self.p=p

    def forward(self,x):
        if not self.training: return x #pytorch retains the state of the optimizer (whether we are in training process, or validation process). Dropout is one of the main reasons for this
        mask=x.new(*x.shape).bernoulli_(1-p) #retain 1-p neurons randomly. This random process is called the bernoulles process of selection
        return x*mask.div_(1-p)

In [None]:
??nn.Dropout

In [None]:
??F.dropout
#torch._VF or torch. variable functions module is an alias of torch._C._VariableFunctions
# https://github.com/pytorch/pytorch/blob/master/torch/_VF.py

And integrating this into our model is as simple as follows

In [87]:
class LMModel5(Module):
    def __init__(self,vocab_sz,n_hidden,n_layers,p):
        self.i_h = nn.Embedding(vocab_sz,n_hidden)
        self.rnn = nn.LSTM(n_hidden,n_hidden,n_layers,batch_first=True)
        self.drop=nn.Dropout(p)
        self.h_o=nn.Linear(n_hidden,vocab_sz)
        self.h_o.weight=self.i_h.weight #weight tying
        self.h=[torch.zeros(n_layers,bs,n_hidden) for _ in range(2)]

    def forward(self,x):
        raw,h = self.rnn(self.i_h(x),self.h)
        out=self.drop(raw)
        self.h = [h_.detach() for h_ in h]
        return self.h_o(out),raw,out #we've to return 3 things
    
    def reset(self):
        for h in self.h: h.zero_()

Let us build our model using this architecture

In [92]:
learn=TextLearner(dls,LMModel5(len(vocab),64,2,0.5), loss_func=CrossEntropyLossFlat(), metrics=accuracy)

In [93]:
learn.fit_one_cycle(15,1e-2,wd=0.1) #adding some additional regularization

epoch,train_loss,valid_loss,accuracy,time
0,2.554272,1.922771,0.486735,00:02
1,1.642464,1.769457,0.537028,00:02
2,0.923154,1.00023,0.746745,00:02
3,0.475945,0.808373,0.79598,00:02
4,0.253969,0.663987,0.843018,00:02
5,0.139685,0.645504,0.864583,00:02
6,0.088738,0.730432,0.850016,00:02
7,0.058592,0.680262,0.861979,00:02
8,0.043683,0.618799,0.866455,00:02
9,0.034281,0.636209,0.863851,00:02


This is amazing. The performance has vastly improved. With this we finish our discussion on LSTMs. Now let us move to the present state-of-the-art architecture used in NLP. This architecture is called as `Transformer`, which originated from [this](https://arxiv.org/abs/1706.03762) paper. We won't be explaining each module step by step, but if you are interested, you can find an intuitive explanation [here](https://www.youtube.com/watch?v=4Bdc55j80l8). 

You may have heard about the famous GPT models from OpenAI, which have performed some really complex tasks, like talking, writing code, building websites from mere vague ideas given as inputs, create art, and some other really amazing things. You can find a brief demo [here](https://www.youtube.com/watch?v=PqbB07n_uQ4), in which a person interviews an AI model, and the results are really amazing. I recommend you watch it fully. 

## Transformers

We'll essentially be using pretrained hodels that are provided by HuggingFace's transformer library. You can find the source code [here](https://github.com/huggingface/transformers) and the documentations [here](https://huggingface.co/transformers/pretrained_models.html). Let us start by building our models and finetuning transformers for our specific tasks. 

In [94]:
!pip install -Uq transformers >./temp

In [95]:
from transformers import GPT2LMHeadModel,GPT2TokenizerFast

So we'll be using a basic version of Transformers, called as GPT2, which itself is quite memory intensive. However, you can explore by changing the model by looking up the documentations. 

In [96]:
pretrained_weights='gpt2' #basic version, which itself takes a lot of memory
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model=GPT2LMHeadModel.from_pretrained(pretrained_weights) #this is a pretrained model.

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




Just to get a brief idea about what our model looks like, let us look into the representation of the model

In [97]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

As we saw before, we need to take a peice of text, tokenize it, and feed it to the model. the transformers library already provides us with a vicabulary that was used to train the model, so let us see what the model performs just without any finetuning. 

In [105]:
#lets see how the tokenizer works
ids=tokenizer.encode('Welcome to the 13th session on Practical Machine Learning')
ids

[14618, 284, 262, 1511, 400, 6246, 319, 13672, 605, 10850, 18252]

In [106]:
#you can even decode the tokens back to original words
tokenizer.decode(ids)

'Welcome to the 13th session on Practical Machine Learning'

In [107]:
#since our model is pretrained, you can directly use it to get predictions
t=torch.LongTensor(ids)[None]
preds=model.generate(t,max_length=50) #by default, the length of predictions is 20 (words). You can change it using the max_length parameter

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [108]:
preds.shape,preds[0] #by default, the predictions are of length 20

(torch.Size([1, 50]),
 tensor([14618,   284,   262,  1511,   400,  6246,   319, 13672,   605, 10850,
         18252,    13,   198,   198,   464,  6246,   481,   307,  2714,   379,
           262,  2059,   286,  3442,    11, 14727,    11,   319,  2693,  1367,
            11,  2177,    13,   198,   198,   464,  6246,   481,   307,  2714,
           379,   262,  2059,   286,  3442,    11, 14727,    11,   319,  2693]))

The output is in the form of numerical tokens. Let us decode these into words, to see what the model has predicted. 

In [109]:
tokenizer.decode(preds[0].numpy())

'Welcome to the 13th session on Practical Machine Learning.\n\nThe session will be held at the University of California, Berkeley, on September 11, 2017.\n\nThe session will be held at the University of California, Berkeley, on September'

This is pretty impressive on its own. But let us fine tune this model onn our own dataset. We'll be using a smaller version of WIKITEXT, called WIKITEXT2. (Just to speed up the training process)

### Finetuning the Transformers Model

Let us first download the dataset and set it up. Our task it to build  a language model that takes in a series of text, and also outputs the same series, just offset by one token.

In [110]:
path = untar_data(URLs.WIKITEXT_TINY)
path.ls()

(#2) [Path('/root/.fastai/data/wikitext-2/test.csv'),Path('/root/.fastai/data/wikitext-2/train.csv')]

In [111]:
df_train=pd.read_csv(path/'train.csv',header=None)
df_valid=pd.read_csv(path/'test.csv',header=None)
df_train.head()

Unnamed: 0,0
0,"\n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z..."
1,"\n = Big Boy ( song ) = \n \n "" Big Boy "" <unk> "" I 'm A Big Boy Now "" was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including "" Big Boy "" . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re..."
2,"\n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit..."
3,"\n = New Year 's Eve ( Up All Night ) = \n \n "" New Year 's Eve "" is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch..."
4,"\n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ...."


As we mentioned before, when building Language Models, we don't care what data is training set, and what is validation set. So we simply club all the data together. 

In [112]:
#concatenating all texts in one array
all_texts = np.concatenate([df_train[0].values,df_valid[0].values])

Now let us build a transformation function that takes in words, and converts them into numericalized tokens. Its just a more efficient way of doing this. Previously, we converted all text to numbers beforehand. Now we will do this while running the model. 

In a fastai Transform you can define:

* an encodes method that is applied when you call the transform (a bit like the forward method in a nn.Module)

* a decodes method that is applied when you call the decode method of the transform, if you need to decode anything for showing purposes (like converting ids to a text here)

* a setups method that sets some inner state of the Transform (not needed here so we skip it)

In [113]:
class TransformersTokenizer(Transform):
    def __init__(self,tokenizer): self.tokenizer=tokenizer
    def encodes(self,x):
        toks=self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self,x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

In [114]:
splits=[range_of(df_train),list(range(len(df_train),len(all_texts)))]
tls=TfmdLists(all_texts,TransformersTokenizer(tokenizer),splits=splits,dl_type=LMDataLoader) #LMDataloader, because this is a LM problem, dependent var itself being a sequence

#ignore this error

Token indices sequence length is longer than the specified maximum sequence length for this model (4576 > 1024). Running this sequence through the model will result in indexing errors


In [115]:
tls.train[0],tls.valid[0] #look the same, but only begin and end the same way

(tensor([220, 198, 796,  ..., 198, 220, 198]),
 tensor([220, 198, 796,  ..., 198, 220, 198]))

In [116]:
tls.tfms(tls.train.items[0]).shape, tls.tfms(tls.valid.items[0]).shape #see? they're not exaclty the same. They infact have different shapes

(torch.Size([4576]), torch.Size([1485]))

Let us see how we can use the Transform's decode function to decode the numericalised tokens back to words, using the `show_at` function in the fastai library. 

In [117]:
#we can have a look at decodes using the show_at function
show_at(tls.train,0)

 
 = 2013 – 14 York City F.C. season = 
 
 The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club, a professional football club based in York, North Yorkshire, England. Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two. The season ran from 1 July 2013 to 30 June 2014. 
 Nigel Worthington, starting his first full season as York manager, made eight permanent summer signings. By the turn of the year York were only above the relegation zone on goal difference, before a 17 @-@ match unbeaten run saw the team finish in seventh @-@ place in the 24 @-@ team 2013 – 14 Football League Two. This meant York qualified for the play @-@ offs, and they were eliminated in the semi @-@ final by Fleetwood Town. York were knocked out of the 2013 – 14 FA Cup, Football League Cup and Football League Trophy in their opening round matches. 
 35 players made at least

Now that our dataset is setup, let us put the data into dataloaders.

In [118]:
bs,sl = 4,256
dls = tls.dataloaders(bs=bs,seq_len=sl)

In [119]:
dls.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"\n = Ed Barrow = \n \n Edward Grant Barrow ( May 10, 1868 – December 15, 1953 ) was an American manager and front office executive in Major League Baseball. He served as the field manager of the Detroit Tigers and Boston Red Sox. He served as business manager ( de facto general manager ) of the New York Yankees from 1921 to 1939 and as team president from 1939 to 1945, and is credited with building the Yankee dynasty. Barrow was elected to the Baseball Hall of Fame in 1953. \n Born in a covered wagon in Springfield, Illinois, Barrow worked as a journalist and soap salesman before entering the business of baseball by selling concessions at games. From there, Barrow purchased minor league baseball teams, also serving as team manager, and served as president of the Atlantic League. After managing the Tigers in 1903 and 1904","\n = Ed Barrow = \n \n Edward Grant Barrow ( May 10, 1868 – December 15, 1953 ) was an American manager and front office executive in Major League Baseball. He served as the field manager of the Detroit Tigers and Boston Red Sox. He served as business manager ( de facto general manager ) of the New York Yankees from 1921 to 1939 and as team president from 1939 to 1945, and is credited with building the Yankee dynasty. Barrow was elected to the Baseball Hall of Fame in 1953. \n Born in a covered wagon in Springfield, Illinois, Barrow worked as a journalist and soap salesman before entering the business of baseball by selling concessions at games. From there, Barrow purchased minor league baseball teams, also serving as team manager, and served as president of the Atlantic League. After managing the Tigers in 1903 and 1904 and"
1,"defined by demarcation commissions in 1947, were the Yugoslav federal constitutional amendments of 1971 and 1974, granting that sovereign rights were exercised by the federal units, and that the federation had only the authority specifically transferred to it by the constitution. \n Germany advocated quick recognition of Croatia, stating that it wanted to stop ongoing violence in Serb @-@ inhabited areas. It was opposed by France, the United Kingdom, and the Netherlands, but the countries agreed to pursue a common approach and avoid unilateral actions. On 10 October, two days after the Croatian Parliament confirmed the declaration of independence, the EEC decided to postpone any decision to recognize Croatia for two months, deciding to recognize Croatian independence in two months if the war had not ended by then. As the deadline expired, Germany presented its decision to recognize Croatia as its policy and duty — a position supported by","by demarcation commissions in 1947, were the Yugoslav federal constitutional amendments of 1971 and 1974, granting that sovereign rights were exercised by the federal units, and that the federation had only the authority specifically transferred to it by the constitution. \n Germany advocated quick recognition of Croatia, stating that it wanted to stop ongoing violence in Serb @-@ inhabited areas. It was opposed by France, the United Kingdom, and the Netherlands, but the countries agreed to pursue a common approach and avoid unilateral actions. On 10 October, two days after the Croatian Parliament confirmed the declaration of independence, the EEC decided to postpone any decision to recognize Croatia for two months, deciding to recognize Croatian independence in two months if the war had not ended by then. As the deadline expired, Germany presented its decision to recognize Croatia as its policy and duty — a position supported by Italy"


There is one small tweak that needs to be done. 

The HuggingFace model will return a tuple in outputs, with the actual predictions and some additional activations (should we want to use them in some regularization scheme). To work inside the fastai training loop, we will need to drop those using a Callback: we use those to alter the behavior of the training loop.

Here we need to write the event `after_pred` and replace `self.learn.pred` (which contains the predictions that will be passed to the loss function) by just its first element. In callbacks, there is a shortcut that lets you access any of the underlying Learner attributes so we can write `self.pred[0]` instead of `self.learn.pred[0]`. That shortcut only works for read access, not write, so we have to write `self.learn.pred` on the right side (otherwise we would set a pred attribute in the Callback).

In [120]:
class DropOutput(Callback):
    def after_pred(self): self.learn.pred = self.pred[0]

Let us build our learner class.

In [121]:
learn = Learner(dls,model,loss_func=CrossEntropyLossFlat(),cbs=[DropOutput,RNNRegularizer],metrics=Perplexity()).to_fp16() 
#not using Accuracy because its a bad metric for RNNs. Even for very good models, the acuracy will reemain only 30% or so

Actually, before training, let us see how the model is working as is. 

In [122]:
learn.validate()
#seeing model performance even without training. Its actually good. In the vanilla RNN, the perplexity was around 50

(#2) [3.6962380409240723,40.29542922973633]

This is pretty good. The perplexity is much lower than the value we saw in LSTM models. And we havent even trained yet!

Let us train our model for one epoch, and see how it performs. 

In [None]:
learn.fit_one_cycle(1, 1e-4)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.11119,2.846317,0.457277,17.22423,13:09


The perplexity has improved quite a lot. Let us see how the model performs now. 

In [123]:
prompt = "\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn"
prompt_ids = tokenizer.encode(prompt)
inp = tensor(prompt_ids)[None].cuda()
inp.shape

torch.Size([1, 21])

In [124]:
preds=learn.model.generate(inp,max_length=50,num_beams=5,temperature=1.5)
tokenizer.decode(preds[0].cpu().numpy())

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn on its head.\n\nA unicorn is a magical creature with a rainbow tail and a horn on its head.\n\nA unicorn is a'

## Review

In this session, we learnt about NLP in Deep Learning. We learnt how to train state of the art models in NLP, including Vanilla RNNs, LSTMs, and Transformers. We learnt about self-supervised learning techniques in NLP as well. Hopefully this was an informative session. 

## Exercise 
Your task is to train the transformer model on the text classification task using the ULMFiT Approach. You can use the word tokenization approach. The Transformer model is not exactly trained on WIKITEXT data, but you can use model as is, as it has already been pretrained. So you need