# 09 Convolutional Neural Networks
> Before Starting, we will change the accelerator of this notebook from a CPU to a GPU, since we'll be running our Neural Networks on the GPU. 

> Go to Runtime --> Change Runtime Type --> GPU 

Welcome to the 9th session in Practical Machine Learning. In this session, we will continue our journey in Deep Learning. In this session, we will learn about one of the most interesting domains of AI - *Computer Vision*.

![](https://drive.google.com/uc?id=1wYMDn_gWGMeH7OLV3kbO9zlf42f283sF)


Computer Vision is the domain that focuses on Machines' ability to perceive the physical space around us through vision, just like we do through our eyes. Even before there were sophisticated algorithms to perform Computer Vision tasks, the idea and importance of Computer Vision has existed. Vision was one of the key components of the Turing Test. 

It is safe to say that among all Machine Learning tasks, Computer Vision has advanced the most. Every year, we see more and more advanced research in Computer Vision tasks. The most influential conference in AI is CVPR, which stands for *Computer Vision and Pattern Recognition*. 

Computer Vision finds its applications in very sophisticated systems nowadays. Self driving cars, or autonomous robots use Computer Vision. So clearly, Computer Vision cannot be neglected in AI, because we literally let it decide the fate of things as precious as human life!

All this is mostly credited to one single technique in Deep Learning - *Convolutional Neural Networks*. They are a variant of traditional Neural Networks that are capable of handling images as whole. 

So, let us actually begin by building our own model, to see what Convolutional Neural Networks are capable of, and what potential they have. This is the session where we will be learning how to build the best models among all types of models we've built, in terms of performance. 

We'll build our models using PyTorch. However, PyTorch only provides us inbuilt functionality for basic (and frankly slightly archaic) training methods. Instead of plain PyTorch, we recommend using  using a library called *fastai*, which is built on top of PyTorch. We've used fastai in the very Introduction to Python and ML introductory session. *fastai* allows users to implement world class research practices in Deep Learning, without having to deal with all the technical details. Let us begin by installing this library.


In [None]:
from IPython.display import YouTubeVideo

In [None]:
!pip install --upgrade fastai >./tmp
from fastai.vision.all import *
#note: If you're using Colab, you may see an erro message about incompatibility. Ignore that!

Let us revisit the problem of the Cat vs the Dog classifier. However, this time we will not only understand various concepts that you may have ekarng in the preovious. If you don't find such things, thats okay. You are now ready to learn those. We will look at the components that fastai needs to build a model. However you are not required to understand or memorize the internal working of the library at this point. We will guide you on how to build the model. All the steps and documentations can be found [here](https://docs.fast.ai/).

We start by downloading the datasets. *fastai* has the `untar_data` function that automatically downloads the data from a link and unzips it. It returns the paths of the destination where the dataset would be. 


In [None]:
path = untar_data(URLs.PETS)
path

So the dataset is contained at this location. *fastai* has a wonderful API that can dynamically retrive data in the form of tensors just by taking in a list of path addresses of all datapoints, and a method to reallocate. 

In [None]:
files = get_image_files(path/"images")
files

This dataset contains not only images of Cats and Dogs, but even different breeds. You have the option to either use this as a dataset to classifiy different breeds of cats and dogs, so simply classifiy between cats or dogs. Cats are specifically labeled with words starting with a Capital Letter, while a small letter for dogs. So *fastai* needs to know, how labels of an image with a path, is determined. We do so by defining a function `label_func`.

In [None]:
def label_func(f): return f[0].isupper() #Cats will be labeled as True, and Dogs as False

Finally, let us finish building our dataset by creating dataloaders, which we learnt about in the previous lab session. *fastai* is built on top of PyTorch, so dataloaders is used in that as well. 

In [None]:
dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(224)) #Each image is of size 224x224 pixels, since in order to put the data in a tensor, they mst of uniform shape
dls.show_batch()

`dls` is nothing but a tuple of the *train_dl* and the *valid_dl*, that we used in the last lab session. So if we simply deconstruct this variable called `dls` as

In [None]:
train_dl,valid_dl = dls

We reach the same point!

Next, we create a *Learner*, which is nothing but a python class, similar to the `NeuralNetwork` class in the previous session. Its job is to simply keep track of the data, model, optimizer, and the state of the parameters. We can train the model using the `fine_tune` method of the class.

In [None]:
learn = cnn_learner(dls, resnet34, metrics=accuracy)
learn.fine_tune(1) #Make sure your GPU is enabled, otherwise this may take an hour or so

Wow, In less than 2 mins, we have built a model that is about 98% accurate in classifying dogs from cats. If you set the hyperparameters of this model carefully, you may achieve even more. 

We have not been able to achieve this kind of accuracy in any of the previously built dataset. This is the current State of Computer Vision, and as you can see, there are great tools out there, like the fastai library, which bring all these wonders to your fingertips. And in return you don't have to spend years and years on learning the math, or the theory. You can build your own world class models with minimal effort. 


Infact, let us build our own classifier!

I very recently came across [this](https://joedockrill.github.io/jmd_imagescraper/) tool developed by someone, which easily lets us build our own datasets by scraping images from the Internet. It works using a simple duckduckgo based search engine, which lets it downlaod images using a search query. So using this, you can create a dataset of images easily.

In [None]:
!pip install jmd_imagescraper >./tmp
from jmd_imagescraper.core import *

Build your own Image Classifier, not necessarily only 2 classes.



In the below example, we pick a random, but slighlty harder problem - classifying between 2 very similar monuments - the *India Gate*, in New Delhi, and the *Arc De Triomphe*, in Paris. Infact, India Gate was inspired by Arc De Triomphe, and actually look quite similar, atleast from a distance. Can a Convolutional Neural Network identify which one is which?

You can change the queries to anything else. Duck Vs Goose, Mickey Mouse vs Tom the cat, Horse vs Donkey. There can even be more than 2 classes, not an issue. Try building your own Image Classifier.

In [None]:
root = Path().cwd()/"monuments/"
duckduckgo_search(root, "India Gate", "India Gate", max_results=100);
duckduckgo_search(root, "Arc de Triomphe", "Arc de Triomphe", max_results=100);

In [None]:
files = get_image_files(root)
files

In [None]:
dls = ImageDataLoaders.from_path_func(path,files,label_func=parent_label,item_tfms=Resize(224))
dls.show_batch()

In [None]:
learn = cnn_learner(dls, resnet34, metrics=accuracy)
learn.fine_tune(5)

So In just a few lines of code, you have built your own Image classifier, with world class accuracy!

## What is a Convolutional Neural Network?

Now that we've seen the potential of Convolutional Neural Nets (CNNs). For those who have no idea about what a Convolutional Neural Net is or looks like, below is a great video to help you 


In [None]:
YouTubeVideo('K_BHmztRTpA',700,450)


Let us learn about what they exactly are. Let us look at the represenation of the model we used above. 

In [None]:
learn.model

Few Observations! This is a nn.Sequential Object, which we know about. This particular model is a very famous architecture in Computer Vision, and is called the *ResNet* Architecture. We'll look into it in detail in the following subsection. 

Just like in the previous session, we learnt that models are composed of blocks called *modules*. Let us look at one of these modules.

In [None]:
learn.model[0]

Turns out even this module is composed of many blocks. Let us look at an even more basic block within this module.

In [None]:
learn.model[0][0]

Here we have a new PyTorch object, called as the *nn.Conv2D*, which stands for the 2D Convolutional Block (since Images are 2D). This is analogous to the Linear Layer we learnt about. Except that if you look at the model, there are both Conv Blocks as well as linear blocks in this ResNet model. 

### What is a Convolutional Block? 

For that let us start by understanding what a convolution is. A convolution is a feature engineering technique, by which we derive information from an image. A specific kind of convolution may be helpful in identifying the edges of an image, the other may be helpful in identifying where the color green is present, and so on. Visually, this is how convolutions are performed.

![](https://drive.google.com/uc?id=1J92FpdbRlXnd1Mr5tZSYtMrhGZSqbeA3)

![](https://drive.google.com/uc?id=14Dbf6jLFMSVvpbjBm9KUkGNk8lEVye2f)

The 3x3 matrix you see, over which convolution takes place is called the *kernel*. Each kernel is responsible for deriving a feature. 

There may be multiple kernels working on the same image per layer. This resulting matrix (on the right) is the output of one layer, and represents some feature. This is also the input to the next layer of the convNet. The next layer will learn even more complex features, until it can reach a point, where it can learn about features that differentiate between dogs and cats, for example.

There is a great [visualization tool](https://www.cs.ryerson.ca/~aharley/vis/conv/) created by Adam Harley to visualize how CNNs process image data.

### Why does a Convolutional Network work better than MultiLayer Perceptrons?

In the real world, pixels don't really mean much. It is a *group* of pixels that has a meaning. A group of pixels forms the eyes, the nose, the ears of a cat. And a group of these groups collectively forms the face of the cat. MLPs don't consider this spatial relationship between pixels. It simply breaks them apart and treats each pixel independently. Through this we lose spatial information. But CNNs retain this spatial relationship. It works on the pronciple of finding features from a group of pixels without disturbing their spatial distribution. THis is the reason CNNs work better on visual data rather than straightforward MLP Neural Networks.

### History of CNNs

![](https://drive.google.com/uc?id=1aliUnCaqWQURkSYN5MVsm3bUBpMFjPRU)

CNNs were first popularized by Yann LeCun in 1989. At that time, however, the research community did not beleive in the potential of Neural Nets. However, Yann LeCun was one of the few researchers who did, and kept working on its development. Here is a video of him demonstrating the first ConvNet for Handwritten Digit recognition while he was at Bell Labs. 

This model of his was used by Banks to read cheques and the Post Office to read postal codes. 

In [None]:
YouTubeVideo('FwFduRA_L6Q',width=700,height=450)

Today, LeCun works as the Director of Facebook AI Research, which is a pioneer in many AI techniques, that is used as baseline by other researchers throughout the world to build upon their own work!

## The ResNet Architecture and other Image Classification Models

ResNets are one of the most Vanilla image classification models used for classification. They were introduced in 2015 by Kaiming He et al.
They are particulary effective due to the skip connection feature, which looks something like this. 

![](https://drive.google.com/uc?id=1TCdBQjz8qEegGfOpPi_VLQUtBACiPuYh)

This is a great regulization feature. It serves the model in two ways.

* It prevents the model from overskipping, since the data can somewhat be directed to skip over layers.
* Most importantly, it is a great way to propagate activation distributions, and prevents them from collapsing to zero mean/zero standard deviation. 

Zero Activations are a huge problem in Neural Networks. If a model is not trained carefully, the activations can collapse very fast, and at that point, there will be no learning, since the gradients will be zero too, and there will be no significant parameter update. 

This is what the ResNet architecture looks like in total.

![](https://drive.google.com/uc?id=1JD56yCUf_GoyQxy69l_jQ_hPImW6OCrd)

This Skip Connection Technique became so popular, that many modern Image architectures use this. For example, here is the EfficientNet architecture.

![](https://drive.google.com/uc?id=1-mcRP0oDyyfkqBS4cPQJPBM37nL418ie)

And the MobileNet architecture!

![](https://drive.google.com/uc?id=1_5_HT1RXCAgkfBs_Zu5NA9p5rXZG16vi)

Lets actually look inside the ResNet architecture!

ResNet architecture has many variants, based on the number of layers. There is ResNet18, ResNet34, ResNet50, ResNet101,and ResNet151. If you scroll up, you'll notice we used the ResNet34 architecture, because it falls in between the two extremes, and works satisfactorily. 

Torchvision provides us with the architectures of these models, along with *pretrained* weights. [Link](https://pytorch.org/vision/stable/models.html) for all models provided by PyTorch. 

In [None]:
import torchvision

Here is the implementation of the resnet34 architecture. 

In [None]:
model=torchvision.models.resnet34(pretrained=True)

Lets look up the source-code of the implementation of this model.

In [None]:
??model

You would have noticed the argument called `pretrained`. Whats that?

### Pretrained Weights and Transfer Learning

*fastai* learner provides us with a method in its Learner class, called `summary`, which gives the summary of the model layers it contains. It provides information about each layer in the model, the number of parameters it has, as well as the input and output shape of the data that goes into the model. 

It also provides information about the total number of parameters in the model. Let us try to find the total number of parameters in the Resnet34 architecture, which is by the way, on the lower end of the complexity spectrum of models

In [None]:
learn.summary()

There are 21 million parameters in the model. Turns out, it is very computationally expensive to train this model from scratch for whatever task we have (cat vs dog in our example). Most of us do not have the resources to carry out model trainings from scratch. Its very common for models to take multiple GPUs over multiple days to train a model from scratch. 

So, how did we bring down the training time from multiple days to less than 2 minutes?

*Transfer Learning* is the answer. 

The way CNNs work is, that the earlier layers of the model learn basic features, such as right edge, or the presence of a colour. As you progress deeper into the model, the model learns more complex features. Here is a great representation of this from a 2013 paper by Zeiler and Fergus.

![](https://drive.google.com/uc?id=1zvOzZbvCCP-jsZL2oj_4K30BOdBWyBEc)

The basic features are more or less common for all Image recognition tasks. For example, every image recognition task involves the model needing to identify edges, colours, basic shapes such as squares, circles, etc. It is only the later layers of the model that need to be different in order for the model to adapt for the particular problem that you're trying to solve. 

So, researchers already train these architectures on standard Image datasets (such as ImageNet, a corpus of 1.4 Million Images containing 1000 categories of everyday objects, or the COCO dataset), and provide the parameters of the model publically, so that we don't have to train them from scratch. THis is called *Transfer Learning*, and the process is called using pretrained weights (parameters).

Now, for any task, all we need to do is, to *fine-tune* the later layers of the model to adapt to our task.




## Using external models for transfer learning

There is a practitioner named [Ross Wrightman](https://github.com/rwightman), who provides PyTorch compatible ImageNet-pretrained weights for many modern architectures - even those architectures which are not present by default in PyTorch or Keras. He provides us with a [library](https://github.com/rwightman/pytorch-image-models/tree/master/timm) called `timm`, which facilitates retrieving such models. 

We would like to integrate these models with a fastai style learner. But these models are not immediately ready to go into the learner object. Thats because by default, all the layers are clubbed into a single module. To use all the functionality that fastai provides, you need to first split each layer into a separate module, and also do a few architectural changes to make it compatible with the data that you're training with (Image Net has 1000 classes, we have only 2)! 

Zach Mueller is a very active contributor to the fastai community, and created a library that contains this functionality. You can find it [here](https://walkwithfastai.com/vision.external.timm).

Our purpose here is not to learn all the technical details all at once, but rather providing you with the tools to build your own models. Many a times, you want to use other models other than standard ResNets or Inceptions, etc. Such open-source work from the AI community is a great way to learn and grow. 


In [None]:
!pip install timm >./tmp
!pip install wwf >./tmp

In [None]:
from wwf.vision.timm import *
import timm

These are all the models that are available in the timm library.

In [None]:
timm.list_models(pretrained=True)

Let us build a model with another architecture, say `inception_v4`. `wwf` provides us with a convenient function that takes in the model architecture, splits it into modules, and modifies the architecture according to our data input and output dimensions, and returns a fastai style learner object

In [None]:
dls.show_batch()

In [None]:
learn = timm_learner(dls, 'inception_v4', metrics=accuracy)

In [None]:
learn.summary()

Then you can simply follow the same step to train the model

In [None]:
learn.fine_tune(4)

In [None]:
#This will be used in the exercise. This line can be used to get the final accuracy of the model
learn.recorder.values[-1][-1]

That's it. Now you have all the tools to build your own Image Classification models. There are a few tasks that you should do before heading to the Exercises.

Finally, We would like to talk about the fastai library, which we have used extensively in this lab. The reason why we did not use PyTorch directly, was because there are way too many concepts to be learned before implementing state of the art models. It also does not make sense to implement watered down, or substandard models. One of the objectives of this course is to teach the real world Machine Learning, rather than Machine Learning from 5 or 10 years ago. Fastai is a pioneer in teaching Deep Learning in a top-down fashion - where you first learn how to implement models, and then learn the details. 

So a lot of details in this notebook that we have not convered are meant to be learnt slowly. We really recommend the fastai course to learn how to implement your own models, and also learn the theory behind a lot of techniques used in Deep Learning today. 

## Additional Tasks
### Task 1
On the dataset you built, try implementing another model from `timm`. Make sure the model is not one which starts with `tf_` or `hr` as recommended by the wwf library. Train your model on this dataset.

### Task 2
Additional Research:
Many models have modules like Dropout, BatchNorm, and Pooling. Search on Google to understand what each one of these does. It is recommended you go through the original Research papers that introduced these, or did significant study on these topics

## Exercise

Build a new dataset with 5 categories (any classification problem will work), using `duckduckgo_search`. For all models that are available with pretrained weights, and that do not start with `tf_` or `hr`, and train using `fine_tune` for 4 epochs. Plot all the accuracies on a graph with y denoting the accuracy, and x being fixed at 0. Annotate the points with the name of the architecture used, using `plt.annotate`

In [None]:
import matplotlib.pyplot as plt

In [None]:
architectures = timm.list_models(pretrained=True)
max_acc = (architectures[0],0)
for arch in [o for o in architectures if not(o.startswith('tf_') or o.startswith('hr'))][:2]:
    learn = timm_learner(dls, 'inception_v4', metrics=accuracy)
    learn.fine_tune(4)
    acc=learn.recorder.values[-1][-1]
    if max_acc[-1]<acc: max_acc=(arch,acc)
    plt.scatter(0,acc)
    plt.annotate(arch,(0.1,acc))

In [None]:
max_acc