In this series of posts I will be going through how convolutional neural networks (CNNs) do what they do. I will attempt to explain it in a way that is both accessible and comprehensive. I will not be assuming any prior knowledge about neural networks, nor will I be discussing the history of these networks. This series is primarily for those who have heard about CNNs and need a primer in order to effectively use them for their research and/or commercial product, or just want to impress their friends by using big words.
This will be a three part series and here is where I see it going (*you are here):
Overview of CNNs*
What they are good at
What a basic CNN looks like
The basic mechanisms involved
Opening the black box
What exactly is happening inside CNNs
What do we have control over?
Introduction to parameters that we can directly change (Hyperparameters)
All right, without further ado, why don't we begin the series.
Overview of Convolutional Neural Networks (CNNs)
In my previous post I had stated that CNNs are very good at categorization and that their preferred input is image files. Indeed, CNNs do a great job at leveraging spatial invariance that is shared between groups and that is unique within a group. What I mean by this, is that CNNs capitalize on areas of images that are unique to a given category to help it classify the image as such. In order to get a handle on how it does this, it may be helpful to first see a visualization of a basic CNN.
I realize that this may be a touch confusing when first looking at it so lets go through each stage in turn.
In the above image, the input layer is the picture of the boats on the water. Now, I think that is safe to say that computers don't see images the way people do. Maybe. When we see an image, all the processing is already done and we see an entire scene and can say with certain level of assuredness what is in the picture. Computers don't come with the implicit knowledge that we have built up over the years and only see a smattering of numbers, in the form of a matrix (intuitive example of computer vision: name this movie), representing color hues and the presence or absence of color.
Once the image is read into the computer in a way that it can interpret (i.e., a series of matrices), we can now begin to preform the computations that occur in the subsequent layers.
In CNNs there is typically a convolution layer that immediately following the input layer. These convolution layers are what make CNNs...well...convolutional. What is happening in this layer is a sliding window is passing over the image from top to bottom and from left to right. Within this window, math is happening on the raw pixel information (the math that is being applied to the pixel window is determined by your activation function, which we will discuss in part 2 of this series). The result of the math is then saved as a feature or activation map, which is comprised of a series of matrices that are a little bit smaller than the original input image. For those of you who are visual learners, as I am, below is a lovely gif (pronounced with a hard G) that shows what is happening in the convolution layer.
Oddly therapeutic isn't it. Hopefully the image cleared up the mess that is my written explanation. While it looks like the numbers that appear in the convolved feature do so magically, they are actually determined by your choice of activation function that is applied during the convolution (I'll touch further on this in part 2). You may have noticed that there are red numbers in the in the bottom right corner of the sliding window in figure 3. This represents part of the activation function that is applied to the window. Pixels are either multiplied by 1 or 0, the results within the window are then added up and that is what nets you the activation value of the corresponding feature map. Simple, no? Let us continue though the network.
Following a convolution layer, there is typically a pooling layer. There are different kinds of pooling layer, but the most used and generally most effective one is max pooling. The mechanism behind max pooling is very simple: apply a sliding window to the convolved features, keep the biggest number in the window, and throw away the others.
The pattern of convolution layer followed by max pooling is repeated as many times as is needed (generally more when using large or particularly complex images), until you finally want to use the information to make category predictions. At this time you unravel the activation maps into a fully connected vector.
Fully connected layer
So what do I mean by a fully connected vector. I believe the best way to describe it is by way of using a wooden fidget puzzle.
The puzzles are a bunch of little wooden blocks that are connected by a piece of elastic string. Imagine that your activation map is represented by the set up on the left side of figure 5 and each wooden block represented a pooled convolved feature (a single block on the right side of figure 4). In this case you would have a 4x3x1 (4 rows, 3 columns, 1 width) matrix. When you go from the pooling layer, or convolution layer, to a fully connected layer, you unravel the matrix to a single line. In our case, our 4x3x1 matrix would become a 1x12 (1 row, 12 columns) vector. This elongated shape is maintained, but the length is reduced, with each subsequent layer until we have a vector with the same number of blocks as the number of categories we have in the data set.
That is all there really is to CNNs. The basic architecture is pretty straight forward and follows a series of conventions that have sprung up over the years as their use has become more and more popularized. Just to review what we have gone through until now:
What are CNNs good at?
Image classification based on spatial invariance
What does a basic CNN look like?
Input layer -> Convolution layer -> Max pooling layer -> Convolution + pooling .....
-> Fully connected layer -> Fully connected layer -> Classification layer (Output)
What are the basic mechanisms involved?
Presenting images in a way that a computer can use
Multi-channel pixel information
Applying an activation function to distinct windows across the pixel information
Max pooling: taking the largest number in a defined window applied to an activation map
Fully connected layer
Unraveling the activation maps, like interconnected strings of spaghetti
A fully connected layer that is the same size as the number of categories you have
In my next post, part 2 of CNNs 101, we will delve into the black box, take a look at some of the maths involved in the success of CNNs, and maybe even learn a little bit about ourselves. Now go impress your friends with some of your newfound knowledge.
I'm out chyea.