CNNs 101: How do they work? (Part 1 of 3)

In this series of posts I will be going through how convolutional neural networks (CNNs) do what they do. I will attempt to explain it in a way that is both accessible and comprehensive. I will not be assuming any prior knowledge about neural networks, nor will I be discussing the history of these networks. This series is primarily for those who have heard about CNNs and need a primer in order to effectively use them for their research and/or commercial product, or just want to impress their friends by using big words.

This will be a three part series and here is where I see it going (*you are here): 

  1. Overview of CNNs*

    • What they are good at

    • What a basic CNN looks like

    • The basic mechanisms involved

  2. Opening the black box

    • What exactly is happening inside CNNs

    • What do we have control over?

      • Introduction to parameters that we can directly change (Hyperparameters)

  3. Practical applications

All right, without further ado, why don't we begin the series.


Overview of Convolutional Neural Networks (CNNs)

In my previous post I had stated that CNNs are very good at categorization and that their preferred input is image files. Indeed, CNNs do a great job at leveraging spatial invariance that is shared between groups and that is unique within a group. What I mean by this, is that CNNs capitalize on areas of images that are unique to a given category to help it classify the image as such. In order to get a handle on how it does this, it may be helpful to first see a visualization of a basic CNN.

  Figure 1:  A visualization of a very basic CNN architecture. From left to right: Input layer displaying an image of a couple of boats, convolutions are performed to give rise to a series of activation maps, pooling is then preformed on the activation maps which reduces the overall number of activations, convolution and pooling is done again before unraveling and fully connecting the activation maps, the values from fully connected activation maps are then ultimately used for determining the category of the image.  This image is from a  post from WildML  about applying CNNs to natural language processing.

Figure 1: A visualization of a very basic CNN architecture. From left to right: Input layer displaying an image of a couple of boats, convolutions are performed to give rise to a series of activation maps, pooling is then preformed on the activation maps which reduces the overall number of activations, convolution and pooling is done again before unraveling and fully connecting the activation maps, the values from fully connected activation maps are then ultimately used for determining the category of the image.

This image is from a post from WildML about applying CNNs to natural language processing.

I realize that this may be a touch confusing when first looking at it so lets go through each stage in turn.

Input layer

In the above image, the input layer is the picture of the boats on the water. Now, I think that is safe to say that computers don't see images the way people do. Maybe. When we see an image, all the processing is already done and we see an entire scene and can say with certain level of assuredness what is in the picture. Computers don't come with the implicit knowledge that we have built up over the years and only see a smattering of numbers, in the form of a matrix (intuitive example of computer vision: name this movie), representing color hues and the presence or absence of color. 

  Figure 2:  An example of how an image may be read as by a computer using a 3 channel format (RGB).  This picture was made using  Inkscape , you may click the image to download the .svg file if you would like to edit and use the theme yourself. Just right-click and save the image as what ever you would like.

Figure 2: An example of how an image may be read as by a computer using a 3 channel format (RGB).

This picture was made using Inkscape, you may click the image to download the .svg file if you would like to edit and use the theme yourself. Just right-click and save the image as what ever you would like.

Once the image is read into the computer in a way that it can interpret (i.e., a series of matrices), we can now begin to preform the computations that occur in the subsequent layers.

Convolution layer

In CNNs there is typically a convolution layer that immediately following the input layer. These convolution layers are what make CNNs...well...convolutional. What is happening in this layer is a sliding window is passing over the image from top to bottom and from left to right. Within this window, math is happening on the raw pixel information (the math that is being applied to the pixel window is determined by your activation function, which we will discuss in part 2 of this series). The result of the math is then saved as a feature or activation map, which is comprised of a series of matrices that are a little bit smaller than the original input image. For those of you who are visual learners, as I am, below is a lovely gif (pronounced with a hard G) that shows what is happening in the convolution layer.

  Figure 3:  This is an animation of what occurs in a convolution layer. The pixel information of the original image is backgrounded by green and the convolved information is backgrounded by red. The 3x3 sliding window is outlined in yellow. This window has a stride length of one, as it moves over one column or one row as it progresses.  Source .

Figure 3: This is an animation of what occurs in a convolution layer. The pixel information of the original image is backgrounded by green and the convolved information is backgrounded by red. The 3x3 sliding window is outlined in yellow. This window has a stride length of one, as it moves over one column or one row as it progresses. Source.

Oddly therapeutic isn't it. Hopefully the image cleared up the mess that is my written explanation. While it looks like the numbers that appear in the convolved feature do so magically, they are actually determined by your choice of activation function that is applied during the convolution (I'll touch further on this in part 2). You may have noticed that there are red numbers in the in the bottom right corner of the sliding window in figure 3. This represents part of the activation function that is applied to the window. Pixels are either multiplied by 1 or 0, the results within the window are then added up and that is what nets you the activation value of the corresponding feature map. Simple, no? Let us continue though the network.

Pooling layer

Following a convolution layer, there is typically a pooling layer. There are different kinds of pooling layer, but the most used and generally most effective one is max pooling. The mechanism behind max pooling is very simple: apply a sliding window to the convolved features, keep the biggest number in the window, and throw away the others.

  Figure 4:  A visualization of what is occurring in a max pooling layer. In this case, a 2x2 sliding window with a stride of 2 (i.e. skips 2 columns when it progresses right and 2 rows when it progress down) is applied to a feature or activation map. The left side shows what the activation map looks like  before  max pooling. The right side shows what the activation map looks like  after  max pooling.  Source .

Figure 4: A visualization of what is occurring in a max pooling layer. In this case, a 2x2 sliding window with a stride of 2 (i.e. skips 2 columns when it progresses right and 2 rows when it progress down) is applied to a feature or activation map. The left side shows what the activation map looks like before max pooling. The right side shows what the activation map looks like after max pooling. Source.

The pattern of convolution layer followed by max pooling is repeated as many times as is needed (generally more when using large or particularly complex images), until you finally want to use the information to make category predictions. At this time you unravel the activation maps into a fully connected vector.

Fully connected layer

So what do I mean by a fully connected vector. I believe the best way to describe it is by way of using a wooden fidget puzzle.

  Figure 5:   Source .

Figure 5: Source.

The puzzles are a bunch of little wooden blocks that are connected by a piece of elastic string. Imagine that your activation map is represented by the set up on the left side of figure 5 and each wooden block represented a pooled convolved feature (a single block on the right side of figure 4). In this case you would have a 4x3x1 (4 rows, 3 columns, 1 width) matrix. When you go from the pooling layer, or convolution layer, to a fully connected layer, you unravel the matrix to a single line. In our case, our 4x3x1 matrix would become a 1x12 (1 row, 12 columns) vector. This elongated shape is maintained, but the length is reduced, with each subsequent layer until we have a vector with the same number of blocks as the number of categories we have in the data set.

Recap

That is all there really is to CNNs. The basic architecture is pretty straight forward and follows a series of conventions that have sprung up over the years as their use has become more and more popularized. Just to review what we have gone through until now:

  1. What are CNNs good at?

    • Image classification based on spatial invariance

  2. What does a basic CNN look like?

    • Input layer -> Convolution layer -> Max pooling layer -> Convolution + pooling .....
      -> Fully connected layer -> Fully connected layer -> Classification layer (Output)

  3. What are the basic mechanisms involved?

    • Input layer

      • Presenting images in a way that a computer can use

        • Multi-channel pixel information

    • Convolution layer

      • Applying an activation function to distinct windows across the pixel information

    • Pooling layer

      • Max pooling: taking the largest number in a defined window applied to an activation map

    • Fully connected layer

      • Unraveling the activation maps, like interconnected strings of spaghetti

    • Output layer

      • A fully connected layer that is the same size as the number of categories you have

In my next post, part 2 of CNNs 101, we will delve into the black box, take a look at some of the maths involved in the success of CNNs, and maybe even learn a little bit about ourselves. Now go impress your friends with some of your newfound knowledge.

I'm out chyea.