Convolutional Neural Networks and Image Classification: Computer Vision in Deep Learning

Updated: Jul 26, 2020

Vision is unarguably the most complex and fascinating system among all of the senses of humans. Neurons form networks of networks to capture light from the environment and transfer the message to the brain where information is eventually processed. What will happen if machines can “see” and identify different things, just like we do? How will this work? Well, the answer lies in the core principles of Convolutional Neural Networks (CNNs)---a form of deep learning.

A simple representation of the basic structure of CNNS

What is Deep Learning?

Deep learning is a specialized form of machine learning that embraces more complex architectures and much larger neural networks. Different from traditional machine learning algorithms, where they usually take a linear approach to the task, deep learning can tackle the problem through nonlinear approaches, which substantially improves its capacity and accuracy when analyzing big data sets. One of the most important abilities of deep learning models is automatic feature extraction. This allows the model to perform tasks such as image classification and detection more efficiently than traditional machine learning models. But, how do deep learning models have such ability? The unique architecture, or structure, of CNNs answers this question.

Basic Structure and Components

When an image is fed into the system as an input, the model recognizes it as a matrix of pixel values. A gray-scale image usually has only one channel, and a colored image, or RGB image, usually has three---one for each color (red, blue, green). A traditional Multi-Layer Perceptron (MLP) network may be feasible to detect or classify small size images, but when dealing with images that have extremely large pixels, the MLP network might not work efficiently due to the extensive amount of data that needed to process. This is when CNN comes into play. Unlike an MLP network, CNNs can effectively reduce the spatial volume of the network from incorporating fewer learnable parameters (weights and biases) and reusable filters while maintaining accurate processing of the data. The entire system comprises multiple layers, each with neurons or nodes that receive the inputs and compute certain outputs.

  • Input Layer: Takes the image data as inputs.

  • Convolutional Layer: Extract specific features from the image using different filters that scan across the image, creating a “feature map” with the connected receptive fields.

Consider this 5 x 5 image as an example. It is a special case where it is a matrix with only two pixel values: 0 and 1.

Now consider this 3 x 3 matrix below.

The convolution of the 5 x 5 image with the 3 x 3 matrix is basically the filter

scanning through the image, shown by this animation.

The filter only “sees” part of the image at once and a single output is calculated by

adding up the multiplication of the matrix. Therefore, filter matrices with different

values will result in different feature maps and the size of the feature maps will

be determined by various parameters, such as depth, stride, and padding.

ReLU: Rectified Linear Unit; it is a commonly used activation function that

has a threshold of 0, which replaces all negative values to 0; stimulates non-

linear representations that neurons will need to learn as they “study” real

world data so it's often included in convolutional layers.

Graph of ReLU function
  • Pooling Layer: reduces the dimensionality of the feature maps while obtaining the most important feature of the image. Also known as downsampling, this tactic makes data processing more manageable and prevents overfitting by reducing the number of parameters. It also allows objects to be detected regardless of location, establishing a more invariant representation of the image. Two of the most common pooling algorithms are Max Pooling and Average Pooling. Max pooling takes the maximum pixel value of each feature map for the output, demonstrated in the figure below. Similarly, average pooling takes the average pixel value of each feature map as the output. Pooling operates independently to each input feature map, so the resulting number of the output feature maps will be the same as the input maps.

Example of Max Pooling
  • Fully Connected Layer (FC Layer): the “endgame” of this network that connects every neuron in the previous layers to each other. It compiles all the data and features extracted before to compute a probability or confidence value for classifying the image. FCs usually use a softmax activation function for multi-classification tasks and logistic regression for binary classification tasks. Both are effective classifiers that are extremely helpful in image detection and classification.

  • Output Layer: gives the final output of the system that is be based on prediction values computed by these layers. In image classification, for example, the final output will be the identified class of the image.

An example of CNN applied to identify images of handwritten numbers.

Note: There can be multiple convolutional and pooling layers in a CNN model, and the arrangements of these layers can be very different. In fact, it is these different arrangements that constitute the unique architecture for different CNN models that I will discuss later in this blog.

Training and Learning

Now with a brief overview of the structure and basic components of CNNs, it will be easier to understand how specifically these networks “learn” to detect objects or classify image