Convolutional Neural Networks and Image Classification: Computer Vision in Deep Learning
Updated: Jul 25, 2020
Vision is unarguably the most complex and fascinating system among all of the senses of humans. Neurons form networks of networks to capture light from the environment and transfer the message to the brain where information is eventually processed. What will happen if machines can “see” and identify different things, just like we do? How will this work? Well, the answer lies in the core principles of Convolutional Neural Networks (CNNs)---a form of deep learning.

What is Deep Learning?
Deep learning is a specialized form of machine learning that embraces more complex architectures and much larger neural networks. Different from traditional machine learning algorithms, where they usually take a linear approach to the task, deep learning can tackle the problem through nonlinear approaches, which substantially improves its capacity and accuracy when analyzing big data sets. One of the most important abilities of deep learning models is automatic feature extraction. This allows the model to perform tasks such as image classification and detection more efficiently than traditional machine learning models. But, how do deep learning models have such ability? The unique architecture, or structure, of CNNs answers this question.
Basic Structure and Components
When an image is fed into the system as an input, the model recognizes it as a matrix of pixel values. A gray-scale image usually has only one channel, and a colored image, or RGB image, usually has three---one for each color (red, blue, green). A traditional Multi-Layer Perceptron (MLP) network may be feasible to detect or classify small size images, but when dealing with images that have extremely large pixels, the MLP network might not work efficiently due to the extensive amount of data that needed to process. This is when CNN comes into play. Unlike an MLP network, CNNs can effectively reduce the spatial volume of the network from incorporating fewer learnable parameters (weights and biases) and reusable filters while maintaining accurate processing of the data. The entire system comprises multiple layers, each with neurons or nodes that receive the inputs and compute certain outputs.
Input Layer: Takes the image data as inputs.
Convolutional Layer: Extract specific features from the image using different filters that scan across the image, creating a “feature map” with the connected receptive fields.
Consider this 5 x 5 image as an example. It is a special case where it is a matrix with only two pixel values: 0 and 1.

Now consider this 3 x 3 matrix below.

The convolution of the 5 x 5 image with the 3 x 3 matrix is basically the filter
scanning through the image, shown by this animation.

The filter only “sees” part of the image at once and a single output is calculated by
adding up the multiplication of the matrix. Therefore, filter matrices with different
values will result in different feature maps and the size of the feature maps will
be determined by various parameters, such as depth, stride, and padding.
ReLU: Rectified Linear Unit; it is a commonly used activation function that
has a threshold of 0, which replaces all negative values to 0; stimulates non-
linear representations that neurons will need to learn as they “study” real
world data so it's often included in convolutional layers.

Pooling Layer: reduces the dimensionality of the feature maps while obtaining the most important feature of the image. Also known as downsampling, this tactic makes data processing more manageable and prevents overfitting by reducing the number of parameters. It also allows objects to be detected regardless of location, establishing a more invariant representation of the image. Two of the most common pooling algorithms are Max Pooling and Average Pooling. Max pooling takes the maximum pixel value of each feature map for the output, demonstrated in the figure below. Similarly, average pooling takes the average pixel value of each feature map as the output. Pooling operates independently to each input feature map, so the resulting number of the output feature maps will be the same as the input maps.

Fully Connected Layer (FC Layer): the “endgame” of this network that connects every neuron in the previous layers to each other. It compiles all the data and features extracted before to compute a probability or confidence value for classifying the image. FCs usually use a softmax activation function for multi-classification tasks and logistic regression for binary classification tasks. Both are effective classifiers that are extremely helpful in image detection and classification.
Output Layer: gives the final output of the system that is be based on prediction values computed by these layers. In image classification, for example, the final output will be the identified class of the image.

Note: There can be multiple convolutional and pooling layers in a CNN model, and the arrangements of these layers can be very different. In fact, it is these different arrangements that constitute the unique architecture for different CNN models that I will discuss later in this blog.
Training and Learning
Now with a brief overview of the structure and basic components of CNNs, it will be easier to understand how specifically these networks “learn” to detect objects or classify images. Just like humans, machines also need to “train” in order to learn a task fully. Here’s a summary of how a CNN model trains with datasets:
Initialization: randomly assigning weights values for each input and set up the filters and parameters.
Goes through forward propagation and result in an output probability for each category, usually with very random and inaccurate numbers.
Calculate total error by adding the error probability of all classes.
Backpropagation (BackProp): propagate errors back to each layer and adjust/update weights and biases to improve the final predicted probability. The whole process can be viewed as “learning from mistakes”
Optimization: the use of gradient descent, where parameters are optimized to find the minimum value for the error of the output. You can picture this process like a ball rolling into a valley, like the diagram below.

6. Test its accuracy with a new set of data In summary, the entire learning process of CNN models is adjusting parameters until a desirable accuracy is reached.
Specific Models
Since its first successful model LeNet-50 developed in 1998, CNN has come a long way with greater and more efficient models that have innovative architectures implemented to perform tasks more effectively. Here are a few representative examples:

Applications
Because of its broad use in image identification and classification, CNN models can be incorporated in sustainable waste treatment and recycling programs, especially in countries such as South Korea and Japan that already implemented policies that require residents to manually separate their household wastes and dispose them properly. For example, CNN can be used in app development that specifically identifies the type of different waste, thereby helping the residents to divide and discard their waste accurately. In conclusion, CNN is truly an extremely powerful technology that has enormous potential to aid innovative developments in the future. It's imperative to note that this blog only scraps the surface of CNN and many complex mathematical details are omitted. Nevertheless, CNN is undoubtedly a vision of the future.
Other References: