Introduction
• Developed byYann LeCun et al., 1998, for
handwritten digit recognition (MNIST dataset)
• One of the first CNN architectures
• Input: 32×32 grayscale image
• Output: 10 classes (digits 0–9)
3.
Applications
• Handwriting recognitionin postal services and
banking.
• Object and face recognition in images and
videos.
• Autonomous driving systems for recognizing
and interpreting road signs.
Layer C1 (ConvolutionalLayer)
• Feature Maps: 6 feature maps.
• Connections: Each unit is connected to a 5x5 neighborhood in the
input, producing 28x28 feature maps to prevent boundary effects.
• Parameters: 156 trainable parameters and 117,600 connections.
7.
Layer S2 (SubsamplingLayer)
• Feature Maps: 6 feature maps.
• Size: 14x14 (each unit connected to a 2x2 neighborhood in C1).
• Operation: Each unit adds four inputs, multiplies by a trainable coefficient, adds a
bias, and applies a sigmoid function.
• Parameters: 12 trainable parameters and 5,880 connections.
8.
Layer C3 (ConvolutionalLayer)
• Feature Maps: 16 feature maps.
• Connections: Each unit is connected to several 5x5 neighborhoods at identical
locations in a subset of S2’s feature maps.
• Parameters and Connections: Connections are partially connected to force feature
maps to learn different features, with 1,516 trainable parameters and 151,600
connections.
9.
Layer S4 (SubsamplingLayer)
• Feature Maps: 16 feature maps.
• Size: 7x7 (each unit connected to a 2x2 neighborhood in C3).
• Parameters: 32 trainable parameters and 2,744 connections.
10.
Layer C5 (ConvolutionalLayer)
• Feature Maps: 120 feature maps.
• Size: 1x1 (each unit connected to a 5x5 neighborhood on all 16 of S4’s
feature maps, effectively fully connected due to input size).
• Parameters: 48,000 trainable parameters and 48,000 connections.
11.
Layer F6 (FullyConnected Layer)
• Units: 84 units.
• Connections: Each unit is fully connected to C5, resulting in 10,164 trainable parameters.
• Activation: Uses a scaled hyperbolic tangent function f(a)=Atan
(Sa)f(a)=Atan(Sa), where A =
1.7159 and S = 2/3
12.
Output Layer
In theoutput layer of LeNet, each class is represented by an Euclidean Radial Basis Function (RBF) unit.
13.
Key Features &Advantages
• Weight sharing reduces parameters
• Local receptive fields capture spatial patterns
• Pooling layers make the model translation
invariant
• Foundation for modern CNN architectures
14.
AlexNet Architecture
• Overview:
•Developed by Alex Krizhevsky et al. in 2012 – Winner of ImageNet
Large Scale Visual Recognition Challenge (ILSVRC).
• It won the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) 2012 with a top-5 error rate of 15.3% (beating the runner up
which had a top-5 error rate of 26.2%).
• It became famous for its ability to classify images accurately.
• Total 8 layers:
– 5 Convolutional Layers (feature extraction)
– 3 Fully Connected Layers (classification)
• Input: RGB image 227×227×3
• Output: Softmax over 1000 classes
15.
• Key Features:
•ReLU Activation → Faster convergence than sigmoid/tanh
• Max Pooling (with overlapping pooling) → Reduces
spatial size, increases invariance
• Local Response Normalization (LRN) → Improves
generalization
• Dropout in Fully Connected Layers → Prevents overfitting
• GPU Parallelization → Two GPUs for training due to
VRAM limits
• SGD with Momentum & Data Augmentation
16.
AlexNet Architecture Overview
•1. Architecture in a Nutshell
• Layers: 8 layers in total—5 convolutional
layers for feature extraction, followed by 3
fully connected layers for classification.
• Input & Output: Processes RGB images of
approximately 227×227×3 (or cropped central
256×256), and outputs a distribution over
1000 classes via a Softmax layer.
Core Components &Innovations
• ReLU Activation: Applied after every convolutional and
fully connected layer to accelerate convergence and
mitigate vanishing gradients.
• Max Pooling (including overlapping pooling): Used after
certain convolutional layers to reduce spatial dimensions
and improve invariance and generalization.
• Local Response Normalization (LRN): Boosts generalization
by normalizing neuron activities across adjacent channels.
• Dropout in FC Layers: Dropout applied to first two fully
connected layers helps prevent overfitting.
19.
Layer-by-Layer Breakdown
Layer TypeDetails
Conv1 → Pool → LRN 96 filters of size 11×11, stride 4
Conv2 → Pool → LRN 256 filters of size 5×5
Conv3 → Conv4 → Conv5 384, 384, 256 filters of size 3×3
Pool After Conv5
FC1 → FC2 → FC3
Two 4096-unit layers (with Dropout),
followed by a 1000-unit Softmax output
20.
Training Highlights
• GPUAcceleration: Training was distributed across
two GPUs due to limited VRAM (~3 GB each).
• Optimization: Employed SGD with momentum,
weight decay, and data augmentation techniques like
cropping, flipping, and color jittering to improve
generalization.
21.
CIFAR-10
• It isCanadian Institute For Advanced Research dataset.
• There are a few datasets that are part of tensorflow and widely used in
Machine Learning. CIFAR contains subsets of 80 million small pictures
collected in datasets CIFAR-10 and CIFAR-100.
• These were originally collected by Alex Krizhevsky, Geoffrey Hinton,
and Vinod Nair. There are a total of 60,000 images in these datasets,
both of which have the following composition:
• 10,000 test images, 1000 images per class. Test images are randomly-
selected images from each class.
• 50,000 training images, 5000 images per class. The rest of images
(minus the test images from total images) are comprised of training
images. However, some training images may contain more images in
one class.
• The classes in the dataset are entirely mutually exclusive.
• CIFAR-10 consists of 60,000 32 X 32 images (low resolution).
• They are mostly used in Convolutional Neural Network (CNN) models.
23.
ZFNet Architecture
• Developedby Matthew Zeiler and Rob Fergus
in 2013.
• Winner of ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2013.
• Improvement over AlexNet through
hyperparameter tuning and visualization.
24.
Key Features:
• Smallerreceptive field in first convolution
layer: 7x7 filters with stride 2 (vs. AlexNet's
11x11, stride 4).
• Better preservation of spatial information.
• Deconvolutional visualization to understand
feature maps.
• Enhanced depth and fine-tuning for better
accuracy.
25.
VGG-Net Architecture
• TheVisual Geometry Group (VGG) models, particularly
VGG-16 and VGG-19, have significantly influenced the
field of computer vision since their inception.
• Introduced by the Visual Geometry Group from the
University of Oxford, stood out in the 2014 ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) for their
deep convolutional neural networks (CNNs) with a
uniform architecture.
• VGG-19, the deeper variant of the VGG models, has
garnered considerable attention due to its simplicity and
effectiveness.
26.
VGG-19 Architecture
• VGG-19is a deep convolutional neural
network with 19 weight layers, comprising 16
convolutional layers and 3 fully connected
layers.
• The architecture follows a straightforward and
repetitive pattern, making it easier to
understand and implement.
27.
Detailed Layer-by-Layer Architectureof VGG-
Net 19
1. Convolutional Layers: 3x3 filters with a stride of 1 and
padding of 1 to preserve spatial resolution.
2. Activation Function: ReLU (Rectified Linear Unit) applied
after each convolutional layer to introduce non-linearity.
3. Pooling Layers: Max pooling with a 2x2 filter and a stride
of 2 to reduce the spatial dimensions.
4. Fully Connected Layers: Three fully connected layers at
the end of the network for classification.
5. Softmax Layer: Final layer for outputting class
probabilities.
29.
Information about VGGNet-19
•Model Simplicity and Effectiveness: The VGG-19
architecture's simplicity, characterized by its uniform use of
3x3 convolution filters and repetitive block structure,
makes it a highly effective and easy-to-implement model
for various computer vision tasks.
• Computational Requirements: One of the key trade-offs of
the VGG-19 model is its computational demand.
• Due to its depth and the use of small filters, it requires
significant memory and computational power, making it
more suited for environments with robust hardware
capabilities.
30.
• Robust FeatureExtraction: The depth of the VGG-19 model allows it to
capture intricate features in images, making it an excellent feature
extractor. This capability is particularly useful in transfer learning, where
pre-trained VGG-19 models are fine-tuned for specific tasks, leveraging
the rich feature representations learned from large datasets.
• Data Augmentation: To enhance the performance and generalization
capability of VGG-19, data augmentation techniques such as random
cropping, horizontal flipping, and color jittering are often employed
during training. These techniques help the model to better handle
variations and improve its robustness.
• Influence on Network Design: The principles established by the VGG-19
architecture, such as the use of small convolution filters and deep
networks, have influenced the design of subsequent state-of-the-art
models. Researchers have built upon these concepts to develop more
advanced architectures that continue to push the boundaries of what is
possible in computer vision.
Introduction to GoogleNet
•Developed by Szegedy et al. at Google in 2014.
• Winner of ILSVRC 2014 with top-5 error rate of 6.67%.
• Introduced the Inception module for efficient computation.
• Deeper network with fewer parameters compared to AlexNet and VGG.
33.
Key Features ofGoogleNet
• Inception Modules for multi-scale feature extraction.
• 22 layers deep (27 with pooling layers).
• Uses 1x1 convolutions for dimensionality reduction.
• Global Average Pooling instead of fully connected layers.
• Auxiliary classifiers for training stabilization.
34.
Inception Module
• Combinesmultiple convolution filters (1x1, 3x3, 5x5) in parallel.
• Includes pooling layer in parallel paths.
• 1x1 convolutions reduce depth before costly convolutions.
• Outputs concatenated to form final feature map.
GoogleNet Architecture
• Input:224x224 RGB image.
• Initial convolution and pooling layers.
• Stack of Inception modules with occasional pooling.
• Auxiliary classifiers at intermediate layers.
• Global Average Pooling and softmax output.
39.
Advantages of GoogleNet
•High accuracy with fewer parameters (~5 million).
• Computationally efficient due to 1x1 convolutions.
• Good generalization capability.
• Scalable design with modular Inception blocks.
40.
Applications of GoogleNet
•Image classification.
• Object detection.
• Medical image analysis.
• Feature extraction for transfer learning.
Introduction to ResNet
•Developed by Microsoft Research in 2015.
• Winner of ILSVRC 2015 with 3.57% top-5 error
rate.
• Introduced residual learning framework.
• Allows training of extremely deep networks
(over 100 layers).
43.
Key Features
• Residualblocks with identity shortcut
connections.
• Mitigates vanishing gradient problem.
• Enables deeper networks without
performance degradation.
• Common variants: ResNet-18, ResNet-34,
ResNet-50, ResNet-101, ResNet-152.
Applications of ResNet
•Image classification
• Object detection (e.g., Faster R-CNN, Mask R-
CNN)
• Face recognition (e.g., ArcFace, FaceNet)
• Medical image analysis
• Transfer learning in various AI domains
Introduction to ResNet
•Developed by Microsoft Research in 2015.
• Winner of ILSVRC 2015 with 3.57% top-5 error
rate.
• Introduced residual learning framework.
• Allows training of extremely deep networks
(over 100 layers).
48.
Key Features
• Residualblocks with identity shortcut
connections.
• Mitigates vanishing gradient problem.
• Enables deeper networks without
performance degradation.
• Common variants: ResNet-18, ResNet-34,
ResNet-50, ResNet-101, ResNet-152.
Applications of ResNet
•Image classification
• Object detection (e.g., Faster R-CNN, Mask R-
CNN)
• Face recognition (e.g., ArcFace, FaceNet)
• Medical image analysis
• Transfer learning in various AI domains
52.
Architecture Year KeyFeatures Use Case
LeNet 1998 First successful applications of CNNs, 5 layers
(alternating between convolutional and pooling),
Used tanh/sigmoid activation functions
Recognizing handwritten
and machine-
printed characters
AlexNet 2012 Deeper and wider than LeNet, Used ReLU
activation function, Implemented dropout layers,
Used GPUs for training
Large-scale
image recognition tasks
ZFNet 2013
Similar architecture to AlexNet, but with
different filter sizes and numbers of filters,
Visualization techniques for understanding the
network
ImageNet classification
VGGNet 2014 Deeper networks with smaller filters (3×3), All
convolutional layers have the same
depth,
Multiple configurations (VGG16, VGG19)
Large-scale
image recognition
ResNet 2015
Introduced “skip connections” or “shortcuts” to
enable training of deeper networks, Multiple
configurations (ResNet-50, ResNet-101, ResNet-
152)
Large-scale
image recognition, won 1st
place in the ILSVRC 2015
GoogleLeNet 2014 Introduced Inception module, which allows for
more efficient computation and deeper networks,
multiple versions (Inception v1, v2, v3, v4)
Large-scale
image recognition, won 1st
place in the ILSVRC 2014
MobileNets 2017
Designed for mobile and embedded vision
applications, Uses depthwise separable
convolutions to reduce the model size and
complexity
Mobile and embedded
vision applications, real-
time object detection
LeNet 1998 First successful applications of CNNs, 5 layers
(alternating between convolutional and pooling),
Used tanh/sigmoid activation functions
Recognizing handwritten
and machine-
printed characters
Different Types of CNN Architectures