DARPA Neural NetworkStudy (1989)
“Over the history of computing science, two advances have matured: High
speed numerical processing and knowledge processing (Artificial Intelligence).
Neural networks seem to offer the next necessary ingredient for intelligent
machines − namely, knowledge formation and organization.”
3.
DARPA Neural NetworkStudy (1989)
Two key features which, it is widely believed, distinguish neural
networks from any other sort of computing developed thus far:
Neural networks are adaptive, or trainable. Neural networks are
not so much programmed as they are trained with data − thus
many believe that the use of neural networks can relieve today’s
computer programmers of a significant portion of their present
programming load. Moreover, neural networks are said to
improve with experience − the more data they are fed, the more
accurate or complete their response.
Neural networks are naturally massively parallel. This suggests
they should be able to make decisions at high-speed and be fault
tolerant.
4.
History
Early work (1940-1960)
•McCulloch & Pitts (Boolean logic)
• Rosenblatt (Learning)
• Hebb (Learning)
Transition (1960-1980)
• Widrow – Hoff (LMS rule)
• Anderson (Associative memories)
• Amari
Resurgence (1980-1990’s)
• Hopfield (Ass. mem. / Optimization)
• Rumelhart et al. (Back-prop)
• Kohonen (Self-organizing maps)
• Hinton , Sejnowski (Boltzmann machine)
New resurgence (2012 -)
• CNNs, Deep learning, GAN’s ….
5.
A Few Figures
Thehuman cerebral cortex is composed of about
100 billion (1011) neurons
of many different types.
Each neuron is connected to other 1000 / 10000 neurons, wich yields
1014/1015 connections
The cortex covers about 0.15 m²
and is 2-5 mm thick
6.
The Neuron
Cell Body(Soma): 5-10 microns in diameter
Axon: Output mechanism for a neuron; one axon/cell, but thousands of
branches and cells possible for a single axon
Dendrites: Receive incoming signals from other nerve axons via synapse
7.
Neural Dynamics
The transmissionof signal in the cerebral cortex is a complex process:
electrical chemical electrical
Simplifying :
1) The cellular body performs a “weighted sum” of the incoming signals
2) If the result exceeds a certain threshold value, then it produces an
“action potential” which is sent down the axon (cell has “fired”),
otherwise it remains in a rest state
3) When the electrical signal reaches the synapse, it allows the
“neuro-transmitter” to be released and these combine with the
“receptors” in the post-synaptic membrane
4) The post-synaptic receptors provoke the diffusion of an electrical
signal in the post-synaptic neuron
9.
Synapses
SYNAPSE is therelay point where information is conveyed by chemical transmitters from neuron
to neuron. A synapse consists of two parts: the knowblike tip of an axon terminal and the receptor
region on the surface of another neuron. The membranes are separated by a synaptic cleft some
200 nanometers across. Molecules of chemical transmitter, stored in vesicles in the axon terminal,
are released into the cleft by arriving nerve impulses. Transmitter changes electrical state of the
receiving neuron, making it either more likely or less likely to fire an impulse.
10.
Synaptic Efficacy
It’s theamount of current that enters into the post-synaptic neuron,
compared to the action potential of the pre-synaptic neuron.
Learning takes place by modifying the synaptic efficacy.
Two types of synapses:
• Excitatory : favor the generation of action potential
in the post-synaptic neuron
• Inhibitory : hinder the generation of action potential
11.
The McCulloch andPitts Model (1943)
The McCulloch-Pitts (MP) Neuron is modeled as a binary threshold unit
The unit “fires” if the net input reaches (or exceeds) the unit’s threshold T:
If neuron is firing, then its output y is 1, otherwise it is 0.
g is the unit step function:
Weights wij represent the strength of the synapse between neuron j and neuron i
wj
j
å Ij
y = g wj
j
å Ij -T
æ
è
ç
ç
ö
ø
÷
÷
g(x) =
0 if x < 0
1 if x ³ 0
ì
í
î
12.
Properties of McCulloch-PittsNetworks
By properly combining MP neurons one can simulate the behavior of any
Boolean circuit.
Three elementary logical operations (a) negation, (b) and, (c) or. In each diagram
the states of the neurons on the left are at time t and those on the right at time t +1.
The construction for the exclusive or
13.
Network Topologies andArchitectures
• Feedforward only vs. Feedback loop (Recurrent networks)
• Fully connected vs. sparsely connected
• Single layer vs. multilayer
Multilayer perceptrons, Hopfield networks,
Boltzman machines, Kohonen networks, …
(a) A feedforward network and (b) a recurrent network
14.
Classification Problems
Given :
1)some “features”:
2) some “classes”:
Problem :
To classify an “object” according to its features
n
2
1 f
f
f ,....,
,
m
1 c
c ,....,
15.
Example #1
To classifyan “object” as :
= “ watermelon ”
= “ apple ”
= “ orange ”
According to the following features :
= “ weight ”
= “ color ”
= “ size ”
Example :
weight = 80 g
color = green
size = 10 cm³
1
c
2
c
3
c
1
f
2
f
3
f
“apple”
16.
Example #2
Problem: Establishwhether a patient got the flu
• Classes : { “ flu ” , “ non-flu ” }
• (Potential) Features :
: Body temperature
: Headache ? (yes / no)
: Throat is red ? (yes / no / medium)
:
1
f
2
f
3
f
4
f
Neural Networks forClassification
A neural network can be used as a classification device .
Input ≡ features values
Output ≡ class labels
Example : 3 features , 2 classes
22.
Thresholds
We can getrid of the thresholds associated to neurons by adding an
extra unit permanently clamped at -1.
In so doing, thresholds become weights and can be adaptively adjusted
during learning.
23.
The Perceptron
A networkconsisting of one layer of M&P neurons connected in a
feedforward way (i.e. no lateral or feedback connections).
• Discrete output (+1 / -1)
• Capable of “learning” from examples (Rosenblatt)
• They suffer from serious computational limitations
Linear Separability
A classificationproblem is said to be linearly separable if the decision regions
can be separated by a hyperplane.
Example: AND
X Y X AND Y
0 0 0
0 1 0
1 0 0
1 1 1
28.
Limitations of Perceptrons
Ithas been shown that perceptrons can only solve linearly separable
problems.
Example: XOR (exclusive OR)
X Y X XOR Y
0 0 0
0 1 1
1 0 1
1 1 0
29.
The Perceptron ConvergenceTheorem
Theorem (Rosenblatt, 1960)
If the training set is linearly separable, the perceptron learning algorithm
always converges to a consistent hypothesis after a finite number of
epochs, for any η > 0.
If the training set is not linearly separable, after a certain number of
epochs the weights start oscillating.
Multi–Layer Feedforward Networks
•Limitation of simple perceptron: can implement only linearly separable
functions
• Add “ hidden” layers between the input and output layer. A network
with just one hidden layer can represent any Boolean functions including
XOR
• Power of multilayer networks was known long ago, but algorithms for
training or learning, e.g. back-propagation method, became available
only recently (invented several times, popularized in 1986)
• Universal approximation power: Two-layer network can approximate
any smooth function (Cybenko, 1989; Funahashi, 1989; Hornik, et al..,
1989)
• Static (no feedback)
Back-propagation Learning Algorithm
•An algorithm for learning the weights in a feed-forward network,
given a training set of input-output pairs
• The algorithm is based on gradient descent method.
35.
Supervised Learning
Supervised learningalgorithms require the presence of a “teacher” who
provides the right answers to the input questions.
Technically, this means that we need a training set of the form
where :
is the network input vector
is the desired network output
vector
L = x1
,y1
( ), ..... xp
,yp
( )
{ }
xm
m =1… p
( )
ym
m =1… p
( )
36.
Supervised Learning
The learning(or training) phase consists of determining a configuration of
weights in such a way that the network output be as close as possible to the
desired output, for all the examples in the training set.
Formally, this amounts to minimizing an error function such as (not only
possible one):
where Ok
μ is the output provided by the output unit k when the network is
given example μ as input.
E =
1
2 k
å
m
å yk
m
- Ok
m
( )
2
37.
Back-Propagation
To minimize theerror function E we can use the classic gradient-
descent algorithm:
To compute the partial derivates we use the error back propagation
algorithm.
It consists of two stages:
Forward pass : the input to the network is propagated
layer after layer in forward direction
Backward pass : the “error” made by the network is
propagated backward, and weights
are updated properly
η = “learning rate”
38.
Notations
Given pattern µ,hidden unit j receives a net input
and produces as output :
k
k
jk
j x
w
h
k
k
jk
j
j x
w
g
h
g
V
39.
Back-Prop:
Updating Hidden-to-Output Weights
DWij= -h
¶E
¶Wij
= -h
¶
¶Wij
1
2 k
å
m
å yk
m
- Ok
m
( )
2
é
ë
ê
ê
ù
û
ú
ú
= h yk
m
- Ok
m
( )
k
å
m
å
¶Ok
m
¶Wij
= h yi
m
- Oi
m
( )
m
å
¶Oi
m
¶Wij
= h yi
m
- Oi
m
( )
m
å g' hi
m
( ) Vj
m
= h di
m
m
å Vj
m
i
i
i
i h
g
O
y '
:
where
E =
1
2 k
å
m
å yk
m
- Ok
m
( )
2
40.
Back-Prop:
Updating Input-to-Hidden Weights(1)
jk
i
i
i
i
i
jk
i
i
i
i
jk
jk
w
h
h
'
g
O
y
w
O
O
y
w
E
w
jk
j
j
ij
jk
j
ij
jk
j
ij
jk
l
l
il
jk
i
w
h
h
'
g
W
w
h
g
W
w
V
W
w
V
W
w
h
E =
1
2 k
å
m
å yk
m
- Ok
m
( )
2
41.
Back-Prop:
Updating Input-to-Hidden Weights(2)
Hence, we get:
k
m
m
jm
jk
jk
j
x
x
w
w
w
h
k
j
k
j
ij
i
i
k
j
ij
i
i
i
i
jk
x
x
h
g
W
x
h
g
W
h
g
O
y
w
ˆ
'
'
'
,
,
ij
i
i
j
j W
h
g
'
ˆ
:
where
The Momentum Term
Gradientdescent may:
• Converge too slowly if η is too small
• Oscillate if η is too large
Simple remedy:
The momentum term allows us to use large values of η thereby avoiding
oscillatory phenomena
Typical choice: α = 0.9, η = 0.5
The Problem ofLocal Minima
Back-prop cannot avoid local minima.
Choice of initial weights is important.
If they are too large the nonlinearities
tend to saturate since the beginning of
the learning process.
• A networkto pronounce English text
• 7 x 29 (=203) input units
• 1 hidden layer with 80 units
• 26 output units encoding phonemes
• Trained by 1024 words in context
• Produce intelligible speech after 10 training epochs
• Functionally equivalent to DEC-talk
• Rule-based DEC-talk was the result of a decade effort by many linguists
• NETtalk learns from examples and, require no linguistic knowledge
NETtalk
53.
Theoretical / PracticalQuestions
▪ How many layers are needed for a given task?
▪ How many units per layer?
▪ To what extent does representation matter?
▪ What do we mean by generalization?
▪ What can we expect a network to generalize?
• Generalization: performance of the network on data not
included in the training set
• Size of the training set: how large a training set should be for
“good” generalization?
• Size of the network: too many weights in a network result in
poor generalization
54.
True vs SampleError
The true error is unknown (and will remain so forever…).
On which sample should I compute the sample error?
(a) A goodfit to noisy data.(b) Overfitting of the same data: the fit is perfect on the
“training set” (x’s), but is likely to be poor on “test set” represented by the circle.
Overfitting
• The size(i.e. the number of hidden units) of an artificial
neural network affects both its functional capabilities and
its generalization performance
• Small networks could not be able to realize the desired
input / output mapping
• Large networks lead to poor generalization performance
Size Matters
60.
The Pruning Approach
Trainan over-dimensioned net and then remove redundant nodes
and / or connections:
• Sietsma & Dow (1988, 1991)
• Mozer & Smolensky (1989)
• Burkitt (1991)
Adavantages:
• arbitrarily complex decision regions
• faster training
• independence of the training algorithm
61.
An Iterative PruningAlgorithm
Consider (for simplicity) a net with one hidden layer:
Suppose that unit h is to be removed:
IDEA: Remove unit h (and its in/out connections) and adjust the
remaining weights so that the I/O behavior is the same
G. Castellano, A. M. Fanelli, and M. Pelillo, An iterative pruning algorithm for feedforward neural networks, IEEE
Transactions on Neural Networks 8(3):519-531, 1997.
62.
This is equivalentto solving the system:
before after
which is equivalent to the following linear system (in the unknown δ’s):
wij
j=1
nh
å yj
(m)
= wij +dij
( )
j=1
j¹h
nh
å yj
(m)
)
(
)
(
h
ih
j
h
j
ij y
w
y
i =1… nO , m =1… P
i =1… nO , m =1… P
An Iterative Pruning Algorithm
63.
In a morecompact notation:
where
But solution does not always exists.
Least-square solution :
Ax = b
A Î Â
Pno´ no nh-1
( )
min
x
Ax - b
An Iterative Pruning Algorithm
64.
Detecting Excessive Units
•Residual-reducing methods for LLSPs start with an initial solution
x0 and produces a sequences of points {xk} so that the residuals
decrease:
• Starting point:
• Excessive units can be detected so that is minimum
k
k r
b
Ax
1
k
k r
r
b
r
x
0
0 0
b
65.
The Pruning Algorithm
1)Start with an over-sized trained network
2) Repeat
2.1) find the hidden unit h for which is minimum
2.2) solve the corresponding system
2.3) remove unit h
Until Perf(pruned) – Perf(original) < epsilon
3) Reject the last reduced network
b
66.
Example: 4-bit parity
Teninitial 4-10-1 networks
nine 4-5-1
Pruned nets 5 hidden nodes (average)
one 4-4-1
0
10
20
30
40
50
60
70
80
90
100
10 9 8 7 6 5 4 3 2 1
number of hidden units
recognition
rate
(%)
0
0,05
0,1
0,15
0,2
0,25
MSE
recognition rate
MSE
MINIMUM
NET
The Deep Learning“Philosophy”
• Learn a feature hierarchy all the way from pixels to classifier
• Each layer extracts features from the output of previous layer
• Train all layers jointly
71.
Shallow vs DeepNetworks
From. R. E. Turner
Shallow architectures are inefficient at representing deep functions
Old Idea… WhyNow?
1. We have more data - from Lena to ImageNet.
1. We have more computing power, GPUs are
really good at this.
1. Last but not least, we have new ideas
74.
Image Classification
Predict asingle label (or a distribution over labels as shown here to indicate our confidence)
for a given image. Images are 3-dimensional arrays of integers from 0 to 255, of size Width x
Height x 3. The 3 represents the three color channels Red, Green, Blue.
From: A. Karpathy
The Data-Driven Approach
Anexample training set for four visual categories.
In practice we may have thousands of categories and
hundreds of thousands of images for each category. From: A. Karpathy
Cellular Recordings
Kuffler, Hubel,Wiesel, …
1953: Discharge patterns
and functional
organization of
mammalian retina
1959: Receptive fields of
single neurones in the
cat's striate cortex
1962: Receptive fields,
binocular interaction and
functional architecture in
the cat's visual cortex
1968 ..
Simple Cells
Orientation selectivity:Most V1 neurons are orientation selective meaning that they
respond strongly to lines, bars, or edges of a particular orientation (e.g., vertical) but
not to the orthogonal orientation (e.g., horizontal).
Fully- vs Locally-ConnectedNetworks
From. M. A. Ranzato
Fully-connected: 400,000 hidden units = 16 billion parameters
Locally-connected: 400,000 hidden units 10 x 10 fields = 40 million parameters
Local connections capture local dependencies
106.
Using Several TrainableFilters
Normally, several filters are packed together and learnt automatically
during training
107.
Pooling
Max pooling isa way to simplify the network architecture, by
downsampling the number of neurons resulting from filtering operations.
AlexNet (2012)
• 8layers total
• Trained on Imagenet Dataset (1000
categories, 1.2M training images, 150k
test images)
110.
AlexNet Architecture
• 1stlayer: 96 kernels (11 x 11 x 3)
• Normalized, pooled
• 2nd layer: 256 kernels (5 x 5 x 48)
• Normalized, pooled
• 3rd layer: 384 kernels (3 x 3 x 256)
• 4th layer: 384 kernels (3 x 3 x 192)
• 5th layer: 256 kernels (3 x 3 x 192)
• Followed by 2 fully connected layers, 4096 neurons each
• Followed by a 1000-way SoftMax layer
650,000 neurons
60 million parameters
Rectified Linear Units(ReLU’s)
Problem: Sigmoid activation takes on values in (0,1). Propagating the
gradient back to the initial layers, it tends to become 0 (vanishing
gradient problem).
From a practical perspective, this slows down the training procedure of
the initial layers of the network.
Mini-batch Stochastic GradientDescent
Loop:
1. Sample a batch of data
2. Forward prop it through the graph, get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
115.
Data Augmentation
The easiestand most common method to reduce overfitting on
image data is to artificially enlarge the dataset using label-preserving
transformations
AlexNet uses two forms of this data augmentation.
• The first form consists of generating image translations and
horizontal reflections.
• The second form consists of altering the intensities of the RGB
channels in training images.
116.
Dropout
Set to zerothe output of each hidden neuron with probability 0.5.
The neurons which are “dropped out” in this way do not contribute to
the forward pass and do not participate in backpropagation.
So every time an input is presented, the neural network samples a
different architecture, but all these architectures share weights.
Reduces complex co-
adaptations of neurons,
since a neuron cannot
rely on the presence of
particular other neurons.
• A well-trainedConvNet is an excellent feature extractor.
• Chop the network at desired layer and use the output as a feature
representation to train an SVM on some other dataset (Zeiler-Fergus 2013):
• Improve further by taking a pre-trained ConvNet and re-training it on a
different dataset (Fine tuning).
Feature Analysis
128.
Today deep learning,in its several manifestations, is being applied
in a variety of different domains besides computer vision, such as:
• Speech recognition
• Optical character recognition
• Natural language processing
• Autonomous driving
• Game playing (e.g., Google’s AlphaGo)
• …
Other Success Stories of Deep Learning