Machine Learning
Neural Networks
Portland Data Science Group
Created by Andrew Ferlitsch
Community Outreach Officer
August, 2017
Initial History
• Neural Networks have been around a long time.
• 1943 - Warren McCulloch, a neurophysiologist and Walter Pitts,
a mathematician, published a paper on how neurons might work.
They modeled a simple neural network with electrical circuits.
• 1949 - The Organization of Behavior, by Donald Hebb reinforced
the concept of neurons.
• 1950s - Nathanial Rochester from the IBM research laboratories
led the first effort to simulate a neural network.
• 1959 - Bernard Widrow and Marcian Hoff of Stanford developed
the first real neural network – MADALINE.
• 1969 - Marvin Minsky and Seymour Papert's 1969 book
Perceptrons, kicked off the dissolutionment period where little
research continued until 1981.
i.e., demonstrated the Perceptron could not model an XOR operation.
Neuron
Neural Networks consist of Neurons
X1
Inputs
W1
W2
W3
X2
X3
Neuron
Inputs from
the features
(independent
variables) in
the dataset.
Weight (importance)
on how each feature
contributes to the output.
Output
Value
The model
(predictor)
The prediction
Can be:
Real value
Probability
Binary
Categorical
Neuron – Categorical Output
Neural Networks consist of Neurons
X1
Inputs
W1
W2
W3
X2
X3
Neuron
Y1
Y2
Y3
Outputs
Categorical
Outputs
(e.g., Apple,
Pear, Banana).
Neuron outputs only
a single value.
Output nodes Y1, Y2 and
Y3 each weight the output
from the neuron and make
a separate calculation for
their final output.
Neuron - Details
Neural Networks consist of Neurons
X1
Inputs
W1
W2
W3
X2
X3
Neuron
Output
Value
Normalize (0..1) or Standardize the inputs (feature scaling)
so no input dominates another.
𝑖=0
𝑛
𝑤𝑖 ∗ 𝑥𝑖Ø( )
Summation of the weighted inputs
Activation function
Backward propagation to
adjust (learn) the weights
(e.g., Gradient Descent).
The higher the weight,
the more it contributes
to the outcome
(prediction).
Activation Functions
• Most Common
• Threshold – Either a zero or one is outputted (binary).
Ø(x) =
• Sigmoid – A Curve that converges exponentially towards 0 for
x < 0 and 1 for x > 0.
{ 1 if x ≥ 0
0 if x < 0 }
Convergence to zero
Convergence to one
Also referred to as
a squashing function,
Squashing the output
between 0 and 1.
Popularly used in
output nodes for
probability prediction.
Activation Functions
• Most Common
• Hyperbolic Tangent – converges to -1 for x < 0 and 1 for x > 0.
Ø(x) =
𝟏 − 𝒆−𝟐𝒙
𝟏+ 𝒆−𝟐𝒙
• Rectifier – 0 if x <= 0, otherwise x
Ø(x) =
Ø(x) = max(0,x)
{ 0 if x ≤ 0
x if x > 0 } Popularly used in
hidden layers for
outputting to the next
layer.
Also referred to as
a squashing function,
Squashing the output
between -1 and 1.
Alternate representation.
Fully Connected Neural Network (FCNN)
• Full Connected Neural Network consists of:
• Input Layer – inputs from the data (samples).
• Output Layer – the predictions.
• Hidden Layer(s) – Between the input and output layers,
where the learning occurs.
• All nodes are connected to every other node in the next layer.
• Activation Functions – where outputs are binary, squashed, or
rectified.
• Forward Feeding and Backward Propagation - for learning the
weights.
Fully Connected Neural Network (FCNN)
X1
X2
Xn
Input Layer
Hidden Layer
ŷ
Output Layer
Simple FCNN:
- One Hidden Layer
- One Output Node
Rectifier Activation Function (ReLU)
Sigmoid Activation Function
If below zero, then
Output no signal.
Squash into a probability.
Acronym
Deep Neural Network (FCNN)
X1
X2
Xn
Input Layer
Hidden Layers
ŷ
Output Layer
It’s a Deep Neural Network
if it has more than one hidden
layer – That’s It!
Hidden Nodes are Specialized Learners
Age
Income
18-25
(low
income
)
ŷ Spending
Each Node in the Hidden Network Specializes
W1-1
W2-1
Learns weights to best predict when age is young and
income is low (i.e., they spend their parent’s money).
Outputs high signal
Outputs low or no signal
< 25
< 1000
Sample
The more hidden nodes, the more specialized learners
Cost Function
Age
Income
ŷ Spending - ŷ
Calculate Cost (Loss) During Training
W1-1
W2-1
< 25
< 1000
y (label)
Data
y
Predicted
And actual.
C =
𝟏
𝟐
𝒚 − ŷ 𝟐
One of the most commonly used
cost functions for neural networks.
Feed Forward - Training
Feed Forward Training Loop
Training
Data
Data
Data
Data
Data
Feed a single
row of data at
a time.
Repeat
Neural Network
C =
𝟏
𝟐
𝒚 − ŷ 𝟐
Calculate the cost (loss).
Converge
?
Can’t minimize the cost
function anymore.
Adjust Weights
Make small adjustments to
weights in the neural network.
Summation
∑ C =
𝟏
𝟐
𝒚 − ŷ 𝟐
No
Run the training set again
through the neural network.
Each run is called an Epoch.
Yes
StopTrained Neural Network
Multiple Output Nodes - Softmax
• Squashes a set of input values into 0 and 1 (probabilities), all
adding up to 1.
Softmax
z1
z2
z3
zk
f(z1) ∈ R{ 0, 1 }
f(z2) ∈ R{ 0, 1 }
f(z3) ∈ R{ 0, 1 }
f(zk) ∈ R{ 0, 1 }
Output Layer
Hidden Layer
x1
x2
x3
Input Layer
Features
Predicted
output
(real) values
Classification
probabilities, e.g.,
90% apple
6% pear
3% orange
1% banana
Each output node specializes
on a different classification.
Final Note – Training vs. Prediction
• Once we have trained the neural network, we do not have to
repeat the training steps when using the model for prediction.
• No repeating of Epochs, Gradient Descent and Backward Propagation.
• The model will run much faster than during training.

Machine Learning - Introduction to Neural Networks

  • 1.
    Machine Learning Neural Networks PortlandData Science Group Created by Andrew Ferlitsch Community Outreach Officer August, 2017
  • 2.
    Initial History • NeuralNetworks have been around a long time. • 1943 - Warren McCulloch, a neurophysiologist and Walter Pitts, a mathematician, published a paper on how neurons might work. They modeled a simple neural network with electrical circuits. • 1949 - The Organization of Behavior, by Donald Hebb reinforced the concept of neurons. • 1950s - Nathanial Rochester from the IBM research laboratories led the first effort to simulate a neural network. • 1959 - Bernard Widrow and Marcian Hoff of Stanford developed the first real neural network – MADALINE. • 1969 - Marvin Minsky and Seymour Papert's 1969 book Perceptrons, kicked off the dissolutionment period where little research continued until 1981. i.e., demonstrated the Perceptron could not model an XOR operation.
  • 3.
    Neuron Neural Networks consistof Neurons X1 Inputs W1 W2 W3 X2 X3 Neuron Inputs from the features (independent variables) in the dataset. Weight (importance) on how each feature contributes to the output. Output Value The model (predictor) The prediction Can be: Real value Probability Binary Categorical
  • 4.
    Neuron – CategoricalOutput Neural Networks consist of Neurons X1 Inputs W1 W2 W3 X2 X3 Neuron Y1 Y2 Y3 Outputs Categorical Outputs (e.g., Apple, Pear, Banana). Neuron outputs only a single value. Output nodes Y1, Y2 and Y3 each weight the output from the neuron and make a separate calculation for their final output.
  • 5.
    Neuron - Details NeuralNetworks consist of Neurons X1 Inputs W1 W2 W3 X2 X3 Neuron Output Value Normalize (0..1) or Standardize the inputs (feature scaling) so no input dominates another. 𝑖=0 𝑛 𝑤𝑖 ∗ 𝑥𝑖Ø( ) Summation of the weighted inputs Activation function Backward propagation to adjust (learn) the weights (e.g., Gradient Descent). The higher the weight, the more it contributes to the outcome (prediction).
  • 6.
    Activation Functions • MostCommon • Threshold – Either a zero or one is outputted (binary). Ø(x) = • Sigmoid – A Curve that converges exponentially towards 0 for x < 0 and 1 for x > 0. { 1 if x ≥ 0 0 if x < 0 } Convergence to zero Convergence to one Also referred to as a squashing function, Squashing the output between 0 and 1. Popularly used in output nodes for probability prediction.
  • 7.
    Activation Functions • MostCommon • Hyperbolic Tangent – converges to -1 for x < 0 and 1 for x > 0. Ø(x) = 𝟏 − 𝒆−𝟐𝒙 𝟏+ 𝒆−𝟐𝒙 • Rectifier – 0 if x <= 0, otherwise x Ø(x) = Ø(x) = max(0,x) { 0 if x ≤ 0 x if x > 0 } Popularly used in hidden layers for outputting to the next layer. Also referred to as a squashing function, Squashing the output between -1 and 1. Alternate representation.
  • 8.
    Fully Connected NeuralNetwork (FCNN) • Full Connected Neural Network consists of: • Input Layer – inputs from the data (samples). • Output Layer – the predictions. • Hidden Layer(s) – Between the input and output layers, where the learning occurs. • All nodes are connected to every other node in the next layer. • Activation Functions – where outputs are binary, squashed, or rectified. • Forward Feeding and Backward Propagation - for learning the weights.
  • 9.
    Fully Connected NeuralNetwork (FCNN) X1 X2 Xn Input Layer Hidden Layer ŷ Output Layer Simple FCNN: - One Hidden Layer - One Output Node Rectifier Activation Function (ReLU) Sigmoid Activation Function If below zero, then Output no signal. Squash into a probability. Acronym
  • 10.
    Deep Neural Network(FCNN) X1 X2 Xn Input Layer Hidden Layers ŷ Output Layer It’s a Deep Neural Network if it has more than one hidden layer – That’s It!
  • 11.
    Hidden Nodes areSpecialized Learners Age Income 18-25 (low income ) ŷ Spending Each Node in the Hidden Network Specializes W1-1 W2-1 Learns weights to best predict when age is young and income is low (i.e., they spend their parent’s money). Outputs high signal Outputs low or no signal < 25 < 1000 Sample The more hidden nodes, the more specialized learners
  • 12.
    Cost Function Age Income ŷ Spending- ŷ Calculate Cost (Loss) During Training W1-1 W2-1 < 25 < 1000 y (label) Data y Predicted And actual. C = 𝟏 𝟐 𝒚 − ŷ 𝟐 One of the most commonly used cost functions for neural networks.
  • 13.
    Feed Forward -Training Feed Forward Training Loop Training Data Data Data Data Data Feed a single row of data at a time. Repeat Neural Network C = 𝟏 𝟐 𝒚 − ŷ 𝟐 Calculate the cost (loss). Converge ? Can’t minimize the cost function anymore. Adjust Weights Make small adjustments to weights in the neural network. Summation ∑ C = 𝟏 𝟐 𝒚 − ŷ 𝟐 No Run the training set again through the neural network. Each run is called an Epoch. Yes StopTrained Neural Network
  • 14.
    Multiple Output Nodes- Softmax • Squashes a set of input values into 0 and 1 (probabilities), all adding up to 1. Softmax z1 z2 z3 zk f(z1) ∈ R{ 0, 1 } f(z2) ∈ R{ 0, 1 } f(z3) ∈ R{ 0, 1 } f(zk) ∈ R{ 0, 1 } Output Layer Hidden Layer x1 x2 x3 Input Layer Features Predicted output (real) values Classification probabilities, e.g., 90% apple 6% pear 3% orange 1% banana Each output node specializes on a different classification.
  • 15.
    Final Note –Training vs. Prediction • Once we have trained the neural network, we do not have to repeat the training steps when using the model for prediction. • No repeating of Epochs, Gradient Descent and Backward Propagation. • The model will run much faster than during training.