Machine Learning pt1: Classification, Regression,
and Artificial Neural Networks
Self-Driving-Cars Los Angeles
By Jonathan Mitchell
github.com/jonathancmitchell
linkedin.com/in/jonathancmitchell
jmitchell1991@gmail.com
Self Driving Cars Los Angeles
https://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/
Welcome to Machine
Learning
aka computational statistics
How did I learn this?
Sources:
● Udacity’s Self-Driving Car Nanodegree problem (udacity.com/drive)
● MIT Self-Driving Car program (selfdrivingcars.mit.edu)
● Stanford’s cs-231n (cs231n.github.io)
Topics
A) Probability basics - Basics to Logits
B) Linear Classification/ Logistic regression overview
C) Perceptron
D) Perceptron (biological inspiration)
E) Neuron
F) Forward Pass
G) Computing a loss function
H) Visualizing Hidden Layers
I) Setting up training data
J) Preprocessing / Normalization
K) Overfitting / Hyperparameter intro
L) Epochs
M) Minibatch
N) Gradient Descent / Stochastic Gradient Descent
O) Backpropagation
P) Cross Entropy Loss
Probability basics
Probability p = all outcomes of interest / all possible outcomes
P can range from (0, 1) inclusive. P = 1 = 100% likelihood P = 0 = 0% likelihood
Odds: The likelihood of an event P happening:
Coin Toss: Toss a coin in the air, it can either be heads or tails.
P_heads = 0.5 => P_tails = 0.5 = 1 - P_heads
P1 = Heads Probability (0, 1), P0 = Tails Probability (0,1)
Odds ratio can be written as Odds1 : Odds2. In this case 1:1. Equal chance of getting
Heads or tails
P(Not Tails)
P(Tails)
P(Heads)
P(Not Heads)
Odds_heads
Odds_tails
Bernoulli Probability (A specific case of binomial distribution)
Bernoulli Probability: A yes or no question. Two possible outcomes: Success and Fail
p = probability of success (of one trial)
q = probability of fail (of one trial) = 1 - p
unknown probability p.
N = number of trials = 1
Probability of K successes.
K = 1 for Bernoulli
A Bernoulli distribution is a special case of a Binomial Distribution with N = 1 trial.
Binomial Probability
Probability Basics -> Logistic Regression
Goal: Estimate an unknown probability p for any given linear combination of the
independent variables.
Link independent variables to the Ber(p) distribution.
Logistic regression: estimate an unknown probability p for any given linear
combination of the independent variables.
Estimate p = p^
Need function that maps linear combination of variables that can result in any
value onto the bernoulli probability distribution with a domain from 0 to 1.
Use Logit: Natural log of the odds ratio
Logistic Regression
Logit: natural log of odds ratio
Undefined at P = 0, P = 1
Good P domain (0, 1)
Linear Combination
graph from http://www.graphpad.com/support/faqid/1465/
Logistic Regression
Logit:
α will be the linear combination of
independent variables and their
coefficient
Recall:
Ber(p) = logit(p) - logit(1-p)
Inverse logit gives us the
probability that dependent var (p)
is a “1”
Linear Combination
Probability of x with linear
combination mapping (B and B0)
Binary Output variable Y. We
want to model the conditional
probability Pr(Y = 1 | X = x) as a
function of x; any unknown
parameters are to be estimated
by max likelihood
graph from http://www.graphpad.com/support/faqid/1465/
Logistic Regression -> Linear Classification
To classify we seek a binary output variable Y = 1 or 0.
Recall Pr(Y = 1 | X = x). We modeled this as p(x;b,w)
Predict Y = 1 when p >= 0.5. Y = 1 = Class A
Predict Y = 0 when P < 0.5. Y = 0 = Class B
Guess 1 when B + B0 is non-negative
Guess 0 when B + B0 is negative
This is a linear classifier.
We can also infer that the probabilities depend on the distance from the boundary.
This is known as a Binary Logistic Classifier (Binary = 2 options, Class A or Class B)
The decision boundary
separates the two predicted
classes and is the solution to
this equation
Graph from
http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageServic
e.svc/ImageService/Articleimage/2010/AN/b918972f/b918972f-f7.gif
Logistic Regression -> Linear Classification
To classify we seek a binary output variable Y = 1 or 0.
Recall Pr(Y = 1 | X = x). We modeled this as p(x;b,w)
Predict Y = 1 when p >= 0.5. Y = 1 = Class A
Predict Y = 0 when P < 0.5. Y = 0 = Class B
Guess 1 when B + B0 is non-negative
Guess 0 when B + B0 is negative
This is a linear classifier.
We can also infer that the probabilities depend on the distance from the boundary.
This is known as a Binary Logistic Classifier (Binary = 2 options, Class A or Class B)
The decision boundary
separates the two predicted
classes and is the solution to
this equation
Graph from
http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageServic
e.svc/ImageService/Articleimage/2010/AN/b918972f/b918972f-f7.gif
Neuron: Building block of a neural network
src: MIT-Self-Driving-Cars, Fridman,
A Neuron is a computational
building block of the brain.
Human brain: 1000T synapses
10x that of an Artificial Neuron
Artificial Neuron is a
computational building block of
an artificial neural network.
~1-10B synapses
*Takes a set of inputs
*Places a weight of each input
*sums them together
*applies a bias value on each
neuron
*Uses an activation function that
takes in the sum plus bias and
squeeze values together into a
probability distribution (range 0, 1)
Takes a few inputs and places an
output
Classification: output: 1 or 0
This can serve as a linear classifier
src: MIT-Self-Driving-Cars, Fridman,
Perceptron Algorithm
X1
X3
X2
Output
1. Initialize perceptron with random weights.
2. Compute perceptrons output
3. If output does not match known output
a. if output should have been 0 but was 1, decrease the weights that had an input of 1
b. if output should have been 1 but was 0, increase the weights that had an input of 1
4. Move on to next example in the training set until perceptron makes no more mistakes
src: MIT-Self-Driving-Cars, Fridman,
If output does not match
expected output = Punish!
Your output
neurons
didn’t match
the expected
output.
X1
X3
X2
Output
Expected Output: Cat but we got Burrito
Training
Images
Perceptron
Why Neural Networks are great.
X1
X3
X2
Output
Perceptron
We can use the Hidden
Layer to approximate any
function
Universality: We can
closely approximate any
function f(x) with a single
hidden layer.
Driving: Input (sensor data
from the world)
Output: Drive (use steering
data etc)
src: MIT-Self-Driving-Cars, Fridman,
Lex
Dual class Linear Classification with Binary Logistic Regression
Input Data
Goal: To predict class A or class
B from input data.
Two possible outputs!
x
Linear Combination
Logistic Regression
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
P = 1
P = 0
Squeezes Values
between 0 and 1
Scores
(0,1)
range
Notation changeup:
logit-1 -> sigmoid
Input Data
Two possible outputs!
x
Linear Combination
Logistic Regression
P = 1
P = 0
Squeezes Values
between 0 and 1
puts into probability
distribution
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores
(0,1)
range
Unnormalized log probabilities
Generalizing Logistic Regression to multiple classes
If we have two classes we can
have two possible outputs: 1 or 0
What if we have 10 classes?
Binary - Two Outputs
Y either 1 or 0
Supposed we have k classes.
Let’s switch up some notation:
Now set each score s to the
result of that function.
Probability that output Y = class
K.
We have J possible classes.
Perform softmax on scores
Softmax Classifier
is Binary Logistic regression
applied to multiple classes
Output = scores b/w 0 and 1
Scores
Notation changeup:
logit-1 -> sigmoid
Input Data
Two possible outputs!
x
Linear Combination
Logistic Regression
P = 1
P = 0
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores
(0,1)
range
Unnormalized log probabilities
Notation changeup:
logit-1 -> sigmoid
Input Data
Two possible outputs!
x
Linear Combination
Logistic Regression
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores
(0,1)
range
Unnormalized log probabilities
Output of Linear function. AKA
Linear Scores
Linear(x) = xW+b or Wx + b
Textbooks: Wx+b
Tensorflow: xW+b
Computing derivatives is easier for xW + b.
A few notes
f(xi, W, b) = xW + b
Assume image x has all of its pixels flattened out into a single row vector. x =
X’s size is [n x m]. n: # examples/images m: # features (pixels in this case per image)
Matrix W of size [m x k]. m = # features, k = # classes
Bias b of size [k x 1]
Consider our input data (xi, yi) as being fixed. We can set W and b to approximate any function
(remember universality principle).
We use the training data to learn W and b. Once our model has been trained. We can discard the training
data and test our model on test data. Or anything for that matter.
W and b will be tensors if you are using TensorFlow. They can be arrays if you are using Numpy.
x[0] x[1] x[2] x[3] x[4]
Pixel(0, 255)
Example The biases allow us to have
these lines NOT all cross
through the origin
W causes the lines to rotate
about our pixels space
B pushes the lines away
from the origin
src: Andrej Karpathy
Bias Trick (in practice)
It would be annoying to worry about the Bias term separately during classification.
Therefore we simply append the bias row vector to the end of our Weights matrix.
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333 0.12 3.1 -0.5
Weights
Bias
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
0.12 3.1 -0.5
Weights
You may see this in the code as: logits = tf.add(tf.matmul(x, weights), bias) OR
logits = tf.matmul(x, weights)
logits = tf.nn.bias_add(bias)
Input image
X
n x m
1 x 4
20 254 40 1img1
1 image, 4 pixels
Each pixel is a feature.
1 image, 4 features
Pixels range (0, 255)
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weights
m x k
4 x 3
m: # features (pixels per img) = 4
n: # images = 1
k: # classes = 3 (Cat, Car, Dog)
pretend this image only
has 4 pixels
Bias
1 x k
1 x 3
Linear Scores
stretch pixels into single row
Output
1 x 3
Linear Scores = xW + b
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
values from Andrej Karpathy
Initialize weights with values
b/w 0 and 1. You can
initialize biases to start at 0
or very small values if you
like
Linear Scores, f(x; w, b)
Applying softmax
Apply
exponential
Unnormalized log
probabilities
Unnormalized
probabilities
probabilities
Normalize so
sum = 1
k = # specific class, different from k on the last slide.
J = # classes
Cat Car Dog
3.2 5.1 -1.7 24.5 164 0.18
0.13 0.87 0.00
Cat Car Dog
values from Andrej Karpathy
Input image
Normalized
Probabilities
3 x 1
stretch pixels to single row
X
n x m
1 x 4
20 254 40 1img1 0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weights
m x k
4 x 3
Bias
1 x k
1 x 3
Linear Scores
Linear
Scores
1 x 3
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
0.13 0.87 0.00
Cat Car Dog
Process so far:
Each pixel can be considered
a neuron
values from Andrej Karpathy
Input image
Normalized
Probabilities
3 x 1
stretch pixels to single row
X
n x m
1 x 4
20 254 40 1img1 0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weights
m x k
4 x 3
Bias
1 x k
1 x 3
Linear Scores
Linear
Scores
1 x 3
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
0.13 0.87 0.00
Cat Car Dog
Process so far: Forward Pass
Loss Function: How we learn
Recall:
Your output
neurons
didn’t match
the expected
output.
Input image
Normalized
Probabilities
3 x 1
stretch pixels to single row
X
n x m
1 x 4
20 254 40 1img1 0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weights
m x k
4 x 3
Bias
1 x k
1 x 3
Linear Scores
Linear
Scores
1 x 3
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
0.13 0.87 0.00
Cat Car Dog
Process so far: Forward Pass
Loss Function: How we learn
Normalized
Probabilities
3 x 1
0.13 0.87 0.00
Cat Car Dog Maximize the log likelihood
of true class
OR
Minimize the negative log
likelihood of true class.
(easier to do negative
feedback loop than positive
feedback loop)
values from Andrej Karpathy
Use the loss to manipulate
the weights of the incorrect
classifying inputs.
There are many different types of loss functions.
More of this later
Visualizing a hidden layer
X
Linear:
L1
W1
b1
Linear:
L2
W2
b2
Soft
max
X
4 x 10
W1
10 x 100
b1
100 x 1
xW1
4 x 100
b1
100 x 1
Since b1 has a dimension
with value 1, its values can
be broadcasted among the
xW1 product automatically
X: n x m (examples x features)
W: m x k (features x classes)
b: k x 1 (classes row vector)
L1
4 x 100
L1
W1
100 x 10
b1
10 x 1
L2
4 x 10
L2
We can add a wide layer by
adding columns to W1 and
then add a skinny layer by
giving k columns in W2 so
that our output still has the
desired shape of examples
x classes
These layers are hidden
because we cannot see
their output as we run the
graph
Desired output size:
4 x 10
examples x classes
Neurons are not classes, or objects, they are values.
They are the values that are moving through the pipeline. Follow
a pixel of an example image through a network and consider it to
be a neuron.
Neuron
When we implement a
neural network we use a
graph.
X
Linear:
L1
W1
b1
Softmax
S1
Probabilities
L1 S1
0.13 0.87 0.00
Training Data Images
Labels
Labels tell you the true class
of each image.
Softmax
S1
Sigmoid
S1
Note: These are the same thing
X
Linear:
L1
W1
b1
Sigmoid
S1
Probabilities
(Logits)
L1 S1
0.13 0.87 0.00
Training Data Images
Labels
Labels tell you the true class
of each image.
0 1 0
One-Hot-Training Labels
Cat 0 0 1
Car 0 1 0
Dog 1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Error! should be 0 1 0
If we run our network on just
one image in the training set
and take its corresponding
label
X
Linear:
L1
W1
b1
Sigmoid
S1
logits
L1 S1
Training Data Images
Labels
Run network on all training
data and training labels
One-Hot-Training Labels
0 0 1
0 1 0
1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Cat
Car
Dog
0.13 0.87 0.12
0.55 0.91 0.2
0.88 0.66 0.11
Cat Car Dog
Run network on 3 images
Correct_prediction = tf.equal(tf.argmax(logits, 1),
tf.argmax(labels, 1))
Accuracy = tf.reduce_mean(Correct_prediction, axis =
1)
Find accuracy of our training network. More of this later.
Setting up training data
Training Data Test Data
Training Data Validation Data Test Data
Split up your training data into validation data and
training data. Use validation data as test data as
you train and tune your network.
Train Data: 80% of original training data
Validation Data: 20% of original training data
Then shuffle!
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation =
train_test_split(X_train, y_train, train_size =
0.80, test_size = 0.20)
# this splits train validation to 80 20
X_train, y_train = shuffle(X_train, y_train)
# This shuffles and keeps label indices intact
Preprocessing: Normalization
In our examples we used raw pixel values (0, 255) as our inputs to train our network.
In practice, we preprocess this data before running it through our network.
mean centered normalization: we subtract the mean pixel value from each pixel and divide by the standard deviation.
This allows us to have a relatively gaussian distribution of values from [-1, 1].
(0, 255) => (-1, 1)
Maxmin normalization - where we set subtract from max and divide by the different of (max - min).
This gives us a domain of (0, 1).
(0, 255) => (0, 1)
-1 1
img from Wikipedia
Overfitting - Introduction to Hyperparameters
The goal of building an artificial neural network is to generalize.
● We want to apply new data to our network and classify inputs
● If we overtrain / overfit our network to our training data then our accuracy will be
deceiving. It might work very well for training data, but will not work on test data.
● In order to prevent overfitting we
implement preprocessing
techniques and tune our hyper
parameters.
● Tuning hyperparameters is
basically all that we can do after
we set up our network
architecture
● It should be the last step in
setting up your network
● Test on validation data while you
tune (don’t touch test data)
http://docs.aws.amazon.com/machine-learning/latest/dg/images/mlc
oncepts_image5.png
Epochs
An Epoch is a single forward pass and backward pass of the entire network
It is a hyperparameter and we must tune the number of epochs to fit our data / increase our accuracy
The larger the number of epochs the longer it takes to train
We increase epochs to increase the number of training intervals. If they are increased too large we may overfit
X
Linear:
L1
W1
b1
Softmax
S1
logits
L1 S1
backward
forward
More on the backward pass to come
Stop early!
https://qph.ec.quoracdn.net/main-qimg-d23fbbc85b7d18b4e07b794
2ecdfd856?convert_to_webp=true
Minibatch
We don’t feed in all the neurons into our network at once. Instead, we choose a
batch of neurons and feed them in. Perform forward and backwards propagation
on them, and then feed in the next batch of neurons.
We do this so we can perform Stochastic Gradient Descent, and prevent our
network from overfitting.
So in Mini batch gradient descent, you process a small subset of training data
forwards and backwards and update the weights/ biases with the gradient update
formula (shown on next page)
Mini batch
We feed only segments into our neural network at a time.
Training Data
Batch
Batch
Network
● The amount of neurons
in each batch is a
hyper parameter.
● This also depends on
GPU size
● Typically use 128 or
256
Gradient Descent
The “Learning” in Machine Learning.
Update the values of X (punish) it when it is wrong.
X: weights or biases
η: Learning Rate (typically 0.01 to 0.001)
η :The rate at which our network learns. This can change over time with
methods such as Adam, Adagrad etc. (hyperparameter)
∇(x): Gradient of X
We seek to update the weights and biases by a value indicating how “off”
they were from their target.
Gradients naturally have increasing slope, so we put a negative in front of
it to go downwards
Stochastic Gradient Descent
Recall Gradient Descent: X -= η∇(x) (eq 1)
Stochastic Gradient Descent (SGD) is a version of Gradient Descent where on
each forward pass, a batch of data is randomly sampled from the total dataset and
gradient descent is performed on that batch.
The more batches processed by the network = the better the approximation
1. Randomly sample a batch of data (1) from the total dataset
2. Run the network forward and backward to calculate the gradient from data (1)
3. Apply the gradient descent update (eq 1)
4. Repeat 1-3 until convergence or epoch limit
Visualizing Batch and SGD
256
256
256
232
If we start out with 1000
images and use batch size
of 256 we will have a batch
that has 232 images in it.
Training Images
batch size
Stochastic Gradient Descent sample
size’s.
5 images
Maybe take ~5 images from the 256
batch size at a time and run SGD on
them. Then go back and select 5
more.
X -= η∇(x)
Each X is an image in the
SGD batch
Backpropagation
We need to figure out how to alter a parameter to minimize the cost (loss). First we must find out what
effect that parameter has on the cost.
(we can’t just blindly change parameter values and hope that our network converges)
The gradient notes the effect each parameter has on the cost.
How to determine the effect of a parameter on the cost?
We use Backpropagation - which is an application of the chain rule from calculus
Did somebody
say Chain
Rule?
Backpropagation
Derivative Review:
In order to know the effect f
has on x, we must first find
the effect f has on g, then
the effect g has on x
Backpropagation
You want to stage backpropagation at each gate level locally. This is much easier to implement than by
storing each weight value and trying to compute it at the end. Simply add up the gradients along an
individual neurons path.
Andrej Karpathy
f
X
Y
Z
Change in Loss w respect
to Z
change in Z with
respect to Xchange in L with
respect to Z
X
b1
W1
Linear L1
S1
S1 b/c it goes to sigmoid
S = WX + b
(Loss w respect to X)
More Backpropagation
f
X
Y
Z
Change in Loss w respect
to Z
change in Z with
respect to Xchange in L with
respect to Z
X
b1
W1
Linear L1
S1
S1 b/c it goes to sigmoid
S = WX + b
(Loss w respect to X)
More Backpropagation
This comes together on the next slide!
X
b1
W1
Linear L1
S1
S1 b/c it goes to sigmoid
S = WX + b
(Loss w respect to X)
Sigmoid S1
Any
Gate
Output
X has a relationship to L1, S1 has a relationship
to L1. We can use that relationship in an
application of the chain rule to compute the
change in L1 with respect to X. Then we perform
a gradient descent update on X.
Accumulator of all the gradients up to the L1 gate
(sum of all gradients in red box). aka Accumulated Loss
(Gradient Desc Eqn)
(Update X)
Backpropagation cont
Andrej Karpathy
X
Linear:
L1
W1
b1
Sigmoid
S1
logits
L1 S1
Training Data Images
Labels
Run network on all training
data and training labels
One-Hot-Training Labels
0 0 1
0 1 0
1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Cat
Car
Dog
0.13 0.87 0.12
0.55 0.91 0.2
0.88 0.66 0.11
Cat Car Dog
Run network on 3 images Cross
Entropy
Cross Entropy(distance)
X
Input
2.0
1.0
0.1
Wx+b
y
Logit
Linear
0.7
0.2
0.1
S(Y)
Softmax
S(Y)
1.0
0.0
0.0
L
Labels
D(S,L)
Cross Entropy
Tells us how accurate we are
Minimize cross entropy
● Want a high distance for
incorrect class
● Want a low distance for correct
class
● Training loss = average cross
entropy over the entire
training set.
● Want all the distances to be
small
● want loss to be small
● So we attempt to minimize this
function.
Training Loss
weight 1
weight 2
src: Udacity
Cross Entropy Loss (continued)
weight 1
weight 2
src: Udacity
Want to find the weights to
cause this loss to be the
smallest. Turns M.L
problem into numerical
optimization
weight 1
weight 2
Training Loss
Average cross entropy over
entire training set
Minimize this function
Training Loss
● Take the derivative of Loss with respect to
parameters and follow the derivative by taking a
step backwards.
● Repeat until you get to the bottom.
● In this case we have 2 parameters (w1, w2)
● Typically we have millions of parameters
cross_entropy = -tf.reduce_sum(tf.mul(one_hot, tf.log(softmax)))
Cross Entropy
Loss
Installing Dependencies
You can use pip3 or pip. I recommend using an anaconda environment with python3:
https://www.continuum.io/downloads to Download Anaconda, (get Python 3.4+ version)
conda create --name=IntroToTensorFlow python=3 anaconda
source activate IntroToTensorFlow (Your conda environment is named “IntroToTensorFlow”)
conda install -c anaconda numpy=1.11.3
conda install -c conda-forge matplotlib=2.0.0
conda install -c anaconda scipy=0.18.1
conda install scikit-learn
or pip install -u scikit-learn
conda install -c conda-forge tensorflow
conda install -c menpo opencv3=3.2.0
jupyter notebook (to run in browser)
git clone https://github.com/JonathanCMitchell/TensorFlowLab.git
Installing TensorFlow
Recommended: Python 3.4 or higher and Anaconda
Install TensorFlow
conda create --name=IntroToTensorFlow python=3 anaconda
source activate IntroToTensorFlow
conda install -c conda-forge tensorflow
docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow (Docker if you need it)
# Hello World!
import tensorflow as tf
# create tensorflow object called tensor
hello_constant = tf.constant(‘Hello World!’)
with tf.Session() as sess:
# Run the tf.constant operation in the session
output = sess.run(hello_constant)
print(output)
git clone https://github.com/JonathanCMitchell/TensorFlowLab.git
If you have questions here is my info:
Jonathan Mitchell
github.com/jonathancmitchell
linkedin.com/in/jonathancmitchell
jmitchell1991@gmail.com
Self Driving Cars Los Angeles
https://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/
Thank you!

Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved

  • 1.
    Machine Learning pt1:Classification, Regression, and Artificial Neural Networks Self-Driving-Cars Los Angeles By Jonathan Mitchell github.com/jonathancmitchell linkedin.com/in/jonathancmitchell [email protected] Self Driving Cars Los Angeles https://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/
  • 2.
    Welcome to Machine Learning akacomputational statistics How did I learn this? Sources: ● Udacity’s Self-Driving Car Nanodegree problem (udacity.com/drive) ● MIT Self-Driving Car program (selfdrivingcars.mit.edu) ● Stanford’s cs-231n (cs231n.github.io)
  • 3.
    Topics A) Probability basics- Basics to Logits B) Linear Classification/ Logistic regression overview C) Perceptron D) Perceptron (biological inspiration) E) Neuron F) Forward Pass G) Computing a loss function H) Visualizing Hidden Layers I) Setting up training data J) Preprocessing / Normalization K) Overfitting / Hyperparameter intro L) Epochs M) Minibatch N) Gradient Descent / Stochastic Gradient Descent O) Backpropagation P) Cross Entropy Loss
  • 4.
    Probability basics Probability p= all outcomes of interest / all possible outcomes P can range from (0, 1) inclusive. P = 1 = 100% likelihood P = 0 = 0% likelihood Odds: The likelihood of an event P happening: Coin Toss: Toss a coin in the air, it can either be heads or tails. P_heads = 0.5 => P_tails = 0.5 = 1 - P_heads P1 = Heads Probability (0, 1), P0 = Tails Probability (0,1) Odds ratio can be written as Odds1 : Odds2. In this case 1:1. Equal chance of getting Heads or tails P(Not Tails) P(Tails) P(Heads) P(Not Heads) Odds_heads Odds_tails
  • 5.
    Bernoulli Probability (Aspecific case of binomial distribution) Bernoulli Probability: A yes or no question. Two possible outcomes: Success and Fail p = probability of success (of one trial) q = probability of fail (of one trial) = 1 - p unknown probability p. N = number of trials = 1 Probability of K successes. K = 1 for Bernoulli A Bernoulli distribution is a special case of a Binomial Distribution with N = 1 trial. Binomial Probability
  • 6.
    Probability Basics ->Logistic Regression Goal: Estimate an unknown probability p for any given linear combination of the independent variables. Link independent variables to the Ber(p) distribution. Logistic regression: estimate an unknown probability p for any given linear combination of the independent variables. Estimate p = p^ Need function that maps linear combination of variables that can result in any value onto the bernoulli probability distribution with a domain from 0 to 1. Use Logit: Natural log of the odds ratio
  • 7.
    Logistic Regression Logit: naturallog of odds ratio Undefined at P = 0, P = 1 Good P domain (0, 1) Linear Combination graph from http://www.graphpad.com/support/faqid/1465/
  • 8.
    Logistic Regression Logit: α willbe the linear combination of independent variables and their coefficient Recall: Ber(p) = logit(p) - logit(1-p) Inverse logit gives us the probability that dependent var (p) is a “1” Linear Combination Probability of x with linear combination mapping (B and B0) Binary Output variable Y. We want to model the conditional probability Pr(Y = 1 | X = x) as a function of x; any unknown parameters are to be estimated by max likelihood graph from http://www.graphpad.com/support/faqid/1465/
  • 9.
    Logistic Regression ->Linear Classification To classify we seek a binary output variable Y = 1 or 0. Recall Pr(Y = 1 | X = x). We modeled this as p(x;b,w) Predict Y = 1 when p >= 0.5. Y = 1 = Class A Predict Y = 0 when P < 0.5. Y = 0 = Class B Guess 1 when B + B0 is non-negative Guess 0 when B + B0 is negative This is a linear classifier. We can also infer that the probabilities depend on the distance from the boundary. This is known as a Binary Logistic Classifier (Binary = 2 options, Class A or Class B) The decision boundary separates the two predicted classes and is the solution to this equation Graph from http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageServic e.svc/ImageService/Articleimage/2010/AN/b918972f/b918972f-f7.gif
  • 10.
    Logistic Regression ->Linear Classification To classify we seek a binary output variable Y = 1 or 0. Recall Pr(Y = 1 | X = x). We modeled this as p(x;b,w) Predict Y = 1 when p >= 0.5. Y = 1 = Class A Predict Y = 0 when P < 0.5. Y = 0 = Class B Guess 1 when B + B0 is non-negative Guess 0 when B + B0 is negative This is a linear classifier. We can also infer that the probabilities depend on the distance from the boundary. This is known as a Binary Logistic Classifier (Binary = 2 options, Class A or Class B) The decision boundary separates the two predicted classes and is the solution to this equation Graph from http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageServic e.svc/ImageService/Articleimage/2010/AN/b918972f/b918972f-f7.gif
  • 11.
    Neuron: Building blockof a neural network src: MIT-Self-Driving-Cars, Fridman, A Neuron is a computational building block of the brain. Human brain: 1000T synapses 10x that of an Artificial Neuron Artificial Neuron is a computational building block of an artificial neural network. ~1-10B synapses
  • 12.
    *Takes a setof inputs *Places a weight of each input *sums them together *applies a bias value on each neuron *Uses an activation function that takes in the sum plus bias and squeeze values together into a probability distribution (range 0, 1) Takes a few inputs and places an output Classification: output: 1 or 0 This can serve as a linear classifier src: MIT-Self-Driving-Cars, Fridman,
  • 13.
    Perceptron Algorithm X1 X3 X2 Output 1. Initializeperceptron with random weights. 2. Compute perceptrons output 3. If output does not match known output a. if output should have been 0 but was 1, decrease the weights that had an input of 1 b. if output should have been 1 but was 0, increase the weights that had an input of 1 4. Move on to next example in the training set until perceptron makes no more mistakes src: MIT-Self-Driving-Cars, Fridman, If output does not match expected output = Punish!
  • 14.
    Your output neurons didn’t match theexpected output. X1 X3 X2 Output Expected Output: Cat but we got Burrito Training Images Perceptron
  • 15.
    Why Neural Networksare great. X1 X3 X2 Output Perceptron We can use the Hidden Layer to approximate any function Universality: We can closely approximate any function f(x) with a single hidden layer. Driving: Input (sensor data from the world) Output: Drive (use steering data etc) src: MIT-Self-Driving-Cars, Fridman, Lex
  • 16.
    Dual class LinearClassification with Binary Logistic Regression Input Data Goal: To predict class A or class B from input data. Two possible outputs! x Linear Combination Logistic Regression Predictor Class A is Y >= 0.5 Class B is Y < 0.5 P = 1 P = 0 Squeezes Values between 0 and 1 Scores (0,1) range
  • 17.
    Notation changeup: logit-1 ->sigmoid Input Data Two possible outputs! x Linear Combination Logistic Regression P = 1 P = 0 Squeezes Values between 0 and 1 puts into probability distribution Predictor Class A is Y >= 0.5 Class B is Y < 0.5 Scores (0,1) range Unnormalized log probabilities
  • 18.
    Generalizing Logistic Regressionto multiple classes If we have two classes we can have two possible outputs: 1 or 0 What if we have 10 classes? Binary - Two Outputs Y either 1 or 0 Supposed we have k classes. Let’s switch up some notation: Now set each score s to the result of that function. Probability that output Y = class K. We have J possible classes. Perform softmax on scores Softmax Classifier is Binary Logistic regression applied to multiple classes Output = scores b/w 0 and 1 Scores
  • 19.
    Notation changeup: logit-1 ->sigmoid Input Data Two possible outputs! x Linear Combination Logistic Regression P = 1 P = 0 Predictor Class A is Y >= 0.5 Class B is Y < 0.5 Scores (0,1) range Unnormalized log probabilities
  • 20.
    Notation changeup: logit-1 ->sigmoid Input Data Two possible outputs! x Linear Combination Logistic Regression Predictor Class A is Y >= 0.5 Class B is Y < 0.5 Scores (0,1) range Unnormalized log probabilities Output of Linear function. AKA Linear Scores
  • 21.
    Linear(x) = xW+bor Wx + b Textbooks: Wx+b Tensorflow: xW+b Computing derivatives is easier for xW + b.
  • 22.
    A few notes f(xi,W, b) = xW + b Assume image x has all of its pixels flattened out into a single row vector. x = X’s size is [n x m]. n: # examples/images m: # features (pixels in this case per image) Matrix W of size [m x k]. m = # features, k = # classes Bias b of size [k x 1] Consider our input data (xi, yi) as being fixed. We can set W and b to approximate any function (remember universality principle). We use the training data to learn W and b. Once our model has been trained. We can discard the training data and test our model on test data. Or anything for that matter. W and b will be tensors if you are using TensorFlow. They can be arrays if you are using Numpy. x[0] x[1] x[2] x[3] x[4] Pixel(0, 255)
  • 23.
    Example The biasesallow us to have these lines NOT all cross through the origin W causes the lines to rotate about our pixels space B pushes the lines away from the origin src: Andrej Karpathy
  • 24.
    Bias Trick (inpractice) It would be annoying to worry about the Bias term separately during classification. Therefore we simply append the bias row vector to the end of our Weights matrix. 0.1 0.25 0.3 0.63 0.12 -0.64 0.26 0.62 0.58 0.99 -0.14 0.333 0.12 3.1 -0.5 Weights Bias 0.1 0.25 0.3 0.63 0.12 -0.64 0.26 0.62 0.58 0.99 -0.14 0.333 0.12 3.1 -0.5 Weights You may see this in the code as: logits = tf.add(tf.matmul(x, weights), bias) OR logits = tf.matmul(x, weights) logits = tf.nn.bias_add(bias)
  • 25.
    Input image X n xm 1 x 4 20 254 40 1img1 1 image, 4 pixels Each pixel is a feature. 1 image, 4 features Pixels range (0, 255) 0.1 0.25 0.3 0.63 0.12 -0.64 0.26 0.62 0.58 0.99 -0.14 0.333 Weights m x k 4 x 3 m: # features (pixels per img) = 4 n: # images = 1 k: # classes = 3 (Cat, Car, Dog) pretend this image only has 4 pixels Bias 1 x k 1 x 3 Linear Scores stretch pixels into single row Output 1 x 3 Linear Scores = xW + b Cat Car Dog 0.12 3.1 -0.5 3.2 5.1 -1.7 values from Andrej Karpathy Initialize weights with values b/w 0 and 1. You can initialize biases to start at 0 or very small values if you like
  • 26.
    Linear Scores, f(x;w, b) Applying softmax Apply exponential Unnormalized log probabilities Unnormalized probabilities probabilities Normalize so sum = 1 k = # specific class, different from k on the last slide. J = # classes Cat Car Dog 3.2 5.1 -1.7 24.5 164 0.18 0.13 0.87 0.00 Cat Car Dog values from Andrej Karpathy
  • 27.
    Input image Normalized Probabilities 3 x1 stretch pixels to single row X n x m 1 x 4 20 254 40 1img1 0.1 0.25 0.3 0.63 0.12 -0.64 0.26 0.62 0.58 0.99 -0.14 0.333 Weights m x k 4 x 3 Bias 1 x k 1 x 3 Linear Scores Linear Scores 1 x 3 Cat Car Dog 0.12 3.1 -0.5 3.2 5.1 -1.7 0.13 0.87 0.00 Cat Car Dog Process so far: Each pixel can be considered a neuron values from Andrej Karpathy
  • 28.
    Input image Normalized Probabilities 3 x1 stretch pixels to single row X n x m 1 x 4 20 254 40 1img1 0.1 0.25 0.3 0.63 0.12 -0.64 0.26 0.62 0.58 0.99 -0.14 0.333 Weights m x k 4 x 3 Bias 1 x k 1 x 3 Linear Scores Linear Scores 1 x 3 Cat Car Dog 0.12 3.1 -0.5 3.2 5.1 -1.7 0.13 0.87 0.00 Cat Car Dog Process so far: Forward Pass
  • 29.
    Loss Function: Howwe learn Recall: Your output neurons didn’t match the expected output.
  • 30.
    Input image Normalized Probabilities 3 x1 stretch pixels to single row X n x m 1 x 4 20 254 40 1img1 0.1 0.25 0.3 0.63 0.12 -0.64 0.26 0.62 0.58 0.99 -0.14 0.333 Weights m x k 4 x 3 Bias 1 x k 1 x 3 Linear Scores Linear Scores 1 x 3 Cat Car Dog 0.12 3.1 -0.5 3.2 5.1 -1.7 0.13 0.87 0.00 Cat Car Dog Process so far: Forward Pass
  • 31.
    Loss Function: Howwe learn Normalized Probabilities 3 x 1 0.13 0.87 0.00 Cat Car Dog Maximize the log likelihood of true class OR Minimize the negative log likelihood of true class. (easier to do negative feedback loop than positive feedback loop) values from Andrej Karpathy Use the loss to manipulate the weights of the incorrect classifying inputs. There are many different types of loss functions. More of this later
  • 32.
    Visualizing a hiddenlayer X Linear: L1 W1 b1 Linear: L2 W2 b2 Soft max X 4 x 10 W1 10 x 100 b1 100 x 1 xW1 4 x 100 b1 100 x 1 Since b1 has a dimension with value 1, its values can be broadcasted among the xW1 product automatically X: n x m (examples x features) W: m x k (features x classes) b: k x 1 (classes row vector) L1 4 x 100 L1 W1 100 x 10 b1 10 x 1 L2 4 x 10 L2 We can add a wide layer by adding columns to W1 and then add a skinny layer by giving k columns in W2 so that our output still has the desired shape of examples x classes These layers are hidden because we cannot see their output as we run the graph Desired output size: 4 x 10 examples x classes
  • 33.
    Neurons are notclasses, or objects, they are values. They are the values that are moving through the pipeline. Follow a pixel of an example image through a network and consider it to be a neuron. Neuron When we implement a neural network we use a graph.
  • 34.
    X Linear: L1 W1 b1 Softmax S1 Probabilities L1 S1 0.13 0.870.00 Training Data Images Labels Labels tell you the true class of each image. Softmax S1 Sigmoid S1 Note: These are the same thing
  • 35.
    X Linear: L1 W1 b1 Sigmoid S1 Probabilities (Logits) L1 S1 0.13 0.870.00 Training Data Images Labels Labels tell you the true class of each image. 0 1 0 One-Hot-Training Labels Cat 0 0 1 Car 0 1 0 Dog 1 0 0 ‘Cat’ ‘Car’ ‘Dog’ Training Labels Error! should be 0 1 0 If we run our network on just one image in the training set and take its corresponding label
  • 36.
    X Linear: L1 W1 b1 Sigmoid S1 logits L1 S1 Training DataImages Labels Run network on all training data and training labels One-Hot-Training Labels 0 0 1 0 1 0 1 0 0 ‘Cat’ ‘Car’ ‘Dog’ Training Labels Cat Car Dog 0.13 0.87 0.12 0.55 0.91 0.2 0.88 0.66 0.11 Cat Car Dog Run network on 3 images Correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1)) Accuracy = tf.reduce_mean(Correct_prediction, axis = 1) Find accuracy of our training network. More of this later.
  • 37.
    Setting up trainingdata Training Data Test Data Training Data Validation Data Test Data Split up your training data into validation data and training data. Use validation data as test data as you train and tune your network. Train Data: 80% of original training data Validation Data: 20% of original training data Then shuffle! from sklearn.utils import shuffle from sklearn.model_selection import train_test_split X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, train_size = 0.80, test_size = 0.20) # this splits train validation to 80 20 X_train, y_train = shuffle(X_train, y_train) # This shuffles and keeps label indices intact
  • 38.
    Preprocessing: Normalization In ourexamples we used raw pixel values (0, 255) as our inputs to train our network. In practice, we preprocess this data before running it through our network. mean centered normalization: we subtract the mean pixel value from each pixel and divide by the standard deviation. This allows us to have a relatively gaussian distribution of values from [-1, 1]. (0, 255) => (-1, 1) Maxmin normalization - where we set subtract from max and divide by the different of (max - min). This gives us a domain of (0, 1). (0, 255) => (0, 1) -1 1 img from Wikipedia
  • 39.
    Overfitting - Introductionto Hyperparameters The goal of building an artificial neural network is to generalize. ● We want to apply new data to our network and classify inputs ● If we overtrain / overfit our network to our training data then our accuracy will be deceiving. It might work very well for training data, but will not work on test data. ● In order to prevent overfitting we implement preprocessing techniques and tune our hyper parameters. ● Tuning hyperparameters is basically all that we can do after we set up our network architecture ● It should be the last step in setting up your network ● Test on validation data while you tune (don’t touch test data) http://docs.aws.amazon.com/machine-learning/latest/dg/images/mlc oncepts_image5.png
  • 40.
    Epochs An Epoch isa single forward pass and backward pass of the entire network It is a hyperparameter and we must tune the number of epochs to fit our data / increase our accuracy The larger the number of epochs the longer it takes to train We increase epochs to increase the number of training intervals. If they are increased too large we may overfit X Linear: L1 W1 b1 Softmax S1 logits L1 S1 backward forward More on the backward pass to come Stop early! https://qph.ec.quoracdn.net/main-qimg-d23fbbc85b7d18b4e07b794 2ecdfd856?convert_to_webp=true
  • 41.
    Minibatch We don’t feedin all the neurons into our network at once. Instead, we choose a batch of neurons and feed them in. Perform forward and backwards propagation on them, and then feed in the next batch of neurons. We do this so we can perform Stochastic Gradient Descent, and prevent our network from overfitting. So in Mini batch gradient descent, you process a small subset of training data forwards and backwards and update the weights/ biases with the gradient update formula (shown on next page)
  • 42.
    Mini batch We feedonly segments into our neural network at a time. Training Data Batch Batch Network ● The amount of neurons in each batch is a hyper parameter. ● This also depends on GPU size ● Typically use 128 or 256
  • 43.
    Gradient Descent The “Learning”in Machine Learning. Update the values of X (punish) it when it is wrong. X: weights or biases η: Learning Rate (typically 0.01 to 0.001) η :The rate at which our network learns. This can change over time with methods such as Adam, Adagrad etc. (hyperparameter) ∇(x): Gradient of X We seek to update the weights and biases by a value indicating how “off” they were from their target. Gradients naturally have increasing slope, so we put a negative in front of it to go downwards
  • 44.
    Stochastic Gradient Descent RecallGradient Descent: X -= η∇(x) (eq 1) Stochastic Gradient Descent (SGD) is a version of Gradient Descent where on each forward pass, a batch of data is randomly sampled from the total dataset and gradient descent is performed on that batch. The more batches processed by the network = the better the approximation 1. Randomly sample a batch of data (1) from the total dataset 2. Run the network forward and backward to calculate the gradient from data (1) 3. Apply the gradient descent update (eq 1) 4. Repeat 1-3 until convergence or epoch limit
  • 45.
    Visualizing Batch andSGD 256 256 256 232 If we start out with 1000 images and use batch size of 256 we will have a batch that has 232 images in it. Training Images batch size Stochastic Gradient Descent sample size’s. 5 images Maybe take ~5 images from the 256 batch size at a time and run SGD on them. Then go back and select 5 more. X -= η∇(x) Each X is an image in the SGD batch
  • 46.
    Backpropagation We need tofigure out how to alter a parameter to minimize the cost (loss). First we must find out what effect that parameter has on the cost. (we can’t just blindly change parameter values and hope that our network converges) The gradient notes the effect each parameter has on the cost. How to determine the effect of a parameter on the cost? We use Backpropagation - which is an application of the chain rule from calculus Did somebody say Chain Rule?
  • 47.
    Backpropagation Derivative Review: In orderto know the effect f has on x, we must first find the effect f has on g, then the effect g has on x
  • 48.
    Backpropagation You want tostage backpropagation at each gate level locally. This is much easier to implement than by storing each weight value and trying to compute it at the end. Simply add up the gradients along an individual neurons path. Andrej Karpathy
  • 49.
    f X Y Z Change in Lossw respect to Z change in Z with respect to Xchange in L with respect to Z X b1 W1 Linear L1 S1 S1 b/c it goes to sigmoid S = WX + b (Loss w respect to X) More Backpropagation
  • 50.
    f X Y Z Change in Lossw respect to Z change in Z with respect to Xchange in L with respect to Z X b1 W1 Linear L1 S1 S1 b/c it goes to sigmoid S = WX + b (Loss w respect to X) More Backpropagation This comes together on the next slide!
  • 51.
    X b1 W1 Linear L1 S1 S1 b/cit goes to sigmoid S = WX + b (Loss w respect to X) Sigmoid S1 Any Gate Output X has a relationship to L1, S1 has a relationship to L1. We can use that relationship in an application of the chain rule to compute the change in L1 with respect to X. Then we perform a gradient descent update on X. Accumulator of all the gradients up to the L1 gate (sum of all gradients in red box). aka Accumulated Loss (Gradient Desc Eqn) (Update X)
  • 52.
  • 53.
    X Linear: L1 W1 b1 Sigmoid S1 logits L1 S1 Training DataImages Labels Run network on all training data and training labels One-Hot-Training Labels 0 0 1 0 1 0 1 0 0 ‘Cat’ ‘Car’ ‘Dog’ Training Labels Cat Car Dog 0.13 0.87 0.12 0.55 0.91 0.2 0.88 0.66 0.11 Cat Car Dog Run network on 3 images Cross Entropy
  • 54.
    Cross Entropy(distance) X Input 2.0 1.0 0.1 Wx+b y Logit Linear 0.7 0.2 0.1 S(Y) Softmax S(Y) 1.0 0.0 0.0 L Labels D(S,L) Cross Entropy Tellsus how accurate we are Minimize cross entropy ● Want a high distance for incorrect class ● Want a low distance for correct class ● Training loss = average cross entropy over the entire training set. ● Want all the distances to be small ● want loss to be small ● So we attempt to minimize this function. Training Loss weight 1 weight 2 src: Udacity
  • 55.
    Cross Entropy Loss(continued) weight 1 weight 2 src: Udacity Want to find the weights to cause this loss to be the smallest. Turns M.L problem into numerical optimization weight 1 weight 2 Training Loss Average cross entropy over entire training set Minimize this function Training Loss ● Take the derivative of Loss with respect to parameters and follow the derivative by taking a step backwards. ● Repeat until you get to the bottom. ● In this case we have 2 parameters (w1, w2) ● Typically we have millions of parameters cross_entropy = -tf.reduce_sum(tf.mul(one_hot, tf.log(softmax))) Cross Entropy Loss
  • 56.
    Installing Dependencies You canuse pip3 or pip. I recommend using an anaconda environment with python3: https://www.continuum.io/downloads to Download Anaconda, (get Python 3.4+ version) conda create --name=IntroToTensorFlow python=3 anaconda source activate IntroToTensorFlow (Your conda environment is named “IntroToTensorFlow”) conda install -c anaconda numpy=1.11.3 conda install -c conda-forge matplotlib=2.0.0 conda install -c anaconda scipy=0.18.1 conda install scikit-learn or pip install -u scikit-learn conda install -c conda-forge tensorflow conda install -c menpo opencv3=3.2.0 jupyter notebook (to run in browser) git clone https://github.com/JonathanCMitchell/TensorFlowLab.git
  • 57.
    Installing TensorFlow Recommended: Python3.4 or higher and Anaconda Install TensorFlow conda create --name=IntroToTensorFlow python=3 anaconda source activate IntroToTensorFlow conda install -c conda-forge tensorflow docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow (Docker if you need it) # Hello World! import tensorflow as tf # create tensorflow object called tensor hello_constant = tf.constant(‘Hello World!’) with tf.Session() as sess: # Run the tf.constant operation in the session output = sess.run(hello_constant) print(output) git clone https://github.com/JonathanCMitchell/TensorFlowLab.git
  • 58.
    If you havequestions here is my info: Jonathan Mitchell github.com/jonathancmitchell linkedin.com/in/jonathancmitchell [email protected] Self Driving Cars Los Angeles https://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/ Thank you!