Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved

Machine Learning pt1: Classification, Regression,
and Artificial Neural Networks
Self-Driving-Cars Los Angeles
By Jonathan Mitchell
github.com/jonathancmitchell
linkedin.com/in/jonathancmitchell
jmitchell1991@gmail.com
Self Driving Cars Los Angeles
https://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/

Welcome to Machine
Learning
aka computational statistics
How did I learn this?
Sources:
● Udacity’s Self-Driving Car Nanodegree problem (udacity.com/drive)
● MIT Self-Driving Car program (selfdrivingcars.mit.edu)
● Stanford’s cs-231n (cs231n.github.io)

Topics
A) Probability basics - Basics to Logits
B) Linear Classification/ Logistic regression overview
C) Perceptron
D) Perceptron (biological inspiration)
E) Neuron
F) Forward Pass
G) Computing a loss function
H) Visualizing Hidden Layers
I) Setting up training data
J) Preprocessing / Normalization
K) Overfitting / Hyperparameter intro
L) Epochs
M) Minibatch
N) Gradient Descent / Stochastic Gradient Descent
O) Backpropagation
P) Cross Entropy Loss

Probability basics
Probability p = all outcomes of interest / all possible outcomes
P can range from (0, 1) inclusive. P = 1 = 100% likelihood P = 0 = 0% likelihood
Odds: The likelihood of an event P happening:
Coin Toss: Toss a coin in the air, it can either be heads or tails.
P_heads = 0.5 => P_tails = 0.5 = 1 - P_heads
P1 = Heads Probability (0, 1), P0 = Tails Probability (0,1)
Odds ratio can be written as Odds1 : Odds2. In this case 1:1. Equal chance of getting
Heads or tails
P(Not Tails)
P(Tails)
P(Heads)
P(Not Heads)
Odds_heads
Odds_tails

Bernoulli Probability (A specific case of binomial distribution)
Bernoulli Probability: A yes or no question. Two possible outcomes: Success and Fail
p = probability of success (of one trial)
q = probability of fail (of one trial) = 1 - p
unknown probability p.
N = number of trials = 1
Probability of K successes.
K = 1 for Bernoulli
A Bernoulli distribution is a special case of a Binomial Distribution with N = 1 trial.
Binomial Probability

Probability Basics -> Logistic Regression
Goal: Estimate an unknown probability p for any given linear combination of the
independent variables.
Link independent variables to the Ber(p) distribution.
Logistic regression: estimate an unknown probability p for any given linear
combination of the independent variables.
Estimate p = p^
Need function that maps linear combination of variables that can result in any
value onto the bernoulli probability distribution with a domain from 0 to 1.
Use Logit: Natural log of the odds ratio

Logistic Regression
Logit: natural log of odds ratio
Undefined at P = 0, P = 1
Good P domain (0, 1)
Linear Combination
graph from http://www.graphpad.com/support/faqid/1465/

Logistic Regression
Logit:
α will be the linear combination of
independent variables and their
coefficient
Recall:
Ber(p) = logit(p) - logit(1-p)
Inverse logit gives us the
probability that dependent var (p)
is a “1”
Linear Combination
Probability of x with linear
combination mapping (B and B0)
Binary Output variable Y. We
want to model the conditional
probability Pr(Y = 1 | X = x) as a
function of x; any unknown
parameters are to be estimated
by max likelihood
graph from http://www.graphpad.com/support/faqid/1465/

Logistic Regression -> Linear Classification
To classify we seek a binary output variable Y = 1 or 0.
Recall Pr(Y = 1 | X = x). We modeled this as p(x;b,w)
Predict Y = 1 when p >= 0.5. Y = 1 = Class A
Predict Y = 0 when P < 0.5. Y = 0 = Class B
Guess 1 when B + B0 is non-negative
Guess 0 when B + B0 is negative
This is a linear classifier.
We can also infer that the probabilities depend on the distance from the boundary.
This is known as a Binary Logistic Classifier (Binary = 2 options, Class A or Class B)
The decision boundary
separates the two predicted
classes and is the solution to
this equation
Graph from
http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageServic
e.svc/ImageService/Articleimage/2010/AN/b918972f/b918972f-f7.gif

Neuron: Building block of a neural network
src: MIT-Self-Driving-Cars, Fridman,
A Neuron is a computational
building block of the brain.
Human brain: 1000T synapses
10x that of an Artificial Neuron
Artificial Neuron is a
computational building block of
an artificial neural network.
~1-10B synapses

*Takes a set of inputs
*Places a weight of each input
*sums them together
*applies a bias value on each
neuron
*Uses an activation function that
takes in the sum plus bias and
squeeze values together into a
probability distribution (range 0, 1)
Takes a few inputs and places an
output
Classification: output: 1 or 0
This can serve as a linear classifier

Perceptron Algorithm
X1
X3
X2
Output
1. Initialize perceptron with random weights.
2. Compute perceptrons output
3. If output does not match known output
a. if output should have been 0 but was 1, decrease the weights that had an input of 1
b. if output should have been 1 but was 0, increase the weights that had an input of 1
4. Move on to next example in the training set until perceptron makes no more mistakes
If output does not match
expected output = Punish!

Your output
neurons
didn’t match
the expected
output.
X1
X3
X2
Output
Expected Output: Cat but we got Burrito
Training
Images
Perceptron

Why Neural Networks are great.
X1
X3
X2
Output
Perceptron
We can use the Hidden
Layer to approximate any
function
Universality: We can
closely approximate any
function f(x) with a single
hidden layer.
Driving: Input (sensor data
from the world)
Output: Drive (use steering
data etc)
Lex

Dual class Linear Classification with Binary Logistic Regression
Input Data
Goal: To predict class A or class
B from input data.
Two possible outputs!
x
Linear Combination
Logistic Regression
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
P = 1
P = 0
Squeezes Values
between 0 and 1
Scores
(0,1)
range

Notation changeup:
logit-1 -> sigmoid
Input Data
x
Linear Combination
Logistic Regression
P = 1
P = 0
Squeezes Values
between 0 and 1
puts into probability
distribution
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores
(0,1)
range
Unnormalized log probabilities

Generalizing Logistic Regression to multiple classes
If we have two classes we can
have two possible outputs: 1 or 0
What if we have 10 classes?
Binary - Two Outputs
Y either 1 or 0
Supposed we have k classes.
Let’s switch up some notation:
Now set each score s to the
result of that function.
Probability that output Y = class
K.
We have J possible classes.
Perform softmax on scores
Softmax Classifier
is Binary Logistic regression
applied to multiple classes
Output = scores b/w 0 and 1
Scores

Notation changeup:
logit-1 -> sigmoid
Input Data
x
Linear Combination
Logistic Regression
P = 1
P = 0
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores
(0,1)
range

Notation changeup:
logit-1 -> sigmoid
Input Data
x
Linear Combination
Logistic Regression
Predictor
Class A is Y >= 0.5
Class B is Y < 0.5
Scores
(0,1)
range
Output of Linear function. AKA
Linear Scores

Linear(x) = xW+b or Wx + b
Textbooks: Wx+b
Tensorflow: xW+b
Computing derivatives is easier for xW + b.

A few notes
f(xi, W, b) = xW + b
Assume image x has all of its pixels flattened out into a single row vector. x =
X’s size is [n x m]. n: # examples/images m: # features (pixels in this case per image)
Matrix W of size [m x k]. m = # features, k = # classes
Bias b of size [k x 1]
Consider our input data (xi, yi) as being fixed. We can set W and b to approximate any function
(remember universality principle).
We use the training data to learn W and b. Once our model has been trained. We can discard the training
data and test our model on test data. Or anything for that matter.
W and b will be tensors if you are using TensorFlow. They can be arrays if you are using Numpy.
x[0] x[1] x[2] x[3] x[4]
Pixel(0, 255)

Example The biases allow us to have
these lines NOT all cross
through the origin
W causes the lines to rotate
about our pixels space
B pushes the lines away
from the origin
src: Andrej Karpathy

Bias Trick (in practice)
It would be annoying to worry about the Bias term separately during classification.
Therefore we simply append the bias row vector to the end of our Weights matrix.
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333 0.12 3.1 -0.5
Weights
Bias
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
0.12 3.1 -0.5
Weights
You may see this in the code as: logits = tf.add(tf.matmul(x, weights), bias) OR
logits = tf.matmul(x, weights)
logits = tf.nn.bias_add(bias)

Input image
X
n x m
1 x 4
20 254 40 1img1
1 image, 4 pixels
Each pixel is a feature.
1 image, 4 features
Pixels range (0, 255)
0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weights
m x k
4 x 3
m: # features (pixels per img) = 4
n: # images = 1
k: # classes = 3 (Cat, Car, Dog)
pretend this image only
has 4 pixels
Bias
1 x k
1 x 3
Linear Scores
stretch pixels into single row
Output
1 x 3
Linear Scores = xW + b
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
values from Andrej Karpathy
Initialize weights with values
b/w 0 and 1. You can
initialize biases to start at 0
or very small values if you
like

Linear Scores, f(x; w, b)
Applying softmax
Apply
exponential
Unnormalized log
probabilities
Unnormalized
probabilities
probabilities
Normalize so
sum = 1
k = # specific class, different from k on the last slide.
J = # classes
Cat Car Dog
3.2 5.1 -1.7 24.5 164 0.18
0.13 0.87 0.00
Cat Car Dog

Input image
Normalized
Probabilities
3 x 1
stretch pixels to single row
X
n x m
1 x 4
20 254 40 1img1 0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weights
m x k
4 x 3
Bias
1 x k
1 x 3
Linear Scores
Linear
Scores
1 x 3
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
0.13 0.87 0.00
Cat Car Dog
Process so far:
Each pixel can be considered
a neuron

Input image
Normalized
Probabilities
3 x 1
stretch pixels to single row
X
n x m
1 x 4
20 254 40 1img1 0.1 0.25 0.3
0.63 0.12 -0.64
0.26 0.62 0.58
0.99 -0.14 0.333
Weights
m x k
4 x 3
Bias
1 x k
1 x 3
Linear Scores
Linear
Scores
1 x 3
Cat Car Dog
0.12 3.1 -0.5 3.2 5.1 -1.7
0.13 0.87 0.00
Cat Car Dog
Process so far: Forward Pass

Loss Function: How we learn
Recall:
Your output
neurons
didn’t match
the expected
output.

Loss Function: How we learn
Normalized
Probabilities
3 x 1
0.13 0.87 0.00
Cat Car Dog Maximize the log likelihood
of true class
OR
Minimize the negative log
likelihood of true class.
(easier to do negative
feedback loop than positive
feedback loop)
Use the loss to manipulate
the weights of the incorrect
classifying inputs.
There are many different types of loss functions.
More of this later

Visualizing a hidden layer
X
Linear:
L1
W1
b1
Linear:
L2
W2
b2
Soft
max
X
4 x 10
W1
10 x 100
b1
100 x 1
xW1
4 x 100
b1
100 x 1
Since b1 has a dimension
with value 1, its values can
be broadcasted among the
xW1 product automatically
X: n x m (examples x features)
W: m x k (features x classes)
b: k x 1 (classes row vector)
L1
4 x 100
L1
W1
100 x 10
b1
10 x 1
L2
4 x 10
L2
We can add a wide layer by
adding columns to W1 and
then add a skinny layer by
giving k columns in W2 so
that our output still has the
desired shape of examples
x classes
These layers are hidden
because we cannot see
their output as we run the
graph
Desired output size:
4 x 10
examples x classes

Neurons are not classes, or objects, they are values.
They are the values that are moving through the pipeline. Follow
a pixel of an example image through a network and consider it to
be a neuron.
Neuron
When we implement a
neural network we use a
graph.

X
Linear:
L1
W1
b1
Softmax
S1
Probabilities
L1 S1
0.13 0.87 0.00
Training Data Images
Labels
Labels tell you the true class
of each image.
Softmax
S1
Sigmoid
S1
Note: These are the same thing

X
Linear:
L1
W1
b1
Sigmoid
S1
Probabilities
(Logits)
L1 S1
0.13 0.87 0.00
Labels
Labels tell you the true class
of each image.
0 1 0
One-Hot-Training Labels
Cat 0 0 1
Car 0 1 0
Dog 1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Error! should be 0 1 0
If we run our network on just
one image in the training set
and take its corresponding
label

X
Linear:
L1
W1
b1
Sigmoid
S1
logits
L1 S1
Labels
Run network on all training
data and training labels
0 0 1
0 1 0
1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Cat
Car
Dog
0.13 0.87 0.12
0.55 0.91 0.2
0.88 0.66 0.11
Cat Car Dog
Run network on 3 images
Correct_prediction = tf.equal(tf.argmax(logits, 1),
tf.argmax(labels, 1))
Accuracy = tf.reduce_mean(Correct_prediction, axis =
1)
Find accuracy of our training network. More of this later.

Setting up training data
Training Data Test Data
Training Data Validation Data Test Data
Split up your training data into validation data and
training data. Use validation data as test data as
you train and tune your network.
Train Data: 80% of original training data
Validation Data: 20% of original training data
Then shuffle!
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation =
train_test_split(X_train, y_train, train_size =
0.80, test_size = 0.20)
# this splits train validation to 80 20
X_train, y_train = shuffle(X_train, y_train)
# This shuffles and keeps label indices intact

Preprocessing: Normalization
In our examples we used raw pixel values (0, 255) as our inputs to train our network.
In practice, we preprocess this data before running it through our network.
mean centered normalization: we subtract the mean pixel value from each pixel and divide by the standard deviation.
This allows us to have a relatively gaussian distribution of values from [-1, 1].
(0, 255) => (-1, 1)
Maxmin normalization - where we set subtract from max and divide by the different of (max - min).
This gives us a domain of (0, 1).
(0, 255) => (0, 1)
-1 1
img from Wikipedia

Overfitting - Introduction to Hyperparameters
The goal of building an artificial neural network is to generalize.
● We want to apply new data to our network and classify inputs
● If we overtrain / overfit our network to our training data then our accuracy will be
deceiving. It might work very well for training data, but will not work on test data.
● In order to prevent overfitting we
implement preprocessing
techniques and tune our hyper
parameters.
● Tuning hyperparameters is
basically all that we can do after
we set up our network
architecture
● It should be the last step in
setting up your network
● Test on validation data while you
tune (don’t touch test data)
http://docs.aws.amazon.com/machine-learning/latest/dg/images/mlc
oncepts_image5.png

Epochs
An Epoch is a single forward pass and backward pass of the entire network
It is a hyperparameter and we must tune the number of epochs to fit our data / increase our accuracy
The larger the number of epochs the longer it takes to train
We increase epochs to increase the number of training intervals. If they are increased too large we may overfit
X
Linear:
L1
W1
b1
Softmax
S1
logits
L1 S1
backward
forward
More on the backward pass to come
Stop early!
https://qph.ec.quoracdn.net/main-qimg-d23fbbc85b7d18b4e07b794
2ecdfd856?convert_to_webp=true

Minibatch
We don’t feed in all the neurons into our network at once. Instead, we choose a
batch of neurons and feed them in. Perform forward and backwards propagation
on them, and then feed in the next batch of neurons.
We do this so we can perform Stochastic Gradient Descent, and prevent our
network from overfitting.
So in Mini batch gradient descent, you process a small subset of training data
forwards and backwards and update the weights/ biases with the gradient update
formula (shown on next page)

Mini batch
We feed only segments into our neural network at a time.
Training Data
Batch
Batch
Network
● The amount of neurons
in each batch is a
hyper parameter.
● This also depends on
GPU size
● Typically use 128 or
256

Gradient Descent
The “Learning” in Machine Learning.
Update the values of X (punish) it when it is wrong.
X: weights or biases
η: Learning Rate (typically 0.01 to 0.001)
η :The rate at which our network learns. This can change over time with
methods such as Adam, Adagrad etc. (hyperparameter)
∇(x): Gradient of X
We seek to update the weights and biases by a value indicating how “off”
they were from their target.
Gradients naturally have increasing slope, so we put a negative in front of
it to go downwards

Stochastic Gradient Descent
Recall Gradient Descent: X -= η∇(x) (eq 1)
Stochastic Gradient Descent (SGD) is a version of Gradient Descent where on
each forward pass, a batch of data is randomly sampled from the total dataset and
gradient descent is performed on that batch.
The more batches processed by the network = the better the approximation
1. Randomly sample a batch of data (1) from the total dataset
2. Run the network forward and backward to calculate the gradient from data (1)
3. Apply the gradient descent update (eq 1)
4. Repeat 1-3 until convergence or epoch limit

Visualizing Batch and SGD
256
256
256
232
If we start out with 1000
images and use batch size
of 256 we will have a batch
that has 232 images in it.
Training Images
batch size
Stochastic Gradient Descent sample
size’s.
5 images
Maybe take ~5 images from the 256
batch size at a time and run SGD on
them. Then go back and select 5
more.
X -= η∇(x)
Each X is an image in the
SGD batch

Backpropagation
We need to figure out how to alter a parameter to minimize the cost (loss). First we must find out what
effect that parameter has on the cost.
(we can’t just blindly change parameter values and hope that our network converges)
The gradient notes the effect each parameter has on the cost.
How to determine the effect of a parameter on the cost?
We use Backpropagation - which is an application of the chain rule from calculus
Did somebody
say Chain
Rule?

Backpropagation
Derivative Review:
In order to know the effect f
has on x, we must first find
the effect f has on g, then
the effect g has on x

Backpropagation
You want to stage backpropagation at each gate level locally. This is much easier to implement than by
storing each weight value and trying to compute it at the end. Simply add up the gradients along an
individual neurons path.
Andrej Karpathy

f
X
Y
Z
Change in Loss w respect
to Z
change in Z with
respect to Xchange in L with
respect to Z
X
b1
W1
Linear L1
S1
S1 b/c it goes to sigmoid
S = WX + b
(Loss w respect to X)
More Backpropagation

f
X
Y
Z
Change in Loss w respect
to Z
change in Z with
respect to Xchange in L with
respect to Z
X
b1
W1
Linear L1
S1
S = WX + b
More Backpropagation
This comes together on the next slide!

X
b1
W1
Linear L1
S1
S = WX + b
Sigmoid S1
Any
Gate
Output
X has a relationship to L1, S1 has a relationship
to L1. We can use that relationship in an
application of the chain rule to compute the
change in L1 with respect to X. Then we perform
a gradient descent update on X.
Accumulator of all the gradients up to the L1 gate
(sum of all gradients in red box). aka Accumulated Loss
(Gradient Desc Eqn)
(Update X)

Backpropagation cont
Andrej Karpathy

X
Linear:
L1
W1
b1
Sigmoid
S1
logits
L1 S1
Labels
Run network on all training
data and training labels
0 0 1
0 1 0
1 0 0
‘Cat’
‘Car’
‘Dog’
Training Labels
Cat
Car
Dog
0.13 0.87 0.12
0.55 0.91 0.2
0.88 0.66 0.11
Cat Car Dog
Run network on 3 images Cross
Entropy

Cross Entropy(distance)
X
Input
2.0
1.0
0.1
Wx+b
y
Logit
Linear
0.7
0.2
0.1
S(Y)
Softmax
S(Y)
1.0
0.0
0.0
L
Labels
D(S,L)
Cross Entropy
Tells us how accurate we are
Minimize cross entropy
● Want a high distance for
incorrect class
● Want a low distance for correct
class
● Training loss = average cross
entropy over the entire
training set.
● Want all the distances to be
small
● want loss to be small
● So we attempt to minimize this
function.
Training Loss
weight 1
weight 2
src: Udacity

Cross Entropy Loss (continued)
weight 1
weight 2
src: Udacity
Want to find the weights to
cause this loss to be the
smallest. Turns M.L
problem into numerical
optimization
weight 1
weight 2
Training Loss
Average cross entropy over
entire training set
Minimize this function
Training Loss
● Take the derivative of Loss with respect to
parameters and follow the derivative by taking a
step backwards.
● Repeat until you get to the bottom.
● In this case we have 2 parameters (w1, w2)
● Typically we have millions of parameters
cross_entropy = -tf.reduce_sum(tf.mul(one_hot, tf.log(softmax)))
Cross Entropy
Loss

Installing Dependencies
You can use pip3 or pip. I recommend using an anaconda environment with python3:
https://www.continuum.io/downloads to Download Anaconda, (get Python 3.4+ version)
conda create --name=IntroToTensorFlow python=3 anaconda
source activate IntroToTensorFlow (Your conda environment is named “IntroToTensorFlow”)
conda install -c anaconda numpy=1.11.3
conda install -c conda-forge matplotlib=2.0.0
conda install -c anaconda scipy=0.18.1
conda install scikit-learn
or pip install -u scikit-learn
conda install -c conda-forge tensorflow
conda install -c menpo opencv3=3.2.0
jupyter notebook (to run in browser)
git clone https://github.com/JonathanCMitchell/TensorFlowLab.git

Installing TensorFlow
Recommended: Python 3.4 or higher and Anaconda
Install TensorFlow
conda create --name=IntroToTensorFlow python=3 anaconda
source activate IntroToTensorFlow
conda install -c conda-forge tensorflow
docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow (Docker if you need it)
# Hello World!
import tensorflow as tf
# create tensorflow object called tensor
hello_constant = tf.constant(‘Hello World!’)
with tf.Session() as sess:
# Run the tf.constant operation in the session
output = sess.run(hello_constant)
print(output)
git clone https://github.com/JonathanCMitchell/TensorFlowLab.git

If you have questions here is my info:
Jonathan Mitchell
github.com/jonathancmitchell
linkedin.com/in/jonathancmitchell
jmitchell1991@gmail.com
Self Driving Cars Los Angeles
https://www.meetup.com/Los-Angeles-Self-Driving-Car-Meetup/
Thank you!

Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved

More Related Content

What's hot

Viewers also liked

Similar to Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved

Recently uploaded

Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved