Intro to Neural
NetworksEng. Abdallah Bashir
Session topics:
1. Introduction to Neural Networks.
2. Neural Networks Basics.
3. Shallow neural networks.
4. Deep Neural Networks.
1. Introduction to Neural Networks
1.1 What is a neuron ?
The input is the size
of the house (x)
The output is the
price (y)
• It is a linear regression problem because the price as a function
of size is a continuous output.
• We know the prices can never be negative so we are creating a
function called Rectified Linear Unit (ReLU) which starts at zero.
• Single neuron = linear regression
1.2 Neural Network Architecture
• The price of a house can be affected by
other features such as size, number of
bedrooms, zip code and wealth.
• The role of the neural network is to
predicted the price and it will automatically
generate the hidden units. We only need to
give the inputs x and the output y.
Input layer
hidden layers
output layer
Each Input will be connected to the
hidden layer and the NN will decide
the connections.
Supervised learning means we have
the (X,Y) and we need to get the
function that maps X to Y.
1.3 SUPERVISED LEARNING WITH NEURAL
NETWORKS
Different types of neural networks for supervised learning which includes:
• Standard NN (Useful for Structured data)
• CNN or convolutional neural networks (Useful in computer vision)
• RNN or Recurrent neural networks (Useful in Speech recognition or NLP)
• Hybrid/custom NN or a Collection of NNs types
1.4 Structured vs Unstructured Data
• Structured data is like the databases and tables.
• Unstructured data is like images, video, audio, and text.
1.5 Why is deep learning taking off?
Deep learning is taking off for 3 reasons:
1. Data
•For small data NN can perform as Linear regression
or SVM (Support vector machine)
•For big data a small NN is better that SVM
•For big data a big NN is better that a medium NN is
better that small NN.
2. Computation:
•GPUs.
•Powerful CPUs.
•Distributed computing.
3.Algorithm:
Creative algorithms has
appeared that changed
the way NN works.
2. Neural Networks Basics
2.1 Binary Classification
In a binary classification problem, the
result is a discrete value output.
For example:
• account hacked (1) or compromised (0)
•Object is a cat (1) or no cat (0)
Example: Cat vs Non-Cat
The goal is to train a classifier that the input is an image
represented by a feature vector, X, and predicts whether the
corresponding label Y, is 1 or 0. In this case, whether this is a cat
image (1) or a non-cat image (0).
The value in a cell represents the pixel
intensity which will be used to create a
feature vector of n dimension. In pattern
recognition and machine learning, a
feature vector represents an object, in
this case, a cat or no cat.
To create a feature vector, x, the pixel
intensity values will be “unroll” or
“reshape” for each color. The dimension
of the input feature vector x is Nx = 64 x
64 x 3 = 12 288
2.1.1 Neural Networks Notations
Here are some of the notations:
• M is the number of examples in the datasets.
• Nx is the size of the input vector
• Ny is the size of the output vector
• X(1) is the first input vector
• Y(1) is the first output vector
• X = [x(1) x(2).. x(M)]
• Y = (y(1) y(2).. y(M))
• L is the number of layers.
2.2 Logistic Regression
Logistic regression is a learning algorithm used in a supervised learning
problem when the output y are all either zero or one. The goal of
logistic regression is to minimize the error between its predictions and
training data.
Example: Cat vs No - cat
Given an image represented by a feature vector x , the algorithm will
evaluate the probability of a cat being in that image.
The parameters used in Logistic regression are:
2.2.1 Cost Function
To train the parameters w and b we need to define a cost function
Loss Function:
The loss function measures the discrepancy between the prediction
and the desired output
To explain the last function lets see:
• if y= 1 ==> L(y',1) = -log(y’)
• if y = 0 ==> L(y',0) = -log(1-y') ==>
• Then the Cost function will be:
• The loss function computes the error for a single training
example
• the cost function is the average of the loss functions of the
entire training set.
2.2.2 Gradient descent
2.2.2 Gradient descent
• Goal is to find 𝑤, 𝑏 that minimize the cost function 𝐽 𝑤, 𝑏
• First we initialize w and b to 0,0 or initialize them to a
random value in the cost function and then try to improve
the values
• In Logistic regression people always use 0,0 instead of
random.
•The gradient decent algorithm repeats:
• w = w - alpha * dw where alpha is the
learning rate and dw is the derivative of w
(Change to w) The derivative is also the
slope of w.
• w = w - alpha * d(J(w,b) / dw) (how much
the function slopes in the w direction)
• b = b - alpha * d(J(w,b) / db) (how much
the function slopes in the d direction)
Gradient Descent
𝑤
Computing derivatives
𝑢= 𝑏𝑐
𝑣 = 𝑎 + 𝑢 𝐽 = 3𝑣6
11 33
𝑎 = 5
𝑐 = 2
𝑏 = 3
2.2.3 Vectorizing Logistic Regression
• As an input we have a matrix X and its [Nx, m] and a matrix Y and its
[Ny, m].
• We will then compute at instance [z1,z2...zm] = W' * X + [b,b,...b].
This can be written in python as:
Z = np.dot(W.T,X) + b #Z shape is (1, m)
A = 1 / 1 + np.exp(-Z) # A shape is (1, m)
Vectorizing Logistic Regression's Gradient Output:
• dz = A - Y # dz shape is (1, m)
• dw = np.dot(X, dz.T) / m #dw shape is (Nx, 1)
• db = dz.sum() / m # db shape is (1, 1)
Side Notes
The main steps for building a Neural Network
are:
•Define the model structure (such as number of
input features and outputs)
•Initialize the model's parameters.
•Loop.
• Calculate current loss (forward propagation)
• Calculate current gradient (backward propagation)
• Update parameters (gradient descent)
Side Notes
•Preprocessing the dataset is important.
Tuning the learning rate (which is an example of a "hyperparameter")
can make a big difference to the algorithm.
kaggle.com is a good place for datasets and competitions.
3. Shallow
Neural Networks
3.1 Neural Networks Overview
• In logistic regression we had:
𝑥1
𝑥2
𝑥3
𝑦
x
w
b
𝑧 = 𝑤 𝑇
𝑥 + 𝑏 𝑎 = 𝜎(𝑧) ℒ(𝑎, 𝑦)
• In neural networks with one layer we will have:
𝑥1
𝑥2
𝑥3
𝑦
𝑧[1] = 𝑊[1] 𝑥 + 𝑏[1]
x
𝑊[1]
𝑏[1]
𝑎[1]
= 𝜎(𝑧[1]
) 𝑧[2]
= 𝑊[2]
𝑎[1]
+ 𝑏[2] 𝑎[2]
= 𝜎(𝑧[2]
) ℒ(𝑎[2]
, 𝑦)
3.2 Shallow Neural Network Representation
• We will define the neural networks that has one hidden layer.
• NN contains of input layers, hidden layers, output layers.
• Hidden layer means we cant see that layers in the training set.
• a0 = x (the input layer)
• a1 will represent the activation of the hidden neurons.
• a2 will represent the output layer.
• We are talking about 2 layers NN. The input layer isn't counted.
𝑥1
𝑥2
𝑥3
𝑦
3.3 Forward Propagation
3.3 Forward Propagation
𝑥1
𝑥2
𝑥3
𝑦
…𝑋 = 𝑥(1)
𝑥(2) 𝑥(𝑚)
𝑎[1](2)A[1]
= 𝑎[1](1) 𝑎[1](𝑚)…
𝑍 1 = 𝑊 1 𝑋 + 𝑏 1
𝐴 1 = 𝜎(𝑍 1 )
𝑍 2
= 𝑊 2
𝐴 1
+ 𝑏 2
𝐴 2 = 𝜎(𝑍 2 )
Here are some information about the last image:
1) Nh= 4
2) Nx = 3
3) Shapes of the variables:
I. W1 is the matrix of the first hidden layer, it has a shape of
(noOfHiddenNeurons,nx)
II. b1 is the matrix of the first hidden layer, it has a shape of
(noOfHiddenNeurons,1)
III. z1 is the result of the equation z1 = W1*X + b, it has a shape of
(noOfHiddenNeurons,1)
IV. a1 is the result of the equation a1 = sigmoid(z1), it has a shape of
(noOfHiddenNeurons,1)
V. W2 is the matrix of the second layer, it has a shape of (1,noOfHiddenLayers)
VI. b2 is the matrix of the second layer, it has a shape of (1,1)
VII. z2 is the result of the equation z2 = W2*a1 + b, it has a shape of (1,1)
VIII. a2 is the result of the equation a2 = sigmoid(z2), it has a shape of (1,1)
•Pseudo code for forward propagation for the 2
layers NN, Lets say we have X on shape (Nx,m):
Z1 = W1X + b1 # shape of Z1 (noOfHiddenNeurons,m)
A1 = sigmoid(Z1) # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2 # shape of Z2 is (1,m)
A2 = sigmoid(Z2) # shape of A2 is (1,m)
3.4 Activation Functions
• In computational networks, the activation function of a node defines
the output of that node given an input or set of inputs. A standard
computer chip circuit can be seen as a digital network of activation
functions that can be "ON" (1) or "OFF" (0)
• So far we are using sigmoid, but in some cases other functions can be
a lot better.
• Sigmoid can lead us to gradient decent problem where the updates
are so low.
• Sigmoid activation function range is [0,1] A = 1 / (1 + np.exp(-z)) #
Where z is the input matrix
• Tanh activation function range is [-1,1] (Shifted version of sigmoid
function)
• It turns out that the tanh activation usually works better than
sigmoid activation function for hidden units.
• Sigmoid or Tanh function disadvantage is that if the input is too
small or too high, the slope will be near zero which will cause
us the gradient decent problem.
• One of the popular activation functions that solved the slow
gradient decent is the RELU function. RELU = max(0,z) # so if z
is negative the slope is 0 and if z is positive the slope remains
linear.
• So here is some basic rule for choosing activation functions, if
your classification is between 0 and 1, use the output
activation as sigmoid and the others as RELU
Side Notes
• In NN you will decide a lot of choices like:
• No of hidden layers.
• No of neurons in each hidden layer.
• Learning rate. (The most important parameter)
• Activation functions.
• And others..
3.4 Backpropagation
•This is when all the magic happens !!
3.4 Backward Propagation
NN parameters:
o n[0] = Nx
o n[1] = NoOfHiddenNeurons
o n[2] = NoOfOutputNeurons = 1
o W1 shape is (n[1],n[0])
o b1 shape is (n[1],1)
o W2 shape is (n[2],n[1])
o b2 shape is (n[2],1)
Then Gradient descent:
Repeat:
Compute predictions (y'[i], i = 0,...m)
Get derivatives: dW1, db1, dW2, db2
Update: W1 = W1 - LearningRate * dW1
b1 = b1 - LearningRate * db1
W2 = W2 - LearningRate * dW2
b2 = b2 - LearningRate * db2
Forward propagation:
oZ1 = W1A0 + b1 # A0 is X
oA1 = g1(Z1)
oZ2 = W2A1 + b2
oA2 = Sigmoid(Z2) # Sigmoid because the output is between 0 and 1
𝑥1
𝑥2
𝑥3
𝑦
Back propagation :
odZ2 = A2 - Y
odW2 = (dZ2 * A1.T) / m
odb2 = Sum(dZ2) / m
odZ1 = (W2.T * dZ2) * g'1(Z1) # element wise product (*)
odW2 = (dZ1 * A0.T) / m # A0 = X
odb2 = Sum(dZ1) / m
𝑥1
𝑥2
𝑥3
𝑦
3.5 Random Initialization
• In logistic regression it wasn't important to initialize the
weights randomly, while in NN we have to initialize them
randomly.
• If we initialize all the weights with zeros in NN it won't
work (initializing bias with zero is OK):
• All hidden units will be completely identical (symmetric) -
compute exactly the same function.
• On each gradient descent iteration all the hidden units will
always update the same.
• To solve this we initialize the W's with a small random numbers:
• W1 = np.random.randn((2,2)) * 0.01 # 0.01 to make it small enough
• b1 = np.zeros((2,1)) # its ok to have b as zero, it won't get us
to the symmetry breaking
𝑎1
[1]
𝑥1
𝑎2
[1]
𝑥2
𝑦𝑎1
[2]
4. Deep Neural
Networks
4.1 Deep L-layer neural network
•Shallow NN is a NN with one or two layers.
•Deep NN is a NN with three or more layers.
•We will use the notation L to denote the number
of layers in a NN.
•n[l] is the number of neurons in a specific layer l.
•n[0] denotes the number of neurons input layer.
n[L] denotes the number of neurons in output
layer.
•g[l] is the activation function.
4.2 Forward Propagation in a Deep Network
Forward propagation general rule for m inputs:
•Z[l] = W[l]A[l-1] + B[l]
•A[l] = g[l](A[l])
4.2.1 Matrix Dimensions
•Dimension of W is (n[l],n[l-1]) . Can be thought by
right to left.
•Dimension of b is (n[l],1)
•dw has the same shape as W, while db is the
same shape as b
•Dimension of Z[l], A[l], dZ[l], and dA[l] is (n[l],m)
4.3 Intuition about deep representation
𝑦
4.4 Parameters vs Hyperparameters
• Main parameters of the NN is W and b
• Hyper parameters (parameters that control the algorithm) are like:
• Learning rate.
• Number of iteration.
• Number of hidden layers L.
• Number of hidden units n.
• Choice of activation functions.
• You have to try values yourself of hyper parameters.
4.5 NN and The Human Brain !
•The analogy that "It is like the brain" has become really
an oversimplified explanation.
•There is a very simplistic analogy between a single
logistic unit and a single neuron in the brain.
•No human today understand how a human brain
neuron works.
•No human today know exactly how many neurons on
the brain.

Introduction to Neural Netwoks

  • 1.
  • 2.
    Session topics: 1. Introductionto Neural Networks. 2. Neural Networks Basics. 3. Shallow neural networks. 4. Deep Neural Networks.
  • 3.
    1. Introduction toNeural Networks
  • 5.
    1.1 What isa neuron ? The input is the size of the house (x) The output is the price (y)
  • 6.
    • It isa linear regression problem because the price as a function of size is a continuous output. • We know the prices can never be negative so we are creating a function called Rectified Linear Unit (ReLU) which starts at zero. • Single neuron = linear regression
  • 7.
    1.2 Neural NetworkArchitecture • The price of a house can be affected by other features such as size, number of bedrooms, zip code and wealth. • The role of the neural network is to predicted the price and it will automatically generate the hidden units. We only need to give the inputs x and the output y.
  • 8.
  • 9.
    Each Input willbe connected to the hidden layer and the NN will decide the connections. Supervised learning means we have the (X,Y) and we need to get the function that maps X to Y.
  • 10.
    1.3 SUPERVISED LEARNINGWITH NEURAL NETWORKS Different types of neural networks for supervised learning which includes: • Standard NN (Useful for Structured data) • CNN or convolutional neural networks (Useful in computer vision) • RNN or Recurrent neural networks (Useful in Speech recognition or NLP) • Hybrid/custom NN or a Collection of NNs types
  • 13.
    1.4 Structured vsUnstructured Data • Structured data is like the databases and tables. • Unstructured data is like images, video, audio, and text.
  • 14.
    1.5 Why isdeep learning taking off? Deep learning is taking off for 3 reasons: 1. Data
  • 15.
    •For small dataNN can perform as Linear regression or SVM (Support vector machine) •For big data a small NN is better that SVM •For big data a big NN is better that a medium NN is better that small NN.
  • 16.
    2. Computation: •GPUs. •Powerful CPUs. •Distributedcomputing. 3.Algorithm: Creative algorithms has appeared that changed the way NN works.
  • 17.
  • 18.
    2.1 Binary Classification Ina binary classification problem, the result is a discrete value output. For example: • account hacked (1) or compromised (0) •Object is a cat (1) or no cat (0)
  • 19.
    Example: Cat vsNon-Cat The goal is to train a classifier that the input is an image represented by a feature vector, X, and predicts whether the corresponding label Y, is 1 or 0. In this case, whether this is a cat image (1) or a non-cat image (0).
  • 20.
    The value ina cell represents the pixel intensity which will be used to create a feature vector of n dimension. In pattern recognition and machine learning, a feature vector represents an object, in this case, a cat or no cat. To create a feature vector, x, the pixel intensity values will be “unroll” or “reshape” for each color. The dimension of the input feature vector x is Nx = 64 x 64 x 3 = 12 288
  • 21.
    2.1.1 Neural NetworksNotations Here are some of the notations: • M is the number of examples in the datasets. • Nx is the size of the input vector • Ny is the size of the output vector • X(1) is the first input vector • Y(1) is the first output vector • X = [x(1) x(2).. x(M)] • Y = (y(1) y(2).. y(M)) • L is the number of layers.
  • 22.
    2.2 Logistic Regression Logisticregression is a learning algorithm used in a supervised learning problem when the output y are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data. Example: Cat vs No - cat Given an image represented by a feature vector x , the algorithm will evaluate the probability of a cat being in that image.
  • 23.
    The parameters usedin Logistic regression are:
  • 25.
    2.2.1 Cost Function Totrain the parameters w and b we need to define a cost function Loss Function: The loss function measures the discrepancy between the prediction and the desired output
  • 26.
    To explain thelast function lets see: • if y= 1 ==> L(y',1) = -log(y’) • if y = 0 ==> L(y',0) = -log(1-y') ==>
  • 27.
    • Then theCost function will be: • The loss function computes the error for a single training example • the cost function is the average of the loss functions of the entire training set.
  • 28.
  • 29.
    2.2.2 Gradient descent •Goal is to find 𝑤, 𝑏 that minimize the cost function 𝐽 𝑤, 𝑏 • First we initialize w and b to 0,0 or initialize them to a random value in the cost function and then try to improve the values • In Logistic regression people always use 0,0 instead of random.
  • 30.
    •The gradient decentalgorithm repeats: • w = w - alpha * dw where alpha is the learning rate and dw is the derivative of w (Change to w) The derivative is also the slope of w. • w = w - alpha * d(J(w,b) / dw) (how much the function slopes in the w direction) • b = b - alpha * d(J(w,b) / db) (how much the function slopes in the d direction)
  • 31.
  • 33.
    Computing derivatives 𝑢= 𝑏𝑐 𝑣= 𝑎 + 𝑢 𝐽 = 3𝑣6 11 33 𝑎 = 5 𝑐 = 2 𝑏 = 3
  • 34.
    2.2.3 Vectorizing LogisticRegression • As an input we have a matrix X and its [Nx, m] and a matrix Y and its [Ny, m]. • We will then compute at instance [z1,z2...zm] = W' * X + [b,b,...b]. This can be written in python as: Z = np.dot(W.T,X) + b #Z shape is (1, m) A = 1 / 1 + np.exp(-Z) # A shape is (1, m)
  • 35.
    Vectorizing Logistic Regression'sGradient Output: • dz = A - Y # dz shape is (1, m) • dw = np.dot(X, dz.T) / m #dw shape is (Nx, 1) • db = dz.sum() / m # db shape is (1, 1)
  • 36.
    Side Notes The mainsteps for building a Neural Network are: •Define the model structure (such as number of input features and outputs) •Initialize the model's parameters. •Loop. • Calculate current loss (forward propagation) • Calculate current gradient (backward propagation) • Update parameters (gradient descent)
  • 37.
    Side Notes •Preprocessing thedataset is important. Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference to the algorithm. kaggle.com is a good place for datasets and competitions.
  • 38.
  • 39.
    3.1 Neural NetworksOverview • In logistic regression we had: 𝑥1 𝑥2 𝑥3 𝑦 x w b 𝑧 = 𝑤 𝑇 𝑥 + 𝑏 𝑎 = 𝜎(𝑧) ℒ(𝑎, 𝑦)
  • 40.
    • In neuralnetworks with one layer we will have: 𝑥1 𝑥2 𝑥3 𝑦 𝑧[1] = 𝑊[1] 𝑥 + 𝑏[1] x 𝑊[1] 𝑏[1] 𝑎[1] = 𝜎(𝑧[1] ) 𝑧[2] = 𝑊[2] 𝑎[1] + 𝑏[2] 𝑎[2] = 𝜎(𝑧[2] ) ℒ(𝑎[2] , 𝑦)
  • 41.
    3.2 Shallow NeuralNetwork Representation • We will define the neural networks that has one hidden layer. • NN contains of input layers, hidden layers, output layers. • Hidden layer means we cant see that layers in the training set. • a0 = x (the input layer) • a1 will represent the activation of the hidden neurons. • a2 will represent the output layer. • We are talking about 2 layers NN. The input layer isn't counted. 𝑥1 𝑥2 𝑥3 𝑦
  • 42.
  • 43.
    3.3 Forward Propagation 𝑥1 𝑥2 𝑥3 𝑦 …𝑋= 𝑥(1) 𝑥(2) 𝑥(𝑚) 𝑎[1](2)A[1] = 𝑎[1](1) 𝑎[1](𝑚)… 𝑍 1 = 𝑊 1 𝑋 + 𝑏 1 𝐴 1 = 𝜎(𝑍 1 ) 𝑍 2 = 𝑊 2 𝐴 1 + 𝑏 2 𝐴 2 = 𝜎(𝑍 2 )
  • 44.
    Here are someinformation about the last image: 1) Nh= 4 2) Nx = 3 3) Shapes of the variables: I. W1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,nx) II. b1 is the matrix of the first hidden layer, it has a shape of (noOfHiddenNeurons,1) III. z1 is the result of the equation z1 = W1*X + b, it has a shape of (noOfHiddenNeurons,1) IV. a1 is the result of the equation a1 = sigmoid(z1), it has a shape of (noOfHiddenNeurons,1) V. W2 is the matrix of the second layer, it has a shape of (1,noOfHiddenLayers) VI. b2 is the matrix of the second layer, it has a shape of (1,1) VII. z2 is the result of the equation z2 = W2*a1 + b, it has a shape of (1,1) VIII. a2 is the result of the equation a2 = sigmoid(z2), it has a shape of (1,1)
  • 45.
    •Pseudo code forforward propagation for the 2 layers NN, Lets say we have X on shape (Nx,m): Z1 = W1X + b1 # shape of Z1 (noOfHiddenNeurons,m) A1 = sigmoid(Z1) # shape of A1 (noOfHiddenNeurons,m) Z2 = W2A1 + b2 # shape of Z2 is (1,m) A2 = sigmoid(Z2) # shape of A2 is (1,m)
  • 46.
    3.4 Activation Functions •In computational networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard computer chip circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0) • So far we are using sigmoid, but in some cases other functions can be a lot better. • Sigmoid can lead us to gradient decent problem where the updates are so low. • Sigmoid activation function range is [0,1] A = 1 / (1 + np.exp(-z)) # Where z is the input matrix • Tanh activation function range is [-1,1] (Shifted version of sigmoid function)
  • 47.
    • It turnsout that the tanh activation usually works better than sigmoid activation function for hidden units. • Sigmoid or Tanh function disadvantage is that if the input is too small or too high, the slope will be near zero which will cause us the gradient decent problem. • One of the popular activation functions that solved the slow gradient decent is the RELU function. RELU = max(0,z) # so if z is negative the slope is 0 and if z is positive the slope remains linear. • So here is some basic rule for choosing activation functions, if your classification is between 0 and 1, use the output activation as sigmoid and the others as RELU
  • 49.
    Side Notes • InNN you will decide a lot of choices like: • No of hidden layers. • No of neurons in each hidden layer. • Learning rate. (The most important parameter) • Activation functions. • And others..
  • 50.
    3.4 Backpropagation •This iswhen all the magic happens !!
  • 51.
    3.4 Backward Propagation NNparameters: o n[0] = Nx o n[1] = NoOfHiddenNeurons o n[2] = NoOfOutputNeurons = 1 o W1 shape is (n[1],n[0]) o b1 shape is (n[1],1) o W2 shape is (n[2],n[1]) o b2 shape is (n[2],1)
  • 52.
    Then Gradient descent: Repeat: Computepredictions (y'[i], i = 0,...m) Get derivatives: dW1, db1, dW2, db2 Update: W1 = W1 - LearningRate * dW1 b1 = b1 - LearningRate * db1 W2 = W2 - LearningRate * dW2 b2 = b2 - LearningRate * db2
  • 53.
    Forward propagation: oZ1 =W1A0 + b1 # A0 is X oA1 = g1(Z1) oZ2 = W2A1 + b2 oA2 = Sigmoid(Z2) # Sigmoid because the output is between 0 and 1 𝑥1 𝑥2 𝑥3 𝑦
  • 54.
    Back propagation : odZ2= A2 - Y odW2 = (dZ2 * A1.T) / m odb2 = Sum(dZ2) / m odZ1 = (W2.T * dZ2) * g'1(Z1) # element wise product (*) odW2 = (dZ1 * A0.T) / m # A0 = X odb2 = Sum(dZ1) / m 𝑥1 𝑥2 𝑥3 𝑦
  • 56.
    3.5 Random Initialization •In logistic regression it wasn't important to initialize the weights randomly, while in NN we have to initialize them randomly. • If we initialize all the weights with zeros in NN it won't work (initializing bias with zero is OK): • All hidden units will be completely identical (symmetric) - compute exactly the same function. • On each gradient descent iteration all the hidden units will always update the same.
  • 57.
    • To solvethis we initialize the W's with a small random numbers: • W1 = np.random.randn((2,2)) * 0.01 # 0.01 to make it small enough • b1 = np.zeros((2,1)) # its ok to have b as zero, it won't get us to the symmetry breaking 𝑎1 [1] 𝑥1 𝑎2 [1] 𝑥2 𝑦𝑎1 [2]
  • 59.
  • 61.
    4.1 Deep L-layerneural network •Shallow NN is a NN with one or two layers. •Deep NN is a NN with three or more layers. •We will use the notation L to denote the number of layers in a NN. •n[l] is the number of neurons in a specific layer l. •n[0] denotes the number of neurons input layer. n[L] denotes the number of neurons in output layer. •g[l] is the activation function.
  • 62.
    4.2 Forward Propagationin a Deep Network Forward propagation general rule for m inputs: •Z[l] = W[l]A[l-1] + B[l] •A[l] = g[l](A[l])
  • 63.
    4.2.1 Matrix Dimensions •Dimensionof W is (n[l],n[l-1]) . Can be thought by right to left. •Dimension of b is (n[l],1) •dw has the same shape as W, while db is the same shape as b •Dimension of Z[l], A[l], dZ[l], and dA[l] is (n[l],m)
  • 64.
    4.3 Intuition aboutdeep representation 𝑦
  • 65.
    4.4 Parameters vsHyperparameters • Main parameters of the NN is W and b • Hyper parameters (parameters that control the algorithm) are like: • Learning rate. • Number of iteration. • Number of hidden layers L. • Number of hidden units n. • Choice of activation functions. • You have to try values yourself of hyper parameters.
  • 67.
    4.5 NN andThe Human Brain ! •The analogy that "It is like the brain" has become really an oversimplified explanation. •There is a very simplistic analogy between a single logistic unit and a single neuron in the brain. •No human today understand how a human brain neuron works. •No human today know exactly how many neurons on the brain.

Editor's Notes

  • #65  Face recognition application: Image ==> Edges ==> Face parts ==> Faces ==> desired face Audio recognition application: Audio ==> Low level sound features like (sss,bb) ==> Phonemes ==> Words ==> Sentences