Speech Processing with deep learning

Introduction to Speech Recognition
with Deep learning
Speech Processing
Prof. Mahmoud Gad Allah

"Computers are able to see, hear
and learn. Welcome to the Future“
- Dave Waters

Speech recognition, or speech-to-text, is the ability of a machine or program to
identify words spoken aloud and convert them into readable text.

Speech Recognition with Deep Learning
Automatic speech recognition (ASR) refers to the task of recognizing
human speech and translating it into text.
 This research field has gained a lot of focus over the last decades. It is an important
research area for human-to-machine communication.
 Early methods focused on manual feature extraction and conventional techniques
such as Gaussian Mixture Models (GMM), the Dynamic Time Warping
(DTW) algorithm and Hidden Markov Models (HMM).
 More recently, neural networks such as recurrent neural networks (RNNs),
convolutional neural networks (CNNs) and in the last years Transformers, have been
applied on ASR and have achieved great performance.

What is Deep Learning?
Deep learning is a subset of methods for machine learning which is a field dedicated to
the study and development of machines that can learn (sometimes with the goal of
eventually attaining general artificial intelligence).

Why Call it “Deep Learning“?
Why Not Just “Artificial Neural Networks“?
 Geoffrey Hinton is a pioneer in the field of artificial neural networks and co-
published the first paper on the backpropagation algorithm for training multilayer
perceptron networks.
 He may have started the introduction of the phrasing “deep” to describe the
development of large artificial neural networks.
 Deep learning is a deep neural network with many hidden layers and many nodes
in every hidden layer.

Speech Recognition with Deep Learning
The overall flow of ASR can be represented as shown below:

Deep Learning Architecture
There are many variations of deep learning architecture for ASR. Two commonly used
approaches are:
 A CNN (Convolutional Neural Network) plus RNN-based (Recurrent Neural Network)
architecture that uses the CTC Loss algorithm to demarcate each character of the
words in the speech. eg. Baidu’s Deep Speech model.
 An RNN-based sequence-to-sequence network that treats each ‘slice’ of the
spectrogram as one element in a sequence eg. Google’s Listen Attend Spell (LAS)
model.

Deep Learning Architecture
CNN performs better than RNN in Speech Recognition ?
at Image Processing and Speech Emotional Recognition(generally Speech
Recognition)due to application of filters and MaxPooling which leads to elimination of
lighter pixels in Image Processing(compared to darker); and noise and ... (compared to
the voice phonemes)in Speech Recognition;accordingly , CNNs work by reducing an
image/speech to its key features and using the combined probabilities of the identified
features appearing together to determine a classification.While RNNs deals with,only,
sequential and there is no elimination and filtering which has made RNN fall back
compared to CNN in aforementioned tasks.The link below is a brief description of CNN.

Convolutional Neural Network (ConvNet/CNN)
A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image,
assign importance (learnable weights and biases) to various aspects/objects in the image and be able to
differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other
classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets
have the ability to learn these filters/characteristics.

The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the Human Brain
and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a
restricted region of the visual field known as the Receptive Field. A collection of such fields overlap to cover the
entire visual area.

Convolutional neural networks refer to a sub-category of neural networks: they, therefore, have all the
characteristics of neural networks. However, CNN is specifically designed to process input images. Their
architecture is then more specific: it is composed of two main blocks.

The first block makes the particularity of this type of neural network since it functions as a feature extractor. To
do this, it performs template matching by applying convolution filtering operations. The first layer filters the
image with several convolution kernels and returns “feature maps”, which are then normalized (with an
activation function) and/or resized.
This process can be repeated several times: we filter the features maps obtained with new kernels, which gives
us new features maps to normalize and resize, and we can filter again, and so on. Finally, the values of the last
feature maps are concatenated into a vector. This vector defines the output of the first block and the input of the
second.

The second block is not characteristic of a CNN: it is in fact at the end of all the neural networks used for
classification. The input vector values are transformed (with several linear combinations and activation functions)
to return a new vector to the output. This last vector contains as many elements as there are classes: element i
represents the probability that the image belongs to class i. Each element is therefore between 0 and 1, and the
sum of all is worth 1. These probabilities are calculated by the last layer of this block (and therefore of the
network), which uses a logistic function (binary classification) or a softmax function (multi-class classification) as
an activation function.

The different layers of a CNN
1. The convolutional layer
 The convolutional layer is the key component of convolutional neural networks, and is always at least
their first layer.
Its purpose is to detect the presence of a set of features in the images received as input. This is done by
convolution filtering: the principle is to “drag” a window representing the feature on the image, and to
calculate the convolution product between the feature and each portion of the scanned image. A feature is
then seen as a filter: the two terms are equivalent in this context.
 The convolutional layer thus receives several images as input, and calculates the convolution of each of
them with each filter. The filters correspond exactly to the features we want to find in the images.
We get for each pair (image, filter) a feature map, which tells us where the features are in the image: the
higher the value, the more the corresponding place in the image resembles the feature.
 Unlike traditional methods, features are not pre-defined according to a particular formalism (for
example SIFT), but learned by the network during the training phase! Filter kernels refer to the
convolution layer weights.

2. The pooling layer
 This type of layer is often placed between two layers of convolution: it receives several feature maps and
applies the pooling operation to each of them.
The pooling operation consists in reducing the size of the images while preserving their important
characteristics.
To do this, we cut the image into regular cells, then we keep the maximum value within each cell. In practice,
small square cells are often used to avoid losing too much information. The most common choices are 2x2
adjacent cells that don’t overlap, or 3x3 cells, separated from each other by a step of 2 pixels
(thus overlapping).
 The pooling layer reduces the number of parameters and calculations in the network. This improves the
efficiency of the network and avoids over-learning.

3. The ReLu correction layer
 ReLU (Rectified Linear Units) refers to the real non-linear function defined by ReLU(x)=max(0,x). Visually,
it looks like the following:
 The ReLU correction layer replaces all negative values received as inputs by zeros. It acts as
an activation function.

4. The fully-connected layer
 The fully-connected layer is always the last layer of a neural network, convolutional or not — so it is not
characteristic of a CNN.
This type of layer receives an input vector and produces a new output vector. To do this, it applies a linear
combination and then possibly an activation function to the input values received.
 The last fully-connected layer classifies the image as an input to the network: it returns
a vector of size N, where N is the number of classes in our image classification problem.
Each element of the vector indicates the probability for the input image to belong to a
class.

References
• https://www.analyticsvidhya.com/blog/2020/10/what-is-the-convolutional-
neural-network-architecture/
• https://towardsdatascience.com/understand-the-architecture-of-cnn-
90a25e244c7
• https://www.analyticsvidhya.com/blog/2021/05/convolutional-neural-networks-
cnn/
• https://clevertap.com/blog/neural-networks/
• https://towardsdatascience.com/audio-deep-learning-made-simple-automatic-
speech-recognition-asr-how-it-works-716cfce4c706

Speech Processing with deep learning

More Related Content

What's hot

Similar to Speech Processing with deep learning

More from Mohamed Essam

Recently uploaded

In this document

Speech Processing with deep learning