Want to run GPU-accelerated applications inside a Windows 11 Hyper-V virtual machine? This guide walks you through enabling GPU Paravirtualization (GPU-PV).
I will walk through the process of enabling GPU-PV, allowing the guest VM to utilize the host's graphics card for acceleration. We will use a software called Hyper-V, the native hypervisor developed by Microsoft, to create virtual machines on x86-64 systems running Windows.
In this guide, we refer to the "host" as your main physical computer running Windows 11, and the "guest" as the virtual machine (VM) you intend to run.
Before starting, ensure you have the following:
The installation of Hyper-V is straightforward, but if you have the free home edition you should check those instructions.
You will also need to download a windows 11 ISO, check under "Download Windows 11 Disk Image (ISO) for x64 devices" section.
Follow the official Microsoft tutorial to create a VM in Hyper-V. Make sure to disable secure-boot and enable TPM in the VM settings, otherwise the installation will not succeed (ISO is not detected).
Once your VM is created but before installing the guest OS (or after, depending on your workflow), we need to prepare the host for GPU partitioning.
Open PowerShell as Administrator on the host and run:
Get-VMHostPartitionableGpu
For NVIDIA GPUs, the output should include the vendor identifier VEN_10DE.
We will use a community script to set the necessary VM parameters. To download this script and run it, follow:
Invoke-WebRequest -Uri "https://gist.githubusercontent.com/neggles/e35793da476095beac716c16ffba1d23/raw/0500355ed003e441a0ab2785ee5aa4a33a7ec8ab/New-GPUPVirtualMachine.ps1" -OutFile $env:USERPROFILE\Downloads\New-GPUVirtualMachine.ps1
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
.\$env:USERPROFILE\Downloads\New-GPUVirtualMachine.ps1
To prevent issues with GPU state, disable automatic checkpoints:
Set-VM -Name "YourVMName" -AutomaticStopAction TurnOff
Note: While official Microsoft documentation mentions hardware partitioning, they describe it only for Windows Server. This guide is for Windows 11 clients.
Check that the VMs has the correct graphic adapter:
Get-VMHostPartitionableGpu
GPU-PV does not expose a standard PCIe device to the VM. Instead, the VM sees a virtual adapter that relies on specific host-side driver components and we must package these drivers and transfer them to the guest.
To package the GPU drivers from host and sent it to guest, download and run this script.
You should replace accordingly YourVMName and GuestUsername:
Invoke-WebRequest -Uri "https://gist.githubusercontent.com/neggles/e35793da476095beac716c16ffba1d23/raw/0500355ed003e441a0ab2785ee5aa4a33a7ec8ab/New-GPUPDriverPackage.ps1" -OutFile $env:USERPROFILE\Downloads\New-GPUDriverPackage.ps1 -Filter "NVIDIA"
Set-VM -Name YourVMName -GuestControlledCacheTypes $true
Enable-VMIntegrationService -VMName "YourVMName" -Name "Guest Service Interface"
Copy-VMFile -Name "YourVMName" -SourcePath .\GPUPDriverPackage.zip -DestinationPath "C:\Users\GuestUsername\Downloads\GPUPDriverPackage.zip" -FileSource Host
Inside the VM, navigate to C:\Users\GuestUsername\Downloads\ and extract GPUPDriverPackage.zip in C:\Windows\ as explained here.
Once installed, the VM should recognize the GPU in windows settings.
If you need to change the VM's display resolution, you can use the following PowerShell command on the host:
Set-VMVideo -VMName "YourVMName" -HorizontalResolution 2560 -VerticalResolution 1440 -ResolutionType Default
Learn more about GPU-PV vs true PCIe passthrough differences here.
How to avoid registering to a Google account on your Android smart TV.
adbForced registration is a dark pattern that forces users to register to an account to use any kind of services. Google is known for that kind of practices and they use it quite a lot on any android-powered devices (like your smart TV).
People are not necessarily aware that it is possible to completely avoid registering to Google to use your TV and android features, especially installing app.
To use this tutorial, you will need:
The following instructions are for LINUX, but should be somehow similar to Windows.
We first need to install the Android Debug Bridge (adb) on your host PC.
sudo apt-get -y install android-tools-adb
In the Android settings, find the section where you have your Android build information, and push OK a few times.

Quite a weird way to enable dev-mode, it reminds me the old days when putting some cheat codes on my GameBoy games...
If correct, there should be a message saying "you are now a developper". There should be a new menu item "Developper options" where you can enable "USB debugging".
If any troubles check on this website.
Android developers package their application uising the .apk file format, make sure to always select the 32-bit versions (armeabi-v7a versions) if you are not using an NVIDIA shield.
Where to find them ? There are a lot of website that provide repositories to distribute them but I am personnally using apkmirror.com. Always prefer developper mirrors or github/gitlab releases page (which is common), for example SmartTube does that.
The action of installing an app directly to a device is called "side-loading". After you have downloaded a bunch of apk files on your PC, it is time to install them!
Warning
Always proceed with caution when downloading files from an untrusted source!
First identify the IP adress of your TV, it should be somehwere in Settings > Network > About.
Then, connect your TV with adb and check that the connection is successfull:
adb connect <IP>
adb devices
Accept the USB debugging prompt that will appear on your Android TV.

You can now sideload the package!
adb install /path/to/apk/file
A whole new world is now open to you, you are not limited to only installing apk but you can actually customize your TV now!
For example, you can access and send files to the TV storage with adb shell (I use that to backup my kodi library environment).
This is quite usefull if you don't want to install additionnal apps on your TVs.
This is a guide explaining how to dual-boot a system with both Windows and Linux, using grub.
Dual-booting is quite common and really usefull, the idea is to be able to launch multiple OS from a single computer (motherboard). The main reason I personnally use dual-boot is that I prefer to work under Linux for security, convenience and its open-source nature. When I need to play games, I boot Windows. Let's hope that the Steam Deck popularity makes the Linux video game ecosystem a reality, thanks to Proton tool!
Before everything else, you first need to choose between using one or multiple physical drives.
If you use physical drives, you will need at least two disks (HDD or SATA). SATA is more costly than HDD and more performant but has less lifespan and storage capacity.
I recommend you to buy physical drives for efficiency, of course if you have access to the motherboard. It should be easy to plug two physicall drives, almost all today's motherboard support at least 2 SATA ports (to plug up to two SATA drives), and maybe some M.2 ports. For a good overview of the different SSD, check this article.
On the other hand you can use logical drives, which is the idea to decompose your drive into multiple independent partitions.
It is less performant than physical disks, but you can in theory emulate an infinite amount of disks.
To create new partitions, you can use the pre-installed Disks tool on Ubuntu.
From there, you will need to have access to a desktop that is connected to the internet. We will create two installation media (two usb with at least 8GB) for Windows and ubuntu, that each requires ISO images. Those usb sticks will be later used by your desktop, to install the different OS. For the Windows ISO, download it through that page and for ubuntu you can find it here.
Now, the rest depend on which system you have downloaded the ISO images.
If the current desktop you are working with is running Windows, use the following guide to make the ubuntu installation media and Windows installation media.
Otherwise, if the current desktop is running ubuntu, use the system Startup Disk Creator for the ubuntu installation media and WoeUSB for Windows.

I recommend you to first start with Windows, because it has the bad habbit of messing up with other disks.
Plug the USB Windows installation on your desktop, reboot your computer and enter the BIOS menu. This one should be accessible by hitting the DEL key on your keyboard when booting.
From this menu you want to select UEFI boot, and make sure to not select LEGACY. Here we are making sure that the firmware choose the UEFI booting method, because it is the standard nowadays.
Go to the boot menu order, and check that your Windows USB key appears and has top priority. Now restart again the computer and you should boot on the USB installation media.
Note
Each constructor has its own menu so double-check how to acces your BIOS and modify the settings, for example DELL or MSI.
Once Windows is installed, shut-down the desktop, remove the usb, poweron the desktop, check in the BIOS that the Windows drive has highest boot priority, restart, you should boot on Windows.
Now repeat the same process to install Ubuntu on the second drive. Make sure that the usb Linux stick is plugged-in and has highest boot priority. When asked, choose UEFI installation (should be the default) instead of BIOS/LEGACY.
Note
This is not the role of this post, but if you need to setup disk encryption you should do that during installation.
From there you should have a working setup congratulation! But how can you easilly switch between the different OS?
If you hit the F11 key after powering-on the PC, you should be able to acces the BIOS booting setup. From there you can select to boot from either Windows or Ubuntu. Double-check that the two OS are working as expectly.
This guide is not over, we don't always want to acces the BIOS boot menu to change between OS right? Let me present to you GNU GRUB.
The grub tool aims to help you manage your multi-boot configuration, under your Ubuntu OS. That means that once everything is setup, you can make the Ubuntu drive top-priority in the BIOS and don't care of the Windows boot selection. We will see that there are more customization than just the raw boot menu from the BIOS.
First of all boot on Ubuntu, we will check that the different disk are setup accordingly.
Run the following:
sudo fdisk -l
You should have multiple outputs, but we want to check for the Windows disk wich should look like this:
Disk /dev/nvme0n1: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: ****
Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 206847 204800 100M EFI System
/dev/nvme0n1p2 206848 239615 32768 16M Microsoft reserved
/dev/nvme0n1p3 239616 1000212589 999972974 476.8G Microsoft basic data
The most important for Linux is to be able to detect the first partition (EFI System).
It should be the first partition of size 100MB, with type EFI System.
If the type is different, for example Microsoft basic data (which can happen after a Windows boot repair), you can change that
using the Disks utility form Ubuntu.
Click on the 100M partition of the Windows disk, options and Edit Partition.
Now change it to EFI System and optionnally add a name Windows EFI system.
Now we will use a tool to make grub detect the window partition.
sudo apt install os-prober
Running os-prober should output something like this:
$ sudo os-prober
/dev/nvme0n1p1@/efi/Microsoft/Boot/bootmgfw.efi:Windows Boot Manager:Windows:efi
If the output is empty or does not mention Windows, check this thread.
Now we can update the grub configuration file:
sudo update-grub
Reboot your PC, you should see the grub menu with a choice between ubuntu and windows.

If the grub menu does not appear, make sure to set the GRUB_TIMEOUT_STYLE parameter to menu in /etc/default/grub, an re-run update-grub.
It is possible to customize some grub behaviour.
All the options are available in the /etc/default/grub file.
Let's say on Windows you have applications that launch on startup (for example Steam and big picture mode for gamers).
You can make the Windows boot by default by changing GRUB_DEFAULT for example:
GRUB_DEFAULT="Windows Boot Manager (on /dev/nvme0n1p1)"
There are plenty of configurations to play with, check here for more details.
Even custom themes exists!
Some additionnal notes regarding Windows disk migration.
I recently stumbled into issues when I upgraded my Windows disk. Basically I used clonezilla to copy from my old (smaller) disk to the new (bigger) one.
Unfortunately it did not work as expected and was not able to boot on Windows anymore, nor using the boot-repair from Windows nor bootrec helped. Hopefully I was able to re-create the EFI/MSR partitions using diskpart.
Make sure to play with the offset parameter so the partitions are well aligned:
Device Start End Sectors Size Type
/dev/*** 2048 206847 204800 100M EFI System
/dev/*** 206848 239615 32768 16M Microsoft reserved
/dev/*** 239616 1000212589 999972974 476.8G Microsoft basic data
Then bcdboot helps to populate the EFI files. A complete guide is available at this page.
You have all heard about deep neural networks, and maybe used it. In this post, I will try to explain the mathematics behind neural nets with a basic classification example. This tutorial has two parts, check out the part 2 here!
Deep learning is a tool that scientists use to solve a central problem in data science, how to model from data ? Depending of the context, data can be very complex and has diffrent forms: 1D signals (audio, finance), 2D images (camera, ultrasound), 3D volumes (X-ray, computer graphics) or videos, 4D moving volumes (f-MRI).
One way of infering predictions from these data is using machine learning. After designing a feature extractor by hand, you choose a system that predict from the feature whether the space is linear or not (SVM with kernel trick). On the other hand, deep learning strives to combine these two steps at once! It extracts the high-level, abstract features from the data itself and use them to do the prediction.
This of course comes at a cost both in term of mathematical understanding or even money.
McCulloch et al. [1] were the first to introduce the idea of neural network as computing machine. It is a simple model defined by two free parameters, the weight vector $\mathbf{w}$ and bias $b$:
$$ \begin{equation}\label{eq:perceptron} f(\mathbf{x}) = \mathbf{x}\mathbf{w} + b \end{equation} $$where the input vector $\mathbf{x}$ has size $[1\times D]$ with $D$ the number of features, and $\mathbf{w}$ has size $[D\times 1]$. This simple model can classify input features that are linearly separable.
Looking at its derivatives w.r.t the parameters $\mathbf{w}$ and $b$ is usefull when we want to analyse the impact of the parameters on the model.
$$ \begin{equation}\label{eq:softmax} \frac{\partial f(\mathbf{w})}{\partial \mathbf{w}} = \mathbf{x};\quad \frac{\partial f(b)}{\partial b} = 1 \end{equation} $$Things are getting more complicated for multiple neurons $C$ with parameters $\mathbf{W}$ of size $[D\times C]$ and $\mathbf{b}$ $[1\times C]$. Because the neuron activation is a function $f(\mathbf{W}):\mathbb{R}^{D\times C}\rightarrow \mathbb{R}^C$, the resulting jacobian matrix would have a size of $[D\times C\times D]$. There are different strategies to handle this but one way of doing it is to flatten $\mathbf{W}$ so it becomes a vector $\mathbf{w}$ with size $DC$, then the jacobians are: $$ \begin{equation} \frac{\partial f(\mathbf{W})}{\partial \mathbf{W}} = \begin{bmatrix} x_1\\ \vdots & \ddots & x_1\\ x_D & \ddots & \vdots\\ & & x_D\\ \end{bmatrix}_{[DC \times C]} \frac{\partial f(\mathbf{b})}{\partial \mathbf{b}} = \begin{bmatrix} 1\\ & \ddots\\ & & 1 \end{bmatrix}_{[C \times C]}\\ \end{equation} $$
Having a model is not enough to come up with a complete algorithm, we need a classifying rule that will compute the response based on the model output. Rosenblatt [2] used a simple activation function (also reffered as heavyside), so the inputs would be classified as target $t=1$ if $\mathbf{w} \mathbf{x} + b > 0$ and target $t=0$ otherwise:
$$ \begin{equation} f(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} \end{equation} $$Nowadays, we used the widely adopted softmax [3] (sigmoid for binary). It has two nice properties that make it a good function to model probability distributions: each value ranges between 0 and 1 and the sum of all values is always 1. Given the input vector $\mathbf{x}$ of size $[C\times 1]$ classes:
$$ \begin{equation} f(\mathbf{x}) = \frac{1}{\sum_{c=1}^C e^{x_c}} \begin{bmatrix} e^{x_{i=1}} \\ \vdots \\ e^{x_C} \\ \end{bmatrix} \end{equation} $$To compute the jacobian matrix, we need to find the derivative of each output $f_{i=1}(\mathbf{x}) \dots f_C(\mathbf{x})$ (lines) w.r.t. each inputs $x_{j=1} \dots x_{j=D}$ (columns). The resulting jacobian matrix for the softmax function is :
$$ \begin{equation} \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} = \begin{bmatrix} y_{c=1}(\mathbf{x})\cdot (1 - y_{d=1}(\mathbf{x})) & -y_1(\mathbf{x})\cdot y_2(\mathbf{x}) & \dots & -y_1(\mathbf{x})\cdot y_D(\mathbf{x}) \\ -y_2(\mathbf{x})\cdot y_1(\mathbf{x}) & \ddots & & \vdots \\ \vdots & & & \\ -y_C(\mathbf{x})\cdot y_1(\mathbf{x}) & & & y_C(\mathbf{x})\cdot (1 - y_D(\mathbf{x})) \\ \end{bmatrix}_{[C\times D]} \end{equation} $$Note that $C=D$ in this case (the input and output vector has the same length).
The fitness of the data to the model is defined by the cost function, and can be calculated using the likelihood function. Given the weight vector $\mathbf{w}$ (which defines our model), the ground truth vector $\mathbf{t}$ $[1\times C]$ (which is either $0$ or $1$ in classification), the $c$ th activation of the softmax $y_c(\mathbf{x})$ and the input data $\mathbf{x}$ $[1\times C]$, the likelihood is: $$ \begin{equation}\label{eq:cost} \mathcal{L}(\mathbf{w}|\mathbf{t}, \mathbf{x}) = \prod_{c=1}^{C} y_c(\mathbf{x})^{t_c} \end{equation} $$
After minimizing the negative $\log$ likelihood, the loss function over one sample looks like:
$$ \begin{equation}\label{eq:dv_cost} \xi(\mathbf{t}, y(\mathbf{x})) = - \sum_{c=1}^{C} t_{c} \cdot \log( y_c(\mathbf{x})) \end{equation} $$As we could expect, this is similar to the cross entropy function!
We can also look at the derivative of $\xi$ w.r.t. the output of the model $y(\mathbf{x})$ (softmax function in our case):
$$ \begin{equation} \frac{\partial \xi(\mathbf{t}, y(\mathbf{x}))}{\partial y(\mathbf{x})} = (-1)\cdot \begin{bmatrix} t_{c=1}/y_{c=1}(\mathbf{x}) & \cdots & t_{C}/y_{C}(\mathbf{x}) \end{bmatrix}_{[1 \times C]}\\ \end{equation} $$For multiple samples $n$ (each has its own specific features $\mathbf{x}_i$), you have different strategies. A basic strategy is to add each sample's fitness, you can do the same for the jacobian of the cost function.
The standard approach to extract the optimal weights for the model is through the gradient descent algorithm. Knowing the gradient of the cost function w.r.t each parameter, $\frac{\partial \xi}{\partial \mathbf{w}}$, one can extract the optimal set of weights. This equation can be decomposed into multiple simpler derivatives by looking at how the data flows from the input $\mathbf{x}$ to the model output $y(\mathbf{x})$.
We will design the complete equation later after defining the whole model. For now just remember that to compute $\frac{\partial \xi}{\partial \mathbf{w}}$, we need the intermediate derivative of $\xi$ w.r.t. the outputs $y(\mathbf{x})$ that we just definied previously.
The proces of calculating all the gradient and updating the parameters for the neural network is called the backpropagation.
Our goal will be to predict three different classes $C : \{c_a, c_b, c_c\}$, given some data with two features $x_1$ and $x_2$. We suppose that the features can indeed be used to discriminate the samples, and are statistically independent. The data will be a $N\times 3$ vector where each line is a sample $X = [x_1, x_2]$. The three classes are issued from three different noisy gaussians distribution.
### imports
import numpy as np
import copy
import matplotlib
from matplotlib.colors import colorConverter, ListedColormap
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
### Data modeling
# Fix random state
np.random.seed(0)
# Distribution of the classes
n = 20
a_mean = [-3, 0]
b_mean = [0, 0]
c_mean = [2, 0]
std_dev = 0.8
# Generate samples from classes
X_a = np.random.randn(n, 2) * std_dev + a_mean + np.random.rand(n,2)
X_b = np.random.randn(n, 2) * std_dev + b_mean + np.random.rand(n,2)
X_c = np.random.randn(n, 2) * std_dev + c_mean + np.random.rand(n,2)
# Create inputs X and targets C
X = np.vstack((X_a, X_b, X_c))
C = np.vstack( (np.zeros((n,1), dtype=int), np.ones((n,1), dtype=int), 2*np.ones((n,1), dtype=int)) )
# random permutation
idx = np.arange(X.shape[0])
idx = np.random.permutation(idx)
X = X[idx,]
C = C[idx,]
# one hot encoding
one_hot = np.zeros((len(C), np.max(C)+1))
one_hot[np.arange(len(C)), C.flatten().tolist()] = 1
C = one_hot
Let's take a look at the data.
### Plot both classes on the x1, x2 space
plt.plot(X_a[:,0], X_a[:,1], 'ro', label='class $c_a$')
plt.plot(X_b[:,0], X_b[:,1], 'b*', label='class $c_b$')
plt.plot(X_c[:,0], X_c[:,1], 'g+', label='class $c_c$')
plt.grid()
plt.legend()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Class a, b and c in the space $X$')
plt.show()
As the name suggests, a neural network is an architecture combining many neurons! Because the 2D input data is linearly separable, we will design a model with just one layer that has three neurons (one for each class). We will then compute the probabilities for the three classes through the softmax activation function, where the highest probability will gives us our class.
We will start by coding the different blocks that we will need for this task. First the perceptron model from eq.\ref{eq:perceptron}, which will allow us to adjust our model (via the weight and bias parameters).
# Neuron class definition
class Neuron:
def __init__(self, weights, bias):
# first dim is the number of features
# second dim is the number of neurons
self.weights = weights
self.bias = bias
def output(self, input):
return input @ self.weights + self.bias
def grad(self, input):
D = np.kron(np.eye(self.weights.shape[1]), input)
return [D, np.eye(self.weights.shape[1])]
Now we can code the softmax activation function using eq.\ref{eq:softmax} (so we can predict the class), and its derivative (used for the gradient descent).
# Activation function
def softmax(input):
return np.exp(input) / np.sum(np.exp(input))
# Derivative of the activation function
def dv_softmax(input):
diag = softmax(input)*(1-softmax(input))
xj, xi = np.meshgrid(softmax(input), softmax(input))
jacob = (-1)*xi*xj
np.fill_diagonal(jacob, diag)
return jacob
Let's take a look at the activation function:
### Plot the softmax function (for one class)
n_points=100
x = np.linspace(-10, 10, n_points)
xj, xi = np.meshgrid(x, x)
y = np.zeros((100, 100, 2))
# Softmax output
for i in range(n_points):
for j in range(n_points):
y[i,j,:] = softmax(np.array([xj[i,j], xi[i,j]])).flatten()
# Plot the activation for input
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(xj, xi, y[:,:,0], cmap="plasma")
ax.view_init(elev=30, azim=70)
cbar = fig.colorbar(surf)
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_zlabel('$y_1$')
ax.set_title ('$y(\mathbf{x})_1$')
cbar.ax.set_ylabel('$y_1$', fontsize=12)
plt.show()
It is time to define the whole model described previously! Given the weights $\mathbf{W}$ and bias $\mathbf{b}$ from the three neurons, the output $y(\mathbf{x})$ over the input vector $\mathbf{x}=\{x_1, x_2\}$ is:
$$ \begin{equation} y(\mathbf{x}) = \frac{1}{e^{x_1w_{11} + x_2w_{21} + b_1} + e^{x_1w_{12} + x_2w_{22} + b_2} + e^{x_1w_{13} + x_2w_{23} + b_3}} \begin{bmatrix} e^{x_1w_{11} + x_2w_{21} + b_1}\\ e^{x_1w_{12} + x_2w_{22} + b_2}\\ e^{x_1w_{13} + x_2w_{23} + b_3}\\ \end{bmatrix} \end{equation} $$This is called the feed-forward pass of the model, i.e. the evaluation of the output(s) probability(ies) given the input(s).
### feed-forward definition
def feed_forward(inputs, weights, bias):
n_neurons = weights.shape[1]
n_samples = inputs.shape[0]
activations = np.zeros((n_samples, n_neurons))
features = Neuron(weights, bias).output(inputs)
for i in range(n_samples):
activations[i, :] = softmax(features[i, :])
return activations
Let's compute the feed-forward pass on our model, with zero bias and all weights $\mathbf{w} = [1, 0.5]$.
# Feedforward pass
weights = np.array([[1, 1, 1], [0.5, 0.5, 0.5]])
bias = np.array([0, 0, 0])
feed_forward(inputs=X, weights=weights, bias=bias)
The output probability is $33,3\%$ for each class! This is because each neuron contribute equally to the output (they have the same weights), so there is no information in this model.
We will use the gradient descent to infer the best model paramaters. Even if we will compute the gradient by hand for the purpose of this example, it is of course not feasible in practice when you have billions of parameters. Estimating efficiently the derivative of a (very) deep neural network numerically is possible thanks to a big achievement in AI which is automatic derivatives.
Given the ground truth class $\mathbf{t}$ and the model output $\mathbf{y}$ (which is the input of the cost function), we can code the cost function from eq.\ref{eq:cost} and its derivative with eq.\ref{eq:dv_cost} to extract the optimal parameters.
# Optimization of the model's parameters
# cost function given ground truth t and model output y (here refered as x)
def cost_function(input, t):
return (-1)*np.sum(t * np.log(input))
# derivative of the cost function for gradient descent
def dv_cost_function(input, t):
return (-1)*(t/input)
The derivative of the cost function w.r.t the output of the model $\frac{\partial \xi}{\partial \mathbf{y}}$ is not sufficient alone. Remember, for the gradient descent we need the derivative of the cost function w.r.t the output of the model parameters $\frac{\partial \xi}{\partial \mathbf{w}}$. Using the chain rule, one can decompose this derivative into "smaller" consecutive ones, from the output of the network $\mathbf{y}$ to the input $\mathbf{x}$:
$$ \begin{equation} \frac{\partial \xi(\mathbf{t}, y(\mathbf{x}))}{\partial \mathbf{W}} = \frac{\partial \xi(\mathbf{t}, y(\mathbf{x}))}{\partial y(\mathbf{x})} \frac{\partial y(\mathbf{x})}{\partial z(\mathbf{x})} \frac{\partial z(\mathbf{x})}{\partial \mathbf{W}}, \end{equation} $$where $z(\mathbf{x})$ is the outputs of the neurons (just before the activation function). The same derivative formula stands for the bias parameters $\mathbf{b}$.
The process of computing the gradient for a given neural network is called the backward pass (more often refered as backpropagation). Let's create a new class for our model,
### Model class definition
class LinearModel:
def __init__(self, weights, bias):
self.weights = weights
self.bias = bias
self.n_neurons = weights.shape[1]
def update_params(self, weights=None, bias=None):
if weights is not None:
self.weights = weights
if bias is not None:
self.bias = bias
def feed_forward(self, inputs):
n_samples = inputs.shape[0]
activations = np.zeros((n_samples, self.n_neurons))
features = Neuron(self.weights, self.bias).output(inputs)
for i in range(n_samples):
activations[i, :] = softmax(features[i, :])
return activations
def back_propagation(self, inputs, t):
n_samples = inputs.shape[0]
g_w = np.zeros((self.weights.shape[0], self.weights.shape[1], n_samples))
g_b = np.zeros((self.weights.shape[1], n_samples))
feed_forwards = self.feed_forward(inputs)
neuron = Neuron(self.weights, self.bias)
for i in range(n_samples):
grad_cost = dv_cost_function(feed_forwards[i, :], t[i, :])
grad_activation = dv_softmax(neuron.output(inputs[i, :]))
grad_neuron = neuron.grad(inputs[i, :])
# here we resize the jacobian w.r.t. W so it can be easily substracted to W
g_w[:, :, i] = np.reshape(grad_cost @ grad_activation @ grad_neuron[0], self.weights.shape, order='F')
g_b[:, i] = grad_cost @ grad_activation @ grad_neuron[1]
# sum of each sample's gradient
return [np.sum(g_w, axis=-1), np.sum(g_b, axis=-1)]
Now, we can try to update our parameters and compute the new probabilites.
# backpropagation
weights = np.array([[1, 1, 1], [0.5, 0.5, 0.5]])
bias = np.array([0, 0, 0])
my_model = LinearModel(weights=weights, bias=bias)
weights = weights - my_model.back_propagation(inputs=X, t=C)[0]
bias = bias - my_model.back_propagation(inputs=X, t=C)[1]
my_model.update_params(weights, bias)
my_model.feed_forward(X)
We see that updating the parameters starts to give differents probabilities for each sample. In the next section, we will update the parameters inside a loop to gradually improve our weights.
We will first check out the cost at each epoch (step) during the training of the model. An important parameter is the learning rate as it impact the size of the gradient step at each iteration, lot of noise are introduced during training if it is too large. Chossing the optimal learning rate is part of what we call hyperparameter selection.
Here, we will fix it at $\Lambda=0.025$.
# learning phase
# hyper-parameters and model instanciation
lr = 0.025
n_iter = 50
weights = np.array([[1, 1, 1], [0.5, 0.5, 0.5]])
bias = np.array([0, 0, 0])
my_model = LinearModel(weights=weights, bias=bias)
cost=np.array([])
for i in range(n_iter):
# backpropagation step
weights = weights - lr*my_model.back_propagation(inputs=X, t=C)[0]
bias = bias - lr*my_model.back_propagation(inputs=X, t=C)[1]
my_model.update_params(weights=weights, bias=bias)
# cost function
probs = my_model.feed_forward(X)
cost = np.append(cost, cost_function(input=probs, t=C))
### plotting cost
plt.plot(cost, 'b-+')
plt.grid()
plt.xlabel('iter')
plt.ylabel('cost')
plt.title('Cost value over time')
plt.show()
Now let's take a look at the decision boundary of our classifier, by creating inputs ranging from $-5$ to $5$ in $x_1$ and $x_2$ space. It can be interresting to see at the same time the importance of the bias parameter.
### Decision boundary on the x1, x2 space
# decision function
n_samples = 200
Xl = np.linspace(-5, 5, num=n_samples)
Xm, Ym = np.meshgrid(Xl, Xl)
Df = np.zeros((n_samples, n_samples))
for i in range(n_samples):
for j in range(n_samples):
prob = my_model.feed_forward( np.array([Xm[i, j], Ym[i, j]], ndmin=2) )
Df[i, j] = np.argmax(prob, axis=1)
cmap = ListedColormap([
colorConverter.to_rgba('r', alpha=0.3),
colorConverter.to_rgba('b', alpha=0.3),
colorConverter.to_rgba('g', alpha=0.3)])
plt.subplot(1, 2, 1)
plt.contourf(Xm, Ym, Df, cmap=cmap)
# ground truth for the inputs
plt.plot(X_a[:,0], X_a[:,1], 'ro', label='class $c_a$')
plt.plot(X_b[:,0], X_b[:,1], 'b*', label='class $c_b$')
plt.plot(X_c[:,0], X_c[:,1], 'g+', label='class $c_c$')
plt.grid()
plt.legend()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Model with bias')
# decision function for model without bias
my_model_without_bias = copy.deepcopy(my_model)
my_model_without_bias.update_params(bias=np.array([0, 0, 0]))
for i in range(n_samples):
for j in range(n_samples):
prob = my_model_without_bias.feed_forward( np.array([Xm[i, j], Ym[i, j]], ndmin=2) )
Df[i, j] = np.argmax(prob, axis=1)
cmap = ListedColormap([
colorConverter.to_rgba('r', alpha=0.3),
colorConverter.to_rgba('b', alpha=0.3),
colorConverter.to_rgba('g', alpha=0.3)])
plt.subplot(1, 2, 2)
plt.contourf(Xm, Ym, Df, cmap=cmap)
# ground truth for the inputs
plt.plot(X_a[:,0], X_a[:,1], 'ro', label='class $c_a$')
plt.plot(X_b[:,0], X_b[:,1], 'b*', label='class $c_b$')
plt.plot(X_c[:,0], X_c[:,1], 'g+', label='class $c_c$')
plt.grid()
plt.legend()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Model without bias')
plt.show()
It is clear that without the bias, some blue crosses are not well categorized.
By comparing the accuracy between the model with and whithout bias, the difference is even more drastic.
# compute probabilities
probs = my_model.feed_forward(X)
probs_no_bias = my_model_without_bias.feed_forward(X)
Y = np.argmax(probs, axis=1)
Y_no_bias = np.argmax(probs_no_bias, axis=1)
Y_truth = np.argmax(C, axis=1)
acc_Y = np.sum(Y == Y_truth)/len(Y_truth)
acc_Y_no_bias = np.sum(Y_no_bias == Y_truth)/len(Y_truth)
print("Accuracy for model with bias: %.2f" %acc_Y)
print("Accuracy for model without bias: %.2f" %acc_Y_no_bias)
Indeed, having the bias allows the model to have a better decision point since the bias offset the decision function.
We made our own neural network from end to end, isn't that fantastic!? We saw the different and fundamental steps, and how to compute the gradient w.r.t the parameters. The data used here was quite simple since it was highly linear separable in space (gaussian), but what happened if this is not the case? If you are curious, check now the second part on classifying non-linear data!
A first and one of the most important ressource is the official deep learning book, co-authored by the people who put forward this field. More on the perceptron mode in this book. Check this nice overview of how one can compute derivatives automatically here, and most important for deep learning.
Thanks to peter roelants who owns a nice blog on machine learning. It helped me to have deeper understanding behind the neural network mathematics. Some code were also inspired from his work. To design the networks, I used this web tool developped by Alexander Lenail. There are also other styles available: LeNet or AlexNet.
1. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics. 5, 115–133 (1943).
2. Rosenblatt, F.: The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory (1957).
3. Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing. 227–236 (1990).
In the previous post, we saw an example of a linear neural network were the data was clearly linearly sperabale. This time, we will try to classify non-linearly separable data. Before showing how we could do that, it is important to introduce some important functions.
Considering the basic McCulloch's model (with parameters $w$ and $b$) [1], one would maybe have the idea that stacking multiple linear layers could introduce non-linearity to the system. But it is important to understand that stacking mutliple linear layers would not have any effect on the linearity of the system. Indeed, imagine that you stack together $3$ linear layers, then the resulting output would be :
$$ \begin{equation} \begin{split} f_1\circ f_2\circ f_3(x) & = ((x w_3 + b_3) w_2 + b_2) w_1 + b_1 \\ & = x w_3 w_2 w_1 + b_3 w_2 w_1 + b_2 w_1 + b_1 \end{split} \end{equation} $$This is exactly the same as a simple linear layer with $w = w_3 w_2 w_1$ and $b = b_3 w_2 w_1 + b_2 w_1 + b_1$ ! Hence the importance of introducing activation function, that projects the data into a space where it will be linearly-separable.
We already used the softmax popularized by Bridle et al. [2]. If we would stack mutliple layers with softmax outputs, then our model would be non-linear. The issue here is that the softmax is close to zero when its inputs are less equal. This can lead to the vanishing gradient problem, where the learning is slowed down because the gradient does not update enough.
### imports
import numpy as np
import matplotlib
from matplotlib.colors import colorConverter, ListedColormap
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
### functions
class Neuron:
def __init__(self, weights, bias):
# first dim is the number of features
# second dim is the number of neurons
self.weights = weights
self.bias = bias
def output(self, input):
return input @ self.weights + self.bias
def grad(self, input):
D = np.kron(np.eye(self.weights.shape[1]), input)
return [D, np.eye(self.weights.shape[1])]
def softmax(input):
return np.exp(input) / np.sum(np.exp(input))
# Derivative of the activation function
def dv_softmax(input):
diag = softmax(input)*(1-softmax(input))
xj, xi = np.meshgrid(softmax(input), softmax(input))
jacob = (-1)*xi*xj
np.fill_diagonal(jacob, diag)
return jacob
def cost_function(input, t):
return (-1)*np.sum(t * np.log(input))
# derivative of the cost function for gradient descent
def dv_cost_function(input, t):
return (-1)*(t/input)
class LinearModel:
def __init__(self, weights, bias):
self.weights = weights
self.bias = bias
self.n_neurons = weights.shape[1]
def update_model(self, weights=[], bias=[]):
if not isinstance(weights, list):
self.weights = weights
if not isinstance(bias, list):
self.bias = bias
def feed_forward(self, inputs):
n_samples = inputs.shape[0]
activations = np.zeros((n_samples, self.n_neurons))
features = Neuron(self.weights, self.bias).output(inputs)
for i in range(n_samples):
activations[i, :] = softmax(features[i, :])
return activations
def back_propagation(self, inputs, t):
n_samples = inputs.shape[0]
g_w = np.zeros((self.weights.shape[0], self.weights.shape[1], n_samples))
g_b = np.zeros((self.bias.shape[0], n_samples))
feed_forwards = self.feed_forward(inputs)
neuron = Neuron(self.weights, self.bias)
for i in range(n_samples):
grad_cost = dv_cost_function(feed_forwards[i, :], t[i, :])
grad_activation = dv_softmax(neuron.output(inputs[i, :]))
grad_neuron = neuron.grad(inputs[i, :])
# here we resize the jacobian w.r.t. W so it can be easily substracted to W
g_w[:, :, i] = np.reshape(grad_cost @ grad_activation @ grad_neuron[0], self.weights.shape, order='F')
g_b[:, i] = grad_cost @ grad_activation @ grad_neuron[1]
# sum of each sample's gradient
return [np.sum(g_w, axis=-1), np.sum(g_b, axis=-1)]
### Plot softmax derivative
n_points=100
x = np.linspace(-10, 10, n_points)
xj, xi = np.meshgrid(x, x)
y = np.zeros((100, 100, 2))
# Softmax output
for i in range(n_points):
for j in range(n_points):
y[i,j,:] = np.diag(dv_softmax(np.array([xj[i,j], xi[i,j]])))
# Plot the activation for input
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(xj, xi, y[:,:,0], cmap="plasma")
ax.view_init(elev=30, azim=70)
cbar = fig.colorbar(surf)
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_zlabel("$\partial y_1$")
ax.set_title ("Diagonal of $\partial y(\mathbf{x})_1$")
cbar.ax.set_ylabel("$\partial y_1$", fontsize=12)
plt.show()
The vanishing gradient issue brought the idea of introducing the ReLu activation function back in 2010 [3]. It is defined as follow,
$$ \begin{equation} f(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} \end{equation} $$One clear advantage is that this activation is fast, easy to use and not computationnaly intensive. The gradient is always one for positive values, but it can still "die" if the input is negative. That is why some uses the leaky ReLu.
Be carefull because the ReLu is non-derivable at zero, hence the introduction of the softplus activation.
### Data
# Fix random state
np.random.seed(0)
# Distribution of the classes
n = 50
std_dev = 0.2
# Generate samples from classes
linspace = np.linspace(0, 2 * np.pi, n, endpoint=False)
X_a = np.random.randn(n, 2) * std_dev
X_b = np.vstack( (np.cos(linspace), np.sin(linspace))).T * 0.75 + np.random.randn(n, 2) * std_dev
X_c = np.vstack( (np.cos(linspace), np.sin(linspace))).T * 1.25 + np.random.randn(n, 2) * std_dev
# Create inputs X and targets C
X = np.vstack((X_a, X_b, X_c))
C = np.vstack( (np.zeros((n,1), dtype=int), np.ones((n,1), dtype=int), 2*np.ones((n,1), dtype=int)) )
# random permutation
idx = np.arange(X.shape[0])
idx = np.random.permutation(idx)
X = X[idx,]
C = C[idx,]
# one hot encoding
one_hot = np.zeros((len(C), np.max(C)+1))
one_hot[np.arange(len(C)), C.flatten().tolist()] = 1
C = one_hot
Let's take a look at the data.
### Plot both classes on the x1, x2 space
plt.figure()
plt.plot(X_a[:,0], X_a[:,1], 'ro', label='class $c_a$')
plt.plot(X_b[:,0], X_b[:,1], 'b*', label='class $c_b$')
plt.plot(X_c[:,0], X_c[:,1], 'g+', label='class $c_c$')
plt.grid()
plt.legend()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Class a, b and c in the space $X$')
plt.show()
It is quite clear that the linear model from the part 1 isn't adapted for this data. As an exercice, you can try to plug this data using the linear model and see how it performs!
The model that we will design will be close to the previous one, the only difference will be the addition of one layer: the relu activation. Such layers are called hidden layers in the deep learning community. After, we compute the probabilities for the three classes through the softmax activation function. Our final classifying rule will be that the highest probability gives us our class.
We first need to implement the relu function and its derivative, which are quite easy to code. It is worth also comparing it with the logistic activation.
# Activation function
def relu(input):
return input * (input > 0) + 0.1 * input * (input < 0)
# Derivative of the activation function
def dv_relu(input):
alpha = 0.3
offset = alpha * np.ones_like(input) - alpha * np.eye(len(input))
diag = np.ones_like(input) * (input > 0)
return np.eye(input.shape[-1]) * diag + offset
def logistic(input):
"""Logistic function."""
return 1. / (1. + np.exp(-input))
def dv_logistic(input):
"""Logistic function."""
return np.eye(input.shape[-1]) * (logistic(input) * (1 - logistic(input)))
Given the weights $\mathbf{W_h}$ and bias $\mathbf{b_h}$ from the two hidden neurons, the relu activation of the hidden layer $h(\mathbf{x})$ over the input vector $\mathbf{x}=\{x_1, x_2\}$ is:
$$ \begin{equation} h(\mathbf{x}) = \begin{bmatrix} relu(x_1w_{h_{11}} + x_2w_{h_{21}} + b_{h_1}) \\ relu(x_1w_{h_{12}} + x_2w_{h_{22}} + b_{h_2}) \end{bmatrix} \end{equation} $$The softmax activation of the output layer $y(h(\mathbf{x}))$ can be calculated given the weights $W_o$ and bias $\mathbf{b_o}$ from the three output neurons:
$$ \begin{equation} y(\mathbf{x}) = \frac{1}{e^{y_1w_{o_{11}} + y_2w_{o_{21}} + b_{o_1}} + e^{y_1w_{o_{12}} + y_2w_{o_{22}} + b_{o_2}} + e^{y_1w_{o_{13}} + y_2w_{o_{23}} + b_{o_3}}} \begin{bmatrix} e^{y_1w_{o_{11}} + y_2w_{o_{21}} + b_{o_1}}\\ e^{y_1w_{o_{12}} + y_2w_{o_{22}} + b_{o_2}}\\ e^{y_1w_{o_{13}} + y_2w_{o_{23}} + b_{o_3}}\\ \end{bmatrix} \end{equation} $$Given the targets $\mathbf{t}$ and following the chain rule, we can decompose the derivative of the cost function $\xi(\mathbf{t}, y)$ w.r.t the output neuron parameters:
$$ \begin{equation} \frac{\partial \xi(\mathbf{t}, y)}{\partial \mathbf{W_o}} = \frac{\partial \xi(\mathbf{t}, y)}{\partial y} \frac{\partial y}{\partial z_o} \frac{\partial z_o}{\partial \mathbf{W_o}}, \end{equation} $$where $z_o$ is the output of the neurons for the output layer (just before the activation function).
The derivative is quite different for $\mathbf{W_h}$ since we need to go "deeper" onto the model to compute the derivative. But it is still possible to reuse some previous results to avoid redundancy:
$$ \begin{equation} \begin{split} \frac{\partial \xi(\mathbf{t}, y)}{\partial \mathbf{W_h}} & = \frac{\partial \xi(\mathbf{t}, y)}{\partial h} \frac{\partial h}{\partial z_h} \frac{\partial z_h}{\partial \mathbf{W_h}} \\ & = \frac{\partial \xi(\mathbf{t}, y)}{\partial z_o} \frac{\partial z_o}{\partial h} \frac{\partial h}{\partial z_h} \frac{\partial z_h}{\partial \mathbf{W_h}}, \end{split} \end{equation} $$with $$ \begin{equation} \frac{\partial z_o}{\partial h} = \mathbf{W_o} \end{equation} $$
The same process stands for the bias parameters of the output $\mathbf{b_o}$ and hidden layer $\mathbf{b_h}$.
With all the previous code, we can now design the entire model:
### Model class definition
class NoneLinearModel:
def __init__(self, weights, bias, hidden_activation="relu"):
self.n_layers = len(weights)
self.n_neurons = [weight.shape[-1] for weight in weights]
self.hidden_activation = hidden_activation
self.update_model(weights, bias)
def update_model(self, weights=[], bias=[]):
self.weights = weights
self.bias = bias
self.neurons = [Neuron(self.weights[i], self.bias[i]) for i in range(self.n_layers)]
def feed_forward(self, inputs):
n_samples = inputs.shape[0]
activations = [inputs]
activations += [np.zeros((n_samples, self.n_neurons[i])) for i in range(self.n_layers)]
for i in range(self.n_layers):
features = self.neurons[i].output(activations[i])
if i < (self.n_layers - 1):
activations[i+1] = relu(features) if self.hidden_activation == "relu" else logistic(features)
else:
for j in range(n_samples):
activations[i+1][j, :] = softmax(features[j, :])
return activations[1::]
def back_propagation(self, inputs, t):
n_samples = inputs.shape[0]
Jw = [np.zeros((weight.shape[0], weight.shape[1], n_samples)) for weight in self.weights]
Jb = [np.zeros((weight.shape[1], n_samples)) for weight in self.weights]
activations = self.feed_forward(inputs)
for i in range(n_samples):
# output layer
grad_output_cost = dv_cost_function(activations[1][i, :], t[i, :])
grad_output_activation = dv_softmax(self.neurons[1].output(activations[0][i, :]))
grad_output_neuron = self.neurons[1].grad(activations[0][i, :])
# hidden layer
grad_hidden_cost_zo = grad_output_cost @ grad_output_activation
grad_hidden_neuron_h = self.weights[1].T
grad_hidden_activation = dv_relu(self.neurons[0].output(inputs[i, :])) if self.hidden_activation == "relu" \
else dv_logistic(self.neurons[0].output(inputs[i, :]))
grad_hidden_neuron = self.neurons[0].grad(inputs[i, :])
# here we resize the jacobian w.r.t. W so it can be easily substracted to W
Jw[0][:, :, i] = np.reshape(grad_hidden_cost_zo @ grad_hidden_neuron_h @ grad_hidden_activation @ grad_hidden_neuron[0], self.weights[0].shape, order='F')
Jb[0][:, i] = grad_hidden_cost_zo @ grad_hidden_neuron_h @ grad_hidden_activation @ grad_hidden_neuron[1]
Jw[1][:, :, i] = np.reshape(grad_output_cost @ grad_output_activation @ grad_output_neuron[0], self.weights[1].shape, order='F')
Jb[1][:, i] = grad_output_cost @ grad_output_activation @ grad_output_neuron[1]
# sum of each sample's gradient, for each layer
Jw = [np.sum(Jw[l], axis=-1) for l in range(self.n_layers)]
Jb = [np.sum(Jb[l], axis=-1)[np.newaxis, :] for l in range(self.n_layers)]
return Jw, Jb
It is now time to train the model!
# learning phase
# hyper-parameters and model instanciation
lr = 0.01
n_iter = 1000
weights = [np.random.randn(2, 3), np.random.randn(3, 3)]
bias = [np.zeros((1, 3)), np.zeros((1, 3))]
cost_relu = np.array([])
cost_logits = np.array([])
relu_model = NoneLinearModel(weights=weights, bias=bias, hidden_activation="relu")
logits_model = NoneLinearModel(weights=weights, bias=bias, hidden_activation="logits")
for i in range(n_iter):
# backpropagation
Jw, Jb = relu_model.back_propagation(inputs=X, t=C)
weights = [relu_model.weights[l] - lr * Jw[l] for l in range(relu_model.n_layers)]
bias = [relu_model.bias[l] - lr * Jb[l] for l in range(relu_model.n_layers)]
relu_model.update_model(weights=weights, bias=bias)
# cost function
probs = relu_model.feed_forward(X)[-1]
cost_relu = np.append(cost_relu, cost_function(input=probs, t=C))
# backpropagation
Jw, Jb = logits_model.back_propagation(inputs=X, t=C)
weights = [logits_model.weights[l] - lr * Jw[l] for l in range(logits_model.n_layers)]
bias = [logits_model.bias[l] - lr * Jb[l] for l in range(logits_model.n_layers)]
logits_model.update_model(weights=weights, bias=bias)
# cost function
probs = logits_model.feed_forward(X)[-1]
cost_logits = np.append(cost_logits, cost_function(input=probs, t=C))
### plotting cost
plt.figure()
plt.subplot(1, 2, 1)
plt.plot(cost_relu, 'b-+')
plt.grid()
plt.xlabel('iter')
plt.ylabel('cost')
plt.title('Relu model cost')
plt.subplot(1, 2, 2)
plt.plot(cost_logits, 'b-+')
plt.grid()
plt.xlabel('iter')
plt.ylabel('cost')
plt.title('Logits model cost')
plt.show()
By comparing the decision functions for both relu and logits, we see that relu is faster to converge but its decision boundary is straight. Because of the smoothness of the logits activation, the decision function is less prone to error.
### decision functions
# decision function for logits model
n_samples = 200
Xl = np.linspace(-5, 5, num=n_samples)
Xm, Ym = np.meshgrid(Xl, Xl)
Df = np.zeros((n_samples, n_samples))
for i in range(n_samples):
for j in range(n_samples):
prob = logits_model.feed_forward( np.array([Xm[i, j], Ym[i, j]], ndmin=2) )[-1]
Df[i, j] = np.argmax(prob, axis=1)
cmap = ListedColormap([
colorConverter.to_rgba('r', alpha=0.3),
colorConverter.to_rgba('b', alpha=0.3),
colorConverter.to_rgba('g', alpha=0.3)])
plt.subplot(1, 2, 1)
plt.contourf(Xm, Ym, Df, cmap=cmap)
# ground truth for the inputs
plt.plot(X_a[:,0], X_a[:,1], 'ro', label='class $c_a$')
plt.plot(X_b[:,0], X_b[:,1], 'b*', label='class $c_b$')
plt.plot(X_c[:,0], X_c[:,1], 'g+', label='class $c_c$')
plt.grid()
plt.legend()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Logits model')
# decision function for relu model
for i in range(n_samples):
for j in range(n_samples):
prob = relu_model.feed_forward( np.array([Xm[i, j], Ym[i, j]], ndmin=2) )[-1]
Df[i, j] = np.argmax(prob, axis=1)
cmap = ListedColormap([
colorConverter.to_rgba('r', alpha=0.3),
colorConverter.to_rgba('b', alpha=0.3),
colorConverter.to_rgba('g', alpha=0.3)])
plt.subplot(1, 2, 2)
plt.contourf(Xm, Ym, Df, cmap=cmap)
# ground truth for the inputs
plt.plot(X_a[:,0], X_a[:,1], 'ro', label='class $c_a$')
plt.plot(X_b[:,0], X_b[:,1], 'b*', label='class $c_b$')
plt.plot(X_c[:,0], X_c[:,1], 'g+', label='class $c_c$')
plt.grid()
plt.legend()
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Relu model')
plt.show()
You should now have a deeper understanding of the mathematics behind a fully-connected neural network! In the real world, dense layers are not really usable because of the huge increase in the number of parameters. This is why nowadays everyone uses convolutionnal neural network, which helps decreasing the number of parameters and hence improving the learning!
Writing mathematically the gradients yourself like we did is prone to errors (especially for huge networks), that is why it is interresting to compute numerically the gradients. This is what all deep learning packages (like tensorflow or pytorch) does. Peter roelants's blog post has a good explanation about that!
Check these nice animations if you want to understand how convolutions are used in neural networks.
The topic of understanding deep learning is hot, if you are interrested you should definitively check this distill blog.
Thanks to peter roelants who owns a nice blog on machine learning. It helped me to have deeper understanding behind the neural network mathematics. Some code were also inspired from his work.
1. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics. 5, 115–133 (1943).
2. Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. Neurocomputing. 227–236 (1990).
3. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Icml (2010).
PokéClicker is a famous online clicker game which attracted a lot of players recently. Today we will focus on a specific pokemon that can be obtained through a surprising way, MissingNo.
In the pokemon serie, there was a strange monster that appeared to some of the players, MissingNo. It could not be encountered in the wild, but rather after a bug like the old man glitch. This pokemon resulted from a wrong pokédex entry during a fight, he is actually not unique and has a lot of variations.
![]()
This pokémon was added to PokéClicker mostly as an easter egg and cannot be obtainable through normal manner. He is used for error handling (if there is a bug within the game), and his pokédex entry is #0. Hopefully, we can easilly access the game's data to found a way to obtain it.
Pokéclicker is an open-source, community driven game that is mostly developped using Javascript. The source code is available on github if you are curious.
Most of the important functions are public, hence is it really easy to access the in-game memory using any web browser. Indeed, nowadays most if not all browser acts as an IDE and hence have access to a debug environement. Thanks to that, anyone can modify the data completely offline! For the rest, I will provide the instructions for firefox only, but it should be easilly transferrable to chrome, or edge.
F12 to access the web console, or click on the settings icon (up right) then More tools/Web Developer Tools.id=0 as a parameter:App.game.party.gainPokemonById(0)
Congratulations, you just obtained MissingNo. !
Note:
I should not tell you that, but you can of course change the id to another pokémon...
There is also another optionnal parameter. If you want it to be shiny, put the parameter to
1:App.game.party.gainPokemonById(0, 1)
Obviously this is not the only function that you can have access to!
If you want to cheat, after typing App.game you will see the list of available class/functions.
Check for example App.game.wallet.gainMoney 💰 😈 💰
Use it sporadically (or not at all) if you don't want to ruin your gameplay.
Human societies have always relied on social connection. Today, with the web this is even more true, isn't the internet a huge nebuleous of data ? Networks are also present in our brain, or used to model energy transit.
The goal of this post is to explain how we can interact with graphs supporting one-dimensionnal signals: from data to processing.
Mathematically, a graph is a pair $G = (V , E)$ where $V$ is the set of vertices, and $E$ are the edges. The vertice or node represent the actual data that is broadcasted through the graph. Each node sends its own signal that propagates through the network. The edges define the relation between these nodes with the weight $w$, the more weighted is the edge, the stronger the relation between two nodes. Some examples of weights can be the vertices distance, signal correlation or even mutual information...
There are many properties for a graph, but the most important one is the directions of the edges. We say that a graph is undirected if for two nodes $a$ and $b$, $w_{ab} = w_{ba}$.
Let's look at the following example which is an unweighted undirected graph (if it was directed we would add arrows instead of lines).

We can use this adjacency matrix $A$ to represent it numerically: \begin{equation} A = \begin{pmatrix} 0 & 1 & 0 & 0 & 1 & 0\\ 1 & 0 & 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & 1 & 1\\ 1 & 1 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\\ \end{pmatrix} \end{equation}
Each row correspond to the starting node, each column is the ending node. Because it is unweighted, all weights are fixed to $1$, and because it is undirected, the adjecency matrix is symmetric.
In graph signal processing (GSP), the object of study is not the network itself but a signal supported on the vertices of the graph. The graph provides a structure that can be exploited in the data processing. The signal can be any dimension like a single pixel value in an image, images captured by different people or traffic data from radars as we will see after!
If we would assign a random gaussian number for each node of the previous example, the input graph signal $X$ would look like this:
\begin{equation} X = \begin{pmatrix} -1.1875 \\ -0.1841 \\ -1.1499 \\ -1.8298 \\ 1.9561 \\ 0.6322 \end{pmatrix} \end{equation}For multi-dimensionnal signals (images, volumes), it is common to embed those into a one-dimensionnal latent space to simplify the graph processing. For example if you are working with text features (this could be a tweet from one user, or e-mails), one may use word2vec [1].
What we want is to be able to perform operations on those node signals, taking into account the structure of the graph. The most common operation is to filter the signal, or in mathematical term a convolution between a filter $h$ and a function $y$. For example in the temporal domain $t$: \begin{equation} y(t) \circledast h(t) = \int_{-\infty}^{+\infty}{y(t-\tau) h(\tau) d\tau} \end{equation}
Sadly, this is not directly possible in the graph (vertex) domain because the signal translation with $\tau$ is not defined in the context of graphs [2]. How do you perform the convolution then ?

Hopefully, it is much more simple to convolve in the frequency (spectral) domain because we don't need the shift operator. This is also usefull in general in signal processing because it is less intensive computationnally: \begin{equation} \label{fourierconvolution} y(t) \circledast h(t) = Y(\omega) \cdot H(\omega) \end{equation}
To transform the input data (function) into the frequency domain, we use the widely known fourier transform.
It is a reversible, linear transformation that is just the decomposition of a signal into a new basis formed by complex exponentials. The set of complex exponentials are obviously orthogonal (independent) which is a fundamental property to form a basis.
\begin{equation} y(t) \xrightarrow{\mathscr{F}} Y(\omega) = \int_{-\infty}^{\infty} \! y(t) \mathrm{e}^{-i\omega t}\, dt \end{equation}Note:
In practice, it is not purely reversible because we lose information when applying a fourier transform. Indeed we cannot affoard computationnally to integrate the function over the infinity! We usually use the DFT (Discrete Fourier Transform), which make sense numerically with digital data.
However this formula works in the temporal domain and cannot be applied as is for signal in the graph domain $G$, so how do you define a graph fourier transform ? For that we first need to understand eigen decomposition of the laplacian and its connection to the fourier transform...
The eigen decomposition is a process that is heavily used in data dimension algorithms. For any linear transformation $T$, it exists a non-zero vector (function) $\textbf{v}$ called eigen vector (function) such as:
\begin{equation} \label{eq:eigenfunction} T(\textbf{v}) = \lambda \textbf{v} \end{equation}Where $\lambda$ is a scalar corresponding to the eigen values, i.e. the importance of each eigen vector. This formula basically means that when we apply the matrix $T$ on $\mathbf{v}$, the resulting vector is co-linear to $\mathbf{v}$. And so $\mathbf{v}$ can be used to define a new orthonormal basis for $T$!
What is of interest for us is that the complex exponential $\mathrm{e}^{i\omega t}$ is also an eigen function for the laplacian operator $\Delta$. Following eq \ref{eq:eigenfunction}, we can derive: \begin{equation} \Delta(e^{i \omega t}) = \frac{\partial^2}{\partial{t^2}} e^{i \omega t} = -\omega^2 e^{i\omega t} \end{equation}
With $\mathbf{v}=e^{i \omega t}$ and eigen values $\lambda = -\omega^2$, which makes sense since we are decomposing the temporal signal into the frequency domain!
We can rewrite the fourier transform $Y(\omega)$ using the conjugate of the eigen function of the laplacian $\mathbf{v}^*$:
\begin{equation} \label{eq:fouriereigen} Y(\omega) = \int_{-\infty}^{\infty} \! y(t) \mathbf{v}^*\,dt \end{equation}In other word, the expansion of $y$ in term of complex exponential (fourier tranform) is analogeous to the expansion of $y$ in terms of the eigenvectors of the laplacian. Still that does not help on how to apply that to graphs!
There is one specific operation that is well defined for a graph, yes it is the laplacian operator. This is the connection we were looking for and we have now a way to define the graph fourier transform $\mathscr{G}_\mathscr{F}$ !
The graph lalacian $L$ is the second order derivative for a graph, and is simply defined as:
\begin{equation} L = D-A \end{equation}Where $D$ is the degree matrix that represent the number of nodes connected (including itself).
Intuitively, the graph Laplacian shows in what directions and how smoothly the “energy” will diffuse over a graph if we put some “potential” in node $i$. With the previous example it would be:
\begin{equation} L = \begin{pmatrix} 3 & 0 & 0 & 0 & 0 & 0\\ 0 & 4 & 0 & 0 & 0 & 0\\ 0 & 0 & 3 & 0 & 0 & 0\\ 0 & 0 & 0 & 4 & 0 & 0\\ 0 & 0 & 0 & 0 & 4 & 0\\ 0 & 0 & 0 & 0 & 0 & 2\\ \end{pmatrix} - \begin{pmatrix} 0 & 1 & 0 & 0 & 1 & 0\\ 1 & 0 & 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & 1 & 1\\ 1 & 1 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\\ \end{pmatrix} \end{equation}A node that is connected to many other nodes will have a much bigger influence than its neighbours. To mitigate this, it is common to normalize the laplacian matrix using the following formula:
\begin{equation}\label{eq:normlaplacian} L = I - D^{-1/2}AD^{-1/2} \end{equation}Whith the identity matrix $I$.
After computing the eigen vectors $\mathbf{v}$ for the graph laplacian $L$, we can derive the numeric version of the graph fourier transform $\mathcal{G}$ for $N$ vertices from eq \ref{eq:fouriereigen}:
\begin{equation} \label{eq:graphfouriernumeric} x(i) \xrightarrow{\mathscr{G}_\mathscr{F}} X(\lambda_l) =\sum_{i=0}^{N-1} x(i)\mathbf{v}_l^T(i) \end{equation}With $x(i)$ being the signal for node $i$ in the graph/vertex domain, which was a single value in our first example.
The matrix form is:
\begin{equation} \label{eq:matrixgraphfouriernumeric} \hat{X} = V^TX \end{equation}Now that we can transform the input data from graph/vertex domain (node $i$) to the spectral domain (frequency $\lambda_l$), we can apply the previous eq \ref{fourierconvolution} to perform graph convolution. We perform the convolution of the data $x$ with filter $h$ into the spectral domain $\lambda_l$, but we want our outputs to be in the graph domain $i$:
\begin{align} x(i) \circledast h(i) & = \mathcal{G}^{-1}(\mathcal{G}(x(i) \circledast h(i))) & \\ & = \sum_{l=0}^{N-1} \hat{x}(\lambda_l) \cdot \hat{g}(\lambda_l) \cdot \mathbf{v}_l^T(i) & \\ \end{align}and its matrix form [3]:
\begin{align}\label{eq:graphconv} X \circledast H & = V \hat{H} * (V^T X) \end{align}Warning:
Be carefull of the element wise multiplication in the spectral domain between the filter $\hat{H}$ and transformed data $V^TX$!
One drawback with this method is that is does not work well for dynamic graphs (structure that changes in time) because the eigen vector needs to be recomputed each time! You can also see that this operation is quite complex, hopefully it can be simplified using for example Chebyshev polynomials [4]. The Chebyshev approximation allows to perform local convolution (taking into acount just adjacent nodes) instead of global convolution, which reduces consequently the compute time.
Enough theory, now let's get our hands dirty. We will look into traffic data from the Bay Area and try to analyze it using graph processing.
The data that we will be using was originally studied for trafic forecasting in [5]. It consist of a structure with several sensors in the bay area (California) that acquired the speed of cars (in miles/hour) during 6 months of year 2017. It is available publically and was collected by California Transportation Agencies Performance Measurement System (PeMS).
There are two files, sensor_locations_bay_area.csv consists of the locations of each sensor (that can be used to define the structure of the graph), the other file traffic_bay_area.csv includes the drivers speed gathered by each sensor (the node feature).
### imports
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
import IPython.display
import scipy
import scipy.stats
import networkx as nx
# reading data
df = pd.read_csv("data/gsp/traffic_bay_area.csv")
df.set_index("Date")
IPython.display.display(df)
data = df.to_numpy()[:, 1:].astype(np.float32).T
# select a specific date for further analysis
start = np.where(df["Date"] == "2017-04-03 00:00:00")[0][0]
end = np.where(df["Date"] == "2017-04-03 23:30:00")[0][0]
end_3 = np.where(df["Date"] == "2017-04-06 23:30:00")[0][0]
#other variables
sensor_ids = df.columns[1:].astype(np.str)
num_nodes = len(sensor_ids)
All the sensors are located in the Bay Area of San francisco, as you can see with the following map.
# get sensor locations
locs_df = pd.read_csv("data/gsp/sensor_locations_bay_area.csv")
locs = locs_df.to_numpy()[:, 1:]
### Plot sensor locations on a map
# /!\ This is specific for liquid to escape the braces
fig = px.scatter_mapbox(locs_df, lat="lattitude", lon="longitude", hover_name="sensor_id",
color_discrete_sequence=["red"], zoom=9.5)
fig.update_layout(mapbox_style="stamen-terrain", margin={"r": 0,"t": 0,"l": 0,"b": 0}, height=400)
fig.show(renderer="iframe_connected", config={'showLink': False})
# {% end raw %}
The most important step when constructing a graph is how to define the relationships between nodes. Mathematically, we craft a positive metric bounded between $[0, 1]$ that should be high when the relation between those two nodes is strong, and as much linear as possible. Then we can model the graph using the adjacency matrix, where each line/column index being one node ID.
Here we use the euclidean distance between nodes as a metric (the distance matrix $D$). To simplify this post, we will suppose a weighted undirected graph (which in practice is maybe not ideal since cars go in one direction).
Let's first try to build the adjacency matrix using the sensor locations. We will compute the euclidean distance between each sensor to define the realtionship between nodes.
We are working with geodesic position which lives on a sphere, hence it is not advised to directly use those positions to compute the euclidean distances. First, we need to project those into a plane, following the assumption that the measurements are close to each other, one can use the equirectangular projection.
# equirectangular projection from geodesic positions
locs_radian = np.radians(locs)
phi0 = min(locs_radian[:, 0]) + (max(locs_radian[:, 0]) - min(locs_radian[:, 0]))/2
r = 6371 #earth radius 6371 km
locs_xy = np.array([r*locs_radian[:, 1]*np.cos(phi0), r*locs_radian[:, 0]]).T
Now we can compute the distances.
# euclidean distance matrix
Xi, Xj = np.meshgrid(locs_xy[:, 0], locs_xy[:, 0])
Yi, Yj = np.meshgrid(locs_xy[:, 1], locs_xy[:, 1])
D = (Xi - Xj)**2 + (Yi - Yj)**2
max_distance = np.max(D)
print(f"Maximum squared distance is: {max_distance:.3f}km2")
We craft the adjacency matrix $A$ by inverting and normalizing the current distance matrix $D$, so the edge weights are high when the relation is high (nodes are spatially closed). The resulting distance matrix is symmetric, with the number of rows/columns equal to the number of sensors.
# create adjencency matrix
A = np.copy(D)
#inverting
A = np.max(A) - A
#normalizing
A = A / np.max(A)
### plotting original adjacency
def plot_adjacency(A, title="", xlabel="", vmin=0, vmax=1):
'''Plot the adjacency matrix with histogram'''
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
img = ax[0].imshow(A, cmap="hot", vmin=vmin, vmax=vmax)
plt.colorbar(img, ax=ax[0])
ax[0].set_xlabel("Node $i$")
ax[0].set_title(title)
ax[1].hist(A[A>0].flatten())
ax[1].set_xlabel(xlabel)
plot_adjacency(A, title="Adjacency matrix", xlabel="Squared distance (km2)")
Because we have lot of nodes (325 sensors), the resulting graph is huge with more than $325\times325>1e^7$ edges! Using a graph that has lot of edges is not practical in our case, as this would require more CPU power with high RAM usage, hence we need to downsample the graph.
The topic of downsampling or pre-processing spatiotemporal graph is itself really active [6], here we use a simple thresholded approach called nearest neighbor graph (k-nn): keep the $k$ most important neighbor (for each node). This prevents orphan nodes compare to a basic thresholding approach.
Another additionnal process is removing the so-called "self-loops" (i.e. make the diagonal zero), so nodes does not have a relation with themself.
# thresholded adjacency matrix
k = 20
#removing self loops
np.fill_diagonal(A, 0)
#get nn thresholds from quantile
quantile_h = np.quantile(A, (num_nodes - k)/num_nodes, axis=0)
mask_not_neighbours = (A < quantile_h[:, np.newaxis])
A[mask_not_neighbours] = 0
A = (A + A.transpose())/2
A = A / np.max(A)
Looking at the adjacency matrix itself, it is now much cleaner. We see that the adjacency matrix is really sparse, this is because sensors indexes does not have spatial meaning (sensor $i=0$ and $i=1$ can be far away).
###plotting
plot_adjacency(A, title="Pre-processed adjacency matrix")
But is it logical to just use the distance matrix as a metric? What happens if there are sensors that are spatially really close, but not on the same lane?
Take for example two sensors #400296 and #401014 (near the highway interchange California 237/Zanker road on the map) and let's look at their traffic measurements on Monday 3 Apr 2017.
### plot 2 uncorrelated sensors, but spatially close (#400296 and #401014)
def get_data_sensor(sensor_name, sensor_ids, start=0, end=None):
'''Get the measurement from a sensor name'''
idx = np.where(sensor_ids == sensor_name)[0][0]
sensor_data = data[idx, start:end]
return sensor_data, idx
def plot_sensor_data_comparison(sensor1_name, sensor2_name):
'''Plot a a view of two sensors, for comparison.'''
#get sensor data
sensor1_data, sensor1_idx = get_data_sensor(sensor_name=sensor1_name, sensor_ids=sensor_ids, start=start, end=end)
sensor2_data, sensor2_idx = get_data_sensor(sensor_name=sensor2_name, sensor_ids=sensor_ids, start=start, end=end)
#plot
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
ax[0].plot(sensor1_data)
ax[0].set_title("Data for sensor {}".format(sensor1_name))
ax[1].plot(sensor2_data)
_ = ax[1].set_title("Data for sensor {}".format(sensor2_name))
#correlation
corr = np.corrcoef(sensor1_data, sensor2_data)[0, 1]
print("Distance of {:.3f}km with correlation of {:.3f}".format(np.sqrt(D[sensor1_idx, sensor2_idx]), corr))
sensor1_name = "400296"
sensor2_name = "401014"
plot_sensor_data_comparison(sensor1_name, sensor2_name)
Those are really uncorrelated, whereas the more distant sensors #400296 and #400873 are much more correlated (because they are on the same direction lane).
### plot 2 correlated sensors (#400296 and #400873)
sensor1_name = "400296"
sensor2_name = "400873"
plot_sensor_data_comparison(sensor1_name, sensor2_name)
To take into account this, we use the correlation matrix to filter out "bad" edges. We compute the correlation just for a single day (to mitigate seasonal effects), and every edges that have a correlation less than $0.6$ will be disgarded.
# correlation matrix and filtering
C = np.cov(data[:, start:end])
Cj, Ci = np.meshgrid(np.diag(C), np.diag(C))
C = np.abs(C/(np.sqrt(Ci*Cj)))
# correlation matrix and filtering
A = A * (C > 0.7)
### plot graph
#nx variables
pos = {i : (locs_xy[i, 0], locs_xy[i, 1]) for i in range(num_nodes)}
nx_graph = nx.from_numpy_matrix(A)
#rendering
fig, ax = plt.subplots(figsize=(8, 8))
nx.draw(nx_graph, pos, node_size=100, node_color='b', ax=ax)
_ = ax.set_title("Graph of traffic data from bay area")
This graph mainly follows the structure of the main roads in the bay area, because this is where the sensors are. Still, it also features some interconnections mostly due to the inner road connexions of the city.
Now that we defined the graph structure, we can start the spectral decomposition. We will need that to perform the graph convolution, and there are also interresting visualization to better understand the graph structure.
As a reminder, we apply the eq \ref{eq:normlaplacian} using the degree matrix and the adjacency matrix.
# Degree and laplacian matrix for distance and correlation graph
degree_matrix = np.diag(np.sum(A > 0, axis=0))
np.seterr(divide='ignore')
degree_matrix_normed = np.power(degree_matrix, -0.5)
degree_matrix_normed[np.isinf(degree_matrix_normed)] = 0
L = np.identity(num_nodes) - (degree_matrix_normed @ A @ degree_matrix_normed)
### plot
plot_adjacency(-L, title="Laplacian matrix", vmin=0, vmax=0.2)
After the laplacian is defined, we can now eigen decompose the matrix. Because our matrix is real and symmetric, we can use np.linalg.eigh which is faster than the original np.linalg.eig.
# Eigen decomposition
l, v = np.linalg.eigh(L) #lambda:l eigen values; v: eigenvectors
An important thing to check is if the eigen vectors carries a notion of frequency, for a graph that is simply the number of time each eigen vector change sign, or number of zero-crossings [8].
# zero-crossing
derivative = np.diff(np.sign(v), axis=0)
zero_crossings = np.sum(np.abs(derivative > 0), axis=0)
Minus the noise, the tendency is that the larger the eigen values $\lambda_l$, the higher the frequency (more zero-crossings between nodes). Also all the eigen values are positive, that would not be the case if the laplacian matrix was not symmetric.
### plot zero crossings
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(l, zero_crossings, marker="+")
ax.set_title("Number of zero crossing")
_ = ax.set_xlabel("$\lambda_l$")
We can also try to interpret what the eigen vectors are responsible for, by projecting it on the graph (similar to what we would do to analyse deep learning filters). In the figure below, it is clear that an eigen vector with higher eigen value leads to a higher frequency graph.
Some clear patterns can be distinguished, for example eigen vector #1 seems to be responsible for eastern sensors.
### Eigen vector plot and visualization
# {% raw %} /!\ This is specific for liquid to escape the braces
eig_to_plot = [1, 40, 120]
fig, ax = plt.subplots(nrows=2, ncols=len(eig_to_plot), figsize=(5*len(eig_to_plot), 10))
for ii, ii_eig in enumerate(eig_to_plot):
#eigen plots
img = ax[0, ii].plot(v[:, ii_eig])
ax[0, ii].set_title("$\lambda_{{{}}}={{{:.3f}}}$"
"".format(ii_eig, l[ii_eig]))
#eigen vectors on the graph
node_colors = np.array([val for val in v[:, ii_eig]])
img = nx.draw_networkx_nodes(nx_graph, pos,
node_color=node_colors,
cmap="jet", node_size=50, ax=ax[1, ii])
plt.colorbar(img, ax=ax[1, ii])
_ = ax[1, ii].set_title("$\mathbf{{v}}_{{{}}}$ on graph"
"".format(ii_eig))
#
It is relatively easy to apply clustering, even with unlabelled data. What you want to do is perform a simple unsupervised clustering (like k-means) on the lowest eigen vectors. Why on the low frequencies? Because they carry the power of the signal, when the highest frequency carry the dynamic of the graph. Here we will try to use the first $5$ eigen vectors.
# kmean clustering
import sklearn
from sklearn.cluster import KMeans
sk_clustering = sklearn.cluster.KMeans(n_clusters=2, random_state=0).fit(v[:, :5])
This result in the clustering of some western sensors, interrestingly it does not cluster all the sensors on the same road, but just those who are on the same lane/direction. This is because of the constrain that we put on the graph, taking into acount correlation of the data!
### plotting
fig, ax = plt.subplots(figsize=(5, 5))
nx.draw_networkx_nodes(nx_graph, pos, nodelist=list(np.where(sk_clustering.labels_==0)[0]),
node_size=50, node_color='b', ax=ax)
nx.draw_networkx_nodes(nx_graph, pos, nodelist=list(np.where(sk_clustering.labels_==1)[0]),
node_size=50, node_color='r', node_shape="+", ax=ax)
_ = ax.set_title("Clustering the low frequency nodes")
We saw in the mathematical background section that we can easily filter a signal in the spectral domain, let's implement this using eq \ref{eq:graphconv}.
# graph convolution
def graph_convolution(x, h, v):
'''Graph convolution of a filter with data
Parameters
----------
x : `np.array` [float], required
ND input signal in vertex domain [n_nodes x signal]
h : `np.array` [float], required
1D filter response in spectral domain
v : `np.array` [float]
2D eigen vector matrix from graph laplacian
Returns
-------
`np.array` [float] : the output signal in vertex domain
'''
# graph fourier transform input signal in spectral domain
x_g = v.T @ x
# convolution with filter, all in spectral domain
x_conv_h = h * x_g
# graph fourier inverse transform to get result back in vertex domain
out = v @ x_conv_h
return out
In classical signal processing, a filter can be easilly translated into the temporal domain. This is not possible in GSP (translate a filter into the graph domain), however it is possible to check its response in our system by convolving the filter with a dirac. Let's try to apply a heat filter on our graph and see its response in the vertex domain.
# heat filter
def heat_filter(x):
'''Heat filter response in spectral domain'''
out = np.exp(-10*x/np.max(x))
return (1/out[0])*out
### plot frequency response
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(l, heat_filter(l))
ax.set_xlabel("$\lambda_l$")
_ = ax.set_title("Heat filter frequency response")
Here is the result when applying the heat filter to $\delta_{0}$ (at the position of the node #400001 located at the interchange US 101/Miami 880).
# apply heat filter at the position of #400001
sensor_name = "400001"
_, sensor_idx = get_data_sensor(sensor_name=sensor_name, sensor_ids=sensor_ids)
dirac_from_sensor = np.zeros((num_nodes, 1))
dirac_from_sensor[sensor_idx] = 1
out = graph_convolution(x=dirac_from_sensor, h=heat_filter(l)[:, np.newaxis], v=v)
### plotting signal on graph domain
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
node_colors = np.array([val for val in out.flatten()])
img = nx.draw_networkx_nodes(nx_graph, pos,
node_color=node_colors,
cmap="jet", node_size=50, ax=ax[0])
plt.colorbar(img, ax=ax[0])
_ = ax[0].set_title("$h \otimes \delta_0$")
ax[1].plot(out.flatten())
_ = ax[1].set_xlabel("node $i$")
This can help us interprate how the energy diffuse over the system, for this specific example it can be a tool for engineers to see where traffic congestion can happen. Good luck for the drivers in the center of bay area!
Applying this filter to the sensor measurements acts a (really) low pass filter, as you can see below.
# apply heat filter to all data
sensor_name = "400001"
_, sensor_idx = get_data_sensor(sensor_name=sensor_name, sensor_ids=sensor_ids)
out = graph_convolution(x=data, h=heat_filter(l)[:, np.newaxis], v=v)
### plot before and after filtering
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
ax[0].plot(data[sensor_idx, start:end_3])
ax[0].set_title("Sensor {} before filtering".format(sensor_name))
ax[0].set_ylim(10, 80)
ax[0].set_xlabel("Time")
ax[0].set_ylabel("Speed (mph)")
ax[1].plot(out[sensor_idx, start:end_3])
ax[1].set_title("Sensor {} after filtering".format(sensor_name))
_ = ax[1].set_ylim(10, 80)
plt.show()
Why is the resulting signal lower? I suppose this is because the offset component is partially canceled out after filtering (in our case the minimum frequency is not zero but $\lambda_0 = 0.048$).
This post guided you on how the graph signal processing theory was built, and how to apply it concretely. We saw a specific example using traffic data, and although we could go much further in the analysis, I hope it answered some of your questions
A first reference that helped me a lot for my understanding is the Stanford class CS224W: Machine Learning with Graphs. Also the Persagen Consulting web page which are specialized in molecular genetics, but still is a really good ressource.
The extension of GSP applied to deep learning is the hot topic of graph convolution networks (GCN). If you are curious about them definitively look at this wonderfull distill interactive post. Also, I was lucky to work with Dr Zhang specifically on applying GCNs to neuroimaging data (fMRI), check her work!
Finally, the set of tutorials from pygsp are also a great way to understand each component of GSP.
1. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. (2013).
2. Nonato, L.G.: Graph Fourier Transform, (2017). https://doi.org/https://sites.icmc.usp.br/gnonato/gs/slide3.pdf.
3. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. (2016).
4. Hammond, D.K., Vandergheynst, P., Gribonval, R.: Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis. 30, 129–150 (2011).
5. Li, Y., Yu, R., Shahabi, C., Liu, Y.: Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926. (2017).
6. Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 631–636 (2006).
7. Hagberg, A.A., Schult, D.A., Swart, P.J.: Exploring Network Structure, Dynamics, and Function using NetworkX. In: Varoquaux, G., Vaught, T., and Millman, J. (eds.) Proceedings of the 7th Python in Science Conference. pp. 11–15. , Pasadena, CA USA (2008).
8. Shuman, D.I., Narang, S.K., Frossard, P., Ortega, A., Vandergheynst, P.: The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine. 30, 83–98 (2013).
This post will guide you step by step on how to access your video files over LAN wifi without the official app, from your GoPro directly to your computer.
GoPro is a well known action camera founded by Woodman Labs. As many of big manufacturers, they like to build their own eco-system to keep their clients inside it (so client won't easilly switch product because they are used to the frontend). This is a pure marketing technic that personally I hate, and make simple things as copying files from one device to another a living hell (hi Apple!).
In this case, we have videos captured from a GoPro, and we want to transfer them to our computer. The easiest way is to download the GoPro software or app, and download from there. If you are like me and don't like to install too useless softwares on your computer to keep it clean (so it is not a garbage), follow this guide!
Before you start make sure you have a stable wifi, and enable it on your desktop. Since we will use wifi to connect to the GoPro devide, if you are not using ethernet you will lose your internet connection. Optionnally, if you want to automatically scrap all the content on your desktop, you will need to install wget.
sudo apt install wget
If you have the app installed, they ask you to put your GoPro ready in application mode. What happens is that your GoPro is actually creating a local web server, then the app access this server and let you download the files. Instead, we will bypass the app and access the GoPro server directly!
settings/connection/remote connections/yessettings/connections/connect a peripheral/gopro applicationWarning
Pluging your GoPro is important because when the device is put into application mode, it won't put itself on sleep. Be also carefull that the camera (and mostly the lens!) doesn't get too hot.
Now we want to access the server to be able to download files. Place the GoPro near your desktop, then:
GPXXXXXXXX (with 8 digits).settings/connections/camera info.When you are connected, you should be able to access locally your file at http://10.5.5.9:8080/videos/DCIM/100GOPRO/. Voila! You have access to your files and you can play them on your browser, or manually download them.
Note
The actual important video files are the*.MP4, the other files*.LRVare a specific GoPro format used by their app for previews.
Of course, downloading the files one by one is a time consuming process. Ideally you would like to get all the videos directly inside a folder:
cd my/folder
wget -A .MP4 -r --no-parent http://10.5.5.9:8080/videos/DCIM/100GOPRO/
Note
Normally it should download at relatively high speed (>5MB/s), but it depends of the speed of your LAN network.
There is a high chance that you will not be able to play the media on your browser. This is because GoPro uses HEVC format (which is the update from h264) by default, and most browser does not support this format. If you need more information about the HEVC format, check this amazing post from mozilla, but to summarize, it allows a more efficient compression compare to its previous version.
To disable HEVC, you can do that under the GoPro seetings in general/video compression/h.264 and HEVC.
If you have you files already in HEVC, you can still transcode it to h.264 using ffmpeg.
As an example if you have already compiled ffmpeg with nvidia:
ffmpeg -vsync 0 -hwaccel cuda -hwaccel_output_format cuda -i GXxxxxxx.MP4 -map 0:a:0 -c:a copy -map 0:v:0 -c:v:0 h264_nvenc -preset slow -b:v 36M GXxxxxxx_h264.mp4
To guide you on which options to use for compression, check the ffmpeg wiki.
Because the GoPro acts as a server, there is nothing stopping you accessing the media on another device than your desktop. For example you could view the videos on your tv with a kodi device, remotely and without buying the hdmi cable.
This post is about the random generation number of the MMO New World. It should help you understand what are the requirements to gather a rare ressource.
New World™ is a massively multiplayer online role-playing game (MMORPG) developed by Amazon Games Orange County. In this game, you play as a colonizer in the mid-seventeenth century. Your goal is to survive in this island by crafting, trading, and killings opponents (be it monsters or other players), all of that in an atmosphere surrounded by mystery and magic (powered by the "azoth" ressource).
To be able to perform well in this universe, you need to constantly improve yourself by upgrading your armor, buying consummmables or leveling-up... This is why trading and crafting is so critical in MMOs, and more specifically collecting ressources (ores, herbs etc...) and their rare version. One crucial component is the random number generator (RNG) running on the server, this is what determine your loot.
I will not elaborate too much on this since Mala Zedik already made a great post on this (btw good luck for his "pacifist" run!). Basically to summarize it, each time you gather/kill something, you will roll a "virtual" dice. The number of faces depends on the thing your are trying to loot on (monster, chest...), and 82% of them are based on a 100 000 dice roll (ROL). If you reach a certain threshold, then you are elligible to gain that rare loot, if not then too bad!
Given the probability of obtaining one (rare) ressource, one can easilly infer how much minimum of tries he needs to perform (i.e. how many monster needs to be killed/ ressources needs to be gathered) using the Bernoulli distribution.
We define the probability $p$ as the probability to get the rare ressource, and $q= 1-p$, the probability to miss the rare ressource. Let $X$ be a random variable, following $n$ Bernoulli trials and given that dice rolls are independents, the probability for exactly $k$ successes (i.e. successfull rare loots) is [1]:
\begin{equation} P(X = k) = \binom{n}{k} \cdot p^kq^{n - k} \end{equation}In our context, we don't need exactly $k$ drops, we are interrested in getting at least $k$ ressources ( if we have more then this is better). To model this we need to define the cummulative distribution $P(X \ge k)$, that can be formulated as a complement probability:
\begin{equation} P(X \ge k) = 1 - P(X < k) \end{equation}This is actually much more convenient to use, for example to compute the probability to get at least $k=1$ rare ressource you can simply define:
\begin{equation} P(X \ge 1) = 1 - P(X < 1) = 1 - P(X = 0) = 1 - q^n \end{equation}Instead of:
\begin{equation} P(X \ge 1) = P(X = 1) + P(X = 2) + P(X = 3) \dots \end{equation}Where you would need to compute the outcome for almost all events (i.e. probability of 1 drop or 2 drops or 3 drops etc...).
As you can see with the Bernoulli formula, it will never be possible to achieve 100% accuracy (if $p < 1$), that is why we can define an "acceptable" error $\alpha$. For example, we can compute the minimum number of trials to get at least one ressource ($k=1$), with a 95% confidence ($\alpha=0.05$).
The following algorithm should give you an idea about how to concretely compute the bernoulli distribution, for $p=0.05$:
### imports
import math
import scipy
import scipy.special
import matplotlib.pyplot as plt
ress_min = 1 #minimum number of resources to drop
prob_true = 0.05 #probability to drop one ressource
alpha = 0.05 #acceptance error
#reduce computation usage
ress_min = min(ress_min, 1e3)
max_iters = int(1e4)
for n in range(max_iters):
p_bernoulli = 0.
for k in range(ress_min):
p_bernoulli = (scipy.special.comb(n, k) * math.pow(prob_true, k) * math.pow(1 - prob_true, n - k)) + p_bernoulli
if (p_bernoulli < (alpha - 1e-6)):
break
print("Minimal tries: ", n)
One can also check the impact on the probability depending on the minimal number of tries (the inverse cummulative distribution $P(X>=k) = 1 - P(X<k)$).
bernoulli_list = []
ress_min_list = range(15)
for ress_min in ress_min_list:
p_bernoulli = 0.
for k in range(ress_min):
p_bernoulli = (scipy.special.comb(n, k) * math.pow(prob_true, k) * math.pow(1 - prob_true, n - k)) + p_bernoulli
bernoulli_list += [1 - p_bernoulli]
fix, axes = plt.subplots()
axes.plot(ress_min_list, bernoulli_list)
axes.set_xlabel("minimum successfull tries")
axes.set_ylabel("Cummulative probability")
axes.set_title("Inverse cummulative distribution")
plt.show()
Obviusly, the probability decrease when increasing the minimum successful tries (i.e. it is more difficult to drop more ressources)!
I implemented a tool that you can use to check your probability of dropping something.
Warning
At the time of writing, luck bonuses are based on version 1.0.2.The true loot luck is based from Mala Zedik post from the closed Beta.
1. Dodge, Y.: The concise encyclopedia of statistics. Springer Science & Business Media (2008).
Neuroscience is a multidisciplinary science to study the human brain, and MRI is one of the main and widely used tool for measuring the brain. f-MRI itself focuses on the identification of patterns evoked by tasks performed during MRI scanning.
The goal of this post is to give you a simple and broad overview of how f-MRI studies are conducted.
Magnetic Resonance Imaging (MRI) uses radio wave frequency (RF) in the presence of a strong magnetic field (usually 3T) to produce detailed images from any part of the body.
The emitted RF signal cause the water molecules in our body to resonate, and the signal is altered at different frequencies so different slices of the body resonate.
After this operation, it is the strength of the received RF signal from the different body slices that makes the MRI pixel intensity.
Check also that nice little clip if you want a graphical explanation.
MRI does not emit ionizing radiation like CT-scan, but there are several risks associated like use of contrast agents [1], the powerfull magnetic field attracting magnetic objects, or noise inside the machine [2].
In 2003, Paul Lauterbur and Peter Mansfield were awarded the Nobel Prize in Physiology or Medicine for their contributions to the development of Magnetic Resonance Imaging (MRI).
Anatomical or structural image is the most common neuroimaging data that uses MRI techniques. It is simply a 3D volume of the cerebrum that can be viewed either as multiple slices, or by its surface.

The tissues emits a more or less intense signal, and MRI is more sensitive than CT scans for distinguishing soft tissues. This property makes MRI a good candidate for brain imaging, to study its shape, volume, and developmental changes. Example studies include brain tumor segmentation [3] or even investigating COVID-19 patients chests [4] (by the way the [best detection method for COVID-19](/data-science/2020/08/01/false_positive.html)).
It is also possible to modify the contrast of the image (and hence highliting specific tissues of the brain), by gauging the relaxation time of the MRI signal.
Different relaxation times enable multiple pulse sequences: T1-weighted image (the most basic, white matter is lighter), T2-weighted (cerebrospinal fluid darker) and FLAIR (similiar to T2w but with CSF darker).
Functionnal MRI is an approximated measure of the brain activity.
When activity occurs in the brain, neurons fires and consumes energy through oxygen.
After this short-term consumption (called the initial dip), the brain slowly over-compensate the demand of oxygen to a much higher number than the consumption.
This is what is refered as Blood-oxygen-level-dependent (BOLD) signal, and serves as the canonical response to model the activity of the brain (haemodynamic response function or HRF).

The relationship between electrical activity and the BOLD signal is complex and still remains a debate among researchers [5].
In practice, f-MRI is simply the acquisition of multiple low-resolution and fast MRI images (echo planar or EPI) through time, where each voxel intensity represent the ativity. Usually, a participant is asked to performs some tasks while being scanned, and the idea is to try to correlate these tasks with the f-MRI signal. Because BOLD signal acquisition is really slow (few seconds), it is important to design appropriate tasks [6].
There are obviously lot of others imaging technics derived from MRI, but those are not the goal of this post.
Designing a task f-MRI study requires a lot of time and expertises. One of the first important step when manipulating clinical data is to standardize its structure and make it usable. It is only then that you can create a statisctial model and start analysing the results.
f-MRI data is collected by the scanner, and sequences are controlled via a computer in another room (what is called the console). It is then converted from a raw binary fromat to the Digital Imaging and Communications in Medicine (DICOM) format, the standard for medical imaging. Anonymization (removing all participant naming information, and de-facing procedure) is usually perfomed at this stage.
More recently, a consensus was proposed on the organization of data for neuroimaging experiments: the Brain Imaging Data Structure (BIDS) [7]. It is now widely used in the neuroscience community.
The f-MRI data content itself is still not usable as is for neuroscience studies. Even if it is in BIDS format (which is just a way of re-organizing files), the data needs to be prepared. Such preparation involves registration (linear motion and non-linear warp estimation), filtering (de-noising, confounds removal) and segmentation (brain parts, skull stripping).
Each pre-processing step is critical, for example you need to aligned each EPI scan so they are in the same reference space (usually MNI152 [8]). There are lot of pre-processing tools available for the community, but the most widely used is fMRIPrep [9].
After pre-processing, it is essential to perform a quality control on your data. It usually involve an interactive software where the researchers can either label the current data as "pass", "fail" or "maybe". QC is also itself a whole area of research but not the goal of this post.
The analysis of any preprocessed fMRI can be decomposed into multiple steps: task definition/design, modeling and statistical information extraction.
Modeling is usually performed using a general linear model (GLM), and the resulting coefficient can be incorporated into a standard statistical test procedure.
The GLM is just the general case of the multiple linear regression (MLR) model. Whereas MLR depends on one $n\text{-}D$ dependent variable $X$, the GLM is expressed with $i=0...p$ independent variables (or regressors) $X_{ij}$.
For one specific sample $j$ , the time signal $Y_j$:
$$ \begin{equation} Y_j = \beta_{0j} X_{0} + \beta_{1j} X_{1} + ... + \beta_{pj} X_{p} + \epsilon_j \end{equation} $$
Some of the regressors $X_{i}$ can be derived from the task design, and this is what we will see during the hands-on.
Basically, knowing the time of each event (for example, the patient move its hand at $t=2.3$ sec), one can convolve these with the HRF to design the expected measurement.
The other (nuisance) regressors are called confounds and are either included into the GLM, or filtered out during the pre-processing (there is no strong scientific consensus on which method to employ [10]).
What is of interrest here is the what so called beta-maps $\beta_{ij}$ (estimated coefficients), which are simply the "amount" of activation for a voxel $j$ and one specific stimuli $i$.
A last note is that the input fMRI data is a 4D tensor of shape $(x \times y \times z \times t)$. The "trick" to express $Y$ into a GLM is to mask the data so it has a shape of $(n \times t)$, where each line of the matrix is a specific voxel. To reduce the data burden and smooth-out the data, it is also standard to use a common functional brain segmentation map (or parcelations) with different resolution [11].
There exists a second level of analysis called group-analysis. We will not talk about that here but it basically involve using the beta-maps as measurements, and extract voxel-wise relative importance of factors such as age, disease etc...
Once the GLM is defined and beta-maps estimated, one can use statistical tests to explain the parameters. The standard null hypothesis is that there is no activation related to a given regressor.
One popular test is the F-test that describes how your betas (multiplied by the regressor) explains well the input signal. Another, the student test (t-test, or z-test for more samples), can be used to compare one task versus another. For example in the next section, we will consider voxels in the brain where an audio task has more effect than a visual task. Such inequalities $\beta_1 > \beta_2$ can be designed using a contrast.
Check this pdf for more information on the statistical tests.
Finally the hands-on! Sorry for the long introduction but there were so many things to discuss before this...
This pythonic hands-on will focus on finding which brain regions are activated under specific stimulis (i.e. tasks that a participant perform while being scanned).
Hopefully, we will not need to build everything from scratch because there exist a python package. It is called Nilearn, and it aims to help neuroscientist build their machine learning project. The neuroscience community is indeed really active to provide open source tools, and some of them even reached a bigger audience like ANTs or Datalad.
### imports
import warnings
warnings.filterwarnings("ignore")
import IPython.display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nilearn
import nilearn.datasets
import nilearn.glm
import nilearn.plotting
For this tutorial, we will use a little and open fMRI dataset: a subset of the Functional Localizer dataset. This simple protocol captures the cerebral bases of auditory and visual perception, motor actions, reading, language comprehension and mental calculation at an individual level. It can be accessible online, but we will fetch it from Nilearn instead.
Note:
The fMRI data from the Nilearn fetcher is already pre-processed and ready for analysis.
This is usually not the case for the raw data in BIDS, but using non-preprocessed data in the case of a strict GLM study should not harm too much (but not recomended!).
# localizer dataset download
data = nilearn.datasets.fetch_localizer_first_level()
Looking at the fMRI data (in NIfTI format), it is a $4D$ numpy array.
There is just one participant on which were acquired $128$ EPI images of size $(53, 63, 46)$.
# checking fMRI input data
func_image = data['epi_img']
func_image_data = nilearn.image.load_img(func_image)
func_image_data.shape
And here is how one sample (or voxel) of fMRI signal looks like:
### plot one voxel data
plt.figure()
plt.plot(func_image_data.get_fdata()[20, 20, 20, :])
plt.title("BOLD signal for (20, 20, 20)")
plt.xlabel("time (frame)")
plt.show()
As you can see, fMRI data is really noisy!
Note:
The raw time unit for neuroinformatic is often in frames (one EPI image is one frame), and not in seconds.
Here, we will focus on auditory and visual stimulis. Let's check some examples of what the participant is hearing or seeing (stimulis and protocol can be downloaded here)!
# "display" audio
IPython.display.Audio("data/fmri_intro/audio_example.wav")
"On raconte des histoires de fantômes." Really!?
### visual example
visual_example = plt.imread("data/fmri_intro/visual_example.png")
plt.figure()
plt.imshow(visual_example, cmap="gray")
plt.axis('off')
plt.show()
Imagine being scanned and looking at weird stuff like this during hours... Yes, this is common in neuroimaging studies!
The first step is to load the behavioral (or event) file.
It contains all the condition/label (trial_type), start time (onset) and duration (duration) in seconds.
There should be obviously one event file per fMRI run, and in our case there is just one.
# Load data labels
event_file = data['events']
events = pd.read_csv(event_file, sep='\t')
events
Note:
Sometimes the event file(s) is not atsvfile or it does not contain the necessary columns. If that is the case there is no choice to build a new one.
To design the GLM model, we will need the repetition time $TR$ of the acquisition (which is the time between two $3D$ EPI scans). It should be accessible in the dataset metadata and for that dataset it is $2.4$ sec.
The GLM will also include a cosine drift model so it can catch slow oscilating data (like heart rate or breathing), with a cut-off frequency of $0.01$ Hz.
# GLM design and fit
TR = 2.4
fmri_glm = nilearn.glm.first_level.FirstLevelModel(t_r=TR, high_pass=.01)
fmri_glm = fmri_glm.fit(func_image, events=events)
Let's check how the GLM looks like!
### GLM plot
design_matrix = fmri_glm.design_matrices_[0]
nilearn.plotting.plot_design_matrix(design_matrix)
plt.show()
If we take a deeper look at some regressor, we clearly see how the HRF canonical function was convolved with the events.
### plot some regressors
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
axes[0].plot(fmri_glm.design_matrices_[0]['audio_computation'].to_numpy())
axes[0].set_xlabel("time (frame)")
axes[0].set_title("Expected audio signal (audio_computation)")
axes[1].plot(fmri_glm.design_matrices_[0]['visual_computation'].to_numpy())
axes[1].set_xlabel("time (frame)")
axes[1].set_title("Expected visual signal (visual_computation)")
plt.show()
Now that the GLM model was defined, we can estimate the brain activation. To do so, we will first define the appropriate contrast and later perform a z-test.
Because we want to check the activation of an audio task versus a visual task, the contrast can be defined by the substraction of the two conditions.
### contrast definition
conditions = events['trial_type'].values
contrasts_audio = (
np.array(design_matrix.columns == "audio_left_hand_button_press"
, dtype=np.float32)
+ np.array(design_matrix.columns == "audio_right_hand_button_press"
, dtype=np.float32)
+ np.array(design_matrix.columns == "audio_computation"
, dtype=np.float32)
+ np.array(design_matrix.columns == "sentence_listening"
, dtype=np.float32))
contrasts_visual = (
np.array(design_matrix.columns == "horizontal_checkerboard"
, dtype=np.float32)
+ np.array(design_matrix.columns == "vertical_checkerboard"
, dtype=np.float32)
+ np.array(design_matrix.columns == "visual_left_hand_button_press"
, dtype=np.float32)
+ np.array(design_matrix.columns == "visual_right_hand_button_press"
, dtype=np.float32)
+ np.array(design_matrix.columns == "visual_computation"
, dtype=np.float32)
+ np.array(design_matrix.columns == "sentence_reading"
, dtype=np.float32))
contrast_def = contrasts_audio - contrasts_visual
nilearn.plotting.plot_contrast_matrix(contrast_def, design_matrix=design_matrix)
plt.show()
The last step consist in performing a z-test using the previously defined contrast definition, we will use a p-value of $0.05$.
# estimate contrast
z_map = fmri_glm.compute_contrast(contrast_def, output_type='z_score')
_, threshold = nilearn.glm.threshold_stats_img(z_map, alpha=.05)
# plot
nilearn.plotting.view_img(z_map, threshold=threshold, black_bg=True, title="audio vs visual (p<0.05)")
The results are pretty good! There is a high activation in the auditory cortex (Brodmann areas 41, 42, 22) and visual cortex (Brodmann area 17) as expected.
In case you want to see all the plots with just one command, you should definitively check nilearn.reporting.make_glm_report.
This post gave an overview on what is a neuroimaging study. As you saw, there are a lot (lot) of different steps involved, and each require a specific expertise (from hardware to software). A good exercice is to try a second level study!
You want to check the fsl tutorials. They explain all steps involved in a neuroimaging study, it helped me a lot understanding first and second level analysis for example. Check also this really nice MIT fMRI Bootcamp from Rebecca Saxe.
If you want to apply machine learning to neuroscience, you should definitievly look at this live tutorial. Also one of my favourite paper in neuroscience, a benchmarking of different ML predictive models [12].
Among my professionnal work, I have been working in the cneuromod project team. This project aims to train ANNs using extensive experimental data on individual human brain activity and behaviour. As part of this project, I was even able to be the first one playing Super Mario Bros. inside a MRI scanner!

This work was inspired from this initial tutorial.
1. Shellock, F.G., Spinazzi, A.: MRI safety update 2008: part 1, MRI contrast agents and nephrogenic systemic fibrosis. American Journal of Roentgenology. 191, 1129–1139 (2008).
2. Shellock, F.G., Spinazzi, A.: MRI safety update 2008: part 2, screening patients for MRI. American Journal of Roentgenology. 191, 1140–1149 (2008).
3. Wadhwa, A., Bhardwaj, A., Verma, V.S.: A review on brain tumor segmentation of MRI images. Magnetic resonance imaging. 61, 247–259 (2019).
4. Vasilev, Y.A., Sergunova, K., Bazhin, A., Masri, A., Vasileva, Y.N., Semenov, D., Kudryavtsev, N., Panina, O.Y., Khoruzhaya, A., Zinchenko, V., others: Chest MRI of patients with COVID-19. Magnetic resonance imaging. 79, 13–19 (2021).
5. Logothetis, N.K., Wandell, B.A.: Interpreting the BOLD signal. Annu. Rev. Physiol.. 66, 735–769 (2004).
6. Amaro Jr, E., Barker, G.J.: Study design in fMRI: basic principles. Brain and cognition. 60, 220–232 (2006).
7. Gorgolewski, K.J., Auer, T., Calhoun, V.D., Craddock, R.C., Das, S., Duff, E.P., Flandin, G., Ghosh, S.S., Glatard, T., Halchenko, Y.O., others: The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific data. 3, 1–9 (2016).
8. Fonov, V., Evans, A.C., Botteron, K., Almli, C.R., McKinstry, R.C., Collins, D.L., Group, B.D.C., others: Unbiased average age-appropriate atlases for pediatric studies. Neuroimage. 54, 313–327 (2011).
9. Esteban, O., Markiewicz, C.J., Blair, R.W., Moodie, C.A., Isik, A.I., Erramuzpe, A., Kent, J.D., Goncalves, M., DuPre, E., Snyder, M., others: fMRIPrep: a robust preprocessing pipeline for functional MRI. Nature methods. 16, 111–116 (2019).
10. Lindquist, M.A., Geuter, S., Wager, T.D., Caffo, B.S.: Modular preprocessing pipelines can reintroduce artifacts into fMRI data. Human brain mapping. 40, 2358–2376 (2019).
11. Bellec, P., Rosa-Neto, P., Lyttelton, O.C., Benali, H., Evans, A.C.: Multi-level bootstrap analysis of stable clusters in resting-state fMRI. Neuroimage. 51, 1126–1139 (2010).
12. Dadi, K., Rahim, M., Abraham, A., Chyzhyk, D., Milham, M., Thirion, B., Varoquaux, G., Initiative, A.D.N., others: Benchmarking functional connectome-based predictive models for resting-state fMRI. NeuroImage. 192, 115–134 (2019).