LearnPython.com
  • Courses
  • Articles
  • Log in
  • Create free account
  • fullName

    User profile menu open Open user profile menu avatar
    avatar
    fullName
    Dashboard
    My Profile
    Payment & Billing
    Log out
MENU CLOSE
  • Courses
  • Articles
  • Dashboard
  • My Profile
  • Payment & Billing
  • Log in
  • Create free account
  • Log out 
Back to articles list Articles
28th Aug 2023 9 minutes read

Python Exploratory Data Analysis Cheat Sheet

Author's photo
Luke Hande
  • python
  • data analysis
See More

Get a quick overview of exploratory data analysis, a process used to summarize your dataset and get some quick insights. We’ll give you the tools and techniques you need in this cheat sheet.

Exploratory data analysis (EDA) is a term used to describe the process of starting to analyze your data in the early stages. Its primary purpose is to understand the properties of the data, with the aim of using these insights to refine the analysis to derive the best insights possible from the data you have. After performing an EDA, you’ll have a better idea of what your data looks like and what questions you can answer.

What Is Exploratory Data Analysis?

It’s important to do an EDA before you start the formal analysis, modelling, or hypothesis testing. Many analysis methods have assumptions about the data; if your data doesn’t conform to these assumptions, your results may be invalid. For example, some statistical tests assume the data is Gaussian (i.e. normally distributed); you need to explicitly check this by doing an EDA before applying the statistical test.

The EDA process can involve several steps: loading the data, cleaning the data, plotting each variable, grouping variables, and plotting groups of variables. In this article, we’ll provide you with an overview of these steps. In your next data analytics project, you can come back to this article and use it as a cheat sheet to inspire you on how to best inspect your data.

We’ll cover some advanced topics in this article, so it’ll be quite useful to have some experience in programming with Python and data analytics. If you want some relevant learning material, the Introduction to Python for Data Science course is aimed at beginner data scientists. For more in-depth material, the Python for Data Science track bundles together 5 of the best interactive courses relevant to data science.

Load and Clean Your Data

Processing your data properly is an important first step, as we discuss in What Is Data Processing in Python?. The details of this first step depend on the type of data you have. If you have data in CSV format, you can use Python’s csv module to read it in. The article A Guide to the Python csv Module has more information and examples to help you out here. If your data is in an Excel spreadsheet, you’ll need different libraries, which we discuss in Best Python Packages for Excel.

Or perhaps you have data in the JSON format. In this case, you can use the json module to read it in. We have some useful examples in How to Convert JSON to CSV in Python. In that article, we also show you how the pandas library can make your life much easier when reading in data. Many of these libraries appear in our Top 15 Python Libraries for Data Science.

Speaking of pandas, this library will come in handy for most parts of the EDA process. When it comes to cleaning data, there are some pandas functions that make your life easier.

1.   Load Data and Remove Duplicates

Start by importing your data into a pandas DataFrame called df.

To remove duplicate entries, use the df.drop_duplicates() function. The subset argument allows you to provide one or more column names to consider when dropping duplicates. The keep argument allows you to specify if you want to keep the first duplicated entry, the last, or none of them.

2.   Deal with Missing Data

When you’re working with real data, it’s common to have missing values. You can fill these with the df.fillna() method. The first argument specifies the value to use to fill in the missing values. With the second argument, you can choose to propagate the last or next valid observation forward or backwards.

See the article The Most Helpful Python Data Cleaning Modules for more information and tips for cleaning data. Since this step in the EDA process involves manipulating data, here are 12 Python Tips and Tricks That Every Data Scientist Should Know.

For the rest of this article, we’ll be working with the famous iris flower dataset. It, along with many other interesting datasets, can be imported from scikit-learn and then converted to a pandas DataFrame for convenience, as shown below:

>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
>>> df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
>>> df.head()

   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  species
0                5.1               3.5  ...               0.2   setosa
1                4.9               3.0  ...               0.2   setosa
2                4.7               3.2  ...               0.2   setosa
3                4.6               3.1  ...               0.2   setosa
4                5.0               3.6  ...               0.2   setosa

Here we can see the variables in this dataset include measurements of the physical properties of different species of flowers.

Summarize Your Data

Pandas DataFrames have some useful built-in methods to help get a quick overview of your data. You can use df.shape to print the shape of the DataFrame. The output is a tuple with the number of observations and the number of variables. For the summary statistics of your dataset, use the df.describe() method. The output looks like this:

>>> df.describe()

 sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
count         150.000000        150.000000         150.000000        150.000000
mean            5.843333          3.057333           3.758000          1.199333
std             0.828066          0.435866           1.765298          0.762238
min             4.300000          2.000000           1.000000          0.100000
25%             5.100000          2.800000           1.600000          0.300000
50%             5.800000          3.000000           4.350000          1.300000
75%             6.400000          3.300000           5.100000          1.800000
max             7.900000          4.400000           6.900000          2.500000

This calculates the mean, standard deviation, minimum, and 25th, 50th (median), and 75th percentiles. It provides a nice quick overview of your variables.

Visualize Your Data

The next step in the EDA process is to start plotting your data to get an idea of the nature of the variables. The Python library Matplotlib – which is used to create static or interactive visualizations  – is useful here.

The type of visualizations to consider plotting at this stage are histograms, box plots, bar plots, or density plots (amongst others). An example of using Matplotlib to plot a histogram of the sepal length is shown below:

>>> import matplotlib.pyplot as plt
>>> plt.hist(df['sepal length (cm)'], bins=20)
>>> plt.ylabel('Counts')
>>> plt.xlabel('Sepal length (cm)')
>>> plt.show()

Running this code produces the diagram below. You can see the sepal length for all species varies between about 4.3 to 7.9 cm. Note also that this distribution doesn’t look Gaussian; this could be formalized with a statistical test. This means, however, that some statistical quantities or tests that assume a normal distribution may not be valid.

Python Exploratory Data Analysis Cheat Sheet

Next, you could consider plotting pairs of variables to see if there are any relationships between quantities. For example, you can plot a scatter plot of petal length against petal width as follows:

>>> plt.scatter(df['petal length (cm)'], df['petal width (cm)'])
>>> plt.ylabel('Petal width (cm)')
>>> plt.xlabel('Petal length (cm)')
>>> plt.show()

The resulting diagram is shown below. You can see a nice linear relationship between the two variables: as petal length increases, so does petal width. Interestingly there appears to be two clusters in the data. This is begging for an explanation!

Python Exploratory Data Analysis Cheat Sheet

At this point in the analysis, you shouldn’t worry too much about making visually appealing plots; just use the style defaults to save time and effort. Check out the Matplotlib gallery for detailed examples of the many different types of plots available. Also, our article How to Plot a Running Average in Python Using Matplotlib has a nice example of plotting time series data.

Group Your Data

From the above visualization, we have a clue that there could be naturally occurring groups in the data. A good next step is to start to look at how the data could be grouped together. The built-in pandas function df.groupby() can come in handy here. The groupby function must be used in conjunction with another function to produce a summary statistic of the group, for example df.max() or df.min(). Let’s take a look at a concrete example:

>>> df.groupby('species').mean()
            sepal length (cm)  ...  petal width (cm)
species                        ...                  
setosa                  5.006  ...             0.246
versicolor              5.936  ...             1.326
virginica               6.588  ...             2.026

In the above example, we group the data by the species column and calculate the mean of each variable for that species. The results are printed to the console and show the species Virginica has the largest sepal length on average. Setosa, on the other hand, has quite clearly the smallest petal width compared to the other species. Perhaps this species could represent the lower left group from the above section?

From above we know the 50th percentile of the petal width is 1.3 cm. We can find out how many members of each species have petal widths greater than this value. Just run the following code:

>>> df[df['petal width (cm)']>1.3]['species'].value_counts()
virginica     50
versicolor    22
setosa         0

Here, we’re subsetting the DataFrame to return the species column for all observations which have a petal width greater than 1.3 cm. Then, using the df.value_counts() method, we count how many of each species we have. This reveals there are no examples of Setosa with a petal width greater than 1.3 cm. This suggests the lower left cluster represents this species. But what about the upper right cluster?

We can plot the same scatter diagram but colored by species. We just need to pass an array-like data structure containing an integer identifying the species to the c argument of the scatter function.  An Introduction to NumPy in Python has more information about working with arrays. To generate another scatter plot, use the following code:

>>> scatter = plt.scatter(df['petal length (cm)'], df['petal width (cm)'], c=iris.target)
>>> plt.legend(handles=scatter.legend_elements()[0], labels=df['species'].unique())
>>> plt.ylabel('Petal width (cm)')
>>> plt.xlabel('Petal length (cm)')
>>> plt.show()
Python Exploratory Data Analysis Cheat Sheet

Grouping the data by species reveals that the upper right cluster contains two sub-clusters belonging to the Versicolor and Virginica species. And as we suspected, the lower left cluster belongs to the Setosa species.

How to Use This Cheat Sheet for Exploratory Data Analysis

In this article, we demonstrated a typical process for EDA. We started with reading in and cleaning the data and proceeded to getting a quick understanding of the variables. The next steps depend on the type of data you have, but they should involve visualizing each variable using histograms, box, and bar plots. Density plots and scatter diagrams are a good way to plot two variables to see if there are any interesting relationships worth further investigating. Our findings motivated us to start to separate the data into groups and to try to explain why groupings occur. If you have tabular data, How to Pretty-Print Tables in Python has some examples about producing nice tables in Python.

You can use this process as a guide for your next EDA project. Other techniques (such as statistical hypothesis testing) could provide some new insights, such as confirming if the distribution of sepal length (shown above) is statistically different for each species. For some more advanced learning material, consider taking the Data Processing with Python track. It contains 5 interactive courses and is aimed at advanced users. Or you can start your own Python project. Here are some Python Data Science Project Ideas to get you motivated to practice what you have learnt here.

Tags:

  • python
  • data analysis

You may also like

How to Measure Python Script Execution Times
Timing the execution of your programs is a good way to find out if they’re running efficiently. Here, we’ll show you all you need to know to time your programs.
Read more
Best Visual Studio Code Extensions for Python
Discover some of the best VS Code extensions to write better Python code in your favorite IDE!
Read more
What 'inconsistent use of tabs and spaces in indentation' Means in Python and How to Fix It
Python does indentation a little differently. Here, we’ll explain everything you need to know and show you how to avoid and fix a common indentation error.
Read more
What Is the Python Interpreter?
Every time you run a Python script, you rely on the Python Interpreter to execute it. But how does it work? We’ll explain everything you need to know.
Read more
How to Loop Over Multiple Lists in Python
Learn to loop over all elements in multiple lists in Python, even when the lists are not the same length.
Read more
Subscribe to our newsletter Join our monthly newsletter to be notified about the latest posts.

How Do You Write a SELECT Statement in SQL?

What Is a Foreign Key in SQL?

Enumerate and Explain All the Basic Elements of an SQL Query

Quick links

  • Pricing
  • Blog
  • Vertabelo.com

Assistance

Need assistance? Drop us a line at [email protected]

Write to us

Follow us

LearnSQL Facebook We Learn SQL Facebook Linkedin LearnPython.com We Learn SQL Youtube
go to top
Copyright ©2016-2018 Vertabelo SA All rights reserved
Vertabelo
  • Terms of service
  • Privacy policy
  • Imprint