What is Exploratory Data Analysis?
Exploratory data analysis (EDA) is a critical component of any data scientist's tool box. EDA involves learning about a data set to make sure you're using the right predictive tools to analyze it.
Despite their importance, exploratory data analysis principles are seldom taught. They are learned through the course of building many real data science projects.
This tutorial will teach you the basics of exploratory data analysis in Python.
Table of Contents
You can skip to a specific section of this exploratory data analysis tutorial using the table of contents below:
- What is Exploratory Data Analysis (EDA)?
- Why is EDA Important?
- A Simple EDA Exercise in Python
- Final Thoughts
What is Exploratory Data Analysis (EDA)?
Exploratory data analysis (EDA) is used to analyze a data set to assess its nature and characteristics. EDA primarily uses visualization techniques to do this.
The objectives of EDA are
- To understand the distribution of the dataset
- To identify potential patterns and trends
- To identify outliers or anomalies
- To produce a hypothesis
- To validate assumptions
- To generate a visually-summarized representation of the dataset
Why is EDA Important?
Although many tend to skip it, exploratory data analysis is an essential step in any data science project. It helps you understand whether the model you're applying to the data will be logical, well-defined, and scoped within the boundaries of its intended application.
EDA also helps to find anomalies, errors, noise, loss of data, and the overall validity of the data set with respect to the application.
Performing EDA helps you in modeling outcomes, determining the best-suited algorithm, finding new data patterns, and even defining which variables are dependent and independent.
Using proper exploratory data analysis techniques can save you from building invalid or erroneous models, building models on unsuitable data, selecting unsuitable variables, and even unoptimized development of the model. EDA can be helpful in preparing data for the model as well (this is typically called data cleaning).
EDA can also save you from making bad predictions by revealing that your data set is not adequate for the task at hand. This means you'll need to go back to the data collection stage.
To summarize, the main goal of EDA is to ensure that the dataset aligns with the model that you're building. The rest of this article will explain a few methods you can use to perform exploratory data analyses.
Methods to Perform EDA
There are many methods that can be used for exploratory data analysis.
Note that as with everything else, it's often best to use multiple EDA techniques. You can then compare the results of each analysis to see if they align with each other.
This method can be used to produce a statistical summary for each separate column of the unprocessed dataset.
This method is used to determine the connection between each independent variable and the intended dependent variable.
This method can be used to understand the relationship between different columns of the dataset.
In simple terms, dimensionality reduction can be used to reduce the number of dimensions in a dataset and transform to another dimensional space that is lower than the original while still sustaining the meaningful and resourceful characteristics of the dataset.
Principal component analysis is one of the most common techniques used for dimensionality reduction.
Dimensionality reduction methods can be used to develop questions to answer during your analysis or to develop a sense of how the results of this model should be interpreted. The usual procedure to perform exploratory data analysis in your code is as follows.
- Acquire a suitable dataset and import it
- Understand the nature of the dataset by observing data from different rows.
- Try to further understand the nature of the dataset by querying the data
- Identify if there are any missing values and determine how the handle the missing data
- Try to comprehend the features of the dataset from a data science standpoint
- Identify the difficulties of working with the dataset due to its missing and extreme values.
- Identify any potential patterns even if they cannot be explained right away
This process is often encompassed by a step called data profiling.
Data profiling is performed after summarizing the dataset using visualization and statistics to help you understand the dataset even better.
Based on the outcome of data profiling, the next steps can be decided regarding the dataset whether or not to rectify or reject depending on the suitability to the potential machine learning model.
We will now put everything we have learned to practice.
A Simple EDA Exercise in Python
We will now get started with getting ourselves familiar with the basics of exploratory data analysis by practicing on a dataset.
We will be using the Cortex cryptocurrency dataset in this exercise.
You'll also need the following libraries as a prerequisite:
Pandas is a highly optimized Python library written for manipulation and analysis of data. It is well-known for its highly useful pandas DataFrame data structure.
NumPy is one of the most versatile libraries intended towards mathematical and numerical computations on matrices of various dimensions. It's NumPy array data structure is similar to a Python list, but is more well-suited towards high-performance computing.
Matplotlib is a Python library that is written for the visualization of data.
Seaborn is a library that is based on Matplotlib and it helps to make visualizations of data easier.
We will now install the above libraries using the
pip package manager. Run the following statements on your command line to do this:
pip install pandas pip install numpy pip install matplotlib pip install seaborn
Additionally, we also require the
datapackage library to be installed. This is because we are directly importing data through a remote server.
pip install datapackage
Now that our imports are complete, you can import the data for this tutorial with the following code:
import pandas as pd import datapackage remote_url = 'https://datahub.io/cryptocurrency/cortex/datapackage.json' my_pkg = datapackage.Package(remote_url) my_rsc = my_pkg.resources for rsc in my_rsc: if rsc.tabular: df = pd.read_csv(rsc.descriptor['path']) df
The last line of this code block will output the data as follows.
If we look at the fields within this data set, we will see that there are 8 columns (in addition to the index column).
For this exercise, you don't need to know what each column means from the start. Just know that they are variables. You will get familiar with each specific column as you work through this tutorial.
If you want to learn the shape of the data set, you can easily run the following command to get the number of columns and rows.
Here's the output of this command:
This means the above dataset has 156 observations and 8 features.
In addition to that, by running the following
tail statements, you can print the first five rows and the last five rows from the dataset.
Similarly, here's how you can print the tail of the data set:
As we noticed in the first five observations, the data set contains null values. It's important to understand how prevalent these null values are in the rest of the dataset.
info method is an excellent tool for learning more about missing data in a pandas DataFrame.
The output of the
info method tells us that except for the
date field, all the other fields have float values.
Moreover, except for in the
date column, every other field has
We have a few options on how to deal with this missing data:
- go ahead with the existing table
- replace the
NaNvalues with another value
- discard all the observations that have
In this scenario, let’s choose the last option and drop all the observations with
NaN fields. We'll use the pandas
dropna method to do this:
df = df.dropna() df
Here's what the new DataFrame looks like:
To see how many rows have been removed from our data set, we can look at the DataFrame's
Here's the output of this code:
You can tell that our data set has dropped from
156 observations originally to
119 observations now. This shows that there were 37 observations with missing data in the original data set.
You'll also notice that the DataFrame's indices start from 37 and have lost their consistency. You can reset the indices of the DataFrame using the
reset_index method, like this:
df = df.reset_index(drop = True)
This will reset the indices of the DataFrame. Note that we've passed a
drop = True argument into the
reset_index method because we do not require the old indices to be included in the new DataFrame as another column.
We can now use the
pandas.describe() function to look at a summary of all the fields using statistics like means, percentiles, standard deviation, minimums, and maximums.
If you want to know the median of a particluar column, it's not explicitly labelled here. With that said, the median is just another name for the 50th percentile.
You can see from this output that the difference between the 50th percentile and the 75th percentile of
txCount field is much greater than the difference between the 25th percentile and the 50th percentile.
describe method also shows similar characteristics for some of the other fields. In many columns, the mean is significantly higher than its median.
Both these observations suggest that there are outliers in the dataset.
Provided that this dataset had lots of similar values or repeated values, we could have used the
unique() function to observe all the unique values from each field. In addition, the
value_counts() method shows the number of times each value in a column occurs.
Take your own time and try to observe more patterns in the dataset. There could be hundreds of different patterns that are just waiting to be found!
Now that we are a bit familiar with the nature of this dataset, we can produce a correlation matrix for this dataset using the
Correlation matrices are highly useful in summarizing massive amounts of data and to identify patterns among the different fields. They also allow us to perform diagnostics to find if certain variables are extremely correlated to each other. This indicates multicollinearity and would suggest that some types of modelling (such as linear regressions) would produce unreliable predictions.
Here's the code to generate a correlation matrix from the DataFrame:
Here's the output of this code:
You'll also want to use the
seaborn library we installed earlier to generate some visualizations. Let's start by importing the library into our script:
import seaborn as sns import matplotlib.pyplot as plt
One of my favorite
seaborn visualizations is the
heatmap. Here's the code to generate a basic
seaborn that presents our correlation matrix in a visually-appealing manner:
sns.heatmap(df.corr(), annot=True) plt.show()
This visualization reveals that
txCount has a very strong positive correlation with
Why is this?
If we consider what
txCount actually is, it is a field that denotes the number of blockchain transactions that a specific user has completed. The field
activeAddresses represents the number of individual addresses that have engaged in blockchain transactions with that specific user. With this in mind, it makes sense that these two fields have a high correlation.
There are many more such patterns in this data set. Try and identify more of the. It's a great way to practice your exploratory data analysis skills!
You can also look at how the price has changed over the available dates. Let's first see if there are any duplicate dates. As you already know, there are
119 observations. Let's produce the array of unique values in the date field and find its length:
Since then number of unique dates is equal to the number of total dates, this shows that all dates are unique.
It also shows that the dates are in chronological order even with the
NaN values dropped.
Let's produce a matplotlib scatterplot to see how the prices have changed over the months:
import matplotlib.pyplot as plt plt.scatter(df['date'],df['price(USD)']) plt.xlabel('Timeline') plt.ylabel('Price') plt.show()
We can also produce a scatterplot that shows how the number of active addresses has varied over the months.
import matplotlib.pyplot as plt plt.scatter(df['date'],df['activeAddresses']) plt.xlabel('Active Addresses') plt.ylabel('Price') plt.show()
From the description we produced earlier, we know that the mean for active addresses is approximately 204.
We also observed that on some dates, there are numbers as high as
4270 and numbers as low as
The charts we just created allow us to observe those outliers in a visual way.
There are many more patterns in this data set that are best observed visually. Feel free to produce more plots involving other fields that might be useful in inferring important information in this exploratory data analysis step.
Exploratory data analysis is used to summarize and analyze data using techniques like tabulation and visualization.
Using EDA in statistical modeling can be extremely useful in determining the best-suited model for the problem at hand. Moreover, it also helps in identifying unseen patterns in a dataset.
This tutorial provided a broad overview of exploratory data analysis techniques. Please feel free to use it as a reference when approaching new data science problems in the future.