Python is widely considered the best programming language for data science.
This is largely because of its rich ecosystem. Python has countless open-source libraries that make it quick and easy to integrate common functionality into your applications.
Pandas is one of these libaries. Known for its tabular data structure the Pandas DataFrame - which is similar to a spreadsheet that you can manipulate in Python - Pandas is the de-facto tool for working tabular data.
In this tutorial, we will get ourselves familiar with two of the widely used functions to select and filter data from Pandas dataframes: loc and iloc.
Table of Contents
You can skip to a specific section of this pandas loc vs. iloc tutorial using the table of contents below:
In case you do not have Pandas, you can install it by running the following statement from your command line:
pip install pandas
We'll need a dataset in order to get ourselves familiar with the functions we will be going over today. In this tutorial, we'll be using the World Happiness Report 2015 dataset from Kaggle. Fee free to use your own dataset if you'd like.
Let’s download a dataset of your choice and import to our script.
import pandas as pd
df = pd.read_csv('dataset.csv')
If we look at the shape of the dataset, we can see that it has 158 rows and 12 columns.
Let’s take a look at the first 5 rows of the dataset using pandas' head method:
Now we are ready to get started. We'll now look at the different ways of selecting and filtering data from the dataset.
Selecting Data from Dataframes: A Rudimentary Approach
Selecting Columns by Column Name(s)
Let’s retrieve the first five rows of the columns Country and Happiness Score. To do this, we'll first select the two columns using square bracket parsing and then call the head method on those columns:
This code generates:
In this code, we specify a list of the columns we want to retrieve. We can add more columns to the output DataFrame by appending the required column names to the list.
Selecting Columns by Data Type(s)
Let's now consider how we can retrieve columns by their datatype. To do this, we pass a list of the datatypes of the columns we require as the include parameter to the select_dtypes() function.
Let’s retrieve all string values from our columns and print their first 5 rows using the head method:
This code generates:
Instead of ‘str’, we are using the datatype ‘object’ here. We will now retrieve both string and float values.
Since the whole dataset only consists of the above two data types, it will output all of the data set's columns.
Selecting Columns by Filtering
We can also use the filter() function to retrieve data from columns that have a substring of our choice in its column name.
We will also receive multiple columns if the substring of choice is contained in any of the other column names.
Now that we have a fair idea about how to retrieve data from a dataframe, we will next look at two of the most versatile functions built into Pandas: iloc and loc.
Selecting Data from Dataframes: iloc
iloc in Pandas is used to make selections based on integer (denoted by i in iloc) positions or indices. That means we can retrieve data by using the position at which its rows and columns are present in the dataframe.
We can visualize that the rows and columns of a dataframe are numbered from 0. This is the logic used to retrieve data using iloc.
Let us go through an example first.
We will pick the first row of the dataset.
We can also pick multiple rows using the same method.
We can also pick the last row without knowing how many rows there really are. To do this, pass the index -1 into the pandas iloc function:
We can select columns from our DataFrame using similar logic.
If we look at the structure of iloc, it's designed so we specify the intended rows we plan to receive in the first list and the intended columns in the second list. Although this might look tricky and complex, selecting a column really is as simple as retrieving a row - just specific : for your list of rows!
Here's an example:
We can also retrieve the last column by passing -1 to the list as we did before.
We can combine both the row and columns lists we learned to retrieve data elements from various positions of the dataset. Let’s say we want to retrieve the first cell, which happens to be the data element in the first column and the first row.
We can retrieve the first cell using the following statement:
We can play around with this trying to retrieve various data elements. Let’s retrieve the Happiness Rank and Happiness Score for the 0th, 25th, and the last observations of the dataset. We can identify that the Happiness Rank and the Happiness columns are the 2nd and the 3rd columns of our dataframe.
We can retrieve these data with the following code:
If we are to select rows and columns between a range, we can treat the indexer the same way we treat arrays.
Let’s select Happiness Rank and Happiness Score data from the rows 30 to 34.
One of the important things to keep in mind is that when we specify the range, it always runs from the starting number to the number right before the interval ending number. Said differently, the first number is inclusive while the second number is exclusive. For instance, if we say [0, 5], it means that we are considering a range that runs from 0 to 4.
Now that we have familiarized ourselves with iloc, let us look at the next indexer in line: loc.
Selecting Data from Dataframes: loc
loc in Pandas can be used in two main ways.
We can use loc to retrieve data elements based on their indices or labels, and we can use loc to retrieve data using conditional statements.
We'll look at each of these separately next.
Index-Based Data Retrieval
If we recall what we learned in the last section, we understand that we used iloc to retrieve data based on their position or the location in the dataframe we are working with.
For instance, if we execute df.iloc[], it will output the first row from the dataframe. Even if the index of that row was a different number than 0, it would have returned the first row from the dataframe as we would perceive it.
However, when it comes to loc indexer, it works a little differently than the iloc. It will always call the data elements by their specified index. Let’s consider an example.
The result was the same result that was produced by iloc earlier in the previous section. This is because the DataFrame's index is equal to the integer position of each row in the DataFrame.
Let's change the index of the dataframe to a different field. More specifically, let's set it to Happiness Rank.
df.set_index('Happiness Rank', inplace=True)
We can see that the index has been changed to Happiness Rank and now it starts from 1 opposed to the original dataframe that started with 0.
Let us now run iloc and loc again.
As we can see, the result is the same as before except for the index, which is 1 now.
Let us run the same for loc.
KeyError: "None of [Int64Index(, dtype='int64', name='Happiness
Rank')] are in the [index]"
As we can see, running the same statement with the loc indexer gives us a KeyError exception.
If we go back to the current df version, we will see that this happens because there is no index with 0 in the updated dataframe since we changed the index field to Happiness Rank which starts from 1 instead of 0 as before.
We can run the following with the updated index instead to retrieve the same element as before.
Let's change the index to the Country field now. However, we have to make sure that we reset the index so that the Happiness Rank is normal column, as before.
Now we can see that the Happiness Rank is back as a field:
We can change the place of the column to be the same as before. However, we will not spend time on that as it is not a necessity here when we are using loc. Let’s set the index to the Country field now.
Let us run iloc to retrieve the first row.
If we are to retrieve the same data using loc, we will have to use the index value. In this case, it will be ‘Switzerland’.
With that, now we know how differently iloc and loc behaves when they are retrieving elements.
We will now look at a few use cases that we can perform on the updated dataframe with the index as the Country.
Let’s retrieve data for two countries, Denmark and Panama.
If we are writing functions for our custom programs to retrieve data based on the index, instead of querying explicitly, we can set the index as the field we want and retrieve the information using loc this way.
Let’s retrieve the Region and Happiness Score of Australia and Sweden.
Now that we have covered the basics and are familiar with the functionality of loc and iloc indexers, we will now take a look at their data types.
Most Python developers tend to think that whenever we receive a table as an output, it necessarily has to be a DataFrame. However, this is not actually true.
In reality, when we are retrieving tabular data, although they may look alike, their actual type can be different. We will now take a look at a few of the types of Pandas objects iloc and loc indexers produce.
Let’s consider the following.
Notice that we have used the value straightaway instead of passing a list with the value.
In this, we realize that instead of printing the dataframe as a horizontal table, we can also access it as a pandas Series and have a link to individual values based on indices.
Let us confirm this by executing,
Let us observe what happens if we do the same by passing a list with the index.
Although these may seem trivial, it is at the real-world implementations of these functions and their basics we learn that will be useful.
The Pandas library is one of the most important components of the data science ecosystem. Within pandas, loc and iloc are two of the most important functions.
This tutorial showed you how to use both functions in Python. Feel free to use this as a reference in your future data science projects.