Introduction to scikit-learn (and Course Configurations)

Hey - Nick here! This page is a free excerpt from my new eBook Pragmatic Machine Learning, which teaches you real-world machine learning techniques by guiding you through 9 projects.

Since you're reading my blog, I want to offer you a discount. Click here to buy the book for 70% off now.

In this brief tutorial, you will learn about how to configure Python on your local computer so that you can build machine learning algorithms throughout the rest of this course.

You will also be introduced to scikit-learn, which is the Python library that we will be using to build machine learning models through the rest of this course.

Table of Contents

You can skip to a specific section of this Python machine learning tutorial using the table of contents below:

How to Install and Use Jupyter Notebooks

A Jupyter Notebook is a file (and corresponding application) that provides a nice environment for you to write and execute Python code. The Jupyter Notebook is arguably the most popular environment used by machine learning engineers.

The easiest way to install the Jupyter Notebook application is by downloading the Anaconda distribution of Python. Please follow the instructions in the following tutorial to do so:

If you've never worked with a Jupyter notebook before, you'll need to learn how to operate in this environment. The following tutorial will be useful for you:

If you're an experienced Python developer, please note that you do not necessarily need to work in a Jupyter Notebook to be successful in this course. However, all of the screenshots, examples, and practice problems will assume that you're working from a Jupyter Notebook. Keep this in mind before proceeding using a different Python editor or programming environment.

How to Install scikit-learn

scikit-learn is the Python library that we will be using to build machine learning models in this course. Accordingly, you'll need to install the scikit-learn library on your computer before proceeding.

If you installed Python using the Anaconda distribution, then scikit-learn will already be installed. If not, you can install scikit-learn using the following command line prompt:

pip install scikit-learn

If you're working with the Anaconda distribution but scikit-learn isn't installed for some reason, you can install is by running the following statement from the command line:

conda install scikit-learn

Introduction to scikit-learn

To conclude this tutorial, I wanted to provide a brief introduction to the scikit-learn library in Python.

Earlier in this course, you learned that building machine learning models generally follows this recipe:

  1. Data acquisition
  2. Data cleaning
  3. Splitting the data set into training data, validation data, and test data
  4. Training the model on the training data
  5. Validating and tweaking the model using the validation data
  6. Testing the model's final performance using the test data

scikit-learn provides tools for each step of this process. We will explore each of these tools quickly in this section.

Before proceeding, please note that this tutorial is intended to be nothing but a quick introduction. Don't worry about understanding every concept introduced in this tutorial, because we'll be learning about each step in much more detail later.

First, let's discuss how we import models from scikit-learn. Every algorithm is exposed in scikit-learn using something called an estimator. In scikit-learn, an estimator is any object that learns from data. A scikit-learn estimator usually falls into one of three categories: classification, regression, or clustering.

The first step of importing an estimator is importing the model. The generalized Python command for importing a model is:

from sklearn.family import Model

where:

  • family is the model family that the model you're importing is from
  • Model is the name of the specific model you're importing.

As an example, the LinearRegression model is part of the linear_model family. Here is the command you would use to import this model into your Python script:

from sklearn.linear_model import LinearRegression

Next, you need to run the model estimator and pass in the required parameters. You can use Shift + Tab in the Jupyter Notebook to generate a list of the required arguments for a specific model.

As an example, here are the arguments required for the LinearRegression model:

LinearRegression(copy_X = True, fit_intercept = True, normalize = True)

Most commonly, we'll create an instance of the LinearRegression object and assign it to a variable named model:

model = LinearRegression(copy_X = True, fit_intercept = True, normalize = True)

It is not time to fit this model on some training data! Remember that it is important to split our model into both training data and test data. Let's see how to do this.

First, let's generate a fake data set:

from sklearn.linear_model import LinearRegression

import numpy as np

from sklearn.model_selection import train_test_split

x = np.arange(10).reshape((5,2))

y = range(5)

After running this code, here is the value assigned to x:

array([[0, 1],

       [2, 3],

       [4, 5],

       [6, 7],

       [8, 9]])

Similarly, here is the value assigned to y:

range(0, 5)

If you're wondering what this data means, let's explain. The x variable holds the actual observations from the data set, which has 2 different characteristics and 5 observations. The y variable contains labels (which our model will attempt to predict) of the data set.

Now we'll split the data we generated into training data and test data using the train_test_split function contained in scikit-learn.

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.4)

The test_size parameter of the train_test_split function is important. It ranges from 0.0 to 1.0 and represents the proportion of the data set to include in the test data.

After running this code, here's what's contained in x_training_data:

array([[0, 1],

       [2, 3],

       [8, 9]])

Similarly, here are the other three data sets:

x_test_data

array([[6, 7],

       [4, 5]])

y_training_data

[0, 1, 4]

y_test_data

[3, 2]

Let's move on to actually fitting our model to our training data. This is done by using the model.fit() method by passing in the training data.

Here's the code to do this:

model.fit(x=x_training_data, y=y_training_data)

Our model has been trained and we can now use it to make predictions on our data set. We do this using the predict() method, passing in our x_test_data as the only parameter.

model.predict(x_test_data)

Here's what this code returns:

array([3., 2.])

You can then compare these predicted values to the actual values in the data set to assess the performance of your model.

Final Thoughts

In this tutorial, we quickly discussed the tooling required for you to proceed through this course. You also had your first brief introduction to the Python library scikit-learn, which we will be using to build machine learning models through the rest of this course.

Here is a brief summary of what you learned in this lesson:

  • How to download and run Jupyter Notebooks
  • How to install scikit-learn (and why you don't need to if you installed Python using the Anaconda distribution)
  • A brief summary of how machine learning models are built using the scikit-learn package. Please note that you need not understand every detail from this overview, since we will be revisiting every step in more detail later.