Machine Learning

Supervised learning with Scikit-Learn

Posted on October 22, 2019

In supervised learning whe have several data points, described using predictor variables and a target variable. The aim is to build a model that is able to predict the traget variable. For this we need labelled data, defined as data with input and output parameters. There are many ways to perform supervised learning, here we will use scikit-learn, one of the most popular and user-friendly machine learning libraries for Python. It also integrates very well with the scipy stack, including numpy. Our data is commonly represented in a table structure such as the one you see below.

In [1]:
from sklearn import datasets
import numpy as np
import pandas as pd

# load iris data from scikit-learn's datasets
iris = datasets.load_iris()

# convert the data to a pandas dataframe
X = pd.DataFrame(iris.data, columns=iris.feature_names)

X.head(7)
Out[1]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3

Here we have data pertaining to iris flower in which the features consist of four measurements: sepal lenght, sepal width, petal lenght and petal width. The target variable encodes the species of flower and there are 3 possibilities as we can see below.

In [2]:
y = iris.target

np.unique(y)
Out[2]:
array([0, 1, 2])

0 stands for setosa, 1 for versicolor and 2 for virginica.

We can use the shape attribute to return a tuple representing the dimensionality of the DataFrame.

In [3]:
X.shape
Out[3]:
(150, 4)

In order to perform some initial EDA (exploratory data analysis) we will use seaborn, a visualization library based on matplotlib.

Using seaborn.paiplot we can plot pairwise relationships in a dataset.

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

df = X.copy()
df['target'] = y
_ = sns.pairplot(df, hue='target', vars=df.columns[:-1])

The result is a matrix of figures where the diagonal axes are showing the univariate distribution of the data for the variable in that column. The off-diagonal features are scatter plots of the column feature versus row feature colored by the target variable.

There is a great deal of information in this scatter matrix. Petal lenght and width are highly correlated. Also, flowers are clustered according to species.

Classification

So, we have a set of labelled data and we want to build a classifier that takes new unlabelled data as input and outputs a label. How do we construct this classifier?

We first need to choose a type of classifier and it needs to learn from the already laballed data. For this reason , we call the already labelled data the training data. We will initially choose a simple algorithm called k-nearest neighbors.

k-Nearest Neighbors

The basic idea of k-nearest neighbors, or k-NN, is to predict the label of any data points by looking at the $k$ closest labeled datapoints and getting them to vote on what label the unlabeled point should have.

What the k-NN algorithm essentially does is it creates a set of decision boundaries and votes in which boundary any new data point lies.

All machine learning models in scikit-learn are implemented as classes. These classes serve two purposes: they implement the algorithms for learning a model and the algorithms to predict a model, while also storing all the information that is learned from the data.

Training a model on the data is also called fitting the model to the data. In sklearn we use the fit method to do this. Similarly, the predict method is what we use to predict the label of an unlabelled data point.

The scikit-learn API requires firstly that we have the data as NumPy arrays or pandas DataFrame. It also requires that the features take on continuous values and that there are no missing values in the data. The iris dataset already satisfies this prerequisites and can thus we can move on to fitting the data.

To fit our k-nearest neighbor classifier we need to import it from sklearn.neighbors. By using the n_neighbors attribute we can set the number of neighbors. Notice we named the feature array X and response variable y. This is in accordance with the common scikit-learn practice.

In [5]:
from sklearn.neighbors import KNeighborsClassifier

# create a knn classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# fit it to the features X and target y
knn.fit(X, y)
Out[5]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                     weights='uniform')

Having fit a k-NN classifier, we can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! We can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

For now, we can generate some random unlabeled data points.

In [6]:
X_new = X.sample(n=5, random_state=22)

prediction = knn.predict(X_new)

print('Prediction {}'.format(prediction))
Prediction [0 2 1 2 1]

We now need to figure out how to measure its performance. In classification, accuracy is a commonly used metric. The accuracy of a classifier is defined as the number of correct predictions divided by the total number of data points.

In order to have data points to train on and data points to make predictions on we can split the training data into a training set and a test set using train_test_split and then compute our accuracy.

In [7]:
from sklearn.model_selection import train_test_split

# split our data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    test_size=0.4, 
                                                    random_state=8,
                                                    stratify=y) 

# create our k-NN with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=6) 

# fit on the training set
knn.fit(X_train, y_train) 

# predict the test set
y_pred = knn.predict(X_test) 
print("Test set predictions:\n {}".format(y_pred))

#compute the accuracy
acc = knn.score(X_test, y_test)

print('Accuracy: ', acc)
Test set predictions:
 [2 1 0 1 0 2 0 2 0 1 0 0 2 1 2 2 1 0 2 2 2 0 0 2 1 1 0 0 2 1 2 0 1 1 1 1 2
 0 1 0 2 0 0 2 0 1 1 1 0 2 1 1 0 1 0 2 2 2 1 2]
Accuracy:  0.9666666666666667

The accuracy pretty good for an untuned model.

We have recently introduced the concept of decision boundary. In the next plot we will visualize a decision boundary for several increasing values of $k$ in a k-NN model.

In [8]:
# we only take two features for plotting the decision boundaries
bi_dim = X.copy().to_numpy()[:, [0, 2]]

h = .01  # step size in the mesh

x_min, x_max = bi_dim[:, 0].min() - 1, bi_dim[:, 0].max() + 1
y_min, y_max = bi_dim[:, 1].min() - 1, bi_dim[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

neighbors = [1, 6, 15]

fig, ax = plt.subplots(1, 3, figsize=(10,10))

for _, n in zip(ax, neighbors):

    clf = KNeighborsClassifier(n_neighbors=n)
    
    clf.fit(bi_dim, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    
    _.pcolormesh(xx, yy, Z, cmap='plasma', alpha=.8)

    # Plot also the training and testing points
    _.scatter(bi_dim[:, 0], bi_dim[:, 1], c=y, 
                cmap='plasma', edgecolor='k', s=20)
    _.title.set_text("k = {}".format(n))
    
plt.show()

Notice that as $k$ increases the decision boundary gets smoother and less curvy, therefore we consider it to be a less complex model than those with lower $k$.

Generally complex model run the risk of being sensitive to noise in the specific data that we have, rather than reflecting general trends in the data. This is known as overfitting. If we increase $k$ even more and we make the model even simpler, then the model will perform less well on both test and training set as indicated in the below plot.

In [9]:
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 30)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=k)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

This is known as a model complexity curve. We can see that there is a sweet spot in the middle that gives us the best performance in the test set.

Classification metrics

Accuracy can be used to measure model performance but it is not always a useful metric, for example in the case of one of our target classes being more frequent. This is a very common situation in practice and requires another metric to assess the performance of our model.

We can then use a confusion matrix, a matrix that summarizes predictive performance. Given any model, we can fill in the confusion matrix according to its predictions.

In [10]:
from sklearn.metrics import confusion_matrix 

# confusion matrix as pandas dataframe
cm = confusion_matrix(y_test, y_pred)
cm = pd.DataFrame(cm, index=['actual 0', 'actual 1', 'actual 2'], 
                  columns=['predicted 0', 'predicted 1', 'predicted 2'])
cm
Out[10]:
predicted 0 predicted 1 predicted 2
actual 0 20 0 0
actual 1 0 19 1
actual 2 0 1 19

There are several other important metrics we can easily calculate from the confusion matrix.

Precision, which is the number of true positives divided byt the total number of true positives and false positives. It is also called the positive predictive value or PPV: $$\frac{tp}{tp + fp}$$

Recall, which is the number of true positives divided by the total number of true positives and false negatives. It is also called sensitivity, hit rate or true positive rate: $$\frac{tp}{tp + fn}$$

F1 score, defined as two times the product of the precision and recall divided by the sum of the preciosion and recall. In other words, it is the armonic mean of precision and recall: $$2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}$$

To produce the resulting metrics, we pass the same arguments to classification_report which outputs a string containing all the relevant metrics, alternatively when change change the returned type to dictionary setting the output_dict attribute to True.

In [11]:
from sklearn.metrics import classification_report

# classification report as dict
cr = classification_report(y_test, y_pred, output_dict=True)

# create a dataframe and transpose
cr = pd.DataFrame(cr).transpose()
cr
Out[11]:
f1-score precision recall support
0 1.000000 1.000000 1.000000 20.000000
1 0.950000 0.950000 0.950000 20.000000
2 0.950000 0.950000 0.950000 20.000000
accuracy 0.966667 0.966667 0.966667 0.966667
macro avg 0.966667 0.966667 0.966667 60.000000
weighted avg 0.966667 0.966667 0.966667 60.000000

Logistic Regression

Logistic regression is another model use to perform classification tasks, despite its name. We will not go into the mathematical details here but will provide an intuition towards how logistic regression works for binary classification.

Give one feature, logistic regression will output a probability $p$, with respect to the target variable. If $p$ is greater than 0.5, we label the data as 1, if $p$ is less than 0.5, we label it 0. Logistic regression produces a linear decision boundary.

Using LogisticRegression in sklear follows the same formula as the other models.

In [12]:
from sklearn.linear_model import LogisticRegression

# load breast cancer dataset
X, y = datasets.load_breast_cancer(return_X_y=True)

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# instantiate a logistic regression model
logreg = LogisticRegression(solver='liblinear')

# fit the model
logreg.fit(X_train, y_train)

# predict on test set
y_pred = logreg.predict(X_test)

In defining Logistic Regression we have specified a threshold of 0.5 for the probability $p$, a threshold that defines our model. This is not particular for logistic regression but other models as well,for example k-NN. If we vary the threshold to $p=0$ for example, the model predicts 1 for all the data. When the threshold is $p=1$, the model predicts 0 for all the data. If we vary the threshold between these two extremes, we get a series of different false positive and true positive rates. The set of points we get when trying all possible thresholds is called an ROC (Receiver Operating Charachteristic) curve.

To plot the ROC curve, we need to import roc_curve from sklearn.metrics. The first arguments is given by the actual labels, the second by the predicted probabilities. The function returns the false positive rate, the true positive rate and the decreasing thresholds on the decision function used to compute the false positive rate and the true positive rate.

In [13]:
from sklearn.metrics import roc_curve

# predict the probabilities on our test set
y_proba = logreg.predict_proba(X_test)[:, 1]

# apply the roc_curve function
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# plot the curve
_ = plt.plot([0,1], [0,1], 'k--')
_ = plt.plot(fpr, tpr, label='Logistic Regression')
_ = plt.xlabel('False Positive Rate')
_ = plt.ylabel('True Positive Rate')
_ = plt.title('Logistic Regression ROC Curve')
_ = plt.legend()

We used the predicted probabilities of the model assigning a value of '1' to the observation in question. This is because to compute the ROC we do not merely want the predictions on the test set, but we want the probability that our log reg model outputs before using a threshold to predict the label. To do this we applied the method predict_proba to the model and pass it the test data. It returns an array with two columns: each column contains the probabilities for the respective target values. We chose the second column, the one with index 1, that is the probabilities of the predicted labels being 1.

The larger the area under the ROC curve, the better our model is. The area under the ROC curve, commonly denoted as AUC is another popular metric for classification models. To compute the AUC we need to import roc_auc_score from sklearn.metrics.

In [14]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_proba)
Out[14]:
0.9977071732721914

When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds.

In [15]:
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

_ = plt.plot(recall, precision, label='Logistic Regression')
_ = plt.xlabel('Recall')
_ = plt.ylabel('Precision')
_ = plt.legend()

Next we will be checking out another type of supervised learining problem: regression.

Regression

In regression tasks, the target value is a continuously variing variable. Our first regression task will be using the Boston housing dataset, this is again available from sklearn's preloaded datasets.

In [16]:
# load the boston data 
boston = datasets.load_boston()

# assign our features and target
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

X.head()
Out[16]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

The documentation tells us what the feature names stand for, as we can see below.

In [17]:
datasets.load_boston()['DESCR']
Out[17]:
".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n"

The target variable "MEDV" is the median value of owner occupied homes in thousands of dollars.

As a first task we will try to predict the price from a single feature: the average number of rooms in a block.

In [18]:
# we only select the 5th column corresponding to the number of rooms
X_rooms = X.iloc[:, 5].to_numpy()

# plot on figure
_ = sns.scatterplot(x=X_rooms, y=y)
_ = plt.ylabel('Value of House /1000 ($)')
_ = plt.xlabel('Number of rooms')

We can immediately see that there is a correlation, as one might expect, more rooms lead to higher prices.

It's time to fit a regression model to our data. We are going to use a model called LinearRegression from sklearn.linear_model.

Linear regression

Linear regression is probably one of the most important and widely used regression techniques. It’s among the simplest regression methods. One of its main advantages is the ease of interpreting results.

In [19]:
from sklearn.linear_model import LinearRegression

# instantiate the regressor
reg = LinearRegression()

# fit the regressor to the data, the reshape method is there to change the dimensionality
reg.fit(X_rooms.reshape(-1, 1), y)
Out[19]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [20]:
# create our x values
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1, 1) 

# plot the data with a regression line
_ = sns.scatterplot(x=X_rooms, y=y)
_ = plt.plot(prediction_space, reg.predict(prediction_space), color='red', linewidth=2)
_ = plt.xlabel('Number of Rooms')
_ = plt.ylabel('Value of house /1000 ($)')

The equation of the above line is:

$$y = ax + b$$

Where $b$ is the intercept and $a$ is the slope of the line. So basically, the linear regression algorithm gives us the most optimal value for the intercept and the slope (in two dimensions). The y and x variables remain the same, since they are the data features and cannot be changed. The values that we can control are the intercept (b) and slope (a).

But how do we choose $a$ and $b$? A common method to is to define an error function for any given line and then to choose the line that minimizes the error function. Such an error function is also called a loss function or cost function. For this reason, we wish to minimize the vertical distance between the fit and the data. So for each data point we calculate the vertical distance between it and the line. This distance is called a residual.

We could try to minimize the sum of the residuals but then a large positive residual would cancel out a large negative residual. For this reason we minimize the sum of the squares of the residuals. This will be our loss function and using this loss function is commonly called Ordinary least squares (OLS).

This same concept can be extended to cases where there are more than two variables. This is called multiple linear regression. In this case, the dependent variable(target variable) is dependent upon several independent variables. A regression model involving multiple variables can be represented as:

$$y = a_1x_1 + a_2x_2 + ... + a_nx_n + b$$

Fitting a regression model is to specify a coefficient $a_i$ for each feature, as well as the intercept $b$.

Let's now see this again on the Boston Housing dataset, but this time we will work with all the features.

In [21]:
# slit our data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    test_size = 0.3, 
                                                    random_state=23) 

# instantiate a regression model
reg_all = LinearRegression() 

#fit linear model on the training set
reg_all.fit(X_train, y_train) 

#predict on the test set using the linear model
y_pred = reg_all.predict(X_test)
print('coefficients: ', reg_all.coef_)
print('intercept: ', reg_all.intercept_)
coefficients:  [-9.34184911e-02  4.33034125e-02 -2.25219455e-02  3.34204681e+00
 -1.63301964e+01  4.27193035e+00  7.20755545e-04 -1.40323066e+00
  2.55428997e-01 -7.48230095e-03 -8.50445960e-01  1.46848799e-02
 -5.29853457e-01]
intercept:  27.7896067032509

We saw that in classification we can use accuracy as a metric of model performance. The default scoring method for LinearRegression is called $R^2$.

The $R^2$ metric quantifies the amount of variance in the target variable that is predicted from the feature variables. To compute the $R^2$ we once apply the method score to the model and pass it two arguments, the test data and the test data target.

In [22]:
reg_all.score(X_test, y_test)
Out[22]:
0.6947991644651335

We will most generally not use linear regression out of the box like this, we will wish to use regularization, which we'll see soon and which places further constraints on the model coefficients.

Cross-validation

If we are computing $R^2$ on our test set, the $R^2$ returned is dependent on the way that we split up our data. The data points in the test set may have some peculiarities that mean the $R^2$ computed on it is not representative of the model's ability to generalize to unseen data. To combat this dependence on what is essentially an arbitrary split, we use a technique called cross-validation.

We begin by splitting the dataset into $k$ groups or folds. Then we hold out our first fold as the test set and fit our model on the remaining folds, predict on the test set and compute the metric of interest. Next we hold out our second fold as the test set, then train on the other folds and compute our metric of interest on the test set. We continue like this for all the other folds. As a result we get $k$ values of $R^2$ from which we can compute statistics of interest such as the mean, the median and confidence intervals.

There is however a tradeoff as using more folds is more computationally expensive. This is because we are fitting and predicting more times. This method avoids the problem of our metric of choice being dependent on the train test split. To perform k-fold cross validation in sklearn we first need to import it from sklearn.model_selection.

In [23]:
from sklearn.model_selection import cross_val_score 

reg = LinearRegression() 

cv_results = cross_val_score(reg, X, y, cv=5) 
print(cv_results)
print('Mean R^2: ', np.mean(cv_results))
[ 0.63919994  0.71386698  0.58702344  0.07923081 -0.25294154]
Mean R^2:  0.3532759243958772

Regularized Regression

Linear regression minimizes a loss function, it chooses a coefficient $a_i$ for each feature variable. If we allow this coefficients or parameters to be large, it can lead to overfitting. For this reason, it is common practice to alter the loss function so that it penalizes for large coefficients. This is called regularization.

Ridge Regression

We will first have a look at ridge regression a type of regularized regression, in which our loss function is the standard OLS loss plus the suqared value of each coefficient multiplied by a constant $\alpha$:

$$\text{OLS loss function}+\alpha\ast\sum_{i=1}^{n}a_i^2$$

Thus when minimizing the loss function to fit to our data , models are penalized for coefficients with a large magnitude, meaning large positive or large negative coefficients. $\alpha$ is a parameter we need to choose in order to fit and predict. Essentially, we can select the alpha for which our model performs best. Picking $\alpha$ for ridge regression is similar to picking $k$ in k-NN, this is called hyperparameter tuning. This $\alpha$, which we may also see called $\lambda$ in the wild, can be though of as a parameter that controls model complexity. Notice that when $\alpha=0$ we get back OLS.

In [24]:
from sklearn.linear_model import Ridge

#slpit into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# instantiate a ridge model
ridge = Ridge(alpha=0.1, normalize=True)

#fit on training set
ridge.fit(X_train, y_train)

# predict test set
ridge_pred = ridge.predict(X_test)

# score model
ridge.score(X_test, y_test)
Out[24]:
0.6996938275127313

Lasso Regression

There is another type of regularized regression called lasso regression in which our loss function is the standard OLS loss function plus the absolute value of each coefficient multiplied by a constant $\alpha$:

$$\text{OLS loss function}+\alpha\ast\sum_{i=1}^{n}|a_i|$$

The method of performing lasso regression mirrors ridge regression. One of the really cool aspects of lasso regression is that it can be used to slect important features of a dataset. This is because it tends to shrink the coefficients of less imoprtant features to be exactly zero. The features whose coefficients are not shrunk to zero are selected by the lasso algorithm.

In [25]:
from sklearn.linear_model import Lasso

# get the feature names
names = X.columns

# instantiate a lasso model
lasso = Lasso(alpha=0.1, normalize=True)

# get the features coefficients
lasso_coef = lasso.fit(X, y).coef_

# plot the coefficients for each feature
_ = plt.plot(range(len(names)), lasso_coef) 
_ = plt.xticks(range(len(names)), names, rotation=50) 
_ = plt.ylabel('Coefficients') 
_ = plt.xlabel('Features')

We can see that the most important predictor for our target variable is the number of rooms.

Hyperparameter tuning

In previously seen models we had to set some parameters, for example alpha in ridge regression or n_neighbors in k-NN. We need to choose these parameters so that our model can fit data the best, in other words these are parameters that cannot be explicitly learned by fitting the model.

Choosing the correct hyperparameters is the key to a successful model. The basic idea is to try a whole bunch of different values, fit all of them separately, see how well each performs and choose the best one. This is called hyperparameter tuning. When fitting different values of hyperparameters it is essential to use cross-validation as using train test split alone would risk overfitting the hyperparameters to the test set.

To proceed with hyperparameter tuning, we first choose a grid a possible values we want to try for the hyperparameter or hyperparameters. We then perform k-fold cross-validation for each point in the grid, that is for each choice of hyperparameter or combination of hyperparameters. We the choose for our model the choice of hyperparameters that performed the best. This is called a grid search and in scikit-learn we implement it using the class GridSearchCV.

In [26]:
from sklearn.model_selection import GridSearchCV

X, y = datasets.load_breast_cancer(return_X_y=True)

# create a dictionary with parameters and values
param_grid = {'C': np.logspace(-5, 8, 15)}

# instantiate a logistic regression classifier
logreg = LogisticRegression(solver='liblinear')

# instantiate the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# fit on data
logreg_cv.fit(X, y)

print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))
Tuned Logistic Regression Parameters: {'C': 100000000.0}
Best score is 0.9578207381370826

best_params_ returns the parameters that perform the best. best_score_ returns the mean cross validation score amongst the folds. If we specify multiple parameters, all possible combinations will be tried thus GridSearchCV can be computationally expensive, especially if we are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. This is applied in the decision tree below.

In [27]:
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5, iid=True)

# Fit it to the data
tree_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))
Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 6}
Best score is 0.9314586994727593

After using k-fold cross validation to tune the model's hyperparameters, we may want to report how well the model can be expected to perform on a dataset that it has never seen before, given a scoring function of choice. For this reason it is important to split all of the data at the very beginning into a training set and hold-out set, then perform cross validation on the training set to tune the models's hyperparameters. After this we can use the best hyperparameters on the hold-out set, which has not been used at all, to test how well the model can be expected to perform on a new dataset. We can still use train_test_split for this.

Preprocessing Data

All the data used so far was relatively "nice" and in a format that allowed us to use scikit-learn models from the go. With real-world data this will rarely be the case and instead we will have to preprocess the data before we can build models.

Categorical Features

In the case of a categorical feature with non numerical values, the sklearn api will not accept it and we will have to preprocess these features into the correct format. Our goal is to convert this features so that they are numerical. The way to achieve this is by splitting the feature into a number of binary features called "dummy variables", one for each category: 0 mean the observation was not that category, 1 means it was.

There are several ways to create dummy vairables, in scikit-learn we can use OneHotEncoder or we can use pandas' get_dummies function.

In [28]:
# create a categorical dataframe
before = pd.DataFrame(['France', 'Italy', 'United Kingdom', 
                       'Italy', 'United Kingdom', 'United Kingdom'],
                        dtype='category', columns=['Country'])
display(before)

# create dummy features
after = pd.get_dummies(before, drop_first=True)
display(after)
Country
0 France
1 Italy
2 United Kingdom
3 Italy
4 United Kingdom
5 United Kingdom
Country_Italy Country_United Kingdom
0 0 0
1 1 0
2 0 1
3 1 0
4 0 1
5 0 1

Notice the drop_first parameter set to True, this is because if the country is not Italy or United Kingdom it is implicitly from France. Duplicating information, like not dropping the France column, might be an issue for some models.

Missing values

We say that data is missing when there is no value for a given feature in a particular observation. This can occur for many reasons and needs to be dealt with.

The following diabetes data for example has apparently no null values as we can see below.

In [29]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
names = ['preg', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age', 'diabetes']

# load DataFrame from url
df = pd.read_csv(url, names=names)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
preg         768 non-null int64
glucose      768 non-null int64
diastolic    768 non-null int64
triceps      768 non-null int64
insulin      768 non-null int64
bmi          768 non-null float64
dpf          768 non-null float64
age          768 non-null int64
diabetes     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

However missing values can be encoded in a number of different ways.

In [30]:
df.describe()
Out[30]:
preg glucose diastolic triceps insulin bmi dpf age diabetes
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Checking out the data it looks like there are observations that are zero, that is not possible for insulin levels or triceps for example. Since we do not have any indication of the real value this data is missing.

To handle missing data one option would be to drop all the observations that contain missing values, although it is usually better to keep data rather than discard it. Another option is to impute missing data. What imputing means is try and make an educated guess as to what the missing values could be. We could for example replace it with the mean value for that feature.

In [31]:
from sklearn.impute import SimpleImputer

# convert the missing values to np.nan
df.insulin.replace(0, np.nan, inplace=True)
df.triceps.replace(0, np.nan, inplace=True)
df.bmi.replace(0, np.nan, inplace=True)

# create an imputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

# fit the imputer
imp.fit(df)

# create the imputed dataframe
X = pd.DataFrame(imp.transform(df), columns=df.columns)
X.head()
Out[31]:
preg glucose diastolic triceps insulin bmi dpf age diabetes
0 6.0 148.0 72.0 35.00000 155.548223 33.6 0.627 50.0 1.0
1 1.0 85.0 66.0 29.00000 155.548223 26.6 0.351 31.0 0.0
2 8.0 183.0 64.0 29.15342 155.548223 23.3 0.672 32.0 1.0
3 1.0 89.0 66.0 23.00000 94.000000 28.1 0.167 21.0 0.0
4 0.0 137.0 40.0 35.00000 168.000000 43.1 2.288 33.0 1.0

Due to the ability to transform data as such, imputers are known as transformers, and any model that can transform data this way, using the transform method, is called a transformer.

Centering and Scaling

If we have ranges in features that vary widely, they can unduly influence our model. For this reason we actually want features to be on a similar scale. To do this, we do what it's called normalizing or scaling and centering.

There are several ways to normalize our data. In standardization given any column, we can subtract the mean and divide by the variance, so that all features are centered around zero and have variance of one. We can also subtract the minimun and divide by the range of the data, so the normalized dataset has a minimum zero and maximum one. There are diferent approaches for this, see here for a more complete list of scalers in the sklearn library.

In [32]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
scaled.head()
Out[32]:
preg glucose diastolic triceps insulin bmi dpf age diabetes
0 0.639947 0.848324 0.149641 6.655021e-01 -3.345079e-16 0.166292 0.468492 1.425995 1.365896
1 -0.844885 -1.123396 -0.160546 -1.746338e-02 -3.345079e-16 -0.852531 -0.365061 -0.190672 -0.732120
2 1.233880 1.943724 -0.263941 8.087936e-16 -3.345079e-16 -1.332833 0.604397 -0.105584 1.365896
3 -0.844885 -0.998208 -0.160546 -7.004289e-01 -7.243887e-01 -0.634212 -0.920763 -1.041549 -0.732120
4 -1.141852 0.504055 -1.504687 6.655021e-01 1.465506e-01 1.548980 5.484909 -0.020496 1.365896

Pipeline

We can use the scikit-learn pipeline object when we want to perform multiple tasks sequentially, meaning apply a list of transformers and a final estimator. Intermediate steps of the pipeline must be transformers, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. In the next example we will be applying first an imputer, than a scaler and the fit the data on our k-NN kmodel.

In [33]:
from sklearn.pipeline import Pipeline

y = df.diabetes
X = df.drop('diabetes', axis=1)

#construct a list of steps in the pipeline
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
         ('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier(n_neighbors=6))]

# instantiate a pipeline 
pipeline = Pipeline(steps)

# split into train and test set
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=42)

# fit the pipeline
pipeline.fit(X_train, y_train) 

# predict on test set
y_pred = pipeline.predict(X_test) 

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       151
           1       1.00      0.96      0.98        80

    accuracy                           0.99       231
   macro avg       0.99      0.98      0.99       231
weighted avg       0.99      0.99      0.99       231

Conclusion

The scikit-learn library provides many different algorithms which can be imported into the code and then used to build models just like we would import any other Python library. This makes it easier to quickly build different models and compare these models to select the highest scoring one.

In this tutorial, we have only scratched the surface of what is possible with the scikit-learn library. To use this Machine Learning library to the fullest, there are many resources available with detailed documentation that you can dive into.

But to really appreciate the true power of the scikit-learn library, what you really need to do is start using it on different open data sets and building predictive models using these data sets. There are various sources that contain many interesting data sets on which one can practice building predictive models by using the algorithms provided by the scikit-learn library.