In supervised learning whe have several data points, described using predictor variables and a target
variable. The aim is to build a model that is able to predict the traget variable. For this we need
labelled data, defined as data with input and output parameters.
There are many ways to perform supervised learning, here we will use scikit-learn
, one of
the most popular and user-friendly machine learning libraries for Python. It also integrates very well
with the scipy
stack, including numpy
.
Our data is commonly represented in a table structure such as the one you see below.
from sklearn import datasets
import numpy as np
import pandas as pd
# load iris data from scikit-learn's datasets
iris = datasets.load_iris()
# convert the data to a pandas dataframe
X = pd.DataFrame(iris.data, columns=iris.feature_names)
X.head(7)
Here we have data pertaining to iris flower in which the features consist of four measurements: sepal lenght, sepal width, petal lenght and petal width. The target variable encodes the species of flower and there are 3 possibilities as we can see below.
y = iris.target
np.unique(y)
0 stands for setosa, 1 for versicolor and 2 for virginica.
We can use the shape
attribute to return a tuple representing the dimensionality of the
DataFrame.
X.shape
In order to perform some initial EDA (exploratory data analysis) we will use seaborn
, a
visualization library based on matplotlib
.
Using seaborn.paiplot
we can plot pairwise relationships in a dataset.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
df = X.copy()
df['target'] = y
_ = sns.pairplot(df, hue='target', vars=df.columns[:-1])
The result is a matrix of figures where the diagonal axes are showing the univariate distribution of the data for the variable in that column. The off-diagonal features are scatter plots of the column feature versus row feature colored by the target variable.
There is a great deal of information in this scatter matrix. Petal lenght and width are highly correlated. Also, flowers are clustered according to species.
Classification¶
So, we have a set of labelled data and we want to build a classifier that takes new unlabelled data as input and outputs a label. How do we construct this classifier?
We first need to choose a type of classifier and it needs to learn from the already laballed data. For this reason , we call the already labelled data the training data. We will initially choose a simple algorithm called k-nearest neighbors.
k-Nearest Neighbors¶
The basic idea of k-nearest neighbors, or k-NN, is to predict the label of any data points by looking at the $k$ closest labeled datapoints and getting them to vote on what label the unlabeled point should have.
What the k-NN algorithm essentially does is it creates a set of decision boundaries and votes in which boundary any new data point lies.
All machine learning models in scikit-learn
are implemented as classes. These classes
serve two purposes: they implement the algorithms for learning a model and the algorithms to predict a
model, while also storing all the information that is learned from the data.
Training a model on the data is also called fitting the model to the data. In sklearn we use the
fit
method to do this. Similarly, the predict
method is what we use to
predict the label of an unlabelled data point.
The scikit-learn API requires firstly that we have the data as NumPy arrays or pandas DataFrame. It also requires that the features take on continuous values and that there are no missing values in the data. The iris dataset already satisfies this prerequisites and can thus we can move on to fitting the data.
To fit our k-nearest neighbor classifier we need to import it from sklearn.neighbors
. By
using the n_neighbors
attribute we can set the number of neighbors. Notice we named the
feature array X and response variable y. This is in accordance with the common scikit-learn practice.
from sklearn.neighbors import KNeighborsClassifier
# create a knn classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)
# fit it to the features X and target y
knn.fit(X, y)
Having fit a k-NN classifier, we can now use it to predict the label of a new data point. However,
there is no unlabeled data available since all of it was used to fit the model! We can still use the
.predict()
method on the X that was used to fit the model, but it is not a good indicator
of the model's ability to generalize to new, unseen data.
For now, we can generate some random unlabeled data points.
X_new = X.sample(n=5, random_state=22)
prediction = knn.predict(X_new)
print('Prediction {}'.format(prediction))
We now need to figure out how to measure its performance. In classification, accuracy is a commonly used metric. The accuracy of a classifier is defined as the number of correct predictions divided by the total number of data points.
In order to have data points to train on and data points to make predictions on we can split the
training data into a training set and a test set using train_test_split
and then compute
our accuracy.
from sklearn.model_selection import train_test_split
# split our data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.4,
random_state=8,
stratify=y)
# create our k-NN with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=6)
# fit on the training set
knn.fit(X_train, y_train)
# predict the test set
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
#compute the accuracy
acc = knn.score(X_test, y_test)
print('Accuracy: ', acc)
The accuracy pretty good for an untuned model.
We have recently introduced the concept of decision boundary. In the next plot we will visualize a decision boundary for several increasing values of $k$ in a k-NN model.
# we only take two features for plotting the decision boundaries
bi_dim = X.copy().to_numpy()[:, [0, 2]]
h = .01 # step size in the mesh
x_min, x_max = bi_dim[:, 0].min() - 1, bi_dim[:, 0].max() + 1
y_min, y_max = bi_dim[:, 1].min() - 1, bi_dim[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
neighbors = [1, 6, 15]
fig, ax = plt.subplots(1, 3, figsize=(10,10))
for _, n in zip(ax, neighbors):
clf = KNeighborsClassifier(n_neighbors=n)
clf.fit(bi_dim, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
_.pcolormesh(xx, yy, Z, cmap='plasma', alpha=.8)
# Plot also the training and testing points
_.scatter(bi_dim[:, 0], bi_dim[:, 1], c=y,
cmap='plasma', edgecolor='k', s=20)
_.title.set_text("k = {}".format(n))
plt.show()
Notice that as $k$ increases the decision boundary gets smoother and less curvy, therefore we consider it to be a less complex model than those with lower $k$.
Generally complex model run the risk of being sensitive to noise in the specific data that we have, rather than reflecting general trends in the data. This is known as overfitting. If we increase $k$ even more and we make the model even simpler, then the model will perform less well on both test and training set as indicated in the below plot.
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 30)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = KNeighborsClassifier(n_neighbors=k)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)
#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
This is known as a model complexity curve. We can see that there is a sweet spot in the middle that gives us the best performance in the test set.
Classification metrics¶
Accuracy can be used to measure model performance but it is not always a useful metric, for example in the case of one of our target classes being more frequent. This is a very common situation in practice and requires another metric to assess the performance of our model.
We can then use a confusion matrix, a matrix that summarizes predictive performance. Given any model, we can fill in the confusion matrix according to its predictions.
from sklearn.metrics import confusion_matrix
# confusion matrix as pandas dataframe
cm = confusion_matrix(y_test, y_pred)
cm = pd.DataFrame(cm, index=['actual 0', 'actual 1', 'actual 2'],
columns=['predicted 0', 'predicted 1', 'predicted 2'])
cm
There are several other important metrics we can easily calculate from the confusion matrix.
Precision, which is the number of true positives divided byt the total number of true positives and false positives. It is also called the positive predictive value or PPV: $$\frac{tp}{tp + fp}$$
Recall, which is the number of true positives divided by the total number of true positives and false negatives. It is also called sensitivity, hit rate or true positive rate: $$\frac{tp}{tp + fn}$$
F1 score, defined as two times the product of the precision and recall divided by the sum of the preciosion and recall. In other words, it is the armonic mean of precision and recall: $$2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}$$
To produce the resulting metrics, we pass the same arguments to classification_report
which outputs a string containing all the relevant metrics, alternatively when change change the
returned type to dictionary setting the output_dict
attribute to True.
from sklearn.metrics import classification_report
# classification report as dict
cr = classification_report(y_test, y_pred, output_dict=True)
# create a dataframe and transpose
cr = pd.DataFrame(cr).transpose()
cr
Logistic Regression¶
Logistic regression is another model use to perform classification tasks, despite its name. We will not go into the mathematical details here but will provide an intuition towards how logistic regression works for binary classification.
Give one feature, logistic regression will output a probability $p$, with respect to the target variable. If $p$ is greater than 0.5, we label the data as 1, if $p$ is less than 0.5, we label it 0. Logistic regression produces a linear decision boundary.
Using LogisticRegression
in sklear follows the same formula as the other models.
from sklearn.linear_model import LogisticRegression
# load breast cancer dataset
X, y = datasets.load_breast_cancer(return_X_y=True)
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# instantiate a logistic regression model
logreg = LogisticRegression(solver='liblinear')
# fit the model
logreg.fit(X_train, y_train)
# predict on test set
y_pred = logreg.predict(X_test)
In defining Logistic Regression we have specified a threshold of 0.5 for the probability $p$, a threshold that defines our model. This is not particular for logistic regression but other models as well,for example k-NN. If we vary the threshold to $p=0$ for example, the model predicts 1 for all the data. When the threshold is $p=1$, the model predicts 0 for all the data. If we vary the threshold between these two extremes, we get a series of different false positive and true positive rates. The set of points we get when trying all possible thresholds is called an ROC (Receiver Operating Charachteristic) curve.
To plot the ROC curve, we need to import roc_curve
from sklearn.metrics
.
The first arguments is given by the actual labels, the second by the predicted probabilities. The
function returns the false positive rate, the true positive rate and the decreasing thresholds on the
decision function used to compute the false positive rate and the true positive rate.
from sklearn.metrics import roc_curve
# predict the probabilities on our test set
y_proba = logreg.predict_proba(X_test)[:, 1]
# apply the roc_curve function
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# plot the curve
_ = plt.plot([0,1], [0,1], 'k--')
_ = plt.plot(fpr, tpr, label='Logistic Regression')
_ = plt.xlabel('False Positive Rate')
_ = plt.ylabel('True Positive Rate')
_ = plt.title('Logistic Regression ROC Curve')
_ = plt.legend()
We used the predicted probabilities of the model assigning a value of '1' to the observation in
question. This is because to compute the ROC we do not merely want the predictions on the test set,
but we want the probability that our log reg model outputs before using a threshold to predict the
label. To do this we applied the method predict_proba
to the model and pass it the test
data. It returns an array with two columns: each column contains the probabilities for the respective
target values. We chose the second column, the one with index 1, that is the probabilities of the
predicted labels being 1.
The larger the area under the ROC curve, the better our model is. The area under the ROC curve,
commonly denoted as AUC is another popular metric for classification models. To compute the AUC
we need to import roc_auc_score
from sklearn.metrics
.
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_proba)
When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds.
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
_ = plt.plot(recall, precision, label='Logistic Regression')
_ = plt.xlabel('Recall')
_ = plt.ylabel('Precision')
_ = plt.legend()
Next we will be checking out another type of supervised learining problem: regression.
Regression¶
In regression tasks, the target value is a continuously variing variable. Our first regression task
will be using the Boston housing dataset, this is again available from sklearn
's
preloaded datasets.
# load the boston data
boston = datasets.load_boston()
# assign our features and target
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X.head()
The documentation tells us what the feature names stand for, as we can see below.
datasets.load_boston()['DESCR']
The target variable "MEDV" is the median value of owner occupied homes in thousands of dollars.
As a first task we will try to predict the price from a single feature: the average number of rooms in a block.
# we only select the 5th column corresponding to the number of rooms
X_rooms = X.iloc[:, 5].to_numpy()
# plot on figure
_ = sns.scatterplot(x=X_rooms, y=y)
_ = plt.ylabel('Value of House /1000 ($)')
_ = plt.xlabel('Number of rooms')
We can immediately see that there is a correlation, as one might expect, more rooms lead to higher prices.
It's time to fit a regression model to our data. We are going to use a model called
LinearRegression
from sklearn.linear_model
.
Linear regression¶
Linear regression is probably one of the most important and widely used regression techniques. It’s among the simplest regression methods. One of its main advantages is the ease of interpreting results.
from sklearn.linear_model import LinearRegression
# instantiate the regressor
reg = LinearRegression()
# fit the regressor to the data, the reshape method is there to change the dimensionality
reg.fit(X_rooms.reshape(-1, 1), y)
# create our x values
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1, 1)
# plot the data with a regression line
_ = sns.scatterplot(x=X_rooms, y=y)
_ = plt.plot(prediction_space, reg.predict(prediction_space), color='red', linewidth=2)
_ = plt.xlabel('Number of Rooms')
_ = plt.ylabel('Value of house /1000 ($)')
The equation of the above line is:
$$y = ax + b$$Where $b$ is the intercept and $a$ is the slope of the line. So basically, the linear regression algorithm gives us the most optimal value for the intercept and the slope (in two dimensions). The y and x variables remain the same, since they are the data features and cannot be changed. The values that we can control are the intercept (b) and slope (a).
But how do we choose $a$ and $b$? A common method to is to define an error function for any given line and then to choose the line that minimizes the error function. Such an error function is also called a loss function or cost function. For this reason, we wish to minimize the vertical distance between the fit and the data. So for each data point we calculate the vertical distance between it and the line. This distance is called a residual.
We could try to minimize the sum of the residuals but then a large positive residual would cancel out a large negative residual. For this reason we minimize the sum of the squares of the residuals. This will be our loss function and using this loss function is commonly called Ordinary least squares (OLS).
This same concept can be extended to cases where there are more than two variables. This is called multiple linear regression. In this case, the dependent variable(target variable) is dependent upon several independent variables. A regression model involving multiple variables can be represented as:
$$y = a_1x_1 + a_2x_2 + ... + a_nx_n + b$$Fitting a regression model is to specify a coefficient $a_i$ for each feature, as well as the intercept $b$.
Let's now see this again on the Boston Housing dataset, but this time we will work with all the features.
# slit our data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.3,
random_state=23)
# instantiate a regression model
reg_all = LinearRegression()
#fit linear model on the training set
reg_all.fit(X_train, y_train)
#predict on the test set using the linear model
y_pred = reg_all.predict(X_test)
print('coefficients: ', reg_all.coef_)
print('intercept: ', reg_all.intercept_)
We saw that in classification we can use accuracy as a metric of model performance. The default scoring method for LinearRegression is called $R^2$.
The $R^2$ metric quantifies the amount of variance in the target variable that is predicted from the
feature variables. To compute the $R^2$ we once apply the method score
to the model and
pass it two arguments, the test data and the test data target.
reg_all.score(X_test, y_test)
We will most generally not use linear regression out of the box like this, we will wish to use regularization, which we'll see soon and which places further constraints on the model coefficients.
Cross-validation¶
If we are computing $R^2$ on our test set, the $R^2$ returned is dependent on the way that we split up our data. The data points in the test set may have some peculiarities that mean the $R^2$ computed on it is not representative of the model's ability to generalize to unseen data. To combat this dependence on what is essentially an arbitrary split, we use a technique called cross-validation.
We begin by splitting the dataset into $k$ groups or folds. Then we hold out our first fold as the test set and fit our model on the remaining folds, predict on the test set and compute the metric of interest. Next we hold out our second fold as the test set, then train on the other folds and compute our metric of interest on the test set. We continue like this for all the other folds. As a result we get $k$ values of $R^2$ from which we can compute statistics of interest such as the mean, the median and confidence intervals.
There is however a tradeoff as using more folds is more computationally expensive. This is because we
are fitting and predicting more times. This method avoids the problem of our metric of choice being
dependent on the train test split. To perform k-fold cross validation in sklearn
we first
need to import it from sklearn.model_selection
.
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv=5)
print(cv_results)
print('Mean R^2: ', np.mean(cv_results))
Regularized Regression¶
Linear regression minimizes a loss function, it chooses a coefficient $a_i$ for each feature variable. If we allow this coefficients or parameters to be large, it can lead to overfitting. For this reason, it is common practice to alter the loss function so that it penalizes for large coefficients. This is called regularization.
Ridge Regression¶
We will first have a look at ridge regression a type of regularized regression, in which our loss function is the standard OLS loss plus the suqared value of each coefficient multiplied by a constant $\alpha$:
$$\text{OLS loss function}+\alpha\ast\sum_{i=1}^{n}a_i^2$$Thus when minimizing the loss function to fit to our data , models are penalized for coefficients with a large magnitude, meaning large positive or large negative coefficients. $\alpha$ is a parameter we need to choose in order to fit and predict. Essentially, we can select the alpha for which our model performs best. Picking $\alpha$ for ridge regression is similar to picking $k$ in k-NN, this is called hyperparameter tuning. This $\alpha$, which we may also see called $\lambda$ in the wild, can be though of as a parameter that controls model complexity. Notice that when $\alpha=0$ we get back OLS.
from sklearn.linear_model import Ridge
#slpit into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# instantiate a ridge model
ridge = Ridge(alpha=0.1, normalize=True)
#fit on training set
ridge.fit(X_train, y_train)
# predict test set
ridge_pred = ridge.predict(X_test)
# score model
ridge.score(X_test, y_test)
Lasso Regression¶
There is another type of regularized regression called lasso regression in which our loss function is the standard OLS loss function plus the absolute value of each coefficient multiplied by a constant $\alpha$:
$$\text{OLS loss function}+\alpha\ast\sum_{i=1}^{n}|a_i|$$The method of performing lasso regression mirrors ridge regression. One of the really cool aspects of lasso regression is that it can be used to slect important features of a dataset. This is because it tends to shrink the coefficients of less imoprtant features to be exactly zero. The features whose coefficients are not shrunk to zero are selected by the lasso algorithm.
from sklearn.linear_model import Lasso
# get the feature names
names = X.columns
# instantiate a lasso model
lasso = Lasso(alpha=0.1, normalize=True)
# get the features coefficients
lasso_coef = lasso.fit(X, y).coef_
# plot the coefficients for each feature
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation=50)
_ = plt.ylabel('Coefficients')
_ = plt.xlabel('Features')
We can see that the most important predictor for our target variable is the number of rooms.
Hyperparameter tuning¶
In previously seen models we had to set some parameters, for example alpha
in ridge
regression or n_neighbors
in k-NN. We need to choose these parameters so that our model
can fit data the best, in other words these are parameters that cannot be explicitly learned by
fitting the model.
Choosing the correct hyperparameters is the key to a successful model. The basic idea is to try a whole bunch of different values, fit all of them separately, see how well each performs and choose the best one. This is called hyperparameter tuning. When fitting different values of hyperparameters it is essential to use cross-validation as using train test split alone would risk overfitting the hyperparameters to the test set.
To proceed with hyperparameter tuning, we first choose a grid a possible values we want to try for
the hyperparameter or hyperparameters. We then perform k-fold cross-validation for each point in the
grid, that is for each choice of hyperparameter or combination of hyperparameters. We the choose for
our model the choice of hyperparameters that performed the best. This is called a grid search and in
scikit-learn we implement it using the class GridSearchCV
.
from sklearn.model_selection import GridSearchCV
X, y = datasets.load_breast_cancer(return_X_y=True)
# create a dictionary with parameters and values
param_grid = {'C': np.logspace(-5, 8, 15)}
# instantiate a logistic regression classifier
logreg = LogisticRegression(solver='liblinear')
# instantiate the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
# fit on data
logreg_cv.fit(X, y)
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
best_params_
returns the parameters that perform the best. best_score_
returns the mean cross validation score amongst the folds. If we specify multiple parameters, all
possible combinations will be tried thus GridSearchCV can be computationally expensive, especially if
we are searching over a large hyperparameter space and dealing with multiple hyperparameters. A
solution to this is to use RandomizedSearchCV
, in which not all hyperparameter values are
tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability
distributions. This is applied in the decision tree below.
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]}
# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()
# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5, iid=True)
# Fit it to the data
tree_cv.fit(X, y)
# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))
After using k-fold cross validation to tune the model's hyperparameters, we may want to report how
well the model can be expected to perform on a dataset that it has never seen before, given a scoring
function of choice. For this reason it is important to split all of the data at the very beginning
into a training set and hold-out set, then perform cross validation on the training set to tune the
models's hyperparameters. After this we can use the best hyperparameters on the hold-out set, which
has not been used at all, to test how well the model can be expected to perform on a new dataset. We
can still use train_test_split
for this.
Preprocessing Data¶
All the data used so far was relatively "nice" and in a format that allowed us to use scikit-learn models from the go. With real-world data this will rarely be the case and instead we will have to preprocess the data before we can build models.
Categorical Features¶
In the case of a categorical feature with non numerical values, the sklearn api will not accept it and we will have to preprocess these features into the correct format. Our goal is to convert this features so that they are numerical. The way to achieve this is by splitting the feature into a number of binary features called "dummy variables", one for each category: 0 mean the observation was not that category, 1 means it was.
There are several ways to create dummy vairables, in scikit-learn we can use
OneHotEncoder
or we can use pandas' get_dummies
function.
# create a categorical dataframe
before = pd.DataFrame(['France', 'Italy', 'United Kingdom',
'Italy', 'United Kingdom', 'United Kingdom'],
dtype='category', columns=['Country'])
display(before)
# create dummy features
after = pd.get_dummies(before, drop_first=True)
display(after)
Notice the drop_first
parameter set to True
, this is because if the country
is not Italy or United Kingdom it is implicitly from France. Duplicating information, like not
dropping the France column, might be an issue for some models.
Missing values¶
We say that data is missing when there is no value for a given feature in a particular observation. This can occur for many reasons and needs to be dealt with.
The following diabetes data for example has apparently no null values as we can see below.
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
names = ['preg', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age', 'diabetes']
# load DataFrame from url
df = pd.read_csv(url, names=names)
df.info()
However missing values can be encoded in a number of different ways.
df.describe()
Checking out the data it looks like there are observations that are zero, that is not possible for insulin levels or triceps for example. Since we do not have any indication of the real value this data is missing.
To handle missing data one option would be to drop all the observations that contain missing values, although it is usually better to keep data rather than discard it. Another option is to impute missing data. What imputing means is try and make an educated guess as to what the missing values could be. We could for example replace it with the mean value for that feature.
from sklearn.impute import SimpleImputer
# convert the missing values to np.nan
df.insulin.replace(0, np.nan, inplace=True)
df.triceps.replace(0, np.nan, inplace=True)
df.bmi.replace(0, np.nan, inplace=True)
# create an imputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
# fit the imputer
imp.fit(df)
# create the imputed dataframe
X = pd.DataFrame(imp.transform(df), columns=df.columns)
X.head()
Due to the ability to transform data as such, imputers are known as transformers, and any model that
can transform data this way, using the transform
method, is called a transformer.
Centering and Scaling¶
If we have ranges in features that vary widely, they can unduly influence our model. For this reason we actually want features to be on a similar scale. To do this, we do what it's called normalizing or scaling and centering.
There are several ways to normalize our data. In standardization given any column, we can subtract the mean and divide by the variance, so that all features are centered around zero and have variance of one. We can also subtract the minimun and divide by the range of the data, so the normalized dataset has a minimum zero and maximum one. There are diferent approaches for this, see here for a more complete list of scalers in the sklearn library.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
scaled.head()
Pipeline¶
We can use the scikit-learn pipeline object when we want to perform multiple tasks sequentially, meaning apply a list of transformers and a final estimator. Intermediate steps of the pipeline must be transformers, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. In the next example we will be applying first an imputer, than a scaler and the fit the data on our k-NN kmodel.
from sklearn.pipeline import Pipeline
y = df.diabetes
X = df.drop('diabetes', axis=1)
#construct a list of steps in the pipeline
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=6))]
# instantiate a pipeline
pipeline = Pipeline(steps)
# split into train and test set
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=42)
# fit the pipeline
pipeline.fit(X_train, y_train)
# predict on test set
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
Conclusion¶
The scikit-learn library provides many different algorithms which can be imported into the code and then used to build models just like we would import any other Python library. This makes it easier to quickly build different models and compare these models to select the highest scoring one.
In this tutorial, we have only scratched the surface of what is possible with the scikit-learn library. To use this Machine Learning library to the fullest, there are many resources available with detailed documentation that you can dive into.
But to really appreciate the true power of the scikit-learn library, what you really need to do is start using it on different open data sets and building predictive models using these data sets. There are various sources that contain many interesting data sets on which one can practice building predictive models by using the algorithms provided by the scikit-learn library.