Extreme Gradient Boosting with XGBoost¶

XGBoost was developed originally as a C++ command-line application. After winning a ML competition it started being widely used from the community. As a result bindings or functions that tapped into the core C++ code started appearing in a variety of other languages, including Python, R, Scala, Julia and Java.

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. What makes XGBoost so popular is its speed and performance. Because the core XGBoost algorithm is parallelizable, it can harness all of the processing power of modern multi-core computers. Furthermore it is parallelizable onto GPUs and across networks, making it feasible to train models on very large datasets on the order of hundreds of millions of training examples.

However XGBoost' s speed isn't the package's real draw. Ultimately a fast but poorly performing machine learning algorithm is not going to have wide adoption within the community. What makes it so popular is that it consistently outperforms almost all other single-algorithm methods and has been shown to achieve state-of-the-art performance on a variety of benchmark machine learing datasets.

XGBoost has a scikit-learn comptible API and we can that in the example below.

In [1]:

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# import xgboost
import xgboost as xgb

# load breast cancer dataset
X, y = datasets.load_breast_cancer(return_X_y=True)

# split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=23)

# instantiate an XGBClassifier
xg_cl = xgb.XGBClassifier(objective='binary:logistic', 
                          n_estimators=20, 
                          seed=23)

# fit the model on training set
xg_cl.fit(X_train, y_train)

# predict test set
y_pred = xg_cl.predict(X_test)

pd.DataFrame(classification_report(y_test, y_pred, output_dict=True)).transpose()

Out[1]:

	f1-score	precision	recall	support
0	0.948718	0.948718	0.948718	39.000000
1	0.973333	0.973333	0.973333	75.000000
accuracy	0.964912	0.964912	0.964912	0.964912
macro avg	0.961026	0.961026	0.961026	114.000000
weighted avg	0.964912	0.964912	0.964912	114.000000

XGBoost is usually used with trees as base learners and XGBoost itself is considered an ensemble learning method in that uses the outputs of many models for final prediction. Compared to other ensemble methods XGBoost uses a slightly different kind of a decision tree, called a classification and regression tree, or CART.

CART trees contain a real-valued score in each leaf, regardless of whether they are used for classification or regression. The real-valued scores can then be thresholded to convert into categories for classification problems if necessary.

What is Boosting?¶

At bottom, boosting isn't really a specific machine learning algorithm, but a concept that can be applied to a set of machine learning models (meta-algorithm). Specifically it's an ensemble meta-algorithm primarily used to reduce any given single learner's variance and to convert many weak learners into an arbitrarily strong learner.

A weak learner is any meachine learning algorithm that is just slightly better than chance. So a decision tree that can predict some outcome slightly more frequently than pure randomness would be considered a weak learner.

The principal insight that allows XGBoost to work is the fact that we can use boosting to convert a collection of weak learners into a strong learner, where a strong learner is any algorithm that can be tuned to achieve arbitrarily good performance for some supervised learning problem.

How is this accomplished?¶

This is achieved by iteratively learning a set of weak models on subsets of the data we have at hand, and weighting each of their predictions according to each weak learner's performance. We then combine all of the weak learners' predictions multiplied by their weights to obtain a single final weighted prediction that is much better than any of the individual predictions themselves.

Cross-validation with XGBoost¶

Cross-validation is a robust method for estimating the expected performance of a machine learning model on unseen data by generating many non-overlapping train/test split on the training data and reporting the average test set performance across all data splits.

In this following example we will convert our dataset into an optimized data structure, that the creators of XGBoost made, that gives the package its lauded performance and efficiency gains called DMatrix

In [2]:

# create a DMatrix with feature array X and target y
churn_dmatrix = xgb.DMatrix(X, y)
churn_dmatrix

Out[2]:

<xgboost.core.DMatrix at 0x29fb8fa1d68>

In the previous excercise the input datasets were converted into DMatrix data on the fly, but when we use the XGBoost cv object, which is part of XGBoost learning api we have to first explicitly convert our data into a DMatrix. This is necessary because the cv method has no idea what kind of XGBoost model we are using and expects us to provide that information as a dictionary of appropriate key-value pairs.

In [3]:

# create a dictionary of parameters
params = {'objective':'binary:logistic',
          'max_depth':4}

# use cross-validation on on DMatrix
cv_results = xgb.cv(dtrain=churn_dmatrix,
                    params=params,
                    nfold=10,
                    num_boost_round=80,
                    metrics='error',
                    as_pandas=True,   # to store output as pandas df
                    seed=23)


print('Accuracy {:.4f}'.format((1 - cv_results['test-error-mean']).iloc[-1]))
cv_results.tail()

Accuracy 0.9684

Out[3]:

	test-error-mean	test-error-std
75	0.031642	0.034026
76	0.031642	0.034026
77	0.031642	0.034026
78	0.031642	0.034026
79	0.031642	0.034026

cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round 'test-error-mean' is extracted and converted into an accuracy value, where accuracy is 1-error.

Measuring AUC¶

Now that we have used cross-validation to compute average accuracy (after converting from an error), it's very easy to compute any other metric we might be interested in. All we have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv()

In [4]:

cv_results = xgb.cv(dtrain=churn_dmatrix, 
                    params=params, 
                    nfold=10, 
                    num_boost_round=80, 
                    metrics="auc", 
                    as_pandas=True, 
                    seed=23)

print('AUC score {:.4f}'.format((cv_results["test-auc-mean"]).iloc[-1]))
cv_results.tail()

AUC score 0.9911

Out[4]:

	train-auc-mean	test-auc-mean	test-auc-std
75	1.0	0.991406	0.013545
76	1.0	0.991406	0.013545
77	1.0	0.991274	0.013480
78	1.0	0.991131	0.013484
79	1.0	0.991131	0.013484

When to use XGBoost¶

We should use XGBoost for any supervised machine learning task that fits the following criteria:

A large number of training examples (greater than 1k training samples and less than 100 features)
It tends to perform well when we have a mixture of categorical and numeric features or just numeric features.

It should not be used in the case of:

Image recognition
Computer vision
NLP and understanding problems
Number of training smaples is significantly smaller than the number of features

as these kind of problems can be much better tackled using dep learning approaches.

Regression¶

Now that we have seen how to use XGBoost for classification, we will see how to use it in regression tasks.

Regression probles involve predictiong continuous, or real, values. Evaluating the quality of a regression model involves using a different set of metrics: in most cases we use the root mean squared error (RMSE) or the mean absolute error (MAE).

RMSE is computed by taking the difference between the actual and the predicted values for what we are trying to predict by squaring those differences, computing their mean, and taking that value's square root. This allows us to treat negative and positive differences equally, but it tends to punish larger differences between predicted and actual values much more than smaller ones.
MAE on the other hand simply sums the absolute differences between predicted and actual values across all of the samples we build our model on. Although MAE isn't affected by large differences as much as RMSE, it lacks some nice mathematical properties that make it much less frequently used as an evaluation metric.

Some common algorithms that are used for regression problems include linear regression and decision trees. It's important to briefly note here that some algorithms such as decision trees, can be used for both regression as well as classification tasks, which is one of their important properties that makes them prime candidates to be building blocks for XGBoost models.

Objective functions¶

An objective or loss function quantifies how far off our prediction is from the actual result for a given data point. It maps the difference between the prediction and the target to a real number. When we construct a model we do so in the hopes that it minimizes the loss function across all of the data points we pass in. Our ultimate goal is the smallest possible loss.

Loss functions have specific naming conventions in XGBoost.

For regression models the most common is called reg linear
For binary classification models the most common loss functions used are reg logistic, when we want just decision, not probability, binary logistic when we want the actual predicted probability of the positive class.

Base learners¶

XGBoost in an ensemble learning method composed of many base learners that are added together to generate a single prediction. The goal is to have base learners that are slightly better than random guessing on certain subsets of training examples, and uniformly bad at the remainder. When all of the predictions are combined they create a final prediction that is non-linear, uniformely bad predictions cancel out and those slightly better than chance combine into a single very good prediction.

Linear base learner is simply a sum of linear terms, exactly as we would find in a linear or logistic regression model.

When we combine many of these base models into an ensemble, we gat a weighted sum of linear models, which is itself linear since we do not get any nonlinear combinations of features in the model, this approach is rarely used, as we can get identical performance from a regularized linear model.
Tree base learner uses decision tree as base models. When the decision trees are all combined into an ensemble, their combination becomes a nonlinear function of each individual tree, which itself in nonlinear

Decision trees as base learners¶

Here we will build an XGBoost model for a regression task. By default XGBoost uses trees as base learners, so it does not need being specified. We will be using the boston housing dataset from scikit-learn.

In [5]:

import numpy as np
from sklearn.metrics import mean_squared_error

# load boston housinng data
X, y = datasets.load_boston(return_X_y=True)

# split into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# instantiate an XGBRegressor
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', 
                          n_estimators=60,
                          seed=1)

# fit on training set
xg_reg.fit(X_train, y_train)

# predict test set
y_pred = xg_reg.predict(X_test)

# calculate and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE: %f'%(rmse))

RMSE: 2.933955

Linear base learners¶

After using trees as base models let's use the other kind of base model: a linear learner. This model, although not as commonly used, allows us to create a regularized linear regression using XGBoost's powerful learning API. However, because it's uncommon, we have to use XGBoost's own non-scikit-learn compatible functions to build the model, such as xgb.train().

In order to do this we must create the parameter dictionary that describes the kind of booster we want to use. The key-value pair that defines the booster type (base model) we need is "booster":"gblinear".

Once we've created the model, we can use the .train() and .predict() methods of the model.

In [6]:

# convert the training and testing sets into DMatrixes
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test = xgb.DMatrix(data=X_test, label=y_test)

# create the parameter dictionary
params = {"booster":"gblinear", "objective":"reg:squarederror"}

# train the model
xg_reg = xgb.train(params=params, dtrain=DM_train, num_boost_round=60)

# predict the labels of the test set
preds = xg_reg.predict(DM_test)

# compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

RMSE: 5.448172

We can compare the RMSE and MAE of a cross-validated model to evaluate the quality.

In [7]:

# create the DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# create the parameter dictionary
params = {"objective":"reg:squarederror", "max_depth":4}

# perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix, 
                    params=params, nfold=4, 
                    num_boost_round=5, 
                    metrics='rmse', 
                    as_pandas=True, 
                    seed=1)

cv_results

Out[7]:

	train-rmse-mean	train-rmse-std	test-rmse-mean	test-rmse-std
0	17.109061	0.167222	17.171979	0.616886
1	12.348774	0.121270	12.472192	0.475912
2	9.016547	0.084773	9.217072	0.356090
3	6.688067	0.055278	6.980938	0.264590
4	5.067312	0.055031	5.571385	0.285786

Regularization¶

Loss functions in XGBoost do not just take into account how close a model' predictiona are to the actual value but also the complexity of the model. This idea of penalizing models as they become more complex is called regularization. So loss functions are used to find models that are both accurate and as simple as they can possibly be.

There are several parameters that can be tweaked to limit model complexity by altering the loss function.

gamma is a parameter for tree based learners that controls wheater a given node on will split based on the expected reduction in the loss that would occur after preforming the split, so that higher values lead to fewer splits.
alpha, or l1 regularization, is a penalty on leaf weights rather that on feature weights, as is the case of linear or logistic regression. Higher alpha values lead to stronger l1 regularization, which causes many leaf weights in the base learners to go to 0.
lambda, or l2 regularization, is a much smoother penalty that l1 and causes leaf weights to smoothly decrease, instead of enforcing strong sparsity constraints on the leaf weights as in l1.

In [8]:

params = {'objective':'reg:squarederror', 'max_depth':4}
l1_params = [1, 10, 100]

# initialize empty list to store our final root mean square error
# for each of this l1 or aplha values
rmses_l1 = []

# loop over alpha values
for reg in l1_params:
    
    # assign current value to alpha
    params['alpha'] = reg
    
    # perform cross-validation 
    cv_results = xgb.cv(dtrain=housing_dmatrix,
                        params=params,
                        nfold=4,
                        num_boost_round=10,
                        metrics='rmse',
                        as_pandas=True,
                        seed=1)
    
    # add final RMSE to our list
    rmses_l1.append(cv_results['test-rmse-mean'].tail(1).values[0])

print('Best rmse as a function of l1:')
rmse_df = pd.DataFrame(zip(l1_params, rmses_l1), columns=['l1', 'rmse'])
rmse_df

Best rmse as a function of l1:

Out[8]:

	l1	rmse
0	1	3.431487
1	10	3.633929
2	100	4.623489

Having seen an example of l1 regularization, we'll now vary the l2 regularization penalty - also known as lambda - and see its effect on overall model performance on the housing dataset.

In [9]:

reg_params = [1, 10, 100]

# create the initial parameter dictionary for varying l2 strength
params = {"objective":"reg:squarederror","max_depth":3}

# create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []

# iterate over reg_params
for reg in reg_params:

    # Update l2 strength
    params["lambda"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)
    
    # Append best rmse (final round) to rmses_l2
    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
pd.DataFrame(zip(reg_params, rmses_l2), columns=["l2", "rmse"])

Best rmse as a function of l2:

Out[9]:

	l2	rmse
0	1	6.022222
1	10	7.201520
2	100	10.692149

Visualizing individual XGBoost trees¶

Now that we've used XGBoost to both build and evaluate regression as well as classification models, we should get a handle on how to visually explore our models.

XGBoost has a plot_tree() function that makes this type of visualization easy. Once we train a model using the XGBoost learning API, we can pass it to the plot_tree() function along with the number of trees we want to plot using the num_trees argument.

In [11]:

import matplotlib.pyplot as plt

params = {"objective":"reg:squarederror", "max_depth":2}

# train the model
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# plot the first tree
_ = xgb.plot_tree(xg_reg, num_trees=0)
_.figure.set_size_inches(40,25)
plt.show()

# plot the fifth tree
_ = xgb.plot_tree(xg_reg, num_trees=4)
_.figure.set_size_inches(40,25)
plt.show()

# plot the last tree sideways
_ = xgb.plot_tree(xg_reg, num_trees=9, rankdir='LR')
_.figure.set_size_inches(40,25)
plt.show()

Visualizing feature importances¶

Another way to visualize our models is to examine the importance of each feature column in the original dataset within the model.

One simple way of doing this involves counting the number of times each feature is split on across all boosting rounds (trees) in the model, and then visualizing the result as a bar graph, with the features ordered according to how many times they appear. XGBoost has a plot_importance() function that allows us to do this.

In [12]:

import seaborn as sns
sns.set()

# create boston housing DMatrix 
housing_dmatrix = xgb.DMatrix(data=X, 
                              label=y, 
                              feature_names=datasets.load_boston().feature_names)

# create the parameter dictionary
params = {'objective':'reg:squarederror', 'max_depth':10}

# train the model
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()

Hyperparameter Tuning¶

Let's now have a look at how tuning can improve a model. We will analyze another housing dataset and see the difference between a simple model and a tuned model.

In [13]:

# load Ames housing data
housing_data = pd.read_csv('ames_housing_trimmed_processed.csv')

# create our feature and target variables
X, y = housing_data[housing_data.columns.tolist()[:-1]].values, housing_data[housing_data.columns.tolist()[-1]].values

# create a DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y, feature_names=housing_data.columns.tolist()[:-1])

# create an untuned parameter dictionary
untuned_params = {'objective':'reg:squarederror'}

# perform cv
untuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, 
                                 params=untuned_params,
                                 nfold=4,
                                 #num_boost_round=200,
                                 metrics='rmse',
                                 as_pandas=True,
                                 seed=1)

# compute and print RMSE
untuned_rmse_mean = untuned_cv_results_rmse['test-rmse-mean'].iloc[-1]
print('Untuned rmse: %f'%untuned_rmse_mean)

Untuned rmse: 32929.541015

and here is the tuned model.

In [14]:

# create a dictionary of tuned parameters
tuned_params = {'objective':'reg:squarederror',
                'colsample_bytree':0.4,
                'learning_rate':0.1,
                'max_depth':4}

# perform cv
tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, 
                               params=tuned_params,
                               nfold=4,
                               num_boost_round=200,
                               metrics='rmse',
                               as_pandas=True,
                               seed=1)

# compute and print RMSE
tuned_rmse_mean = tuned_cv_results_rmse['test-rmse-mean'].iloc[-1]
print('Tuned rmse: %f'%tuned_rmse_mean)

Tuned rmse: 27608.438965

In [15]:

rmse_decr = ((tuned_rmse_mean - untuned_rmse_mean ) / untuned_rmse_mean * 100)
print('There was a decrease of {}% rmse with the tuned model'.format(round(-rmse_decr, 1)))

There was a decrease of 16.2% rmse with the tuned model

Boosting Rounds¶

Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees we build) impacts the out-of-sample performance of our model. We'll use xgb.cv() inside a for loop and build one model per num_boost_round parameter.

Here, we'll continue working with the Ames housing dataset.

In [16]:

# create the parameter dictionary for each tree 
params = {"objective":"reg:squarederror", "max_depth":3}

# create list of number of boosting rounds
num_rounds = [5, 10, 15]

# empty list to store final round rmse per model
final_rmse_per_round = []

# iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:
    
    # perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=1)
    
    # append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

# print the resultant DataFrame
num_rounds_rmses = zip(num_rounds, final_rmse_per_round)
pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"])

Out[16]:

	num_boosting_rounds	rmse
0	5	52376.579427
1	10	37997.385417
2	15	35770.492188

Automated boosting round selection using early_stopping¶

Instead of attempting to pick the best possible number of boosting rounds, we can very automatically select the number of boosting rounds within xgb.cv(). This is done using a technique called early stopping.

Early stopping works by testing model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds. Here we will use the early_stopping_rounds parameter in xgb.cv() with a large possible number of boosting rounds (50). It's important to notice that if the holdout metric continuously improves up through when num_boosting_rounds is reached, then early stopping does not occur.

In [17]:

# perform cross-validation with early stopping
cv_results = xgb.cv(dtrain=housing_dmatrix, 
                    params=params,
                    early_stopping_rounds=10,
                    num_boost_round=300,
                    as_pandas=True,
                    seed=1)

# print the number of rounds
print(len(cv_results))

We can see that the model stopped at the 77th round instead of 300th.

Tunable Parameters¶

The parameters that can be tuned are significantly different for each base learner.

- Tree base learner¶

learning_rate: affects how quickly the model fits the residual error using additional base learners. A low learning rate will require more boosting rounds to achieve the same reduction in residual error as a model with a higher learning rate
gamma: min loss reduction to create a new tree split
lambda: l2 regularization on leaf weights
alpha: l1 regulaization on leaf weights
max_depth: a positive integer value that affects how deep ech tree is allowed to grow during any given boosting round
subsample: value between 0 and 1 and is a fraction of the total training set that can be used for any given boosting round. If the value is low then the fraction of our training data used per boosting round would be low and we could run into underfitting problems, a value that is very high can lead to overfitting as well
colsample_bytree: a fraction of features we can select from during any given boosting round and must also ba a value between 0 and 1. A large value means that almost all features can be used to build a tree during a given boosting round, whereas a small value means that the fraction of features that can be selected from is very small. In general smaller colsample_bytree values can be thought of as providing additional regularization to the model, while using all columns might overfit a trained model.

- Linear base learner¶

lambda: l2 regularization on leaf weights
alpha: l1 regularization on leaf weights
lambda_bias: l2 regularization term on bias

Finally it is important to mention that the number of boosting rounds, that is either the number of trees we build or the number of linear base learners we construct, is itself a tunable parameter.

Tuning eta¶

We will begin by tuning the eta, also known as the learning rate. The learning rate is a parameter that can range between 0 and 1, with higher values of "eta" penalizing feature weights more strongly, causing much stronger regularization.

In [18]:

# create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}

# create list of eta values and empty list to store final round RMSE per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# systematically vary the eta 
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, 
                        params=params, 
                        nfold=3, 
                        early_stopping_rounds=5, 
                        num_boost_round=100, 
                        metrics='rmse',
                        seed=1)
        
    
    # append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# print the resultant DataFrame
pd.DataFrame(zip(eta_vals, best_rmse), columns=["eta","best_rmse"])

Out[18]:

	eta	best_rmse
0	0.001	179997.286458
1	0.010	82890.429687
2	0.100	33476.307292

Tuning max_depth¶

max_depth is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.

In [19]:

max_depths = [2, 5, 10, 20]
best_rmse = []

# systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, 
                        params=params, 
                        nfold=3, 
                        early_stopping_rounds=5, 
                        num_boost_round=100, 
                        metrics='rmse', 
                        seed=1)        
    
    # append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# print the resultant DataFrame
pd.DataFrame(zip(max_depths, best_rmse),columns=["max_depth","best_rmse"])

Out[19]:

	max_depth	best_rmse
0	2	33385.765625
1	5	35365.897787
2	10	36978.430989
3	20	37036.133464

Tuning colsample_bytree¶

In scikit-learn it is called max_features, but in both this parameter simply specifies the fraction of features to choose from at every split in a given tree. colsample_bytree must be specified as a float between 0 and 1.

In [20]:

# create the parameter dictionary
params={"objective":"reg:squarederror", "max_depth":3}

# create list of hyperparameter values
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []

# systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:

    params['colsample_bytree'] = curr_val
    
    # perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, 
                        params=params, 
                        nfold=2,
                        num_boost_round=100, 
                        early_stopping_rounds=5,
                        metrics="rmse", 
                        seed=1)
    
    # append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# print the resultant DataFrame
pd.DataFrame(zip(colsample_bytree_vals, best_rmse), columns=["colsample_bytree","best_rmse"])

Out[20]:

	colsample_bytree	best_rmse
0	0.1	31565.036133
1	0.5	31556.867187
2	0.8	30517.711914
3	1.0	30092.976562

To find the optimal values for several hyperparameters simultaneously, leading to the lowest loss possible, when their values interact in a non-obvious or nonlinear way the common strategy is to use Grid Search or Random Search.

Grid Search¶

Grid search is a method of exhaustively searching through a collection of possible parameters values, every parameter configuration is evaluated. The parameter configuration that gave the best value for the metric used is picked as the final model.

In [21]:

# import GridSearchCV from sklearn
from sklearn.model_selection import GridSearchCV

# create a parameter dictionay
gbm_param_grid = {'colsample_bytree': [0.3, 0.7],
                  'learning_rate':[0.01, 0.1, 0.5, 0.9],
                  'n_estimators':[200],
                  'max_depth': [2, 5],
                  'subsample':[0.3, 0.5, 0.9]}

# instantiate an XGBRegressor
gbm = xgb.XGBRegressor(objective='reg:squarederror')

# create GridSearchCV object using XGBRegressor and the parameter dictionary
grid_mse = GridSearchCV(estimator=gbm, 
                        param_grid=gbm_param_grid,
                        scoring='neg_mean_squared_error',
                        cv=2,
                        n_jobs=-3)

# fit on Ames housing dataset
grid_mse.fit(X, y)


print('Best parameters found :', grid_mse.best_params_)
print('Lowest RMSE found :',np.sqrt(np.abs(grid_mse.best_score_)))

Best parameters found : {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.9}
Lowest RMSE found : 29675.893146360446

Random Search¶

Significantly different from grid search in that the number of models that we are required to iterate over does not grow as we expand the overall hyperparameter space. In random search we can decide how many models, or iterations, we want to try out before stopping. It simply works by drawing a random combination of possible hyperparameter values from the range of allowable hyperparameters a set number of times. We the models specified are evaluated we simply pick the best one.

In [22]:

# import RandomizedSearchCV from sklearn 
from sklearn.model_selection import RandomizedSearchCV

# create a parameters dictionary
gbm_param_grid = {'learning_rate':np.arange(0.05, 1.05, 0.05),
                  'n_estimators':[200],
                  'subsample':np.arange(0.05, 1.05, 0.05),
                  'max_depth': range(2, 12)}

# instantiate an XGBRegressor
gbm = xgb.XGBRegressor(objective='reg:squarederror')

# create RandomizedSearchCV object using XGBRegressor and the parameter dictionary
randomized_mse = RandomizedSearchCV(estimator=gbm,
                                    param_distributions=gbm_param_grid,
                                    n_iter=60,
                                    scoring='neg_mean_squared_error',
                                    cv=2,
                                    n_jobs=-3)

# fit on Ames housing dataset
randomized_mse.fit(X, y)

print('Best parameters found :', randomized_mse.best_params_)
print('Lowest RMSE found :', np.sqrt(np.abs(randomized_mse.best_score_)))

Best parameters found : {'subsample': 0.3, 'n_estimators': 200, 'max_depth': 11, 'learning_rate': 0.05}
Lowest RMSE found : 30134.69835631754

Using XGBoost in Pipelines¶

Pipelines in sklearn are objects that take a list of named tuples as input. The named tuples must always contain a string name as the first element in each tuple and any scikit-learn compatible transfromer or estimator object as the second element. Each named tuple in the pipeline is called a step, and the list of informations that are conatined in the list are executed in order once some data is passed through the pipeline. This done by the standard .fit() and .predict() methods. Finally where pipelines are really useful is that they can be used as input estimator objects into other sklearn objects themselves, the most useful of which are the cross_val_score (allows for efficient cross-validation and out of sample metric calculation), and grid search and random search approaches for tuning hyperparameters.

Pipeline on boston data with RandomForestRegressor¶

In [23]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# create our feature matrix and target vector
X, y = datasets.load_boston(return_X_y=True)

# create a pipeline object using RandomForestRegressor
rf_pipeline = Pipeline(steps=[('sd_scaler', StandardScaler()),
                              ('rf_model', RandomForestRegressor(n_estimators=100))])

# create a pipeline object using XGBRegressor
xgb_pipeline = Pipeline(steps=[('sd_scaler', StandardScaler()),
                              ('xgb_model', xgb.XGBRegressor(objective='reg:squarederror'))])

In [24]:

%%timeit

scores = cross_val_score(rf_pipeline, X, y, scoring='neg_mean_squared_error', cv=10)

# get the final root mean squared error
final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))

print('Average RMSE', final_avg_rmse)

Average RMSE 4.13086608644044
Average RMSE 4.24034175572216
Average RMSE 4.19112461689934
2.14 s ± 15.5 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Pipeline on boston data with XGBoost¶

In [25]:

%%timeit 

scores = cross_val_score(xgb_pipeline, X, y,scoring='neg_mean_squared_error',cv=5)

# get the final root mean squared error
final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))

print('Average RMSE', final_avg_rmse)

Average RMSE 4.110181019209378
Average RMSE 4.110181019209378
Average RMSE 4.110181019209378
195 ms ± 3.81 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

As we can see, without any hyperparameter tuning the XGBoost model had a lower RMSE compare to the RandomForestRegressor and was also much faster.

Hyperparameter tuning in a pipeline¶

Lastly we will see how automated hyperparameter tuning for an XGBoost model works within a scikit-learn pipeline.

In [26]:

# create a parameter grid for the pipeline
gbm_param_grid = {
    'xgb_model__subsample':np.arange(0.05, 1, 0.05),
    'xgb_model__max_depth':np.arange(3, 20, 1),
    'xgb_model__colsample_bytree':np.arange(0.1, 1.05, 0.05)
                 }

# create a RandomizedSearchCV using the XGBoost pipeline we 
# instantiated in the previous exercise
randomized_neg_mse = RandomizedSearchCV(xgb_pipeline,
                                        param_distributions=gbm_param_grid,
                                        n_iter=20,
                                        scoring='neg_mean_squared_error',
                                        cv=4,
                                        iid=True,
                                        random_state=13)

# fit on Ames housing data
randomized_neg_mse.fit(X, y)

print('Best RMSE:', np.sqrt(np.abs(randomized_neg_mse.best_score_)))
print('Best Model', randomized_neg_mse.best_estimator_)

Best RMSE: 4.499229408398059
Best Model Pipeline(memory=None,
         steps=[('sd_scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('xgb_model',
                 XGBRegressor(base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=1.0000000000000004, gamma=0,
                              importance_type='gain', learning_rate=0.1,
                              max_delta_step=0, max_depth=6, min_child_weight=1,
                              missing=None, n_estimators=100, n_jobs=1,
                              nthread=None, objective='reg:squarederror',
                              random_state=0, reg_alpha=0, reg_lambda=1,
                              scale_pos_weight=1, seed=None, silent=None,
                              subsample=0.3, verbosity=1))],
         verbose=False)

Conclusion¶

We have seen how to use the XGBoost API with many of the features it offers. To deepen our knowledge there are other useful subjects, like how to use it for ranking/recommendation problems, using more sophisticated hyperparameter tuning strategies (like Bayesian Optimization) or how to use XGBoost as part of an ensemble of other models.

XGBoost however, should not be used as a silver bullet. Best results can be obtained in conjunction with great data exploration and feature engineering.