Extreme Gradient Boosting with XGBoost¶
XGBoost was developed originally as a C++ command-line application. After winning a ML competition it started being widely used from the community. As a result bindings or functions that tapped into the core C++ code started appearing in a variety of other languages, including Python, R, Scala, Julia and Java.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. What makes XGBoost so popular is its speed and performance. Because the core XGBoost algorithm is parallelizable, it can harness all of the processing power of modern multi-core computers. Furthermore it is parallelizable onto GPUs and across networks, making it feasible to train models on very large datasets on the order of hundreds of millions of training examples.
However XGBoost' s speed isn't the package's real draw. Ultimately a fast but poorly performing machine learning algorithm is not going to have wide adoption within the community. What makes it so popular is that it consistently outperforms almost all other single-algorithm methods and has been shown to achieve state-of-the-art performance on a variety of benchmark machine learing datasets.
XGBoost has a scikit-learn comptible API and we can that in the example below.
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# import xgboost
import xgboost as xgb
# load breast cancer dataset
X, y = datasets.load_breast_cancer(return_X_y=True)
# split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=23)
# instantiate an XGBClassifier
xg_cl = xgb.XGBClassifier(objective='binary:logistic',
n_estimators=20,
seed=23)
# fit the model on training set
xg_cl.fit(X_train, y_train)
# predict test set
y_pred = xg_cl.predict(X_test)
pd.DataFrame(classification_report(y_test, y_pred, output_dict=True)).transpose()
XGBoost is usually used with trees as base learners and XGBoost itself is considered an ensemble learning method in that uses the outputs of many models for final prediction. Compared to other ensemble methods XGBoost uses a slightly different kind of a decision tree, called a classification and regression tree, or CART.
CART trees contain a real-valued score in each leaf, regardless of whether they are used for classification or regression. The real-valued scores can then be thresholded to convert into categories for classification problems if necessary.
What is Boosting?¶
At bottom, boosting isn't really a specific machine learning algorithm, but a concept that can be applied to a set of machine learning models (meta-algorithm). Specifically it's an ensemble meta-algorithm primarily used to reduce any given single learner's variance and to convert many weak learners into an arbitrarily strong learner.
A weak learner is any meachine learning algorithm that is just slightly better than chance. So a decision tree that can predict some outcome slightly more frequently than pure randomness would be considered a weak learner.
The principal insight that allows XGBoost to work is the fact that we can use boosting to convert a collection of weak learners into a strong learner, where a strong learner is any algorithm that can be tuned to achieve arbitrarily good performance for some supervised learning problem.
How is this accomplished?¶
This is achieved by iteratively learning a set of weak models on subsets of the data we have at hand, and weighting each of their predictions according to each weak learner's performance. We then combine all of the weak learners' predictions multiplied by their weights to obtain a single final weighted prediction that is much better than any of the individual predictions themselves.
Cross-validation with XGBoost¶
Cross-validation is a robust method for estimating the expected performance of a machine learning model on unseen data by generating many non-overlapping train/test split on the training data and reporting the average test set performance across all data splits.
In this following example we will convert our dataset into an optimized data structure, that the
creators
of XGBoost made, that gives the package its lauded performance and efficiency gains called
DMatrix
# create a DMatrix with feature array X and target y
churn_dmatrix = xgb.DMatrix(X, y)
churn_dmatrix
In the previous excercise the input datasets were converted into DMatrix data on the fly, but when we
use
the XGBoost cv object, which is part of XGBoost learning api we have to first explicitly convert our
data
into a DMatrix
. This is necessary because the cv method has no idea what kind of XGBoost
model
we are using and expects us to provide that information as a dictionary of appropriate key-value
pairs.
# create a dictionary of parameters
params = {'objective':'binary:logistic',
'max_depth':4}
# use cross-validation on on DMatrix
cv_results = xgb.cv(dtrain=churn_dmatrix,
params=params,
nfold=10,
num_boost_round=80,
metrics='error',
as_pandas=True, # to store output as pandas df
seed=23)
print('Accuracy {:.4f}'.format((1 - cv_results['test-error-mean']).iloc[-1]))
cv_results.tail()
cv_results
stores the training and test mean and standard deviation of the error per
boosting
round (tree built) as a DataFrame. From cv_results
, the final round
'test-error-mean'
is extracted and converted into an accuracy value, where accuracy is
1-error
.
Measuring AUC¶
Now that we have used cross-validation to compute average accuracy (after converting from an error),
it's
very easy to compute any other metric we might be interested in. All we have to do is pass it (or a
list of
metrics) in as an argument to the metrics
parameter of xgb.cv()
cv_results = xgb.cv(dtrain=churn_dmatrix,
params=params,
nfold=10,
num_boost_round=80,
metrics="auc",
as_pandas=True,
seed=23)
print('AUC score {:.4f}'.format((cv_results["test-auc-mean"]).iloc[-1]))
cv_results.tail()
When to use XGBoost¶
We should use XGBoost for any supervised machine learning task that fits the following criteria:
- A large number of training examples (greater than 1k training samples and less than 100 features)
- It tends to perform well when we have a mixture of categorical and numeric features or just numeric features.
It should not be used in the case of:
- Image recognition
- Computer vision
- NLP and understanding problems
- Number of training smaples is significantly smaller than the number of features
as these kind of problems can be much better tackled using dep learning approaches.
Regression¶
Now that we have seen how to use XGBoost for classification, we will see how to use it in regression tasks.
Regression probles involve predictiong continuous, or real, values. Evaluating the quality of a regression model involves using a different set of metrics: in most cases we use the root mean squared error (RMSE) or the mean absolute error (MAE).
-
RMSE is computed by taking the difference between the actual and the predicted values for what we are trying to predict by squaring those differences, computing their mean, and taking that value's square root. This allows us to treat negative and positive differences equally, but it tends to punish larger differences between predicted and actual values much more than smaller ones.
-
MAE on the other hand simply sums the absolute differences between predicted and actual values across all of the samples we build our model on. Although MAE isn't affected by large differences as much as RMSE, it lacks some nice mathematical properties that make it much less frequently used as an evaluation metric.
Some common algorithms that are used for regression problems include linear regression and decision trees. It's important to briefly note here that some algorithms such as decision trees, can be used for both regression as well as classification tasks, which is one of their important properties that makes them prime candidates to be building blocks for XGBoost models.
Objective functions¶
An objective or loss function quantifies how far off our prediction is from the actual result for a given data point. It maps the difference between the prediction and the target to a real number. When we construct a model we do so in the hopes that it minimizes the loss function across all of the data points we pass in. Our ultimate goal is the smallest possible loss.
Loss functions have specific naming conventions in XGBoost.
- For regression models the most common is called
reg linear
- For binary classification models the most common loss functions used are
reg logistic
, when we want just decision, not probability,binary logistic
when we want the actual predicted probability of the positive class.
Base learners¶
XGBoost in an ensemble learning method composed of many base learners that are added together to generate a single prediction. The goal is to have base learners that are slightly better than random guessing on certain subsets of training examples, and uniformly bad at the remainder. When all of the predictions are combined they create a final prediction that is non-linear, uniformely bad predictions cancel out and those slightly better than chance combine into a single very good prediction.
-
Linear base learner is simply a sum of linear terms, exactly as we would find in a linear or logistic regression model.
When we combine many of these base models into an ensemble, we gat a weighted sum of linear models, which is itself linear since we do not get any nonlinear combinations of features in the model, this approach is rarely used, as we can get identical performance from a regularized linear model.
-
Tree base learner uses decision tree as base models. When the decision trees are all combined into an ensemble, their combination becomes a nonlinear function of each individual tree, which itself in nonlinear
Decision trees as base learners¶
Here we will build an XGBoost model for a regression task. By default XGBoost uses trees as base learners, so it does not need being specified. We will be using the boston housing dataset from scikit-learn.
import numpy as np
from sklearn.metrics import mean_squared_error
# load boston housinng data
X, y = datasets.load_boston(return_X_y=True)
# split into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# instantiate an XGBRegressor
xg_reg = xgb.XGBRegressor(objective='reg:squarederror',
n_estimators=60,
seed=1)
# fit on training set
xg_reg.fit(X_train, y_train)
# predict test set
y_pred = xg_reg.predict(X_test)
# calculate and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE: %f'%(rmse))
Linear base learners¶
After using trees as base models let's use the other kind of base model: a linear learner. This
model,
although not as commonly used, allows us to create a regularized linear regression using XGBoost's
powerful
learning API. However, because it's uncommon, we have to use XGBoost's own non-scikit-learn compatible
functions to build the model, such as xgb.train()
.
In order to do this we must create the parameter dictionary that describes the kind of booster we
want to
use. The key-value pair that defines the booster type (base model) we need is
"booster":"gblinear"
.
Once we've created the model, we can use the .train()
and .predict()
methods of
the model.
# convert the training and testing sets into DMatrixes
DM_train = xgb.DMatrix(data=X_train, label=y_train)
DM_test = xgb.DMatrix(data=X_test, label=y_test)
# create the parameter dictionary
params = {"booster":"gblinear", "objective":"reg:squarederror"}
# train the model
xg_reg = xgb.train(params=params, dtrain=DM_train, num_boost_round=60)
# predict the labels of the test set
preds = xg_reg.predict(DM_test)
# compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))
We can compare the RMSE and MAE of a cross-validated model to evaluate the quality.
# create the DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# create the parameter dictionary
params = {"objective":"reg:squarederror", "max_depth":4}
# perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix,
params=params, nfold=4,
num_boost_round=5,
metrics='rmse',
as_pandas=True,
seed=1)
cv_results
Regularization¶
Loss functions in XGBoost do not just take into account how close a model' predictiona are to the actual value but also the complexity of the model. This idea of penalizing models as they become more complex is called regularization. So loss functions are used to find models that are both accurate and as simple as they can possibly be.
There are several parameters that can be tweaked to limit model complexity by altering the loss function.
-
gamma
is a parameter for tree based learners that controls wheater a given node on will split based on the expected reduction in the loss that would occur after preforming the split, so that higher values lead to fewer splits. -
alpha
, or l1 regularization, is a penalty on leaf weights rather that on feature weights, as is the case of linear or logistic regression. Higheralpha
values lead to stronger l1 regularization, which causes many leaf weights in the base learners to go to 0. -
lambda
, or l2 regularization, is a much smoother penalty that l1 and causes leaf weights to smoothly decrease, instead of enforcing strong sparsity constraints on the leaf weights as in l1.
params = {'objective':'reg:squarederror', 'max_depth':4}
l1_params = [1, 10, 100]
# initialize empty list to store our final root mean square error
# for each of this l1 or aplha values
rmses_l1 = []
# loop over alpha values
for reg in l1_params:
# assign current value to alpha
params['alpha'] = reg
# perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix,
params=params,
nfold=4,
num_boost_round=10,
metrics='rmse',
as_pandas=True,
seed=1)
# add final RMSE to our list
rmses_l1.append(cv_results['test-rmse-mean'].tail(1).values[0])
print('Best rmse as a function of l1:')
rmse_df = pd.DataFrame(zip(l1_params, rmses_l1), columns=['l1', 'rmse'])
rmse_df
Having seen an example of l1 regularization, we'll now vary the l2 regularization penalty - also
known as
lambda
- and see its effect on overall model performance on the housing dataset.
reg_params = [1, 10, 100]
# create the initial parameter dictionary for varying l2 strength
params = {"objective":"reg:squarederror","max_depth":3}
# create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []
# iterate over reg_params
for reg in reg_params:
# Update l2 strength
params["lambda"] = reg
# Pass this updated param dictionary into cv
cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)
# Append best rmse (final round) to rmses_l2
rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])
# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
pd.DataFrame(zip(reg_params, rmses_l2), columns=["l2", "rmse"])
Visualizing individual XGBoost trees¶
Now that we've used XGBoost to both build and evaluate regression as well as classification models, we should get a handle on how to visually explore our models.
XGBoost has a plot_tree()
function that makes this type of visualization easy. Once we
train a
model using the XGBoost learning API, we can pass it to the plot_tree()
function along
with the
number of trees we want to plot using the num_trees
argument.
import matplotlib.pyplot as plt
params = {"objective":"reg:squarederror", "max_depth":2}
# train the model
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)
# plot the first tree
_ = xgb.plot_tree(xg_reg, num_trees=0)
_.figure.set_size_inches(40,25)
plt.show()
# plot the fifth tree
_ = xgb.plot_tree(xg_reg, num_trees=4)
_.figure.set_size_inches(40,25)
plt.show()
# plot the last tree sideways
_ = xgb.plot_tree(xg_reg, num_trees=9, rankdir='LR')
_.figure.set_size_inches(40,25)
plt.show()
Visualizing feature importances¶
Another way to visualize our models is to examine the importance of each feature column in the original dataset within the model.
One simple way of doing this involves counting the number of times each feature is split on across
all
boosting rounds (trees) in the model, and then visualizing the result as a bar graph, with the
features
ordered according to how many times they appear. XGBoost has a plot_importance()
function
that
allows us to do this.
import seaborn as sns
sns.set()
# create boston housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,
label=y,
feature_names=datasets.load_boston().feature_names)
# create the parameter dictionary
params = {'objective':'reg:squarederror', 'max_depth':10}
# train the model
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)
# plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()
Hyperparameter Tuning¶
Let's now have a look at how tuning can improve a model. We will analyze another housing dataset and see the difference between a simple model and a tuned model.
# load Ames housing data
housing_data = pd.read_csv('ames_housing_trimmed_processed.csv')
# create our feature and target variables
X, y = housing_data[housing_data.columns.tolist()[:-1]].values, housing_data[housing_data.columns.tolist()[-1]].values
# create a DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y, feature_names=housing_data.columns.tolist()[:-1])
# create an untuned parameter dictionary
untuned_params = {'objective':'reg:squarederror'}
# perform cv
untuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix,
params=untuned_params,
nfold=4,
#num_boost_round=200,
metrics='rmse',
as_pandas=True,
seed=1)
# compute and print RMSE
untuned_rmse_mean = untuned_cv_results_rmse['test-rmse-mean'].iloc[-1]
print('Untuned rmse: %f'%untuned_rmse_mean)
and here is the tuned model.
# create a dictionary of tuned parameters
tuned_params = {'objective':'reg:squarederror',
'colsample_bytree':0.4,
'learning_rate':0.1,
'max_depth':4}
# perform cv
tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix,
params=tuned_params,
nfold=4,
num_boost_round=200,
metrics='rmse',
as_pandas=True,
seed=1)
# compute and print RMSE
tuned_rmse_mean = tuned_cv_results_rmse['test-rmse-mean'].iloc[-1]
print('Tuned rmse: %f'%tuned_rmse_mean)
rmse_decr = ((tuned_rmse_mean - untuned_rmse_mean ) / untuned_rmse_mean * 100)
print('There was a decrease of {}% rmse with the tuned model'.format(round(-rmse_decr, 1)))
Boosting Rounds¶
Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees we
build)
impacts the out-of-sample performance of our model. We'll use xgb.cv()
inside a for loop
and
build one model per num_boost_round
parameter.
Here, we'll continue working with the Ames housing dataset.
# create the parameter dictionary for each tree
params = {"objective":"reg:squarederror", "max_depth":3}
# create list of number of boosting rounds
num_rounds = [5, 10, 15]
# empty list to store final round rmse per model
final_rmse_per_round = []
# iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:
# perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=1)
# append final round RMSE
final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])
# print the resultant DataFrame
num_rounds_rmses = zip(num_rounds, final_rmse_per_round)
pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"])
Automated boosting round selection using early_stopping¶
Instead of attempting to pick the best possible number of boosting rounds, we can very automatically
select
the number of boosting rounds within xgb.cv()
. This is done using a technique called
early
stopping.
Early stopping works by testing model after every boosting round against a hold-out dataset
and
stopping the creation of additional boosting rounds (thereby finishing training of the model early) if
the
hold-out metric ("rmse"
in our case) does not improve for a given number of rounds. Here
we
will use the early_stopping_rounds
parameter in xgb.cv()
with a large
possible
number of boosting rounds (50). It's important to notice that if the holdout metric continuously
improves up
through when num_boosting_rounds
is reached, then early stopping does not occur.
# perform cross-validation with early stopping
cv_results = xgb.cv(dtrain=housing_dmatrix,
params=params,
early_stopping_rounds=10,
num_boost_round=300,
as_pandas=True,
seed=1)
# print the number of rounds
print(len(cv_results))
We can see that the model stopped at the 77th round instead of 300th.
Tunable Parameters¶
The parameters that can be tuned are significantly different for each base learner.
- Tree base learner¶
-
learning_rate: affects how quickly the model fits the residual error using additional base learners. A low learning rate will require more boosting rounds to achieve the same reduction in residual error as a model with a higher learning rate
-
gamma: min loss reduction to create a new tree split
-
lambda: l2 regularization on leaf weights
-
alpha: l1 regulaization on leaf weights
-
max_depth: a positive integer value that affects how deep ech tree is allowed to grow during any given boosting round
-
subsample: value between 0 and 1 and is a fraction of the total training set that can be used for any given boosting round. If the value is low then the fraction of our training data used per boosting round would be low and we could run into underfitting problems, a value that is very high can lead to overfitting as well
-
colsample_bytree: a fraction of features we can select from during any given boosting round and must also ba a value between 0 and 1. A large value means that almost all features can be used to build a tree during a given boosting round, whereas a small value means that the fraction of features that can be selected from is very small. In general smaller colsample_bytree values can be thought of as providing additional regularization to the model, while using all columns might overfit a trained model.
- Linear base learner¶
-
lambda: l2 regularization on leaf weights
-
alpha: l1 regularization on leaf weights
-
lambda_bias: l2 regularization term on bias
Finally it is important to mention that the number of boosting rounds, that is either the number of trees we build or the number of linear base learners we construct, is itself a tunable parameter.
Tuning eta¶
We will begin by tuning the eta
, also known as the learning rate.
The learning rate is a parameter that can range between 0 and 1, with higher values of "eta"
penalizing
feature weights more strongly, causing much stronger regularization.
# create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}
# create list of eta values and empty list to store final round RMSE per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []
# systematically vary the eta
for curr_val in eta_vals:
params["eta"] = curr_val
# perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix,
params=params,
nfold=3,
early_stopping_rounds=5,
num_boost_round=100,
metrics='rmse',
seed=1)
# append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# print the resultant DataFrame
pd.DataFrame(zip(eta_vals, best_rmse), columns=["eta","best_rmse"])
Tuning max_depth¶
max_depth
is the parameter that dictates the maximum depth that each tree in a boosting
round
can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.
max_depths = [2, 5, 10, 20]
best_rmse = []
# systematically vary the max_depth
for curr_val in max_depths:
params["max_depth"] = curr_val
# perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix,
params=params,
nfold=3,
early_stopping_rounds=5,
num_boost_round=100,
metrics='rmse',
seed=1)
# append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# print the resultant DataFrame
pd.DataFrame(zip(max_depths, best_rmse),columns=["max_depth","best_rmse"])
Tuning colsample_bytree¶
In scikit-learn it is called max_features
, but in both this parameter simply specifies
the
fraction of features to choose from at every split in a given tree. colsample_bytree
must
be
specified as a float between 0 and 1.
# create the parameter dictionary
params={"objective":"reg:squarederror", "max_depth":3}
# create list of hyperparameter values
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []
# systematically vary the hyperparameter value
for curr_val in colsample_bytree_vals:
params['colsample_bytree'] = curr_val
# perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix,
params=params,
nfold=2,
num_boost_round=100,
early_stopping_rounds=5,
metrics="rmse",
seed=1)
# append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# print the resultant DataFrame
pd.DataFrame(zip(colsample_bytree_vals, best_rmse), columns=["colsample_bytree","best_rmse"])
To find the optimal values for several hyperparameters simultaneously, leading to the lowest loss possible, when their values interact in a non-obvious or nonlinear way the common strategy is to use Grid Search or Random Search.
Grid Search¶
Grid search is a method of exhaustively searching through a collection of possible parameters values, every parameter configuration is evaluated. The parameter configuration that gave the best value for the metric used is picked as the final model.
# import GridSearchCV from sklearn
from sklearn.model_selection import GridSearchCV
# create a parameter dictionay
gbm_param_grid = {'colsample_bytree': [0.3, 0.7],
'learning_rate':[0.01, 0.1, 0.5, 0.9],
'n_estimators':[200],
'max_depth': [2, 5],
'subsample':[0.3, 0.5, 0.9]}
# instantiate an XGBRegressor
gbm = xgb.XGBRegressor(objective='reg:squarederror')
# create GridSearchCV object using XGBRegressor and the parameter dictionary
grid_mse = GridSearchCV(estimator=gbm,
param_grid=gbm_param_grid,
scoring='neg_mean_squared_error',
cv=2,
n_jobs=-3)
# fit on Ames housing dataset
grid_mse.fit(X, y)
print('Best parameters found :', grid_mse.best_params_)
print('Lowest RMSE found :',np.sqrt(np.abs(grid_mse.best_score_)))
Random Search¶
Significantly different from grid search in that the number of models that we are required to iterate over does not grow as we expand the overall hyperparameter space. In random search we can decide how many models, or iterations, we want to try out before stopping. It simply works by drawing a random combination of possible hyperparameter values from the range of allowable hyperparameters a set number of times. We the models specified are evaluated we simply pick the best one.
# import RandomizedSearchCV from sklearn
from sklearn.model_selection import RandomizedSearchCV
# create a parameters dictionary
gbm_param_grid = {'learning_rate':np.arange(0.05, 1.05, 0.05),
'n_estimators':[200],
'subsample':np.arange(0.05, 1.05, 0.05),
'max_depth': range(2, 12)}
# instantiate an XGBRegressor
gbm = xgb.XGBRegressor(objective='reg:squarederror')
# create RandomizedSearchCV object using XGBRegressor and the parameter dictionary
randomized_mse = RandomizedSearchCV(estimator=gbm,
param_distributions=gbm_param_grid,
n_iter=60,
scoring='neg_mean_squared_error',
cv=2,
n_jobs=-3)
# fit on Ames housing dataset
randomized_mse.fit(X, y)
print('Best parameters found :', randomized_mse.best_params_)
print('Lowest RMSE found :', np.sqrt(np.abs(randomized_mse.best_score_)))
Using XGBoost in Pipelines¶
Pipelines in sklearn are objects that take a list of named tuples as input. The named tuples must
always
contain a string name as the first element in each tuple and any scikit-learn compatible transfromer
or
estimator object as the second element. Each named tuple in the pipeline is called a step, and the
list of
informations that are conatined in the list are executed in order once some data is passed through the
pipeline. This done by the standard .fit()
and .predict()
methods. Finally
where
pipelines are really useful is that they can be used as input estimator objects into other sklearn
objects
themselves, the most useful of which are the cross_val_score
(allows for efficient
cross-validation and out of sample metric calculation), and grid search and random search approaches
for
tuning hyperparameters.
Pipeline on boston data with RandomForestRegressor¶
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# create our feature matrix and target vector
X, y = datasets.load_boston(return_X_y=True)
# create a pipeline object using RandomForestRegressor
rf_pipeline = Pipeline(steps=[('sd_scaler', StandardScaler()),
('rf_model', RandomForestRegressor(n_estimators=100))])
# create a pipeline object using XGBRegressor
xgb_pipeline = Pipeline(steps=[('sd_scaler', StandardScaler()),
('xgb_model', xgb.XGBRegressor(objective='reg:squarederror'))])
%%timeit
scores = cross_val_score(rf_pipeline, X, y, scoring='neg_mean_squared_error', cv=10)
# get the final root mean squared error
final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))
print('Average RMSE', final_avg_rmse)
Pipeline on boston data with XGBoost¶
%%timeit
scores = cross_val_score(xgb_pipeline, X, y,scoring='neg_mean_squared_error',cv=5)
# get the final root mean squared error
final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))
print('Average RMSE', final_avg_rmse)
As we can see, without any hyperparameter tuning the XGBoost model had a lower RMSE compare to the
RandomForestRegressor
and was also much faster.
Hyperparameter tuning in a pipeline¶
Lastly we will see how automated hyperparameter tuning for an XGBoost model works within a scikit-learn pipeline.
# create a parameter grid for the pipeline
gbm_param_grid = {
'xgb_model__subsample':np.arange(0.05, 1, 0.05),
'xgb_model__max_depth':np.arange(3, 20, 1),
'xgb_model__colsample_bytree':np.arange(0.1, 1.05, 0.05)
}
# create a RandomizedSearchCV using the XGBoost pipeline we
# instantiated in the previous exercise
randomized_neg_mse = RandomizedSearchCV(xgb_pipeline,
param_distributions=gbm_param_grid,
n_iter=20,
scoring='neg_mean_squared_error',
cv=4,
iid=True,
random_state=13)
# fit on Ames housing data
randomized_neg_mse.fit(X, y)
print('Best RMSE:', np.sqrt(np.abs(randomized_neg_mse.best_score_)))
print('Best Model', randomized_neg_mse.best_estimator_)
Conclusion¶
We have seen how to use the XGBoost API with many of the features it offers. To deepen our knowledge there are other useful subjects, like how to use it for ranking/recommendation problems, using more sophisticated hyperparameter tuning strategies (like Bayesian Optimization) or how to use XGBoost as part of an ensemble of other models.
XGBoost however, should not be used as a silver bullet. Best results can be obtained in conjunction with great data exploration and feature engineering.