Feature engineering is the act of taking raw data and extracting features for machine learning. Most machine learning algorithms work with tabular data: when we talk about features we refer to the information stored in the columns of this tables.
Here, we will be seeing how to incorporate feature engineering into our data science workflow.
Types of data¶
The data we use to deploy machine learning models can be of different types:
- Continuous: either integers or floats
- Categorical: one of a limited set of values
- Ordinal: ranked values, often with no detail of distance between them
- Boolean: binary values
- Datetime: dates and times Dealing with this requires a well thought out approach. Feature engineering is often overlooked in machine learning discussions, but any real world practiotioner will confirm that data manipulation and feature engineering is in many cases the most important aspect of the project.
We will be initially working with a modified subset of the Stackoverflow survey response data. This data set records the details and preferences of hundreds of users of the Stackoverfow website. Let' s start by exploring the data.
import numpy as np
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)
df = pd.read_csv('Combined_DS_v10.csv')
df.head()
df.info()
To get the datatypes of each column we can also use the dtypes
attribute.
# count the amount of features of each datatype present in the data set
df.dtypes.value_counts()
Knowing the types of each column is essential if we are performing analysis based on a subset of
specific
datatypes. To do this we can use the select_dtypes()
method and pass a list of relevant
ata
types to include the argument.
# select only continuous and float features
only_cont = df.select_dtypes(include=['int64', 'float64'])
only_cont.columns
Categorical features¶
Categorical variables are used to represent groups that are qualitative in nature. While this can be easely understood by a human, we will need to encode categorical features as numeric values to use them in our machine learning models.
For this approach we cannot arbitrarily allocate numbers to each category as that would imply some sort of ordering in the categories. Instead, values can be encoded by creating additional binary features corresponding to whether each value was picked or not. In doing so the model can leverage information of what category it is given without inferring any order between different options.
There are 2 main approaches:
- One-hot encoding
- Dummy encoding
These are very similar and often confused, in fact, by default pandas performs one-hot encoding when
we use
the get_dummies()
function. In One-hot encoding we convert $n$ categories to $n$
features. If
drop_first=True
is used they are converted to $n-1$ features as the first category is
omitted.
In dummy encoding the base value is encoded by the absence of all other features.
# using get_dummies on the Country feature
pd.get_dummies(df['Country'], prefix='C', drop_first=True).head(3)
We can notice how France is missing here and its value is represented by the intercept.
Both these methods have different advantages. One-hot encoding generally creates much more explainable features as each feature will have its own weight that can be observed after training. But we must be aware is that it may create features that are entirely collinear due to the same information being represented multiple times.
In the case where there are many categories in a column both methods will result in a large number of columns being created. In this cases we may want to only create columns for the most common values. Let's have a look at the values in our country column.
country_counts = df['Country'].value_counts()
country_counts
Some features can have many different categories but a very uneven distribution of their occurrences. In these cases, we may not want to create a feature for each value, but only the more common occurrences.
We can use our count of occurrances to limit what values we will include by first creating a mask of the values that occur less than $n$ times.
# create a mask for only categories that occur less than 10 times
mask = df['Country'].isin(country_counts[country_counts < 10].index)
# label all other categories as Other
df.at[mask, 'Country'] = 'Other'
print(df['Country'].value_counts())
Numeric Variables¶
As mentioned earlier most machine learning models will require our data to be in a numeric format. However even if our data is all numeric, there is still a lot we can do to improve our features.
Numeric features can be used to represent huge array of different characteristics and measurements. Depending on the usecase numeric features can be treated in several different ways.
One of the first questions we should ask when working with numeric features is whether the magnitude of the features is the most important trade or just its direction. For example, if we have a dataset containing the number of times a restaurant has had a violation we might just care about the fact that it did and not the number of times it did it. A solution to this is to create a binary column. This can also be useful for our target variables in some cases.
Let's see how to do this with our Stackoverflow data set.
# create a binary column HasSalary
df['HasSalary'] = False
# set the value of to True if the salary is greater than 0
df.at[df.ConvertedSalary > 0, 'HasSalary'] = True
df[['HasSalary', 'ConvertedSalary']].head()
For many continuous values we will care less about the exact value of a numeric column, but instead
care
about the the general magnitude of the values. We can in this case create different bins per group. We
can
do this using the pandas.cut
function, to define the intervals we use the
bins
argument.
# specify the boundaries and labels of the bins
bins = [-np.inf, 10000, 20000, 50000, 100000, np.inf]
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']
# use pandas.cut to create a binned feature
df['SalaryBinned'] = pd.cut(df['ConvertedSalary'],
bins=bins,
labels=labels)
assert len(bins)==len(labels) + 1
df[['SalaryBinned', 'ConvertedSalary']].head()
Missing Values¶
Most data sets contain missing values, often represented as NaN (Not a Number), for different reasons (data not collected properly, collection and management errors, data intentionally being omitted, etc.). Missing data is extremely important to identify and deal with missing data.
Missing data may be indicative of a problem in our data pipeline. If data is consistently missing in a certain column we should investigate as to why this is the case. Missing data may provide information in itself.
To find where the missing values exist we can use the isna()
method. isna()
return a boolean same-sized dataframe or series where NaN values get mapped to True values. The
inverso, or
non missing values can be found using the notna()
method.
# get the missing values for each column
df.isna().sum()
How do we deal with this missing values?
If we are confident that missing values in our dataset are occurring at random, meaning that the chance of data being missing is unrelated to any of the variables involved in our analysis, the most effective and statistically sound approach to dealing with them is called Complete Case Analysis or Listwise Deletion.
In complete case analysis we simply exclude those records in our dataset which have any data missing.
# get df with NaN entries dropped from it
no_missing_values_rows = df.dropna()
print('Dropped rows: ', df.shape[0] - no_missing_values_rows.shape[0])
If we want to delete rows with missing values on a specific column we use
subset=['column_name']
# get df with rows in Gender that contained NaN entries dropped
df_with_dropped_on_Gender = df.dropna(subset=['Gender'])
print('Dropped rows: ', df.shape[0] - df_with_dropped_on_Gender.shape[0])
Listwise Deletion does have its drawbacks:
- It deletes perfectly valid data points that share a row with the missing value
- If the missing values do not occur entirely at random it can negatively effect the model
- It can reduce the degree of freedom of our model
- When building a predicting model if we were to remove all cases that had missing values when training our data we would run into problems when we received missing values in our test set
The most common method is to fill the missing values, we can do this usnig fillna()
method. We
need to provide the value to replace the missing ones. In the case of categorical columns it is common
to
replace the values with strings (like 'other') or with the most common occurring value.
df['Gender'].fillna(value='Not Given', inplace=True)
df['Gender'].value_counts()
In situations where we believe that the absence or presence of data is more important than the values themselves we can create a new column that records the absence of data and then drop the original column.
df['SalaryGiven'] = df['ConvertedSalary'].notna()
df[['SalaryGiven']].head()
For numeric columns we might want to replace the missing values with a more suitable value. But what is a suitable value? In cases like the salary we often turn to the measures of central tendency, which are the central or typical value for a distribution. The most commonly used values are the mean or median.
One cavet that we must keep in mind is that it can lead to biased estimates of the variances and covariances of the features. Similarly, the standard error and test statistic can be incorrectly estimated, so if these metrics are needed they should be calculated before the missing values have been filled.
# fill ConvertedSalary with the mean salary (calculated excluding null values by default)
df['ConvertedSalary'] = df['ConvertedSalary'].fillna(df['ConvertedSalary'].median())
# get rid of all the decimal values in a column
df['ConvertedSalary'] = df['ConvertedSalary'].astype('int64')
df[['ConvertedSalary']].head()
Of course data issues are not just limited to this. In some instances we will come across features that need to be updated in some other way (for example a column containing monetary value).
df['RawSalary'].dtype
Here our RawSalary column is of type object, although intituively, we know that it should be numeric. We cannot cast this directly to a numerical column as it should not contain any non-numeric characters (like "$", "£" and "," in our case).
try:
df['RawSalary'] = df['RawSalary'].astype('float64')
except ValueError as e:
print(e)
# remove the special characters
for spec in [',', '£', '$']:
df['RawSalary'] = df['RawSalary'].str.replace(spec, '')
# cast RawSalary to float
df['NumericSalary'] = df['RawSalary'].astype('float64')
df['NumericSalary'].dtype
One approach to finding these stray characters is to force the column to the data type desired using
pd.to_numeric()
, coercing any values causing issues to NaN, Then filtering the DataFrame
by
just the rows containing the NaN values.
# attempt to convert the column to numeric values
numeric_vals = pd.to_numeric(df['RawSalary'], errors='coerce')
# find the indexes of missing values
idx = df['RawSalary'].isna()
df.loc[idx, 'RawSalary'][:5]
Data distributions¶
An important consideration before building a machine learning model is to understand what the disstribution of our underlying data looks like. A lot of algorithms make assumptions about how our data is distributed or how different features interact with each other. For example all models besides tree based models require our features to be on the same scale.
Feature engineering can be used to manipulate our data so that it can fit
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
numeric_df = df[['ConvertedSalary', 'Age', 'Years Experience']]
_ = numeric_df.hist(figsize=(10, 10))
_ = sns.boxplot('ConvertedSalary', data=numeric_df)
_ = sns.boxplot('Age', data=numeric_df)
_ = sns.boxplot('Years Experience', data=numeric_df)
_ = sns.pairplot(numeric_df)
Scaling and Transformations¶
We need to rescale data to ensure that it is on the same scale. There are many approaches, the most common are:
MinMaxScaling
When the data is scaled linearly between a minimum and a maximum value
Standardization
orNormalization
MinMaxScaler¶
In normalization you linearly scale the entire column between 0 and 1, with 0 corresponding with the lowest value in the column, and 1 with the largest. When using scikit-learn (the most commonly used machine learning library in Python) you can use a MinMaxScaler to apply normalization. (It is called this as it scales your values between a minimum and maximum value.)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[['Age']])
df['normalized_age'] = scaler.transform(df[['Age']])
df[['Age', 'normalized_age']].head()
Standardization¶
Finds the mean of the data and centers the distribution around it, calculation the number of standard deviations from the mean.
While normalization can be useful for scaling a column between two data points, it is hard to compare two scaled columns if even one of them is overly affected by outliers. One commonly used solution to this is called standardization, where instead of having a strict upper and lower bound, you center the data around its mean, and calculate the number of standard deviations away from mean each data point is.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[['Age']])
df['standardized_age'] = scaler.transform(df[['Age']])
df[['Age', 'standardized_age']].head()
Log Transformation¶
In the previous exercises we scaled the data linearly, which will not affect the data's shape. This works great if your data is normally distributed (or closely normally distributed), an assumption that a lot of machine learning models make. Sometimes you will work with data that closely conforms to normality, e.g the height or weight of a population. On the other hand, many variables in the real world do not follow this pattern e.g, wages or age of a population.
from sklearn.preprocessing import PowerTransformer
log = PowerTransformer()
log.fit(df[['ConvertedSalary']])
df['log_ConvertedSalary'] = log.transform(df[['ConvertedSalary']])
# Plot the data before and after the transformation
df[['ConvertedSalary', 'log_ConvertedSalary']].hist()
plt.show()
df[['ConvertedSalary', 'log_ConvertedSalary']].head()
# Find the 95th quantile
quantile = df['ConvertedSalary'].quantile(0.95)
# Trim the outliers
trimmed_df = df[df['ConvertedSalary'] < quantile]
# The original histogram
_ = df[['ConvertedSalary']].boxplot()
plt.show()
plt.clf()
# The trimmed histogram
_ = trimmed_df[['ConvertedSalary']].boxplot()
Standard Deviation based¶
mean = df['ConvertedSalary'].mean()
std = df['ConvertedSalary'].std()
cut_off = std * 3
lower, upper = mean - cut_off, mean + cut_off
trimmed_df = df[(df['ConvertedSalary'] < upper) & (df['ConvertedSalary'] > lower)]
_ = trimmed_df[['ConvertedSalary']].boxplot()
Scaling and transforming new data¶
If we want to predict new data with our model we need to make sure our new data goes through the same processes.
For example if we use a scaler on training data it need to be applied to test data as well, notice that we have to apply the transformation on the test data with our model fit on the training data.
Why do we only use training data?¶
We do this to avoid data leakage
#splitting data
train_pct_index = int(0.8 * len(df))
X_train, X_test = df[:train_pct_index], df[train_pct_index:]
train_std = X_train['ConvertedSalary'].std()
train_mean = X_train['ConvertedSalary'].mean()
cut_off = train_std * 3
train_lower, train_upper = train_mean - cut_off, train_mean + cut_off
# Trim the test DataFrame
trimmed_test_df = X_test[(X_test['ConvertedSalary'] < train_upper) \
& (X_test['ConvertedSalary'] > train_lower)]
_ = trimmed_test_df[['ConvertedSalary']].boxplot()