Machine Learning

Feature Engineering

Posted on November 26, 2019

Feature engineering is the act of taking raw data and extracting features for machine learning. Most machine learning algorithms work with tabular data: when we talk about features we refer to the information stored in the columns of this tables.

Here, we will be seeing how to incorporate feature engineering into our data science workflow.

Types of data

The data we use to deploy machine learning models can be of different types:

  • Continuous: either integers or floats
  • Categorical: one of a limited set of values
  • Ordinal: ranked values, often with no detail of distance between them
  • Boolean: binary values
  • Datetime: dates and times Dealing with this requires a well thought out approach. Feature engineering is often overlooked in machine learning discussions, but any real world practiotioner will confirm that data manipulation and feature engineering is in many cases the most important aspect of the project.

We will be initially working with a modified subset of the Stackoverflow survey response data. This data set records the details and preferences of hundreds of users of the Stackoverfow website. Let' s start by exploring the data.

In [1]:
import numpy as np
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

df = pd.read_csv('Combined_DS_v10.csv')
df.head()
Out[1]:
SurveyDate FormalEducation ConvertedSalary Hobby Country StackOverflowJobsRecommend VersionControl Age Years Experience Gender RawSalary
0 2/28/18 20:20 Bachelor's degree (BA. BS. B.Eng.. etc.) NaN Yes South Africa NaN Git 21 13 Male NaN
1 6/28/18 13:26 Bachelor's degree (BA. BS. B.Eng.. etc.) 70841.0 Yes Sweeden 7.0 Git;Subversion 38 9 Male 70,841.00
2 6/6/18 3:37 Bachelor's degree (BA. BS. B.Eng.. etc.) NaN No Sweeden 8.0 Git 45 11 NaN NaN
3 5/9/18 1:06 Some college/university study without earning ... 21426.0 Yes Sweeden NaN Zip file back-ups 46 12 Male 21,426.00
4 4/12/18 22:41 Bachelor's degree (BA. BS. B.Eng.. etc.) 41671.0 Yes UK 8.0 Git 39 7 Male £41,671.00
In [2]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 11 columns):
SurveyDate                    999 non-null object
FormalEducation               999 non-null object
ConvertedSalary               665 non-null float64
Hobby                         999 non-null object
Country                       999 non-null object
StackOverflowJobsRecommend    487 non-null float64
VersionControl                999 non-null object
Age                           999 non-null int64
Years Experience              999 non-null int64
Gender                        693 non-null object
RawSalary                     665 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 85.9+ KB

To get the datatypes of each column we can also use the dtypes attribute.

In [3]:
# count the amount of features of each datatype present in the data set
df.dtypes.value_counts()
Out[3]:
object     7
float64    2
int64      2
dtype: int64

Knowing the types of each column is essential if we are performing analysis based on a subset of specific datatypes. To do this we can use the select_dtypes() method and pass a list of relevant ata types to include the argument.

In [4]:
# select only continuous and float features
only_cont = df.select_dtypes(include=['int64', 'float64'])
only_cont.columns
Out[4]:
Index(['ConvertedSalary', 'StackOverflowJobsRecommend', 'Age',
       'Years Experience'],
      dtype='object')

Categorical features

Categorical variables are used to represent groups that are qualitative in nature. While this can be easely understood by a human, we will need to encode categorical features as numeric values to use them in our machine learning models.

For this approach we cannot arbitrarily allocate numbers to each category as that would imply some sort of ordering in the categories. Instead, values can be encoded by creating additional binary features corresponding to whether each value was picked or not. In doing so the model can leverage information of what category it is given without inferring any order between different options.

There are 2 main approaches:

  • One-hot encoding
  • Dummy encoding

These are very similar and often confused, in fact, by default pandas performs one-hot encoding when we use the get_dummies() function. In One-hot encoding we convert $n$ categories to $n$ features. If drop_first=True is used they are converted to $n-1$ features as the first category is omitted. In dummy encoding the base value is encoded by the absence of all other features.

In [5]:
# using get_dummies on the Country feature
pd.get_dummies(df['Country'], prefix='C', drop_first=True).head(3)
Out[5]:
C_India C_Ireland C_Russia C_South Africa C_Spain C_Sweeden C_UK C_USA C_Ukraine
0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0
2 0 0 0 0 0 1 0 0 0

We can notice how France is missing here and its value is represented by the intercept.

Both these methods have different advantages. One-hot encoding generally creates much more explainable features as each feature will have its own weight that can be observed after training. But we must be aware is that it may create features that are entirely collinear due to the same information being represented multiple times.

In the case where there are many categories in a column both methods will result in a large number of columns being created. In this cases we may want to only create columns for the most common values. Let's have a look at the values in our country column.

In [6]:
country_counts = df['Country'].value_counts()
country_counts
Out[6]:
South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Ukraine           9
Ireland           5
Name: Country, dtype: int64

Some features can have many different categories but a very uneven distribution of their occurrences. In these cases, we may not want to create a feature for each value, but only the more common occurrences.

We can use our count of occurrances to limit what values we will include by first creating a mask of the values that occur less than $n$ times.

In [7]:
# create a mask for only categories that occur less than 10 times
mask = df['Country'].isin(country_counts[country_counts < 10].index)

# label all other categories as Other
df.at[mask, 'Country'] = 'Other'
print(df['Country'].value_counts())
South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Other            14
Name: Country, dtype: int64

Numeric Variables

As mentioned earlier most machine learning models will require our data to be in a numeric format. However even if our data is all numeric, there is still a lot we can do to improve our features.

Numeric features can be used to represent huge array of different characteristics and measurements. Depending on the usecase numeric features can be treated in several different ways.

One of the first questions we should ask when working with numeric features is whether the magnitude of the features is the most important trade or just its direction. For example, if we have a dataset containing the number of times a restaurant has had a violation we might just care about the fact that it did and not the number of times it did it. A solution to this is to create a binary column. This can also be useful for our target variables in some cases.

Let's see how to do this with our Stackoverflow data set.

In [8]:
# create a binary column HasSalary
df['HasSalary'] = False

# set the value of to True if the salary is greater than 0
df.at[df.ConvertedSalary > 0, 'HasSalary'] = True

df[['HasSalary', 'ConvertedSalary']].head()
Out[8]:
HasSalary ConvertedSalary
0 False NaN
1 True 70841.0
2 False NaN
3 True 21426.0
4 True 41671.0

For many continuous values we will care less about the exact value of a numeric column, but instead care about the the general magnitude of the values. We can in this case create different bins per group. We can do this using the pandas.cut function, to define the intervals we use the bins argument.

In [9]:
# specify the boundaries and labels of the bins
bins = [-np.inf, 10000, 20000, 50000, 100000, np.inf]
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# use pandas.cut to create a binned feature 
df['SalaryBinned'] = pd.cut(df['ConvertedSalary'], 
                             bins=bins,
                             labels=labels)
assert len(bins)==len(labels) + 1

df[['SalaryBinned', 'ConvertedSalary']].head()
Out[9]:
SalaryBinned ConvertedSalary
0 NaN NaN
1 High 70841.0
2 NaN NaN
3 Medium 21426.0
4 Medium 41671.0

Missing Values

Most data sets contain missing values, often represented as NaN (Not a Number), for different reasons (data not collected properly, collection and management errors, data intentionally being omitted, etc.). Missing data is extremely important to identify and deal with missing data.

Missing data may be indicative of a problem in our data pipeline. If data is consistently missing in a certain column we should investigate as to why this is the case. Missing data may provide information in itself.

To find where the missing values exist we can use the isna() method. isna() return a boolean same-sized dataframe or series where NaN values get mapped to True values. The inverso, or non missing values can be found using the notna() method.

In [10]:
# get the missing values for each column
df.isna().sum()
Out[10]:
SurveyDate                      0
FormalEducation                 0
ConvertedSalary               334
Hobby                           0
Country                         0
StackOverflowJobsRecommend    512
VersionControl                  0
Age                             0
Years Experience                0
Gender                        306
RawSalary                     334
HasSalary                       0
SalaryBinned                  334
dtype: int64

How do we deal with this missing values?

If we are confident that missing values in our dataset are occurring at random, meaning that the chance of data being missing is unrelated to any of the variables involved in our analysis, the most effective and statistically sound approach to dealing with them is called Complete Case Analysis or Listwise Deletion.

In complete case analysis we simply exclude those records in our dataset which have any data missing.

In [11]:
# get df with NaN entries dropped from it
no_missing_values_rows = df.dropna()

print('Dropped rows: ', df.shape[0] - no_missing_values_rows.shape[0]) 
Dropped rows:  735

If we want to delete rows with missing values on a specific column we use subset=['column_name']

In [12]:
# get df with rows in Gender that contained NaN entries dropped
df_with_dropped_on_Gender = df.dropna(subset=['Gender'])

print('Dropped rows: ', df.shape[0] - df_with_dropped_on_Gender.shape[0])
Dropped rows:  306

Listwise Deletion does have its drawbacks:

  • It deletes perfectly valid data points that share a row with the missing value
  • If the missing values do not occur entirely at random it can negatively effect the model
  • It can reduce the degree of freedom of our model
  • When building a predicting model if we were to remove all cases that had missing values when training our data we would run into problems when we received missing values in our test set

The most common method is to fill the missing values, we can do this usnig fillna() method. We need to provide the value to replace the missing ones. In the case of categorical columns it is common to replace the values with strings (like 'other') or with the most common occurring value.

In [13]:
df['Gender'].fillna(value='Not Given', inplace=True)
df['Gender'].value_counts()
Out[13]:
Male                                                                         632
Not Given                                                                    306
Female                                                                        53
Female;Male                                                                    2
Transgender                                                                    2
Male;Non-binary. genderqueer. or gender non-conforming                         1
Female;Transgender                                                             1
Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
Non-binary. genderqueer. or gender non-conforming                              1
Name: Gender, dtype: int64

In situations where we believe that the absence or presence of data is more important than the values themselves we can create a new column that records the absence of data and then drop the original column.

In [14]:
df['SalaryGiven'] = df['ConvertedSalary'].notna()
df[['SalaryGiven']].head()
Out[14]:
SalaryGiven
0 False
1 True
2 False
3 True
4 True

For numeric columns we might want to replace the missing values with a more suitable value. But what is a suitable value? In cases like the salary we often turn to the measures of central tendency, which are the central or typical value for a distribution. The most commonly used values are the mean or median.

One cavet that we must keep in mind is that it can lead to biased estimates of the variances and covariances of the features. Similarly, the standard error and test statistic can be incorrectly estimated, so if these metrics are needed they should be calculated before the missing values have been filled.

In [15]:
# fill ConvertedSalary with the mean salary (calculated excluding null values by default)
df['ConvertedSalary'] = df['ConvertedSalary'].fillna(df['ConvertedSalary'].median())

# get rid of all the decimal values in a column
df['ConvertedSalary'] = df['ConvertedSalary'].astype('int64')

df[['ConvertedSalary']].head()
Out[15]:
ConvertedSalary
0 55562
1 70841
2 55562
3 21426
4 41671

Of course data issues are not just limited to this. In some instances we will come across features that need to be updated in some other way (for example a column containing monetary value).

In [16]:
df['RawSalary'].dtype
Out[16]:
dtype('O')

Here our RawSalary column is of type object, although intituively, we know that it should be numeric. We cannot cast this directly to a numerical column as it should not contain any non-numeric characters (like "$", "£" and "," in our case).

In [17]:
try:
    df['RawSalary'] = df['RawSalary'].astype('float64')
except ValueError as e:
    print(e)
could not convert string to float: '70,841.00'
In [18]:
# remove the special characters
for spec in [',', '£', '$']:
    df['RawSalary'] = df['RawSalary'].str.replace(spec, '')

# cast RawSalary to float
df['NumericSalary'] = df['RawSalary'].astype('float64')

df['NumericSalary'].dtype
Out[18]:
dtype('float64')

One approach to finding these stray characters is to force the column to the data type desired using pd.to_numeric(), coercing any values causing issues to NaN, Then filtering the DataFrame by just the rows containing the NaN values.

In [30]:
# attempt to convert the column to numeric values
numeric_vals = pd.to_numeric(df['RawSalary'], errors='coerce')

# find the indexes of missing values
idx = df['RawSalary'].isna()

df.loc[idx, 'RawSalary'][:5]
Out[30]:
0     NaN
2     NaN
6     NaN
8     NaN
11    NaN
Name: RawSalary, dtype: object

Data distributions

An important consideration before building a machine learning model is to understand what the disstribution of our underlying data looks like. A lot of algorithms make assumptions about how our data is distributed or how different features interact with each other. For example all models besides tree based models require our features to be on the same scale.

Feature engineering can be used to manipulate our data so that it can fit

In [37]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

numeric_df = df[['ConvertedSalary', 'Age', 'Years Experience']]
_ = numeric_df.hist(figsize=(10, 10))
In [26]:
_ = sns.boxplot('ConvertedSalary', data=numeric_df)
In [27]:
_ = sns.boxplot('Age', data=numeric_df)
In [28]:
_ = sns.boxplot('Years Experience', data=numeric_df)
In [29]:
_ = sns.pairplot(numeric_df)

Scaling and Transformations

We need to rescale data to ensure that it is on the same scale. There are many approaches, the most common are:

  • MinMaxScaling When the data is scaled linearly between a minimum and a maximum value
  • Standardization or Normalization

MinMaxScaler

In normalization you linearly scale the entire column between 0 and 1, with 0 corresponding with the lowest value in the column, and 1 with the largest. When using scikit-learn (the most commonly used machine learning library in Python) you can use a MinMaxScaler to apply normalization. (It is called this as it scales your values between a minimum and maximum value.)

In [30]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(df[['Age']])

df['normalized_age'] = scaler.transform(df[['Age']])
df[['Age', 'normalized_age']].head()
Out[30]:
Age normalized_age
0 21 0.046154
1 38 0.307692
2 45 0.415385
3 46 0.430769
4 39 0.323077

Standardization

Finds the mean of the data and centers the distribution around it, calculation the number of standard deviations from the mean.

While normalization can be useful for scaling a column between two data points, it is hard to compare two scaled columns if even one of them is overly affected by outliers. One commonly used solution to this is called standardization, where instead of having a strict upper and lower bound, you center the data around its mean, and calculate the number of standard deviations away from mean each data point is.

In [31]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(df[['Age']])

df['standardized_age'] = scaler.transform(df[['Age']])
df[['Age', 'standardized_age']].head()
Out[31]:
Age standardized_age
0 21 -1.132431
1 38 0.150734
2 45 0.679096
3 46 0.754576
4 39 0.226214

Log Transformation

In the previous exercises we scaled the data linearly, which will not affect the data's shape. This works great if your data is normally distributed (or closely normally distributed), an assumption that a lot of machine learning models make. Sometimes you will work with data that closely conforms to normality, e.g the height or weight of a population. On the other hand, many variables in the real world do not follow this pattern e.g, wages or age of a population.

In [32]:
from sklearn.preprocessing import PowerTransformer

log = PowerTransformer()

log.fit(df[['ConvertedSalary']])

df['log_ConvertedSalary'] = log.transform(df[['ConvertedSalary']])
# Plot the data before and after the transformation
df[['ConvertedSalary', 'log_ConvertedSalary']].hist()
plt.show()
df[['ConvertedSalary', 'log_ConvertedSalary']].head()
Out[32]:
ConvertedSalary log_ConvertedSalary
0 92565 0.409123
1 70841 0.120777
2 92565 0.409123
3 21426 -0.992503
4 41671 -0.406802

Removing Outliers

This extreme values can negatively impact the model.

Quantiles based

In [33]:
# Find the 95th quantile
quantile = df['ConvertedSalary'].quantile(0.95)

# Trim the outliers
trimmed_df = df[df['ConvertedSalary'] < quantile]

# The original histogram
_ = df[['ConvertedSalary']].boxplot()
plt.show()
plt.clf()

# The trimmed histogram
_ = trimmed_df[['ConvertedSalary']].boxplot()

Standard Deviation based

In [34]:
mean = df['ConvertedSalary'].mean()

std = df['ConvertedSalary'].std()

cut_off = std * 3

lower, upper = mean - cut_off, mean + cut_off

trimmed_df = df[(df['ConvertedSalary'] < upper) & (df['ConvertedSalary'] > lower)]

_ = trimmed_df[['ConvertedSalary']].boxplot()

Scaling and transforming new data

If we want to predict new data with our model we need to make sure our new data goes through the same processes.

For example if we use a scaler on training data it need to be applied to test data as well, notice that we have to apply the transformation on the test data with our model fit on the training data.

Why do we only use training data?

We do this to avoid data leakage

In [35]:
#splitting data
train_pct_index = int(0.8 * len(df))
X_train, X_test = df[:train_pct_index], df[train_pct_index:]

train_std = X_train['ConvertedSalary'].std()
train_mean = X_train['ConvertedSalary'].mean()

cut_off = train_std * 3
train_lower, train_upper = train_mean - cut_off, train_mean + cut_off

# Trim the test DataFrame
trimmed_test_df = X_test[(X_test['ConvertedSalary'] < train_upper) \
                    & (X_test['ConvertedSalary'] > train_lower)]

_ = trimmed_test_df[['ConvertedSalary']].boxplot()