A Beginners Guide to Machine Learning in Python. Part I : Feature engineering.

This is the first issue of the machine learning guide for python. It covers feature engineering from top to bottom in detail. Included at the end is a practical workbook.

· 9 min read
An abstract pattern of data engineering and ETL.
Photo by DeepMind / Unsplash

This Guide is going to cover feature engineering from top to bottom. This is the First part of this guide and will cover the basic concepts as well as some advanced techniques. The Final section of the guide takes you through a practical example, alongside links to a free Notebook and dataset for practicing Feature engineering.

Introduction

Feature engineering is the process of identifying, processing, and creating both existing features and new features to help improve the performance of a machine learning model. All data scientists, experienced or not, will find that they might need to transform raw data into valuable representations that modeling algorithms can process. These representations are features and Feature engineering is the process of creating those features with custom logic, so downstream ML algorithms can use them. This article will introduce feature engineering as a set of techniques for preparing features from raw inputs. We’ll cover the importance, examples, usefulness, and some potential pitfalls to avoid.

Why is Feature Engineering Important?

When it comes to identifying patterns in data, machine learning algorithms are excellent. They are, however, not good at identifying features in data that lack obvious characteristics. When an algorithm takes in a large amount of raw data, It tries to identify and create features. When an algorithm has difficulty finding useful features, it can result in inaccurate predictions and poor performance. It assists you in developing features that are useful to your algorithms, resulting in greater accuracy and better results.

What Does Feature Engineering Involve?

Feature engineering is the process of creating new features out of raw data. These features can be numbers, categorical values, images, or other variables within your data. For example, each column in a dataset is a feature of what the data set is trying to explain. Feature engineering looks at the raw data and attempts to identify and create useful features out of them. It is often the hardest part of data science, and it’s something that most data science students will struggle with at some point.

The next part of this guide focuses on the techniques that will help identify new features, process existing ones, and create brand-new features.

Types Of Features

Generally, we can think of several different types of data like:

Numerical Data, Categorical Data, Date & Time, Text Data, and Media (Images/ Video/ Audio). Most features in a dataset belong to one of these categories.

There are typically non-numeric, discrete features.

For example- A box might have a particular shape and color. Both of which are categorical. 4x

At the same time, continuous features of the same box might be the measurements of the box.

For example, —Height, width, and breadth.

Feature Engineering Techniques In Python

As discussed above, we know that there are

Engineering features for Numerical data

Numerical data are usually in the form of numbers and are used to measure something – like, temperature, population, expense, etc.

For example, 24° C, 2,000,000 people, $100,000. We’ll cover some of the more commonly used types of feature engineering techniques for numerical data in this session.

Rescaling Numeric features

Rescaling is a common preprocessing task in machine learning. There are several rescaling techniques, but one of the simplest is called min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range.

Let’s look at an example:

Let’s start by creating an array, let's say sales:

# Load libraries 
import numpy as np 
from sklearn import preprocessing

# Create Sales
sales = np.array([[-200],[-10],[50],[1000],[15],[20],[30],[50],[100],[200],[10000],[-12000],[150000],[160000]])

#View Sales
sales
OUTPUT:
[[  -200]
[   -10]
[    50]
[  1000]
[    15]
[    20]
[    30]
[    50]
[   100]
[   200]
[ 10000]
[-12000]
[150000]
[160000]]

Now let’s use the MinMaxScaler to scale this data

# Create a scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))

#Scale feature
scaled_sales = minmax_scale.fit_transform(sales)

#Show feature
scaled_sales

output:

array([[0.06860465],
[0.0697093 ],
[0.07005814],
[0.0755814 ],
[0.06985465],
[0.06988372],
[0.06994186],
[0.07005814],
[0.07034884],
[0.07093023],
[0.12790698],
[0.        ],
[0.94186047],
[1.        ]])

The Scikit-learn 'MinMaxScaler' offers two options to rescale a feature. One option is to use fit to calculate the minimum and maximum values of the feature, then use transform to rescale the feature. The second option is to use 'fit_transform()' to do both operations simultaneously. There is no mathematical difference between the two options, but it may sometimes be useful to perform the functions separately on different data.

Standardizing features

The scaling of features to be roughly standard and normally distributed is a common substitute for the min-max scaling. To accomplish this, we standardize the data so that it has a mean of 0, and a standard deviation of 1.

Let's do this by creating a standard scaler object and then running the data through it-

#Create a scaler
std_scaler = preprocessing.StandardScaler()
std_sales = std_scaler.fit_transform(sales)

# Show feature standardized
std_sales

output:

array([[-0.40932764],
[-0.40583847],
[-0.40473663],
[-0.38729081],
[-0.40537937],
[-0.40528755],
[-0.40510391],
[-0.40473663],
[-0.40381843]])

The transformed feature shows how far the original value deviates from the mean value of the feature by standard deviations (also called a z-score in statistics).

Standardization is frequently chosen over min-max scaling as the preferred scaling technique for machine learning preprocessing, in my experience.

But the effects might vary based on the learning algorithm. For instance, standardization frequently improves the performance of principal component analysis, and min-max scaling is typically advised for neural networks.

Normalizing

One method for feature scaling is normalization. We use normalization most often when the data is not skewed along either axis or when it does not follow the Gaussian distribution.

By converting data features with different scales to a single scale during normalization, we further simplify the processing of the data for modeling. As a result, each data feature (variable) tends to have a similar impact on the final model.

For this example, let's try to work with some different data.

# Load libraries 
import numpy as np 
from sklearn.preprocessing import Normalizer

# Create feature matrix
x = np.array([[2.5, 1.5],[2.1, 3.4], [1.5, 10.2], [4.63, 34.4], [10.9, 3.3], [17.5,0.8], [15.4, 0.7]])

# Create normalizer
normalizer = Normalizer(norm="l2")

# Transform feature matrix normalizer.transform(features)
normalizer.transform(x)

Output:

array([[0.85749293, 0.51449576],
[0.52549288, 0.850798  ],
[0.14549399, 0.98935914],
[0.13339025, 0.99106359],
[0.95709822, 0.28976368],
[0.99895674, 0.04566659],
[0.99896854, 0.04540766]])

Here we can see that all values are between 0 and 1.

There are many more techniques and transformations for engineering numerical data that you can perform. We’ll take a look at a few of these in the last part of this guide.

Engineering Features for Categorical Data.

Categorical Data is data that measures something Qualitatively or to Classify some things into groups. Categorical Data can be of 2 types –

Ordinal data, i.e., data that follows some natural order. For example, Temperature can be, Cold, Average, or Hot.

Nominal Data usually classifies something into groups or categories. Male, Female.

In this section, we’ll how to deal with both of these

Encoding Ordinal Data

Encoding is the process of converting ordinal data into a numeric format so that the Machine learning algorithm can make sense of it. For transforming ordinal data into numeric data, we typically convert each class into a number. For example, cold, average, is mapped to 1, 2, and 3 respectively. Let’s see how we can do this easily.

Let’s start by importing pandas and creating the dataset.

#Importing libraries
import pandas as pd

#Creating the data
data = pd.DataFrame({"Temprature":["Very Cold", "Cold", "Warm","Hot", "Very Hot"]})

print(data)

Now let's map the data to numerical values.

#Mapping to numerical data
scale_map = {"Very Cold": -3,
             "Cold": -1,
             "Warm": 0,
             "Hot" : 1,
             "Very Hot" : 3}

#Replacing with mapped values
data_mapped = data["Temprature"].replace(scale_map)
data["encoded_temp"] = data_mapped
data

OUTPUT

index Temperature encodedtemp
0 Very Cold -3
1 Cold -1
2 Warm 0
3 Hot 1
4 Very Hot 3

As you can see, I've mapped a numerical value for each of the observations. Notice that I've marked -3 for Very cold and -1 for cold. Mapping this way makes the features more effective because the numerical value that I've assigned is similar to the feature's real characteristics.

One Hot Encoding Nominal Data

In one hot encoding, we convert each class of nominal data into its feature, and we assign a binary value of 1 or 0 to tell whether the feature is true or false. Let’s see how this can be done using both the LibraryBinarizer in scikit learn and Pandas.

Import Libraries and Creating Data

#Import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
# Create the dataset
color_data = {"itemid": ["A1","B1","C2", "D4","E9"],
        "color" : ["red","blue","green","yellow","pink"]}

Encoding this data with LibraryBinarizer()

# Creating one-hot encoder
one_hot = LabelBinarizer() 

# One-hot encode the data and assign to a var
color_encoding = one_hot.fit_transform(color_data.color)

#  feature classes
color_new = one_hot.classes_

#creating new Data Frame with encoded values 
encoded = pd.DataFrame(color_encoding)
encoded.columns = color_new

#Deleting color column and merging with encoded values
color_data_new = color_data.drop("color",axis = 1)
color_data_new = pd.concat([color_data,encoded],axis = 1)

#Viewing new data
print(color_data_new)

Output:

   itemid   color  blue  green  pink  red  yellow
0     A1     red     0      0     0    1       0
1     B1    blue     1      0     0    0       0
2     C2   green     0      1     0    0       0
3     D4  yellow     0      0     0    0       1
4     E9    pink     0      0     1    0       0

We can also do it with pandas, which is much quicker, but it is less flexible.

#Creating encoded df
encoded_pd  = pd.get_dummies(color_data.color)

#Deleting color column and merging with encoded values
color_data_pd = color_data.drop("color", axis = 1)
color_data_pd = pd.concat([color_data,encoded_pd],axis = 1)

#Viewing new data
print(color_data_pd)
.    itemid  color  blue  green pink red  yellow
0     A1     red     0      0     0    1       0
1     B1    blue     1      0     0    0       0
2     C2   green     0      1     0    0       0
3     D4  yellow     0      0     0    0       1
4     E9    pink     0      0     1    0       0

It is good practice to drop one of the features after one hot encoding to reduce linear dependency.

#Dropping final column2
color_data_pd.drop("yellow",axis =1, inplace = True)
color_data_pd
|ItemID| Color|Blue|Green|Pink|Red|
| -----| ---  | ---| --- |--- |---|
| A1   | red  | 0  |  0  | 0  | 1 |
| B1   | blue | 1  |  0  | 0  | 0 |
| C2   | green| 0  |  1  | 0  | 0 |
| D4   |yellow| 0  |  0  | 0  | 0 |
| E9   | pink | 0  |  0  | 1  | 0 |

Feature engineering workflow in python

Now let’s test our skills with some practical examples thanks to this notebook.

Our task is to identify and convert the features, and I'm assuming that we have a basic understanding of python and the data science packages like pandas and NumPy.

Data and Problem Statement

Before continuing to the next part. Take a look at the example case which we are going to work on and download this dataset.

Our dataset contains information collected from the survey. This dataset includes survey data information for 3000 people. The data collected, has 18 columns and 3000 rows.

The data is fake and most of the information might not make sense. But, I generated this data-set specifically for this guide, and it's good enough for practicing and implementing our learnings thus far.

Solution and My Approach:

Basic transformations:

Once I imported the necessary libraries and the dataset,

  • I Start off by checking for any missing values and I also perform the necessary Imputations.
  • I summarize the data to find out descriptive statistics. I also create some visualizations to identify outliers and so on before starting any kind of analysis.
  • Another thing I like to do is to identify the numerical and categorical datasets and assign them to 2 separate variables. My Workflow often consists of dealing with each type of data one by one.

So, the following are the features that i came up with:

Numerical Data

  • Looking at the age group, We can group people by age and immediately encode it. To do this we can create a new column “age group”.
  • Or we can discretization the age into bins
  • With weight and height, we can calculate the BMI which is kg/m2. We can apply this formula and create a new BMI feature.
  • We can also group people into different categories based on EMI and encode these values.

We can group based on income

  • Looking at the age group, We can group people by age and immediately encode it. To do this we can create a new column “age group”.
  • Or we can discretization the age into bins
  • With weight and height, we can calculate the BMI which is kg/m2. We can apply this formula and create a new BMI feature.
  • We can also group people into different categories based on EMI and encode these values.
  • We can group based on income

Learning More

Feature engineering is an important and ongoing part of any machine-learning project. Even when you are choosing ready-made features, you need to manipulate them to create the best possible features for your algorithms. Typically, you will start feature engineering with a raw dataset and end up with features that are ready for use. You may have to recreate features for different algorithms.

You can check out the notebook for this guide alongside the dataset and the examples I used over at GitHub.

This Guide is based on “Machine Learning With Python CookBook by Chrish Albon”. It is a great resource that I use from time to time as a reference for machine learning projects.

Here’s a list of free Resources to help you in your machine-learning journey.

👍
All THE BEST ON YOUR LEARNING JOURNEY