Data preparation | One hot encoding | Data encoding

Categorical Encoding in Machine Learning

Understanding why machines require categorical data encoding

Swapnil Kangralkar
4 min readDec 22, 2020

If you are reading this, I am assuming you already know what encoding means. Nevertheless, I’ll give a brief intro for those who are new to data science.

Note — Throughout this article, the terms; features, columns, and variables have been used interchangeably.

Data is classified as below:

Image created by the author

The What and Why

1. What is Categorical Encoding and why do you need it?

For some Machine Learning algorithms, whenever you have categorical data, you have to convert it to numerical type. The reason you convert categorical columns to numerical is so that the machine learning algorithm is able to understand and process them. This process is called categorical encoding.

Different Approaches to Categorical Encoding are;

I. Label Encoding, and II.One-Hot Encoding.

I. Label Encoding

In the table above, you can see that ‘Names’ is a categorical feature (column). More specifically it is a Nominal feature since the names do not have an order or rank to them.

Now, when you perform label encoding using Python in scikit-learn, the names are encoded based on the order of the alphabets. i.e. Akon will be encoded with 0, Bkon with 1, Ckon with 2, and Dkon with 3.

The algorithm may interpret a relationship between names such as Akon < Bkon < Ckon < Dkon. You do not want this to happen. To handle this issue, you can use One-Hot Encoding.

II. One-Hot Encoding

One-Hot Encoding is the process of creating dummy variables. In this encoding technique, each name (class) is represented as one feature.

i.e. Akon is represented by column 0, Bkon by column 1, Ckon by column 2, and Dkon by column 3.

Note — One-Hot Encoding results in a Dummy Variable Trap. A Dummy Variable Trap is a scenario in which features are highly correlated to each other. The Dummy Variable Trap leads to the problem known as Multicollinearity. Multicollinearity occurs where there is a dependency between the features. Multicollinearity is a serious issue in machine learning algorithms like Linear Regression and Logistic Regression.

Therefore, in order to overcome the problem of multicollinearity, one of the dummy variables (features) needs to be dropped. The outcome of one feature can easily be predicted with the help of the remaining features. i.e. in this case, If the name is not Bkon or Ckon or Dkon, it is definitely Akon. (Considering that it is a non-nullable column)

So, you will be left with only 3 columns. The machine learning algorithm will remember Akon as 0–0–0. Just modify the code as below:

Now comes the very important question;

The When

2. When to use Label Encoding vs One-Hot Encoding?

Apply Label Encoding when:

  1. The categorical feature is ordinal (like t-shit sizes: Small, Medium, Large). You can encode Small as 1, Medium as 2, and Large as 3 or vice-versa.
  2. The number of unique classes in the categorical feature is quite large as one-hot encoding will create too many columns. i.e. if you have 1000 different names and if you apply one-hot encoding you will end up with 999 new columns.

Quick tip: When you have a large number of unique classes in a single categorical feature, you can do an aggregate to obtain the top (repeating) 20–30 classes and label encode them, while the other sparse classes can be simply labeled as ‘others’. This again depends on your data and other project requirements. There is never one solution that fits all in data science.

Apply One-Hot Encoding when:

  1. The number of unique classes in the categorical feature is less.
  2. The categorical feature is not ordinal.

That’s all folks! Thank you for reading. Any feedback will be highly appreciated. You can get in touch with me via LinkedIn.

--

--

Swapnil Kangralkar
Swapnil Kangralkar

Written by Swapnil Kangralkar

Technical Manager | Data Scientist | Professor Visit https://swapnilklkar.github.io for more.

No responses yet