--
ONE HOT ENCODING (Including Dummy Variable Trap) and ORDINAL ENCODING
WHAT ARE CATEGORICAL VARIABLES?
Categorical variables are qualitative data in which the values are assigned to a set of distinct groups or categories. These groups may consist of alphabetic (e.g., male, female) or numeric labels (e.g., male = 0, female = 1) that do not contain mathematical information beyond the frequency counts related to group membership.
SCALE OF MESAUREMENT
Nominal: Nominal Scale defined as a scale used for labelling variables into distinct classifications and doesn’t involve a quantitative value or order.
Example:
- Sex (Male, Female)
- Marital Status (Married, Divorced, Unmarried, Widowed etc.)
Ordinal: Ordinal Scale is defined as a variable measurement scale used to simply depict the order of variables and not the difference between each of the variables. These scales are generally used to depict non-mathematical ideas such as frequency, satisfaction, happiness, a degree of pain, etc.
Example:
How satisfied are you with our services?
Very Unsatisfied — 1
Unsatisfied — 2
Neutral — 3
WHAT IS THE PROBLEM WITH CATEGORICAL DATA?
Some algorithms can work with categorical data directly.
For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).
Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.
In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.
A categorical variable is a variable whose values take on the value of labels.
For example, the variable may be colour and may take on the values red, green, and blue.
y = m+ ax1 + bx2
Sometimes, one colour has more impact in finding out the results than the other one, which means one colour has more value than the other, so in order to assign it a value we need to convert it into numbers.
Also, Machine learning algorithms and deep learning neural networks require that input and output variables are numbers.
This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.
ONE HOT ENCODING
For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst.
Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).
In this case, a one-hot encoding can be applied to the ordinal representation. This is where the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable.
- One-Hot Encoding is the process of creating dummy variables.
DUMMY VARIABLE TRAP
D1 = represent male.
D2= represent female.
D1 + D2 =1
D2=1-D1
y= b0 +b1 *D1+ b2 * D2
y= b0+b1*D1 +b2*(1-D1)
y= b0+b1*D1+b2-b2*D1
y=(b0+b2) +(b1-b2) *D1
Since the value of one feature can be calculated using another feature, that is why calculating weights for each feature adds confusion / error in the model output.
Since the weights for each feature is present in the equation of the underlying algorithm of the model, every time the model predicts the output, there will be high error in that output.
pd.get_dummies(df, columns=[“color”], drop_first=True)
ORDINAL ENCODING
When categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium and High, these must be encoded differently than nominal variables.
import pandas as pd
df = pd.DataFrame({“Score”: [“Low”, “Low”, “Medium”, “Medium”, “High”, “Low”, “Medium”,”High”, “Low”]}) print(df)
Score
0 Low
1 Low
2 Medium
3 Medium
4 High
5 Low
6 Medium
7 High
8 Low
scale_mapper = {“Low”:1, “Medium”:2, “High”:3} df[“Scale”] = df[“Score”].replace(scale_mapper)
Score Scale
0 Low 1
1 Low 1
2 Medium 2
3 Medium 2
4 High 3
5 Low 1
6 Medium 2
7 High 3
8 Low 1