Learn Python – How to One Hot Encode Sequence Data in Python- Basic and advance

In this tutorial, we will examine to convert our enter or output sequence information to a one-hot encoding for use in sequence classification.

One Hot Encoding is a beneficial function of machine learning due to the fact few Machine gaining knowledge of algorithms cannot work with specific records directly. While working with the datasets, we come throughout the column that holds no precise order of preference.

If we are working with a sequence classification type problem, the express records must be transformed to numbers. This approach is additionally used when we work with deep studying methods such as Long Short-term Memory recurrent neural networks.

First, we will discuss the Categorical Data.

What is Categorical Data?

Categorical Data are the sorts of variables that have the label value alternatively than numerical values. These kinds of variables are additionally known as nominal. Let’s see the following instance of express data.

A “car” variable with the values: “Maruti” and “Jaguar”.

A “food” variable with the values: “Veg”, and “Non-Veg”.

A “place” variable with the values: “first”, “second”, and “third”.

As we can see in the above code, some categories may additionally have a herbal relationship such as natural ordering. In the third example, the “place” variable has a natural ordering of values.

Problem with Categorical Data

Some computing device mastering algorithms have the potential to work with specific statistics directly. A few algorithms cannot operate on the label data directly because they require all the records variables and output variables to be numeric.

Therefore, we have to convert hierarchical facts into numerical form. Suppose the categorical variable is an output variable. In that case, you may also additionally desire to exchange forecasts by way of the model returned into a express structure to signify them or use them in some application.

How to Convert Categorical Data to Numeric Data

There are two techniques that use to convert categorical statistics into numerical data.

Integer Encoding

One-Hot Encoding

In the subsequent section, we will discuss One-Hot Encoding.

What is One Hot Encoding?

A one warm encoding is used to convert the categorical variables into numeric values. Before doing in addition records analysis, the express values are mapped to integer values. Each column incorporates “0” or “1” corresponding to which column it has been placed. In this process, every integer price is represented as a binary vector that is all zero assume the index of the integer, which is marked with a 1.

Example of a One Hot Encoding

Let’s apprehend it via the usage of the following easy example.

Suppose we have a sequence of labels with the cost ‘yellow’ and ‘red.’ To convert them into the numerical value, we assign ‘yellow’ an integer fee 1 to corresponding to its number of categories present in column and ‘red’ as zero When we stumble upon these labels, we will assign equal integer value. It is referred to as an integer encoding.

Let’s see any other example – Suppose there is a category referred to as animal and it has fours values – Cat, Dog, Cow and Camel. Consider the following desk which consists of animals and their corresponding specific values.

Input Table –

Animal Categorical Value of Animal
Cat 5
Dog 10
Cow 15
Camel 11

The output will be shown below after one warm encoding.

Cat Dog Cow Camel
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

If we symbolize the above output in a vector form then it will appear like as below.

Cat – > [1, 0, 0, 0]

Dog – > [0, 1, 0, 0]

Cow – > [0, 0, 1, 0]

Camel – > [0, 0, 0, 1]

Why use a One Hot Encoding?

One of the excellent blessings of One Hot encoding is that it represents categorical information to be extra expressive. As we discussed earlier, many computer gaining knowledge of algorithms can’t be able to work with the express facts directly, so that it wants to be transformed into integer.

We can use the integer fee without delay or the place it is needed. It can remedy the hassle where the herbal ordinal has a relationship between the categories. For example – We can assign the integer values to “weather” label, such as ‘winter’, ‘summer’ and ‘Monsoon’.

But there may additionally be issues if no ordinal relationship find. If we enable the representation to lean or any such relationship, it would possibly be damaged the studying to clear up problems.

Manual One Hot Encoding

In the following example, we will consider an instance string of alphabet letters that will be converted into integer value.

hello world  

Now, we will enforce one hot coding to the above given string value. Let’s see the following example.

Example –

from numpy import argmax  
# Here we are define input string  
str_data = 'hello python'  
print(str_data)  
# Here we are defining possible input values of english alphabate  
eng_alphabet = 'abcdefghijklmnopqrstuvwxyz '  
# define a mapping of chars to integers  
char_to_int = dict((c, i) for i, c in enumerate(eng_alphabet))  
int_to_char = dict((i, c) for i, c in enumerate(eng_alphabet))  
# input data is encoding in integer  
int_encoded = [char_to_int[char] for char in data]  
print(int_encoded)  
# one hot encode  
onehot_encoded = list()  
for value in int_encoded:  
  letter = [0 for _ in range(len(eng_alphabet))]  
  letter[value] = 1  
  onehot_encoded.append(letter)  
print(onehot_encoded)  
# invert encoding  
inverted = int_to_char[argmax(onehot_encoded[0])]  
print(inverted)  

Output:

hello python

[7, 4, 11, 11, 14, 26, 15, 24, 19, 7, 14, 13]

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Explanation:

In the above code, we have declared the input string and printed it. Next, we defined the universe of the possible input value. Then, a mapping of all possible inputs is created from the char values to integer values. We used this mapping to encode the input string.

As we can see in the above output, first letter h is encoded as 7. Then, this integer coding is converted to the one warm encoding. One integer encodes persona at a time.

Each personality has the unique index value; we marked that index of a specific persona as 1. The first persona is represented as a 7 in the binary vector of 27. We marked the seventh index as 1 for h.

Now, we will learn to put in force one warm coding the use of the scikit-learn library.

One Hot Encode using Scikit-learn

In this example, let’s anticipate the following output sequence of the three labels.

"apple"  
"mango"  
"banana"  

An example sequence of 10 time step may be.

apple, apple, mango, apple, banana, banana, mango, apple.  

We encode with the integer price to the above labels, such as 1, 2, three In the one hot encoding, we will use the binary vector with three values, such as [1, 0, 0]. The sequence includes the at least one example of one possible fee in the sequence.

We will use the scikit-learn library. We will use the LabelEncoder module from it for creating an integer encoding of labels and OneHotEncoder for creating a one hot encoding of integer encode value.

Let’s understand the following example.

Example –

from numpy import array  
from numpy import argmax  
from sklearn.preprocessing import LabelEncoder  
from sklearn.preprocessing import OneHotEncoder  
# defining sequence example  
data_1 = ['apple', 'apple', 'mango', 'apple', 'banana', 'banana', 'mango', 'apple']  
values_of_seq = array(data_1)  
print(values_of_seq)  
# first appling integer encode  
label_encoder = LabelEncoder()  
integer_encoded = label_encoder.fit_transform(values_of_seq)  
print(integer_encoded)  
# Now doing binary encode  
onehot_encoder = OneHotEncoder(sparse=False)  
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)  
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)  
print(onehot_encoded)  

Output:

['apple' 'apple' 'mango' 'apple' 'banana' 'banana' 'mango' 'apple']
[0 0 2 0 1 1 2 0]
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]

Explanation –

In the above code, first, we have printed the sequence of labels. Then, we carried out integer encoding and sooner or later the one hot encoding. The OneHotEncoder type returns well-organized sparse encoding. But this is now not environment friendly for the some utility such as use with keras library.

One Hot Encoding with Keras

Let’s suppose we have a sequence that is already integer encoded. We can work with the integer encoding at once or map the integer encoding on the label values. We can use the to_categorical() characteristic to one hot encodes integer data.

In this example, we have five integer values [0, 1, 2, 3, 4] and we have an enter sequence of the following 15 numbers.

data_1 = [1, 4, 3, 3, 0, 3, 2, 2, 4, 0, 1, 2, 1, 4, 3]  

Let’s understand the following example.

Example –

from numpy import array  
from numpy import argmax  
from keras.utils import to_categorical  
# define example  
data_1 = [1, 4, 3, 3, 0, 3, 2, 2, 4, 0, 1, 2, 1, 4, 3]  
data = array(data_1)  
print(data)  
# one hot encoding using the to_categorical() method  
encoded = to_categorical(data)  
print(encoded)  
# invert encoding  
inverted = argmax(encoded[0])  
print(inverted)  

Output:

[1 4 3 3 0 3 2 2 4 0 1 2 1 4 3]
[[0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]]
1

Explanation –

In the above code, we have encoded the integer encoded as the binary vectors and printed. Then, we used the Numpy argmax() function to invert the encoding on the first cost in the sequence.