Can we detect a human's emotion given their messages on social media?
A common natural language processing task is to classify text. The most common type of text classification is where a text is classified as positive or negative. In this project, we will consider a slightly harder problem. Given a persons Facebook message, determine the individuals emotional state. This form of natural language processing is known as sentiment analysis. Note: You can find the accompanying code in the Colab Notebook link below. I highly encourage you to fork it, tweak the parameters, or try the model with your own dataset!
The Data
The data used for this project comes from Facebook's AI Empathetic Dialogue dataset. The training dataset comprises of 84169 entries with 8 columns consisting of non-null values of a users prompt (intial message) and a second users utterance (response to the message) associating a prescribed emotion in the context value. We'll only use the prompt and context columns for training our model. The link to the raw data can be found here.
Why use Sentiment Analysis?
Sentiment analysis tools are essential to detect and understand customer feelings. Companies that use these tools to understand how customers feel can use it to improve the customer experience. Sentiment analysis tools generate insights into how companies can enhance the customer experience and improve customer service. Some common use cases for sentiment analysis include chat bots, planning product improvements, and prioritizing customer service issues.
Related Work
Research reports related to this project were reviewed prior to the data preprocessing and model implementation stages. These reports were assessed to understand any potential pitfalls with building a text classification model. Any code sourced is cited within the colab notebook. Links for these reports are found below.
Setup files
Using a Google Colab notebook for this project. Import libraries and load files using the wget command.
# import libraries import os import random import numpy as np import pandas as pd import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers import matplotlib.pyplot as pltDownload dataset from Facebook AI public files
!wget "https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz"

!tar -xvf empatheticdialogues.tar.gz

Load and clean data
Let's start with the training data.
# load csv file splitname = "train" path = '/content/empatheticdialogues' # create variable for training dataset train_df = open(os.path.join(path, f"{splitname}.csv")).readlines() # print first two indices print(train_df[0].strip().split(",")) print(train_df[1].strip().split(",")) print(type(train_df))

Next we'll parse through the dataset and extract the values from the prompt and context columns. Based on the print statements above, the values are located in the 2nd and 3rd index in each sub-array. We'll append the values to their own arrays.
raw_train_prompt_list = [] raw_train_context_list = [] for i in range(1, len(train_df)):Now we'll create a pandas dataframe using the two arrays.data_line = train_df[i].strip().split(",") raw_train_context_list.append(data_line[2]) raw_train_prompt_list.append(data_line[3].replace("_comma_", ","))
train_dataframe = pd.DataFrame(columns=['prompt','context']) train_dataframe['prompt'] = raw_train_prompt_list train_dataframe['context'] = raw_train_context_list print(f"There are {len(train_dataframe)} rows in the training dataset.") train_dataframe.head()

total_duplicate_train = sum(train_dataframe[['prompt','context']].duplicated()) print(f"There are {total_duplicate_train} duplicate prompts (messages) to context (emotion) rows in the training data.")

We'll remove the duplicates in order to remove biased performance estimates when training the model.
train_dataframe = train_dataframe.drop_duplicates(ignore_index=True) train_dataframe.shape

EDA + Pre-Processing
Let's display a random message from the training data with the accompying sentiment. Using a random index from the dataset.idx = np.random.randint(train_dataframe.shape[0]) print('Train DF row is: ', idx) print('Message: ', train_dataframe['prompt'].iloc[idx]) print('Sentiment: ', train_dataframe['context'].iloc[idx])

print(train_dataframe['context'].value_counts())

train_dataframe.context.value_counts().sort_values().plot(kind = 'barh', linewidth=20, figsize = [8,8], title='Sentiment Frequency Training Data')

def emotion_grouping(df_column):Use grouping function with the context column in the training data and add grouped column to dataframeexcited_list = ['excited', 'surprised', 'joyful'] afraid_list = ['afraid', 'terrified', 'anxious', 'apprehensive'] disgusted_list = ['disgusted', 'embarrassed', 'guilty', 'ashamed'] annoyed_list = ['angry', 'annoyed', 'jealous', 'furious'] grateful_list = ['faithful', 'trusting', 'grateful', 'caring', 'hopeful'] disappointed_list = ['sad', 'disappointed', 'devastated', 'lonely', 'nostalgic', 'sentimental'] impressed_list = ['proud', 'impressed', 'content'] prepared_list = ['anticipating', 'prepared', 'confident'] new_emotion_group_list = [] for index, value in df_column.items():  if value in excited_list:    new_emotion_group_list.append('excited')  elif value in afraid_list:    new_emotion_group_list.append('afraid'))  elif value in disgusted_list:    new_emotion_group_list.append('disgusted')  elif value in annoyed_list:    new_emotion_group_list.append('annoyed')  elif value in grateful_list:    new_emotion_group_list.append('grateful')  elif value in disappointed_list:    new_emotion_group_list.append('disappointed')  elif value in impressed_list:    new_emotion_group_list.append('impressed')  elif value in prepared_list:    new_emotion_group_list.append('prepared') return new_emotion_group_list
train_grouped_emotions_col = emotion_grouping(train_dataframe['context']) train_dataframe['grouped_emotions'] = train_grouped_emotions_colDisplay side-by-side comparison of ungrouped and grouped emotion columns.
train_dataframe[['context','grouped_emotions']].head(10)

train_dataframe['grouped_emotions'].value_counts()

train_dataframe.grouped_emotions.value_counts().sort_values().plot(kind = 'barh', linewidth=20, figsize = [8,8], title='Sentiment Frequency Training Data')

# pre-processing validation data splitname = "valid" path = '/content/empatheticdialogues' valid_df = open(os.path.join(path, f"{splitname}.csv")).readlines() print(valid_df[0].strip().split(",")) print(type(valid_df)) # load prompt and context in pandas dataframe raw_valid_prompt_list = [] raw_valid_context_list = [] for i in range(1, len(valid_df)):data_line = valid_df[i].strip().split(",") raw_valid_context_list.append(data_line[2]) raw_valid_prompt_list.append(data_line[3].replace("_comma_", ","))
valid_dataframe = pd.DataFrame(columns=['prompt','context']) valid_dataframe['prompt'] = raw_valid_prompt_list valid_dataframe['context'] = raw_valid_context_list
# drop duplicates valid_dataframe = valid_dataframe.drop_duplicates(ignore_index=True) valid_dataframe.shape
# display sentiment value counts print(valid_dataframe['context'].value_counts()) valid_dataframe.context.value_counts().sort_values().plot(kind = 'barh', linewidth=20, figsize = [8,8], title='Sentiment Frequency Validation Data') # use emotion_grouping function valid_grouped_emotions_col = emotion_grouping(valid_dataframe['context']) valid_dataframe['grouped_emotions'] = valid_grouped_emotions_col valid_dataframe[['context','grouped_emotions']].head(10) # display grouped_emotions value counts valid_dataframe['grouped_emotions'].value_counts() valid_dataframe.grouped_emotions.value_counts().sort_values().plot(kind = 'barh', linewidth=20, figsize = [8,8], title='Sentiment Frequency Validation Data')Next Steps - Test data For the test data, we perform the same preprocessing steps and group the emotions together. We do not need to drop duplicates for the test data because we'll only use the test data for the inference stage where we test our model on raw text data it hasn't seen. the code below is the same as the previous steps minus dropping duplicates.
# pre-processing test data splitname = "test" path = '/content/empatheticdialogues' test_df = open(os.path.join(path, f"{splitname}.csv")).readlines() print(test_df[0].strip().split(",")) print(type(test_df)) # load prompt and context in pandas dataframe raw_test_prompt_list = [] raw_test_context_list = [] for i in range(1, len(test_df)):data_line = test_df[i].strip().split(",") raw_test_context_list.append(data_line[2]) raw_test_prompt_list.append(data_line[3].replace("_comma_", ","))
test_dataframe = pd.DataFrame(columns=['prompt','context']) test_dataframe['prompt'] = raw_test_prompt_list test_dataframe['context'] = raw_test_context_list
# display sentiment value counts print(test_dataframe['context'].value_counts()) test_dataframe.context.value_counts().sort_values().plot(kind = 'barh', linewidth=20, figsize = [8,8], title='Sentiment Frequency Test Data') # use emotion_grouping function test_grouped_emotions_col = emotion_grouping(test_dataframe['context']) test_dataframe['grouped_emotions'] = test_grouped_emotions_col test_dataframe[['context','grouped_emotions']].head(10) # display grouping emotion value counts test_dataframe['grouped_emotions'].value_counts() test_dataframe.grouped_emotions.value_counts().sort_values().plot(kind = 'barh', linewidth=20, figsize = [8,8], title='Sentiment Frequency Test Data')
Vectorizing Text
Deep learning models, being differentiable functions, can only process numeric tensors: they can't take raw text as inputs. Vectorizing text is the process of transforming text into numeric tensors. This process involves a few key steps which includes standardizing text, tokenization, and converting each token into a numerical vector. We'll start by vectorizing our target labels, the 8 emotions we'll use for our sentiment analysis. Processing output labels We'll vectorize our target labels by using the keras StingLookup layer which maps string features to integer indices. By switching the output mode to 'multi_hot', input strings are encoded into an array where each dimension corresponds to an element in the vocabulary. A 'OOV' (out of vocabulary) token is created to cover any labels in the validation or test data that is not in the training data. Each target label will be a (1,9) shaped numeric array consisting of a 1 for the target label and 0's for the non-target labels.sentiment = tf.ragged.constant(train_dataframe['grouped_emotions'].values) lookup = tf.keras.layers.StringLookup(output_mode="multi_hot") lookup.adapt(sentiment) # the sentiment vocabulary added the [UNK] OOV token for unknown values sentiment_vocab = lookup.get_vocabulary() # [UNK] token created to cover unknown emotion sentiments print("Sentiment Vocabulary:\n") print(sentiment_vocab)

idx = np.random.randint(test_dataframe.shape[0]) sample_label = train_dataframe['grouped_emotions'].iloc[idx] print(f"Original label: {sample_label}") label_binarized = lookup([sample_label]) print(f"Label-binarized representation: {label_binarized}")

def invert_multi_hot(encoded_labels):Processing input text Each value from the prompt columns in the training, validation and test data are a sequence of strings representing a users message on Facebook. We need to create a dataset object where we can align the values from the prompt with the sentiment values from the grouped emotions column. We'll use our StringLookup layer in our function to create binary representions of our target labels. For our training dataset we'll shuffle the data to prevent bias when training our model."""Reverse a single multi-hot encoded label to a string vocab term.""" hot_indices = np.argwhere(encoded_labels == 1.0)[..., 0] return np.take(sentiment_vocab, hot_indices)
batch_size = 128 # function takes a pandas dataframe and returns a tensor.dataset object # with string inputs (messages) and one_hot binary versions of the target labels def make_dataset(dataframe, is_train=True):Implement the make_dataset function for each dataframelabels = tf.ragged.constant(dataframe["grouped_emotions"].values) labels_binarized = [lookup(x).numpy() for x in labels] dataset = tf.data.Dataset.from_tensor_slices((dataframe['prompt'].values, labels_binarized)) dataset = dataset.shuffle(batch_size * 10) if is_train else dataset return dataset.batch(batch_size)
train_dataset = make_dataset(train_dataframe, is_train=True) validation_dataset = make_dataset(valid_dataframe, is_train=False) test_dataset = make_dataset(test_dataframe, is_train=False)Lets see if our function was successful by taking a batch of 5 from our train_dataset and displaying the text input with the associated target label. We'll use the inverted multi-hot function to convert the labels as its string version.
text_batch, label_batch = next(iter(train_dataset)) for i, text in enumerate(text_batch[:5]):label = label_batch[i].numpy()[None, ...] print(f"Abstract: {text}") print(f"Label: {invert_multi_hot(label[0])}") print(" ")

Text Vectorization layer: bag of words approach
This process is a basic form of feature engineering that aims to erase encoding differences that you don't want your model to have to deal with. The simplest and most widespread standardization schemes include converting the text to lowercase, removing punctuation characters, and represent each tokenized text in numerical form. All this can be done using the keras TextVectorization layer. The output for this layer depends on your choice for how the text should be represented. There are 2 notable methods for text classification, the bag-of-words approach and the sequence model. In order to determine which method to use, look at the ratio between the number of samples in your training data and the mean number of words per sample. If the ratio is small-less than 1500, then the bag of words model will perform better. If the ratio is higher than 1500 the sequence model works best. Let's get the percentile estimates for the sequence lengths of the promptstrain_dataframe["prompt"].apply(lambda x: len(x.split(" "))).describe()

# take number of samples in training data and divide by the mean sequence length 19307/18

max_tokens = 20000 auto = tf.data.AUTOTUNE # text vectorization layer for preparing bag of words processing approach text_vectorizer = tf.keras.layers.TextVectorization(To build the vocabulary for our model, the TextVectorization layer needs to be adapted using the vocabulary from our training dataset. Our layer is now ready to vectorize the inputs for the training, validation, and test data.max_tokens=max_tokens, ngrams=2, output_mode="tf_idf", )
# `TextVectorization` layer needs to be adapted as per the vocabulary from our # training set with tf.device("/CPU:0"):text_vectorizer.adapt(train_dataset.map(lambda text, label: text))
# vectorize the text inputs in our datasets train_dataset = train_dataset.map( lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto ).prefetch(auto) validation_dataset = validation_dataset.map( lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto ).prefetch(auto) test_dataset = test_dataset.map( lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto ).prefetch(auto)Inspect our dataset objects
train_dataset

validation_dataset

Build Model
Since we're using a simple bag-of-words approach for our machine learning model, we'll keep the architecture simple to reduce the chances of our model overfitting the training data. The model consists of a regular densely-connected neural network layer with ReLU as the activation function. We'll use 90 units for the output of this layer. To reduce overfitting we'll apply a 0.5 dropout layer. Since our task is a single-label multiclass classification problem, we'll apply the softmax function for our output layer. Our output will be a vector of the probability distribution of the 9 possible sentiments. Remember that we added the UNK out of vocabulary token as a possible sentiment.# create make_model function def make_model(max_tokens=max_tokens):Call our model and display the summary.inputs = tf.keras.Input(shape=(max_tokens,)) x = tf.keras.layers.Dense(90, activation='relu')(inputs) x = tf.keras.layers.Dropout(0.5)(x) outputs = tf.keras.layers.Dense(9, activation='softmax')(x) model = tf.keras.Model(inputs,outputs) return model
our_model = make_model() our_model.summary()

#custom learning rate opt_rmsprop = tf.keras.optimizers.RMSprop(learning_rate=1e-4) epochs = 20 our_model.compile(optimizer=opt_rmsprop, loss ='categorical_crossentropy', metrics=['categorical_accuracy']) history = our_model.fit(train_dataset, validation_data=validation_dataset, epochs=epochs)

# visualize the training and validation loss train_loss = history.history['loss'] val_loss = history.history['val_loss'] plt.plot(train_loss, label = 'training loss') plt.plot(val_loss, label = 'validation loss') plt.legend()

# visualize the accuracy during training train_accuracy = history.history['categorical_accuracy'] val_accuracy = history.history['val_categorical_accuracy'] plt.plot(train_accuracy, label='training accuracy') plt.plot(val_accuracy, label='validation accuracy' plt.legend()

_, categorical_acc = our_model.evaluate(test_dataset) print(f"Categorical accuracy on the test set: {round(categorical_acc * 100, 2)}%.")

Inference
An important feature of the preprocessing layers provided by Keras is that they can be included inside a tf.keras.Model. We will export an inference model by including the text_vectorization layer on top of our_model. This will allow our inference model to directly operate on raw strings.# inference stage for i, text in enumerate(text_batch[:5]):label = label_batch[i].numpy()[None, ...] print(f"Abstract: {text}") print(f"Label(s): {invert_multi_hot(label[0])}") predicted_proba = [proba for proba in predicted_probabilities[i]] predicted_label = [ x for _, x in sorted(    zip(predicted_probabilities[i], lookup.get_vocabulary()),    key=lambda pair: pair[0],    reverse=True, ) ][:1] print(f"Predicted Label(s): ({', '.join([label for label in predicted_label])})") print(" ")
