Graph Convolutional Networks are Bringing Emotion Recognition Closer to Machines. Here’s how.

Artificial Intelligence (AI) is on its way to becoming the defining technology of our time. There are lots of uses and advancements around it that can back up this claim. But if we were to choose only one, that definitely had to be deep learning. This is a tech that's part of a series of machine learning methods based on artificial neural networks capable of mimicking how a human brain works.

As impressive as it sounds, deep learning is growing in leaps with every passing week and it still has a lot of room to improve. In its journey, it's being applied to multiple tasks, from speech recognition to computer vision and, though all of those efforts are amazing in their own ways, there's one that stands out - natural language processing (NLP).

Through it, software engineers are trying to come up with systems able to understand humans in normal interaction. In other words, an NLP-based system could be so advanced that it could talk with a person as if it were human. That's mind-blowing, even without considering what it takes for a machine to rise up to that level - understanding intent, comprehending context, handling semantics and grammar, among many other things.

A perfect NLP system doesn't exist yet but research is taking huge steps towards it to the point they outsource quality assurance to help them move forward. One of the latest developments around it? The use of graph convolutional networks to recognize emotions in texts. Can you imagine a computer reading a paragraph and being able to tell if the writer was being funny, ironic, or angry? That's precisely what this sophisticated method is trying to do. Let's take a look at how that will work, at least on a theoretical level. The studies can also be found in Assocation for Computational linguistics.

Some basic definitions

Before getting into the complex explanation waiting ahead of us, it's important for you to know a couple of essential terms that can help you grasp it all. So, here's a list of things you need to learn before moving on:

Emotion recognition (ERC): a task responsible for identifying and classifying the emotion in a piece of text, whether it's angry, sad, happy, excited, bored, and so on. A machine capable of doing this task is essential to develop an AI platform that's advanced enough to talk to humans.

Artificial Neural Network (ANN): computing systems that replicate how biological neural networks (such as the human brain) works. These systems can learn to do things by analyzing examples, so they don't have to be programmed with rule sets to do so.

Deep Neural Network (DNN): a type of Artificial Neural Network that uses multiple layers to handle the input data all the way to the output data. By using mathematical operations, each layer analyzes the data and calculates the probability for each output that is then used as an input for the next layer and so on until the last layer of the system, where it offers a result. The name "deep" implies that there are numerous layers working together to achieve the desired output.

Convolution: mathematical operation where two functions produce a third that results of the interaction between them. It implies that this interaction reshapes the functions, providing a new function with a new shape.

Convolutional Neural Network (CNN): a class of Deep Neural Network that uses convolution in at least one of its layers, thus defining how the output data is reshaped according to the different input data.

Graph Convolutional Network (GCN): a type of Convolutional Neural Network that works with graphs to leverage the structural information represented in them.

Your main takeaway here should be that what we're going to look into here is a computational process that mimics the human brain using convolution operations to interpret graphs. Now it's time to see how it all comes together when it's used on ERC.

The current ERC

Emotion Recognition is like the Holy Grail for NLP enthusiasts. If a system could be so advanced that it would be able to pinpoint the emotions of a piece of text, we'd be closer to an AI-based platform capable of talking to us - like if we lived on a sci-fi movie. Creating such a platform could launch the development of intelligent robots and systems that would revolutionize our education and healthcare as well as the way we sell things or how we work.

The methods used until today for ERC were Recurrent Neural Networks (RNN) and Attention Mechanisms. Without getting too much into them, let's say that their results are far from perfect, even when the two methods are combined. That's because they both have an issue when considering contexts beyond the text (personality traits, topics, and intent).

Since context in communication is everything, systems that use RNN and Attention Mechanisms aren't precisely efficient. That's the main reason why researchers started to look into deep learning as a way to come up with a more sophisticated way of tackling ERC. That's when the GNCs came in.

The importance of context

We said that context is everything and that's undebatable. Let's see an example to prove it. If you read "It's fine" you could understand it in several ways. You can take it at face value (the thing is fine), you can see it as a resigned expression ("it's fine..." as in "just leave it at that"), you can even see the irony in it (an "it's fine" with a mocking gesture). How could you know which interpretation is the right one? Through context.

In such a phrase, the context would be given by what was said before, how all the participants of the conversation are feeling, the past history between them, the atmosphere, and so on. We are able to discern all that in a text because we, as humans, have already integrated the "mechanisms" that allow us to tell whether one interpretation is correct or not. This stems from two types of context:

Sequential Context: in other words, what the phrase or sentence means when placed in a sequence. What was said before impacts the meaning you can get out of a phrase or sentence. Additionally, there are semantic and syntactic rules and relationships that rule out certain interpretations in favor of others. RNN and Attention mechanisms widely use this context to "comprehend" emotions. If you read "Ah, just don't do anything else. It's fine" you can limit the number of interpretations and understand where the person is coming from.

Speaker Level Context: the relationship between the speakers participating in a conversation and the relationship each of them has with themselves provide another kind of context that's more complex to analyze. People's personalities and personal history influence how they talk. In the same sense, the other participants of a conversation as well as what is being said also influence how participants talk and even change how they talk during said conversation. This is a context that RNN and Attention Mechanisms can't grasp and it's the main focus of GNCs.

Naturally, for a GNC to work, it needs the data to be arranged in the form of a graph. That presents one of many challenges for the model. That's because how the heck do you graphic a conversation?

Creating graphical representations of conversations

To create a conversation graph, it's important to understand the different elements that play a part in said conversation. So, the first place we can start is by considering the number of speakers involved. Each of those speakers "creates" a new piece of text and sends it to the other or other speakers in what research has called "utterance."

After understanding that speakers create utterances, it's time to understand that each utterance is connected to the rest in a contextual manner. Those connections are called Edges which, in turn, can be labeled according to the different needs. Said labels are called Relations. Additionally, all Edges have different importance for the context, which is defined as the Weight of the Edge.

Now we have all the elements we need to draw a conversation, including the speakers, their utterances, the edges, the relations, and the weights. Here's how it all comes together:

Of all these, Edges are probably the most complicated notion to grasp. That's why it is important to note 2 things about them:

All edges are one way. That means that there's an edge (connection) between utterance 1 and utterance 2 and a different edge representing the connection between utterance 2 and utterance 1.

All utterances have edges that connect them with themselves. In other words, it's how the utterance influences the speaker that creates it. In practical terms, any time you talk, you listen to yourself and what you say somewhat shapes the communication itself.

Another important consideration to understand the model include the following:

The weight of each edge (that is, the importance of a specific connection between two utterances) is constant and doesn't change in the process
The relation of an edge depends on speaker dependency (who said what) and on temporal dependency (what utterance came first)

These considerations are very important for the graphs, as who said what and when it was said are very important aspects of any conversation. It's not the same if you speak before the other person or if the other person did it first. In the same sense, the conversation won't be the same if there's a third person in the mix.

The GCN model

Using the graph presented above it is possible to comprehend the way GCN works. Here's the visual representation of the process.

It looks complex but it isn't once you take a deep look at it. There are 3 distinct stages in the model: the sequential context encoding, the speaker-level context encoding, and the classification. Let's see them in more detail.

In the sequential context encoding stage, each utterance is run through a series of Gated Recurrent Units (GRUs) with the aid of a sequential context encoder. Here is where the data gains the sequential context, that is, understands its own place in a particular sequence. The output of such a process is used as an input for the second stage.

In the speaker-level context encoding stage, the data with the sequential context is reexamined and classified. Here, the edges are labeled with relations and the speaker context dimension is added. This is done through a 2-step process:

The information of all the neighboring nodes is added to each node to create a new feature vector.
Step 1 is repeated to add a similar feature vector to the output of the previous step.

This process is one of the most important in the whole system, as this is where the classification gets refined and where most of the learning takes place.

When that process is done, it's time to move on to the classification stage. Here, the output of the first and second stages are concatenated for their classification. In other words, the data with the integrated sequential context is tied to the data with the speaker-level context through convolution to obtain a third data set, which will be the richest one and, therefore, the best input for classification. After the third stage, the GCN offers an output through a probability distribution that shows the different emotions for the different utterances.

Of course, that doesn't mean that the system is capable of assigning the right emotion to all utterances right from the start. In fact, and as it happens with all machine learning methods, the output has to be assessed and fed once again to the whole system with enough corrections for the GCN to run over it again and further its learning.

The training of the model is crucial to get better results. For that, researchers are using the following labeled multimodal datasets:

IEMOCAP: videos of ten people having two-way conversations in which each utterance is labeled as happy, sad, neutral, angry, excited or frustrated.

AVEC: interactions of people with AI-based agents, with utterance being classified according to valence, arousal, expectancy, and power.

MELD: 1400 dialogues with 13000 utterances for the TV smash hit Friends, where utterances are labeled with anger, disgust, sadness, joy, surprise, fear, or neutral.

It's important to notice that only the text part of these multimodal datasets are used. However, researchers believe that there are instances where using the audio and the images for the training can be beneficial (as when assessing short utterances like "fine" or classifying similar emotions as "excited" and "happy").

What to expect now

As you surely saw, GCNs are very complex systems that need clear datasets, constant training, high processing power, and time to develop and get more accurate results. As they stand now, though, they are one of the best alternatives at ERC for AI-based solutions.

The most amazing thing about this is that graph neural networks can be the key to unlock the potential in NLP research. Understanding the relationship between the data through neighboring nodes is a revolutionary concept that can further the investigation. It's now time to continue experimenting and informing the systems and the underlying technologies to get significant improvements in what now is our surest path to truly intelligent AI solutions.

For a more comprehensive review of how GCNs work, read the fantastic introductory article from Kevin Shen in Towards Data Science.

And for more exciting open-sourced research on conversation understanding tasks such as ERC, here is an excellent repository https://github.com/declare-lab/conv-emotion by the DeCLaRe lab of Singapore University of Technology and Design.

Join the Discussion