“Did you know? You are just six handshakes away from meeting the President of the USA.”
I am not sure when I first heard this statement, but it sure is a captivating premise. I have heard it many times since, cited as a fun-fact at family birthdays by semi-knowledgeable uncles or as super inspirational introductory quotes for blog posts. This phenomenon, more formally known as the six degrees of separation finds its origins in the perennial Small World experiments by Stanley Milgram. While the first accounts of systemic analysis of social networks (also dubbed network science) go much further back, Milgram’s study serves as a quintessential example of the field.
In this blog post, we will present a high-level overview of our paper on the automated extraction of social networks from novels: http://peerj.com/articles/cs-189. In a social network, persons are represented by nodes, and the connections or interactions are represented by edges. Although traditionally used in social studies, the analysis of social networks is now being widely used in fields such
Unfortunately, knowledge about one’s social network is not always readily available to us. While it is relatively simple to get a social network from online platforms such as Facebook or LinkedIn, the vast majority of information is hidden. Despite the unimpeded growth of (big) data storage and usage, 80% of all data is still estimated to be unstructured. Until recently, these untapped sources of information, in the form of texts, images and video or audio streams, were mostly gathering dust in data silos.
With the advent of Machine Learning, leveraging such sources of data is becoming increasingly attainable. Advancements in the field of Machine Learning are pushing the envelope in all of these tasks, but we will focus on the subfield of Natural Language Processing (NLP). Among others, NLP practitioners work on the automated understanding of written text. In the rest of this blog, we will explain how one could extract social networks from text by using Natural Language Processing techniques on Game of Thrones. But why?
Because Game of Thrones is awesome.
In 2002, The Boston Globe presented the results of an extensive investigation that led to the prosecutions of five priests and brought the sexual abuse of minors by Catholic clergy into the international spotlight. Their work unveiled that one of the cardinals of the Catholic Archdiocese of Boston, systematically covered for the offending priests, by moving those priests accused of child molestation from parish to parish. Because of the scope of this widespread wrongdoing, the research team spent months going through mountains of church documents, internal depositions and extensive personnel files.
More recently, an unknown source leaked 11.5 million documents that would later be known as the Panama Papers. These documents describe the tax evasion, fraud and kleptocracy of a great number of celebrities, wealthy individuals and high-ranking political figures. The unparalleled scale of these terabytes of unstructured data (e.g. emails, PDFs) prompted a collaboration of journalists and computer scientists, and promoted the usage of technologies such as Neo4J and Linkurious to explore the interactions in the data.
Clearly, the above two use cases present a problem. The information we are looking for is locked away in free text. We might be interested in which priests interacted with aforementioned cardinal or which government official talked to which known money launderer. Manually extracting useful information from these heaps of unstructured data can prove to be a tedious and labourious process.
To explain how we approach this problem, I will use an example that is hopefully familiar to most of you. We will be using the first novel in the A Song of Ice and Fire book series by George R.R. Martin: A Game of Thrones. With 400+ named characters over
To gain an understanding of who is interacting with whom, we would first need to identify characters in this raw text. In NLP, characters are typically a subdivision of what is known as Named Entities. Named Entities can be anything ranging from corporations to cities, from characters to currencies, or as a linguist might say, a proper noun. To put it crudely, if it has a capital letter at the beginning of the word, we want to know about it (at least, in English). Now, extracting Named Entities is traditionally done by firstly performing Part-of-Speech-tagging. Basically, this technique attempts to classify the word form of each word, i.e.: nouns, verbs, prepositions etc. This captures the syntactic and semantic meaning of the sentence which in turn should help figuring out if a word is indeed a Named Entity. There are plenty of open source, off-the-shelf tools that we can use to identify such words.
For the sake of simplicity, let us assume that we have successfully identified all Named Entities in our text. Could we now start creating a social network of character interactions? Well, if we define that two characters occurring in the same sentence constitutes an ‘interaction’, we could. However, if we would start creating our character interactions at this point, we would be left with a very sparse graph. In normal discourse, characters are not all that frequently mentioned by their names. In fact, about ¾ of character references in novels are mentioned in the form of anaphoric pronouns such as: he, him, his, she, her and hers. This means that we would be missing out on ~75% of our interaction data!
Luckily for us, this problem is not new, and there are several studies that show how to accomplish this. Some of the Machine Learning features to such an approach include: 1) the antecedent part-of-speech, 2) whether if the pronoun and antecedent appear in the same quotation scope, 3) pronoun and antecedent agree for gender, and 4) the word distance between the pronoun and antecedent. With this information, we can train a model to help out with the following (personal favourite) excerpt:
| 1. Bran thought about it. |
2. “Can a man still be brave if he’s afraid?”
3. “That is the only time a man can be brave,” Ned told him.
In the last sentence, we see Ned and him. By resolving the anaphoric pronoun of him to Bran, we would know that Ned and Bran interact in this sentence. Now, for those of you still paying attention, in the second sentence, he cannot and should not be resolved to any character. It refers to a man, which makes he a generic pronoun. Resolving these properly is still quite difficult and is therefore left out of the scope of this experiment.
Phew, nearly there. We have a way to identify persons and a way to identify pronouns referring to these persons. Are we done yet? Well, almost. While we have indeed established a way of identifying persons, we haven’t yet touched the concept of a character. For example, while Tyrion, Tyrion Lannister, Lord Tyrion, The Hand of the King, and The Imp, would all be correct identifications of a person, we wouldn’t have any way of knowing that these references belong to the same character. So we will need to do some clustering before we can create a social network that does the novel justice. We can cluster some of these with relative ease, using permutations of the names, but the latter two nicknames look nothing like the character’s actual name. This problem is similar to that of the generic pronouns (a man) and is therefore not included in this scope.
Now, why did we go through all of this trouble again? We are looking to create a social network of character interactions, so that we can get some information about the contents at a glance. We now know which of the characters occur in which sentences, so we can figure out when characters co-occur. We can represent characters by nodes and interaction by weighted edges between those nodes and we would have a network. So, let’s just do that!
Well, maybe we jumped the gun on that one. After all, our primary goal was to have a visualisation that helps its user to quickly obtain basic information that would otherwise take a lot of reading. Clearly, the type of visualisation we make depends heavily on the goal that we have. If we were just interested in the key-players, a simple bar-chart with character counts would have sufficed. However, in our case, we would like to know who is interacting with whom, which makes a network a logical choice. But, given that us humans can only take in so much information at a time, we can help guide our brains by reducing some of the noise. With a Force Directed drawing algorithm, node and edge scaling and some more visual sugar, we can quickly generate something that actually looks useful. In fact, I’ve included an interactive visualisation below that allows for even more exploratory flexibility, be sure to check it out in full size!
Game of Thrones
So what can we gather from this? The first thing that stands out is the large supercluster that basically contains all the key players. Only one other cluster can be found that is far from this supercluster, and that one is centered around Dany, Jorah Mormont, and Khal Drogo. This is reflective of their role in the narrative, all of them being on an entirely different continent. Jon Snow is in this narrative too, as a major character, but he does not interact with a large part of this supercluster, which is why he is rightly positioned at the edge of the network. Lastly, note the node of Ned Stark. From this network, he truly seems like the main character of the story; smack in the middle and strongly connected to most major characters. His centrality underpins the shock of many fans when he met his untimely demise in only the first novel of the series.
Clearly, we are imposing prior knowledge onto this graph, but we hope you see the value in such a visualisation. Applying a technique like this to large corpora of text, such as in our introductory criminal examples, could help steer the investigatory team in the right direction.
In this blog post, we have explored how to get critical information on social structures from unstructured text. Our goal was to leverage Machine Learning techniques to automatically harvest character relationships, and visualise those in such a way that we can extract sensible, albeit general, information with a cursory glance.
For a more in-depth explanation, have a look at the accompanying paper: http://peerj.com/articles/cs-189.
If you are still up for more, have a look at these papers:
Automatic Extraction of Social Networks from Literary Text
A Bayesian Mixed Effects Model of Literary Character
Structure-based Clustering of Novels
Read what else Trifork is doing with Machine Learning on our website: