Extracting Social Networks from Short Stories

Slaps Lab
4 min readApr 12, 2020

I came across the paper ‘Mining and Modeling Character Networks’ (see references, read it) which I thought provided an excellent blueprint for building out social networks from short stories. The basic idea is to extract the characters, find all interactions between them and build out a social network based on those ‘found’ interactions.

In the paper, the authors defined an interaction as two distinct ‘characters’ appearing within 15 words of each other. They took steps to ensure a single interaction was only counted once, but for this write up, I am not going to great lengths to protect against this. I will, however, keep that basic definition.

This write up is meant to be simple and something to build upon for future posts. With that in mind, I am not going to be using advanced methods such as co-reference resolution, iterative cleaning techniques and/or custom Named-Entity (NER) model that could potentially yield more characters/interactions. I feel these topics would create a chaotic mess.

A link to a full jupyter notebook can be found at the end of the post.

The Procedure

  1. Retrieve/Clean a short story — ‘The Gift of the Magi’
  2. Named-Entity Recognition (NER) to extract PERSON entities from the short story.
  3. Find character interactions by recording whenever two distinct PERSON entities are within 15 words from each other.
  4. Use the interactions to build a social network.

The Gift of the Magi

I found, ‘The Gift of the Magi’ from a site called Project Gutenberg. The site provides a number of free e-books for download and is a great place to help facilitate building out custom datasets. Some random stumbling around yielded ‘The Gift of the Magi’.

import requestsdef get_content(url):
response = requests.get(url)
assert response.status_code == 200

return response.text
url = 'http://www.gutenberg.org/cache/epub/7256/pg7256.txt'
content = get_content(url)

Extracting Characters and Interactions from the Short Story

Spacy is simple and easy to use but powerful. This makes it perfectly suited to help extract characters using their built in NER pipeline.

import spacynlp = spacy.load('en_core_web_md')
doc = nlp(text)

Once our document is created, we can extract our characters in just a couple of lines of code.

from collections import defaultdictcharacters = defaultdict(int)people = (ent for ent in doc.ents if ent.label_ == 'PERSON')
for ent in people:
person = ent.text.strip()
person_lower = person.lower()
if not 'mme' in person_lower:
characters[person] += 1

Below is the final print out of characters. In this case, you should be able to see a problem. In most stories, characters often go by different names, nicknames, aliases, etc. A bit of human intervention is often needed. An example here is with ‘Mrs. James Dillingham Young’ in the text. The data is screaming at us to stop being lazy. Don’t be like me.

defaultdict(int,
{'Della': 18,
'James Dillingham Young': 2,
'Jim': 26,
'Sheba': 1,
'Solomon': 1,
'Sofronie': 2,
'Madame': 3,
'Babe': 1})

After the characters have been extracted, we can start to find their interactions. I am adding in sentiment analysis (Afinn) to the interaction, but I am not going to be using it. Other papers have used sentiment analysis to color the edges in the network to help show the type of social interaction the two nodes have. So more negative interactions could show us that characters often conflict with each other, where as more positive interactions could show potential friendships in the story.

from afinn import Afinninteractions = []afinn = Afinn('en')tokens = doc
for index, token in enumerate(tokens):
if characters[token.text] > 0:
start = index - 15
end = index + 15

tokens_close_to = tokens[start:end]
for close in tokens_close_to:
if close.text == token.text:
continue

if characters[close.text] > 0:
sentence = ' '.join([
tk.text
for tk in tokens_close_to
])
interactions.append(
(token.text, close.text, afinn.score(sentence))
)

Social Network Analysis — NetworkX

We can now setup a social network with our interactions. The heavier the line, the more interactions the two characters had. In this case, each interaction is being counted twice. Using a weight of .5 for each (2x) should be a fine work around to this problem.

import networkx as nxG = nx.Graph()for interaction in interactions:
n1 = interaction[0]
n2 = interaction[1]

if G.has_edge(n1, n2):
G[n1][n2]['weight'] += .5
else:
G.add_edge(n1, n2, weight = .5)

Drawing out the network, yields,

Social Network — Interactions

Centrality/Rank Metrics

data = {
'#': dict(G.degree),
'Degree': nx.degree_centrality(G),
'Closeness': nx.closeness_centrality(G),
'Betweenness': nx.betweenness_centrality(G),
'Pagerank': nx.pagerank(G)
}
pd.DataFrame(data)
Centrality/Rank Metrics

Jupyter Notebook

References

  1. Bonato A., D’Angelo D.R., Elenberg E.R., Gleich D.F., Hou Y. (2016) Mining and Modeling Character Networks. In: Bonato A., Graham F., Prałat P. (eds) Algorithms and Models for the Web Graph. WAW 2016. Lecture Notes in Computer Science, vol 10088. Springer, Cham. https://arxiv.org/pdf/1608.00646.pdf

--

--

Slaps Lab

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.