Visual of the Extracted r/Siacoin Social Networks per Month

Extracting Social Networks from the r/Siacoin Subreddit Community

3 min readApr 21, 2020

A few years back I worked on a project to analyze the most influential authors (top=n) in a subreddit. I always wanted to circle back and extract other subreddit communities but never had the time/energy. This post is about documenting the extraction journey, from start to finish, for the r/siacoin subreddit.

** All Jupyter Notebook links can be found at the end of the post.

Building a Text Generator based on the most influential authors in the r/Siacoin…

Building an LSTM model using Keras based on the content produced by the most influential authors in the r/Siacoin…

medium.com

Procedure:

Retrieve Post/Comment data for r/Siacoin.
Building Edgelists per month.

Retrieving the Data:

I leveraged a previous post. The full notebook for this example can be found below.

How to Scrap Reddit using pushshift.io via Python

In early 2018, Reddit made some tweaks to their API that closed a previous method for pulling an entire Subreddit…

medium.com

## single record output,{
  'id': '4fz06l',
  'type': 'submission',
  'post_id': '4fz06l',
  'author': 'deleted',
  'text': '[deleted]',
  'created_at': 1461368505.0
}

Building Edgelists per month:

I decided to group the data by month and post. I wanted to build an edge between two authors who appeared together on the same ‘post_id’ within the same month. This is an unique edge, so if an author posted multiple times, that would only create a single edge. In order to accomplish this, I grouped all the records by ‘month’ and ‘post_id’.

from collections import defaultdictbreakouts = {}
for record in filtered_dataset:
    a_key = record['year']
    if a_key not in breakouts:
        breakouts[a_key] = {}
        
    b_key = record['month']
    if b_key not in breakouts[a_key]:
        breakouts[a_key][b_key] = defaultdict(list)
    
    c_key = record['post_id']
    breakouts[a_key][b_key][c_key].append(record)

This results in 49 groups (at the time of this writing 4/2020). Posts with only one unique author were filtered out. All ‘deleted’ authors were also filtered out.

An unique edge was created by grabbing the unique authors for a given post and computing the combinations between each author.

import numpy as npedgelist_breakouts = {}
for year_key in breakouts.keys():
    for month_key in breakouts[year_key].keys():
        edgelist = []
        for post_key in breakouts[year_key][month_key].keys():
            posts = breakouts[year_key][month_key][post_key]
            authors = list(
                map(
                    lambda interaction: interaction['author'], 
                    posts
                )
            )
            for a, b in combinations(np.unique(authors), 2):
                sort = sorted([a, b], key = lambda a: a.lower())
                edgelist.append(
                    (sort[0], sort[1])
                )
                
        edgelist_breakouts[f'{year_key}-{month_key}'] = edgelist

Building out a Network:

Building a Centrality Metrics based author filter for the r/Siacoin Subreddit Community

Using centrality metrics to filter authors to just the most influential (top=n) to aid in noise reduction.

medium.com

G = nx.Graph()
    
for interaction in edgelist_breakouts['2017-9']:
    n1 = interaction[0]
    n2 = interaction[1]    if G.has_edge(n1, n2):
        G[n1][n2]['weight'] += 1
    else:
        G.add_edge(n1, n2, weight = 1)

Building a Centrality Metrics based author filter for the r/Siacoin Subreddit Community

Using centrality metrics to filter authors to just the most influential (top=n) to aid in noise reduction.

medium.com

Extracting Social Networks from the r/Siacoin Subreddit Community

Building a Text Generator based on the most influential authors in the r/Siacoin…

Building an LSTM model using Keras based on the content produced by the most influential authors in the r/Siacoin…

Procedure:

Retrieving the Data:

How to Scrap Reddit using pushshift.io via Python

In early 2018, Reddit made some tweaks to their API that closed a previous method for pulling an entire Subreddit…

Building Edgelists per month:

Building out a Network:

Building a Centrality Metrics based author filter for the r/Siacoin Subreddit Community

Using centrality metrics to filter authors to just the most influential (top=n) to aid in noise reduction.

Building a Centrality Metrics based author filter for the r/Siacoin Subreddit Community

Using centrality metrics to filter authors to just the most influential (top=n) to aid in noise reduction.

Jupyter Notebooks

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Slaps Lab

No responses yet