Ensemble Text Generator (LSTM) that uses content produced by the most influential authors in the r/Siacoin Subreddit Community

4 min readMay 5, 2020

This post builds upon previous ones where we pulled data from Reddit, built multiple social networks (per month) and computed their centrality metrics. We used those metrics to filter down author submissions and built a simple text generator using that filtered content. All of these actions were meant to create an “informed” text generator.

The purpose of this post is to prove out how this Ensemble method can work. A long term project of mine would be to do build different text generation models based on sentiment and use a “choice” method in the Ensemble to weight the models based on current “mood”. These models in this post use the same vocab, but in the future, it could be interesting to allow for different sets (jargon) and possibly use word embeddings to sort of translate across vocab.

Building a Text Generator based on the most influential authors in the r/Siacoin…

Building an LSTM model using Keras based on the content produced by the most influential authors in the r/Siacoin…

medium.com

In this post, we will be using multiple LSTM models with varying input lengths to predict the next word in our text. We will be using weighted random choice to select the next word from the list of predicted words. This will allow us specify which models should be held in higher regard but also allow us to produce slightly different paths when generating our text.

In total, we will be using 4 models (sizes listed below). We will allow each model to attempt the prediction, but if the seeded input text does not satisfy what the model expects, it will skip it. Once satisfied, it will start to contribute. We are doing this to try and see how short, mid and long term views react together.

Small: 3
Small-Medium: 5
Medium: 7
Large: 10

** The links to the jupyter notebooks is available at the end of the post.

LSTM Model Wrapper:

import numpy as npclass TextGeneratorModel(object):
    def __init__(self, vocab, model, limit, weight):
        self.vocab = vocab
        self.model = model
        self.limit = limit
        self.weight = weight
        
        self.vocab_size = len(self.vocab)
        self.word_indices = dict(
            (tk, i) for i, tk in enumerate(self.vocab)
        )
        self.indices_word = dict(
            (i, tk) for i, tk in enumerate(self.vocab)
        )
        
    def get_weight(self):
        return self.weight
    
    def get_line(self, sentence):
        return np.array([ 
            self.word_indices[token]
            for token
            in sentence
        ])
    
    def predict_next(self, sentence):
        ## translate it,
        line = self.get_line(sentence)
    
        ## start: 0, end: len -> [limit: len]
        start = len(sentence) - self.limit
        if start < 0:
            return -1
        
        X_new = np.array([line[start:]])
        predicted_class = self.model.predict_classes(X_new).tolist()
        
        index = predicted_class[0]
        return index, self.indices_word[index]

Wrapper around our LSTM Keras Model.
If the given sentence is not compatible with the model, -1 is returned. The class handles all translations, in case the vocab is in a different order.

Ensemble Method:

class TextGeneratorEnsemble(object):
    def __init__(self, generators):
        self.generators = generators
        
    def predict_next(self, sentence):
        weights = []
        results = []
        
        for generator in self.generators:
            predicted_class = generator.predict_next(sentence)
            if predicted_class[0] == -1:
                continue
            
            results.append(predicted_class)
            weights.append(generator.get_weight())
            
        if len(results) == 0:
            return -1
        
        s = np.sum(weights)
        index = np.random.choice(
            list(range(len(results))),
            1,
            p = weights / s
        )[0]
        
        return results[index][1], results

Going to keep this simple and only do a weighted random choice for selection. Method of choice can be changed to give the ensemble a different “personality”.

Generator Method:

class TextGenerator(object):
    def __init__(self, ensemble, sentence):
        self.ensemble = ensemble
        self.sentence = sentence
        
    def get_next_word(self):
        next_token, results = ensemble.predict_next(self.sentence)
        self.sentence = np.append(self.sentence, next_token)
        return self.indices_word[next_token], results

This class keeps track of the state of our text as we retrieve more tokens.

Full Setup:

import jsonmodels = [
    { 'type': '_sm', 'limit': 3, 'weight': 1 },
    { 'type': '_md_sm', 'limit': 5, 'weight': 1 },
    { 'type': '_md', 'limit': 7, 'weight': 3 },
    { 'type': '', 'limit': 10, 'weight': 1 },
]text_generators = []
for model in models:
    key = model['type']    vocab = []
    vocab_path = f'../data/reddit/models/siacoin_vocab{key}.json'
    with open(vocab_path, 'r') as vocab_input:
        vocab = json.loads(vocab_input.read())    template = f'../data/reddit/models/siacoin_model{key}.h5'
    siacoin_model = load_model(template)
    siacoin_text_generator = TextGeneratorModel(
        vocab,
        siacoin_model,
        model['limit'],
        model['weight']
    )
    
    text_generators.append(siacoin_text_generator)ensemble = TextGeneratorEnsemble(text_generators)

We will load up our 4 models, wrap them in our custom ‘TextGeneratorModel’ class and send them into our Ensemble method.
We are giving the “Medium” model more of a chance to be picked.

import pandas as pdtemplate = '../data/reddit/siacoin_words_dataset_lg.csv'
df = pd.read_csv(template).drop(columns=['target'])limit = 10## set random state,
np.random.seed(899)
seed_index = np.random.randint(
    0,
    len(df.index)
)data_frame = df.iloc[seed_index]
line = data_frame.tolist()[1:limit+1][ indices_word[tk] for tk in line]## output,
['what',
 'are',
 'my',
 'current',
 'values',
 'for',
 'host',
 'config',
 'besides',
 'looking']

We will start with a random line from the ‘lg’ dataset (input = 10). We are hoping that the model produces results that allow the text to “randomly walk” about the topic but still make some sense.

Running:

n = 25
text_generator = TextGenerator(ensemble, line)for _ in range(n):
    next_token, choices = text_generator.get_next_word()
    print(
        next_token, '-', [ tk for i, tk in choices ]
    )

Output:

Running the text generator for multiple rounds, until the predicted words start to spread out, we can obtain text that sounds “reasonable”.

## ['Small', 'Small-Medium', 'Medium', 'Large']into - ['into', 'and', 'into', 'into']
the - ['the', 'the', 'the', 'the']
json - ['json', 'developing', 'json', 'json']
on - ['on', 'on', 'on', 'on']
disk - ['disk', 'disk', 'disk', 'value']
which - ['which', 'which', 'which', 'week']
will - ['are', 'are', 'are', 'will']
i - ['be', 'facilitate', 'not', 'i']
just - ['never', 'be', 'just', 'do']
remove - ['be', 'adjustments', 'remove', 'a']
yet - ['my', 'another', 'this', 'yet']
from - ['that', 'to', 'from', 'is']
my - ['your', 'the', 'my', 'there']
role - ['time', 'computer', 'role', 'of']
and - ['streaming', 'in', 'and', 'what']
let - ['writing', 'still', 'let', 'the']
us - ['is', 'us', 'us', 'siacoin']
talk - ['know', 'talk', 'see', 'to']
for - ['about', 'down', 'to', 'for']
proofs - ['a', 'proofs', 'the', 'their']
of - ['with', 'of', 'of', 'through']
monetizing - ['blown', 'hundreds', 'monetizing', 'the']
account - ['have', 'a', 'coins', 'account']
pricing - ['out', 'and', 'pricing', 'over']
to - ['as', 'it', 'to', 'and']