Using Markov Chains to Simulate Your Friends

By: Andy Lee

Markov Chains, for those who are unaware, are graphs which can be used to model behaviors based on probabilistic weights between the various states in the graph. Markov Chains are often used in mobile keyboards as part of the predictive text functionality and can produce incredibly accurate results. A simple Markov Chain (which admittedly produces less precise results) can be used to create fun simulations of conversations between friends.
The first step is to acquire some data with which you can generate the MC. Facebook allows users to download an archive of all their data. Included in this archive is a collection of html files containing all of that user's Facebook Messenger conversations. I used BeautifulSoup to parse these conversations in order to organize the data before sending them off to be processed into separate MCs for each participant in the conversation.
The actual implementation of the MC is where the greatest variance in complexity can be found. For my implementation, I opted for an edge list representation implemented with Python's dictionary hashmap. Each word is represented by a Word Node struct which contains useful metadata such as the number of times the word has terminated a sentence and the other weights of the connections to other Word Nodes. Additionally, my graph implementation maintains a distinct list of initial words for faster selection of a starting state. The next states are selected using numpy's random choice method run on each node's edge list. As the sentence grows longer, each word is given an increased likelihood of terminating the sentence, thus preventing any infinitely long sentences.
There are several possible improvements that can be made. Most significantly, but also most difficulty, word pairs could be used instead of single words in order to allow for more context driven generation. Currently, the increasing modifier for terminating probability is a constant value though this could be modified based on the average length of a given participant's sentences. This, however, could be negatively influenced by the extreme frequency of single word messages in any sort of text-based chat. It may be possible to improve the performance of the graph generation by switching to a different graph implementation however the more commonly used adjacency matrix would likely be implemented with a sparse matrix which I found had slower runtimes compared to my current edge list dictionary implementation.
The code for this project is available on my github:

Last Edited: 2018-03-04
View Count: 199