Markov Chains in Scala – Let's Talk Data

Although Markov chains have use in machine learning, a more trivial application that pops up from time-to-time is in text generation. Given a sufficiently large enough corpus, the generated text will usually be unique and comprehensible (at least from sentence to sentence).

The full code and a sample corpus used in this post can be found here.

To store the corpus information, we will use a Map.

import scala.collection.mutable

val MARKOV_MAP:mutable.Map[Seq[String], mutable.Map[String, Int]] = new mutable.HashMap()

This structure maps chains of words to “next” words, according to how frequently those words follow the chain. For example, the corpus “I like apples. I like apples. I like oranges. I hate apples.” could create the this structure:

I like -> [(apples -> 2), (oranges -> 1)]
I hate -> [(apples -> 1)]

I say “could” because we can choose a chain size. A larger chain size will produce sentences more similar to those in the corpus, and a smaller chain size will result in more diversion from the corpus.

val CHAIN_SIZE = 2

Having established a chain size, the following function creates the chains from a sentence, and loads the data into the Markov map.

def adjustProbabilities(sentence:String):Unit = {
  val segments = sentence.split(" ").+:("").:+("").sliding(CHAIN_SIZE + 1).toList
  for(segment <- segments) {
    val key = segment.take(CHAIN_SIZE)
    val probs = MARKOV_MAP.getOrElse(key, scala.collection.mutable.Map())
    probs(segment.last) = probs.getOrElse(segment.last, 0) + 1
    MARKOV_MAP(key) = probs
  }
}

Line 2 looks a bit intimidating, but all we are doing is splitting the sentence into words, adding a start empty string and terminal empty string (we’ll see why shortly), and using sliding to process the sentence in chunks. For example, the sentence “Shall I compare thee to a summer’s day” we get the list [["","Shall","I"],["Shall","I","compare"],["I","compare","thee"],["compare","thee","to"],["thee","to","a"],["to","a","summer's"],["a","summer's","day"],["summer's","day",""]].

In general, we don’t want to consider “Shall” and “shall” as separate words include commas, etc. so I also created a method to normalize the corpus. You may need to make adjustments for your specific corpus.

def normalize(line: String): String = {
  line.stripLineEnd
    .toLowerCase
    .filterNot("\\.-,\";:&" contains _)
}

Now we can read in a corpus and process it into the Markov map.

import scala.io.Source
val filePath = "/path/to/shakespeare_corpus.txt"

Source.fromFile(filePath).getLines()
  .map(normalize(_))
  .map(s => s.trim)
  .foreach(s => adjustProbabilities(s))

This assumes each line is a sentence. If your corpus has multiple sentences per line, you might use something like this instead:

Source.fromFile(filePath).getLines()
  .flatMap(normalize(_).split("\\."))
  .map(s => s.trim)
  .foreach(s => adjustProbabilities(s))

Now that the map is built, we can work on generating text. We first need to isolate words that start sentences, which we can do by leveraging the empty string inserted earlier.

val startWords = MARKOV_MAP.keys.filter(_.head == "").toList

A primary feature of Markov chains is that they only care about the current state, which in this case is a chain of words. Given a chain of words, we want to randomly select the next word, according to the probabilities established earlier.

import scala.util.Random
val r = new Random()

def nextWord(seed:Seq[String]):String = {
  val possible = MARKOV_MAP.getOrElse(seed, List())
  r.shuffle(possible.flatMap(pair => List.fill(pair._2)(pair._1))).head
}

This is admittedly a little ham-handed and likely not performant for large corpuses, but in my testing there was no noticeable delay. First we expand the list of possible words into a list with duplicates according to their probabilities. For example [("apple", 2), ("orange", 1), ("pear", 3)] expands to ["apple", "apple", "orange", "pear", "pear", "pear"]. Then we shuffle the list, and pull off the first word. This becomes the next word in the sentence.

Now that we have a method to generate words, we can start with a random starting word (startWords) and build the sentence from there. The process knows to stop when it reaches a terminal word, i.e. a word empty string.

import scala.collection.mutable.ArrayBuffer
def nextSentence():String = {
  val seed = startWords(r.nextInt(startWords.size))
  val sentence:ArrayBuffer[String] = ArrayBuffer()
  sentence.appendAll(seed)
  while(sentence.last != "") {
    sentence.append(nextWord(sentence.view(sentence.size - CHAIN_SIZE, sentence.size)))
  }
  sentence.view(1, sentence.size - 1).mkString(" ").capitalize + ","
}

Since my sample corpus was Shakespeare’s sonnets, I generated 14 lines:

(0 until 14).map(_ => nextSentence()).mkString("\n")

With a little formatting cleanup…

Oaths of thy beauty being mute,
Unthrifty loveliness why dost thou too and therein dignified,
Ah! If thou survive my wellcontented day,
Betwixt mine eye,
These poor rude lines of thy lusty days,
Devouring time blunt thou the master mistress of my faults thy sweet graces graced be,
Leaving thee living in posterity?
Compared with loss and loss with store,
Proving his beauty new,
Whilst I thy babe chase thee afar behind,
Coral is far more red than her lips red,
Admit impediments love is as a careful housewife runs to catch,
Lascivious grace in whom all ill well shows,
Savage extreme rude cruel not to be.

Feel free to use my corpus or choose your own! Project Gutenberg is a great source.

Tagged on: AI, corpus, linguistics, markov, shakespeare

Leave a Reply Cancel reply