Let's Talk Data

Announcing Java for the Real World

When I started my first Java job, I was immediately overwhelmed by my knowledge gaps of the Java ecosystem. I knew how to write decent code and had a good understanding of the Java language, but I had never used Hibernate or “deployed a war to Tomcat” …and what’s a “pom”? I soon found that answers to these questions were less than straightforward. Wading through pages of dense documentation and poorly written tutorials often left me just as confused as when I started.

That’s why I decided to write Java for the Real World. Having lived through the pain of learning the Java ecosystem, I wanted to create a resource for anyone else in their first Java job to quickly become aware of all the ancillary technologies that Java uses. I intentionally do not provide deep tutorials in the book. Not only would that be an insurmountable task (nearly all of these technologies have books of their own), but I have found that companies use the tools in such different ways that tutorials only go so far. Instead, I provide an overview of the most common tools you are likely to encounter, example code to see the technology in action, and suggested resources for more in-depth study.

I’m also trying something new with this book. Instead of waiting to finish it, I am making the in-progress version available today for 50% off the suggested price. If you purchase today, you will get immediate access to everything I have already written and updates as more chapters are completed!

Click here to save 50%!

By Phillip Johnson | April 25, 2017 | Other | No Comments |

Markov Chains in Scala

Although Markov chains have use in machine learning, a more trivial application that pops up from time-to-time is in text generation. Given a sufficiently large enough corpus, the generated text will usually be unique and comprehensible (at least from sentence to sentence).

The full code and a sample corpus used in this post can be found here.

To store the corpus information, we will use a Map.

import scala.collection.mutable

val MARKOV_MAP:mutable.Map[Seq[String], mutable.Map[String, Int]] = new mutable.HashMap()

This structure maps chains of words to “next” words, according to how frequently those words follow the chain. For example, the corpus “I like apples. I like apples. I like oranges. I hate apples.” could create the this structure:

I like -> [(apples -> 2), (oranges -> 1)]
I hate -> [(apples -> 1)]

I say “could” because we can choose a chain size. A larger chain size will produce sentences more similar to those in the corpus, and a smaller chain size will result in more diversion from the corpus.

val CHAIN_SIZE = 2

Having established a chain size, the following function creates the chains from a sentence, and loads the data into the Markov map.

def adjustProbabilities(sentence:String):Unit = {
  val segments = sentence.split(" ").+:("").:+("").sliding(CHAIN_SIZE + 1).toList
  for(segment <- segments) {
    val key = segment.take(CHAIN_SIZE)
    val probs = MARKOV_MAP.getOrElse(key, scala.collection.mutable.Map())
    probs(segment.last) = probs.getOrElse(segment.last, 0) + 1
    MARKOV_MAP(key) = probs
  }
}

Line 2 looks a bit intimidating, but all we are doing is splitting the sentence into words, adding a start empty string and terminal empty string (we’ll see why shortly), and using sliding to process the sentence in chunks. For example, the sentence “Shall I compare thee to a summer’s day” we get the list [["","Shall","I"],["Shall","I","compare"],["I","compare","thee"],["compare","thee","to"],["thee","to","a"],["to","a","summer's"],["a","summer's","day"],["summer's","day",""]].

In general, we don’t want to consider “Shall” and “shall” as separate words include commas, etc. so I also created a method to normalize the corpus. You may need to make adjustments for your specific corpus.

def normalize(line: String): String = {
  line.stripLineEnd
    .toLowerCase
    .filterNot("\\.-,\";:&" contains _)
}

Now we can read in a corpus and process it into the Markov map.

import scala.io.Source
val filePath = "/path/to/shakespeare_corpus.txt"

Source.fromFile(filePath).getLines()
  .map(normalize(_))
  .map(s => s.trim)
  .foreach(s => adjustProbabilities(s))

This assumes each line is a sentence. If your corpus has multiple sentences per line, you might use something like this instead:

Source.fromFile(filePath).getLines()
  .flatMap(normalize(_).split("\\."))
  .map(s => s.trim)
  .foreach(s => adjustProbabilities(s))

Now that the map is built, we can work on generating text. We first need to isolate words that start sentences, which we can do by leveraging the empty string inserted earlier.

val startWords = MARKOV_MAP.keys.filter(_.head == "").toList

A primary feature of Markov chains is that they only care about the current state, which in this case is a chain of words. Given a chain of words, we want to randomly select the next word, according to the probabilities established earlier.

import scala.util.Random
val r = new Random()

def nextWord(seed:Seq[String]):String = {
  val possible = MARKOV_MAP.getOrElse(seed, List())
  r.shuffle(possible.flatMap(pair => List.fill(pair._2)(pair._1))).head
}

This is admittedly a little ham-handed and likely not performant for large corpuses, but in my testing there was no noticeable delay. First we expand the list of possible words into a list with duplicates according to their probabilities. For example [("apple", 2), ("orange", 1), ("pear", 3)] expands to ["apple", "apple", "orange", "pear", "pear", "pear"]. Then we shuffle the list, and pull off the first word. This becomes the next word in the sentence.

Now that we have a method to generate words, we can start with a random starting word (startWords) and build the sentence from there. The process knows to stop when it reaches a terminal word, i.e. a word empty string.

import scala.collection.mutable.ArrayBuffer
def nextSentence():String = {
  val seed = startWords(r.nextInt(startWords.size))
  val sentence:ArrayBuffer[String] = ArrayBuffer()
  sentence.appendAll(seed)
  while(sentence.last != "") {
    sentence.append(nextWord(sentence.view(sentence.size - CHAIN_SIZE, sentence.size)))
  }
  sentence.view(1, sentence.size - 1).mkString(" ").capitalize + ","
}

Since my sample corpus was Shakespeare’s sonnets, I generated 14 lines:

(0 until 14).map(_ => nextSentence()).mkString("\n")

With a little formatting cleanup…

Oaths of thy beauty being mute,
Unthrifty loveliness why dost thou too and therein dignified,
Ah! If thou survive my wellcontented day,
Betwixt mine eye,
These poor rude lines of thy lusty days,
Devouring time blunt thou the master mistress of my faults thy sweet graces graced be,
Leaving thee living in posterity?
Compared with loss and loss with store,
Proving his beauty new,
Whilst I thy babe chase thee afar behind,
Coral is far more red than her lips red,
Admit impediments love is as a careful housewife runs to catch,
Lascivious grace in whom all ill well shows,
Savage extreme rude cruel not to be.

Feel free to use my corpus or choose your own! Project Gutenberg is a great source.

By Phillip Johnson | June 1, 2016 | Probability, Programming | No Comments |

How to combine Scala pattern matching with regex

Scala’s pattern matching is arguably one of its most powerful features and is straight-forward to use when matching on patterns like x::xs vs. x vs. Nil, but you can also use it to match regular expressions. This short tutorial will show you how to use pattern matching and regex to parse a simple DSL for filtering search results.

The domain of the tutorial is a library system where users can search by author, title, or year. They can also combine filters to make the search results more narrow. We’ll start by defining some objects to work with.

case class Book(title:String, author:String, year:Int)

val books = List(
  Book("Moby Dick", "Herman Melville", 1851),
  Book("A Tale of Two Cities", "Charles Dickens", 1859),
  Book("Oliver Twist", "Charles Dickens", 1837),
  Book("The Adventures of Tom Sawyer", "Mark Twain", 1876),
  Book("The Left Hand of Darkness", "Ursula Le Guin", 1969),
  Book("Never Let Me Go", "Kazuo Ishiguro", 2005)
)

To filter the books, we need to supply one or more predicates. A predicate is a function that accepts a Book and returns a Boolean. Our goal is to turn something like “author=Charles Dickens” into a predicate. For starters, we need to be able to parse out user-supplied value “Charles Dickens”. Scala’s regex compiler allows for groups to be surrounded by parentheses which can then be extracted as values. The example in the documentation is val date = """(\d\d\d\d)-(\d\d)-(\d\d)""".r. You can see there are three groups defined: one each for year, month, and day. Here are the patterns we’ll allow to constrain search results:

val authorEquals = """author=([\w\s]+)""".r
val authorLike   = """author~([\w\s]+)""".r
val titleEquals  = """title=([\w\s]+)""".r
val titleLike    = """title~([\w\s]+)""".r
val yearBefore   = """year<(\d+)""".r
val yearAfter    = """year>(\d+)""".r

Remember that the goal is to return a predicate for each filter. The syntax for an anonymous predicate is (b:Book) => [boolean]. Using our example, we could create a predicate (b:Book) => b.author == "Charles Dickens". To make the function generic, we need to be able to extract the supplied author value from the filter. Using the predefined regular expressions combined with pattern matching, we can do just that.

def parseFilter(filterString:String):Book => Boolean = filterString match {
  case authorEquals(value) => (b:Book) => b.author == value
}

The filterString is passed in and pattern matched against the pre-defined regular expression authorEquals. Since we declared one group in the expression, we can name that group (value) and then use that group as a variable. Here’s the complete function that includes all of the expressions.

def parseFilter(filterString:String):Book => Boolean = filterString match {
  case authorEquals(value) => (b:Book) => b.author == value
  case authorLike(value)   => (b:Book) => b.author.contains(value)
  case titleEquals(value)  => (b:Book) => b.title == value
  case titleLike(value)    => (b:Book) => b.title.contains(value)
  case yearBefore(value)   => (b:Book) => b.year < Integer.valueOf(value)
  case yearAfter(value)    => (b:Book) => b.year > Integer.valueOf(value)
  case _                   => (b:Book) => false
}

The last case catches any filter that doesn’t match a pattern and returns a predicate that does not match any book. The functional result being that an invalid filter returns no search results.

Finally, we need to be able to check a book against one or more filters. The forall method is true only if all of the filters match the given book.

def checkBook(b:Book, filterString:String) = {
  val filters = filterString.split(",").map(s => parseFilter(s))
  filters.forall(_(b))
}

We now have everything in place to filter the books according to our search string. Here are some examples:

books.filter(b => checkBook(b, "author=Charles Dickens"))
res0: List[Book] = List(
    Book(A Tale of Two Cities,Charles Dickens,1859),
    Book(Oliver Twist,Charles Dickens,1837))

books.filter(b => checkBook(b, "author=Charles Dickens,year>1840"))
res1: List[Book] = List(
    Book(A Tale of Two Cities,Charles Dickens,1859))

books.filter(b => checkBook(b, "title~of"))
res2: List[Book] = List(
    Book(A Tale of Two Cities,Charles Dickens,1859),
    Book(The Adventures of Tom Sawyer,Mark Twain,1876),
    Book(The Left Hand of Darkness,Ursula Le Guin,1969))

Try to add some more filters such as “starts with” or “year equals” to get practice working with regex matching.

By Phillip Johnson | January 19, 2016 | Programming | 1 Comment |

How to Add Integration Tests to a Play Framework Application Using Scala

If you are new to the Play framework and want to learn more about how Play tests are set up, or if you are new to the idea of HTTP integration testing, I encourage you to check out the tutorial I recently wrote for the Semaphore Community.

In the tutorial, I walk you through creating a simple JSON API using a library as the domain example. Then, I show you how to add tests to that API that make actual HTTP calls.

Here’s a preview of the tutorial contents

Introduction
Prerequisites
Setting Up the Application

Create a New Project
Add routes
Add Controllers
Adding Models and a Repository
Implementing the Controllers

Adding Integration Tests

Testing the Application Index
Books Controller Tests
Customer Controller Tests

Conclusion

Click here to read the full tutorial!

By Phillip Johnson | September 3, 2015 | Programming | No Comments |

Introducing Scalinear 0.1: A Simple Linear Algebra Library for Scala

Earlier this month I was dabbling with WebGL in Scala.js and found myself running into a wall with the matrix math. Since Scala provides no native support for matrices and many WebGL projects leverage Sylvester.js, I was stuck trying to find a comparable Scala library from which I could poach code. (Scala.js interprets Scala source code).

The best I found was Breeze, which does indeed provide a lot of good support for matrices and vectors. Unfortunately, it is a large library and does a lot more than I needed. Pulling down the necessary code was impractical. So I found myself building my own simple library to tide me over. The result was Scalinear: a simple, no-frills library for matrix and vector support in Scala.

Vectors

Vectors are created simply:

scala> val v = Vector(2,4,6,8)
v: com.letstalkdata.scalinear.Vector[Int] = [2, 4, 6, 8]

They also support arithmetic in an intuitive manner:

scala> v / 2
res1: com.letstalkdata.scalinear.Vector[Int] = [1, 2, 3, 4]

scala> val u = Vector(3,3,3,3)
u: com.letstalkdata.scalinear.Vector[Int] = [3, 3, 3, 3]

scala> v + u
res2: com.letstalkdata.scalinear.Vector[Int] = [5, 7, 9, 11]

Matrices

Matrices are syntactically similar to Vertices:

scala> val A = Matrix(Vector(1,2), Vector(3,4), Vector(5,6))
A: com.letstalkdata.scalinear.Matrix[Int] =
[1, 2]
[3, 4]
[5, 6]

scala> val B = Matrix(Vector(1,2,3), Vector(4,5,6))
B: com.letstalkdata.scalinear.Matrix[Int] =
[1, 2, 3]
[4, 5, 6]

scala> A * B
res0: com.letstalkdata.scalinear.Matrix[Int] =
[9, 12, 15]
[19, 26, 33]
[29, 40, 51]

By design, the library is very limited. It’s definitely not intended as a replacement for something like Breeze. Instead, it’s a simple solution when only basic linear algebra support is needed. Having said that, I’d love to know if there are missing functions that would make your life easier. Please take a look at the project and feel free to leave feedback below.

By Phillip Johnson | July 29, 2015 | Uncategorized | No Comments |