Hibernate: Legacy XML vs. JPA Annotations

This is an abbreviated chapter from my book Java for the Real World. Want more content like this? Click here to get the book! The code for this chapter is available here.

Object relational mappers attempt to solve these problems by abstracting away all SQL from the developer. Theoretically, given enough information about the objects, it is possible for all SQL to be generated programatically.

Hibernate was one of the first ORM tools for Java. It was not until a few years later when persistence was standardized with the Java Persistence API (JPA). The JPA provides no implementation–rather it allows third-party tools to leverage a common API. And it just so happens that Hibernate is the most commonly used third-party implementation. Having said that, it is still possible to use Hibernate *without* using JPA. The sample code for this chapter includes two Hibernate projects: one that is more modern and leverages the JPA annotations and one that is an older style that uses XML for its configuration. To further complicate matters, you can use JPA interfaces or Hibernate’s native classes to interact with the database. For convenience, the JPA code example also uses JPA interfaces while the XML code example uses Hibernate native objects, but it’s possible to mix and match. You should be familiar with all of these options as they are all still popular.

JPA Annotations

The JPA provides a set of annotations that are used to mark up domain classes to describe how they map to the underlying database. In general, classes are annotated @Entity and @Table(name = "myDatabaseTableName"), while properties are annotated @Column(name = "myColumnName"). Additional annotations are added to “special” columns such as ID columns and foreign key columns. Importantly, foreign keys are not represented as integers but rather as the objects themselves. For example, the OrderLineItem class does not have a purchase_id property, it has an Order property. Here’s the complete annotated properties of that class.

@Id
@Column
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Integer id;

@ManyToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "purchase_id")
private Order order;

@ManyToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "ingredient_id")
private Ingredient ingredient;

@Column(name = "units")
private Integer units;

Focusing on the two foreign-keyed properties, notice that @JoinColumn is used to signify it is used in a SQL join and that the relationship is described as @ManyToOne since there are many OrderLineItems for each Order. The fetch = FetchType.Lazy instructs Hibernate (or more accurately, any JPA provider you’re using) to not load the actual Order object until it is explicitly asked for. Lazy‘s opposite is FetchType.Eager which would load the Order as soon as the OrderLineItem is returned from the database.

It’s not required, but I chose to create a bi-directional relationship between an order and its line items. Here’s how Order is annotated.

@Id
@Column
@GeneratedValue(strategy = GenerationType.IDENTITY)
private int id;

@OneToMany(mappedBy = "order", cascade = CascadeType.ALL)
private List<OrderLineItem> orderLineItems;

@Column(name = "create_dttm")
private Timestamp created = Timestamp.valueOf(LocalDateTime.now());

@Column(name = "total_price")
private BigDecimal totalPrice;

The new annotation here is @OneToMany: there is one Order for many OrderLineItems. The mappedBy attribute tells Hibernate that it should use the value order to find a setter, i.e. setOrder when returning OrderLineItems, and the attribute cascade is used to describe the behavior of child objects when an operation occurs on the parent. Using CascadeType.ALL means updates, inserts, and deletes on an Order will cascade down to child OrderLineItems.

As you can imagine, there are many annotations to fit the wide variety of relationships in databases. You might consider browsing the JPA JavaDoc to get a feeling for what is possible.

When using JPA, Hibernate is usually configured using a persistence.xml file or programatically with Java. Neither of these options are proprietary, and therefore could be used with other JPA provider. For more details, see “Bootstrapping” in the official documentation. Finally, if you use Spring Boot, the configuration is mostly taken care of automatically with just a few values set in application.yml / application.properties.

XML Mappings

The older method of mapping objects for Hibernate was via–you guessed it–XML. You will have one MyClassName.hbm.xml file for each entity class in the application and it will need to be on the classpath. Each file lists the properties of the class (e.g. <property ...) and the relationships (e.g. <one-to-many ...). Here’s the mappings for Order and OrderLineItem. If you compare them to the JPA-annotated classes, you should notice similar terminology and structure, although some of the vocabulary does not match.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE hibernate-mapping PUBLIC
        "-//Hibernate/Hibernate Mapping DTD//EN"
        "http://www.hibernate.org/dtd/hibernate-mapping-3.0.dtd">

<hibernate-mapping>
    <class name="com.letstalkdata.iscream.domain.Order" table="purchase">
        <id name="id" type="int" column="id">
            <generator class="identity"/>
        </id>
        <property name="created" column="create_dttm" type="timestamp"/>
        <property name="totalPrice" column="total_price" type="big_decimal"/>
        <bag name="orderLineItems"
             table="purchase_line_item"
             cascade="all"
             inverse="true">
            <key column="purchase_id" not-null="true" />
            <one-to-many class="com.letstalkdata.iscream.domain.OrderLineItem"/>
        </bag>
    </class>
</hibernate-mapping>
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE hibernate-mapping PUBLIC
        "-//Hibernate/Hibernate Mapping DTD//EN"
        "http://www.hibernate.org/dtd/hibernate-mapping-3.0.dtd">

<hibernate-mapping>
    <class name="com.letstalkdata.iscream.domain.OrderLineItem"
           table="purchase_line_item">
        <id name="id" type="int" column="id">
            <generator class="identity"/>
        </id>
        <many-to-one name="order"
                     class="com.letstalkdata.iscream.domain.Order"
                     lazy="proxy"
                     fetch="join">
            <column name="purchase_id" not-null="true" />
        </many-to-one>
        <many-to-one name="ingredient"
                     class="com.letstalkdata.iscream.domain.Ingredient"
                     lazy="proxy"
                     fetch="join">
            <column name="ingredient_id" not-null="true" />
        </many-to-one>
        <property name="units" column="units" type="int"/>
    </class>
</hibernate-mapping>

When using the mappings, you have to register them with Hibernate via configuration. In most instances, Hibernate is configured programatically with Java, a hibernate.properties file, or a hibernate.cfg.xml file. For more details on hibernate configuration see “Legacy Bootstrapping” in the official documentation.

Writing Data

JPA

As you can imagine, the OrderService for JPA is going to look considerably different from what we have seen so far. The JPA object that does the heavy lifting is an EntityManager. One way to obtain an EntityManager is through the EntityManagerFactoryBuilder > EntityManagerFactory > EntityManager. The EntityManager has three methods that loosely map to “insert”, “update”, and “delete”: persist, merge, and remove. (The difference between persist and merge is actually non-trivial and this StackOverflow answer does a good job discussing the differences.)

Here’s where the real power of ORMs is revealed: no SQL was written, but using the annotations, the JPA provider is able to create the proper inserts to save the Order and its child OrderLineItems! This means saving an object is usually just one line of code:

package com.letstalkdata.iscream.service;

import com.letstalkdata.iscream.domain.Order;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import javax.persistence.EntityManager;
import javax.transaction.Transactional;

@Service
public class OrderService {

    private EntityManager entityManager;

    @Autowired
    public OrderService(EntityManager entityManager) {
        this.entityManager = entityManager;
    }

    @Transactional
    public void save(Order order) {
        entityManager.persist(order);
    }

}
Native Hibernate

If you are not using JPA, the worker class is a Hibernate SessionFactory which can be used to create Sessions. One Hibernate Session is used per unit-of-work. This is deliberately vague–a unit of work is typically larger than a single database round-trip and may contain a few tightly related operations. For example, a unit of work might be one user request in a web application.

In addition to merge and persist, Hibernate session also has some operations like save, update, saveOrUpdate. Sadly, these are also non-trivial. For a brief discussion, see “Hibernate: save, persist, update, merge, saveOrUpdate” on Baeldung.

The code to save an object using native Hibernate is still relatively short, but remember to (auto) close the Session.

public void save(Order order) {
    try(Session session = sessionFactory.openSession()) {
        Transaction tx = session.beginTransaction();
        session.persist(order);
        tx.commit();
    }
}

Reading Data

Using an ORM, it’s possible to say “Save this object!” because the object has annotations, but saying “Get an object!” is not straight-forward. What filters should be applied? Do you want one object or many? Do you want child objects? Etc. Because of these questions, you need to construct a query.

With Hibernate, you have a few options: use a Criteria object, write Hibernate Query Language (HQL), or write native SQL. (In general, you should avoid native SQL, since that really defeats the purpose of an ORM framework. But there are definitely situations where it cannot be (easily) avoided.) Here’s how the three methods compare:

Criteria:

private List<Ingredient> getIngredients(Ingredient.Type type) {
    CriteriaBuilder cb = entityManager.getCriteriaBuilder();
    CriteriaQuery<Ingredient> criteriaQuery =
            cb.createQuery(Ingredient.class);
    Root ingredient = criteriaQuery.from(Ingredient.class);
    criteriaQuery.where(cb.equal(ingredient.get("type"), type));
    TypedQuery<Ingredient> query = entityManager.createQuery(criteriaQuery);
    return query.getResultList();
}

HQL:

private List<Ingredient> getIngredients(Ingredient.Type type) {
    String hql = "select i from Ingredient i where type =:type";
    Query query = entityManager.createQuery(hql);
    query.setParameter("type", type);
    @SuppressWarnings("unchecked")
    List<Ingredient> ingredients =
            (List<Ingredient>) query.getResultList();
    return ingredients;
}

Native SQL:

private List<Ingredient> getIngredients(Ingredient.Type type) {
    String sql = "select * from ingredient where ingredient_type = ?";
    Query query = entityManager.createNativeQuery(sql, Ingredient.class);
    query.setParameter(1, type.name());
    @SuppressWarnings("unchecked")
    List<Ingredient> ingredients =
            (List<Ingredient>) query.getResultList();
    return ingredients;
}

If you are using native Hibernate instead of JPA, the code is largely the same except that a Query is an org.hibernate.Query, not a javax.persistence.Query and is created from Session.createQuery().

For a more in-depth comparison and other practical advice about the Java ecosystem, check out my book Java for the Real World.

Click here to get Java for the Real World!

Java Build Tools: Ant vs. Maven vs. Gradle

This is an abbreviated chapter from my book Java for the Real World. Want more content like this? Click here to get the book!

For anything but the most trivial applications, compiling Java from the command line is an exercise in masochism. The difficulty including dependencies and making executable .jar files is why build tools were created.

For this example, we will be compiling this trivial application:

package com.example.iscream;

import com.example.iscream.service.DailySpecialService;
import java.util.List;

public class Application {
    public static void main(String[] args) {
        System.out.println("Starting store!\n\n==============\n");

        DailySpecialService dailySpecialService = new DailySpecialService();
        List<String> dailySpecials = dailySpecialService.getSpecials();

        System.out.println("Today's specials are:");
        dailySpecials.forEach(s -> System.out.println(" - " + s));
    }
}
package com.example.iscream.service;

import com.google.common.collect.Lists;
import java.util.List;

public class DailySpecialService {

    public List<String> getSpecials() {
        return Lists.newArrayList("Salty Caramel", "Coconut Chip", "Maui Mango");
    }
}

Ant

The program make has been used for over forty years to compile source code into applications. As such, it was the natural choice in Java’s early years. Unfortunately, a lot of the assumptions and conventions with C programs don’t translate well to the Java ecosystem. To make (har) building the Java Tomcat application easier, James Duncan Davidson wrote Ant. Soon, other open source projects started using Ant, and from there it quickly spread throughout the community.

Build files

Ant build files are written in XML and are called build.xml by convention. I know even the word “XML” makes some people shudder, but in small doses it isn’t too painful. I promise. Ant calls the different phases of the build process “targets”. Targets that are defined in the build file can then be invoked using the ant TARGET command where TARGET is the name of the target.

Here’s the complete build file with the defined targets:

<project>

    <path id="classpath">
        <fileset dir="lib" includes="**/*.jar"/>
    </path>

    <target name="clean">
        <delete dir="build"/>
    </target>

    <target name="compile">
        <mkdir dir="build/classes"/>
        <javac srcdir="src/main/java"
               destdir="build/classes"
               classpathref="classpath"/>
    </target>

    <target name="jar">
        <mkdir dir="build/jar"/>
        <jar destfile="build/jar/IScream.jar" basedir="build/classes"/>
    </target>

    <target name="run" depends="jar">
        <java fork="true" classname="com.example.iscream.Application">
            <classpath>
                <path refid="classpath"/>
                <path location="build/jar/IScream.jar"/>
            </classpath>
        </java>
    </target>

</project>

With these targets defined, you may run ant clean, ant compile, ant jar, ant run to compile, build, and run the application we built.

Of course, the build file you’re likely to encounter in a real project is going to be much more complex than this example. Ant has dozens of built-in tasks, and it’s possible to define custom tasks too. A typical build might move around files, assemble documentation, run tests, publish build artifacts, etc. If you are lucky and are working on a well-maintained project, the build file should “just work”. If not, you may have to make tweaks for your specific computer. Keep an eye out for .properties files referenced by the build file that may contain configurable filepaths, environments, etc.

Summary

While setting up a build script takes some time up front, hopefully you can see the benefit of using one over passing commands manually to Java. Of course, Ant isn’t without its own problems. First, there are few enforced standards in an Ant script. This provides flexibility, but at the cost of every build file being entirely different. In the same way that knowing Java doesn’t mean you can jump into any codebase, knowing Ant doesn’t mean you can jump into any Ant file–you need to take time to understand it. Second, the imperative nature of Ant means build scripts can get very, very long. One example I found is over 2000 lines long! Finally, we learned Ant has no built-in capability for dependency management, although it can be supplemented with Ivy. These limitations along with some other build script annoyances led to the creation of Maven in the early 2000s.

Maven

Maven is really two tools in one: a dependency manager and a build tool. Like Ant it is XML-based, but unlike Ant, it outlines fairly rigid standards. Furthermore, Maven is declarative allowing you to define what your build should do and less about how to do it. These advantages make Maven appealing; build files are much more standard across projects and developers spend less time tailoring the files. As such, Maven has become somewhat of a de facto standard in the Java world.

Maven Phases

The most common build phases are included in Maven and can be executed by running mvn PHASE (where PHASE is the phase name). The most common phase you will invoke is install because it will fully build and test the project, then create a build artifact.

Although it isn’t actually a phase, the command mvn clean deserves a mention. Running that command will “clean” your local build directory (i.e. /target), and remove compiled classes, resources, packages, etc. In theory, you should just be able to run mvn install and your build directory will be updated automatically. However, it seems that enough developers (including myself) have been burned by this not working that we habitually run mvn clean install to force the project to build from scratch.

Project Object Model (POM) Files

Maven’s build files are called Project Object Model files, usually just abbreviated to POM, and are saved as pom.xml in the root directory of a project. In order for Maven to work out of the box, it’s important to follow this directory structure:

.
├── pom.xml
└── src
    ├── main
    │   ├── java
    │   │    <-- Your Java code goes here
    │   ├── resources
    │   │    <-- Non-code files that your app/library needs
    └── test
        ├── java
        │    <-- Java tests
        ├── resources
        │    <-- Non-code files that your tests need

As mentioned previously, Maven has dependency management built in. The easiest way to find the correct values are from the project's website or the MVNRepository site. For our build, we also need to use one of Apache's official plugins--the Shade plugin. This plugin is used to build fat .jar files.

Here's the complete POM file:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0                       https://maven.apache.org/xsd/maven-4.0.0.xsd">
   <modelVersion>4.0.0</modelVersion>
   <groupId>com.example</groupId>
   <artifactId>iscream</artifactId>
   <version>0.0.1-SNAPSHOT</version>
   <packaging>jar</packaging>
   <properties>
      <maven.compiler.source>1.8</maven.compiler.source>
      <maven.compiler.target>1.8</maven.compiler.target>
      <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   </properties>
   <dependencies>
      <dependency>
         <groupId>com.google.guava</groupId>
         <artifactId>guava</artifactId>
         <version>21.0</version>
      </dependency>
   </dependencies>
   <build>
      <plugins>
         <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.3</version>
            <executions>
               <execution>
                  <phase>package</phase>
                  <goals>
                     <goal>shade</goal>
                  </goals>
                  <configuration>
                     <transformers>
                        <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                           <mainClass>com.example.iscream.Application</mainClass>
                        </transformer>
                     </transformers>
                  </configuration>
               </execution>
            </executions>
         </plugin>
      </plugins>
   </build>
</project>

At this point you can run mvn package and you will see the iscream-0.0.1-SNAPSHOT.jar file inside of the target folder. If you run java -jar iscream-0.0.1-SNAPSHOT.jar you can run the application.

Summary

Although Maven has made considerable strides in making builds easier, all Maven users have found themselves banging their head against the wall with a tricky Maven problem at one time or another. I've already mentioned some usability problems with plugins, but there's also the problem of "The Maven Way". Anytime a build deviates from what Maven expects, it can be difficult to put in a work-around. Many projects are "normal...except for that one weird thing we have to do". And the more "weird things" in the build, the harder it can be to bend Maven to your will. Wouldn't it be great if we could combine the flexibility of Ant with the features of Maven? That's exactly what Gradle is trying to do.

Gradle

The first thing you will notice about a Gradle build script is that it is not XML! In fact, Gradle uses a domain specific language (DSL) based on Groovy, which is another programming language that can run on the JVM.

The DSL defines both the core parts of the build file and specific build steps called "tasks". It is also extensible making it very easy to define your own tasks. And of course, Gradle also has a rich third-party plugin library. Let's dive in.

Build files

Gradle build files are appropriately named build.gradle and start out by configuring the build. For our project we need to take advantage of a fat jar plugin, so we will add the Shadow plugin to the build script configuration.

In order for Gradle to download the plugin, it has to look in a repository, which is an index for artifacts. Some repositories are known to Gradle and can be referred to simply as mavenCentral() or jcenter(). The Gradle team decided to not reinvent the wheel when it comes to repositories and instead relies on the existing Maven and Ivy dependency ecosystems.

Tasks

Finally after Ant's obscure "target" and Maven's confusing "phase", Gradle gave a reasonable name to their build steps: "tasks". We use Gradle's apply to give access to certain tasks. (The java plugin is built in to Gradle which is why we did not need to declare it in the build's dependencies.)

The java plugin will give you common tasks such as clean, compileJava, test, etc. The shadow plugin will give you the shadowJar task which builds a fat jar. To see a complete list of the available tasks, you can run gradle -q tasks.

Dependency Management

We've already discussed how a build script can rely on a plugin dependency, likewise the build script can define the dependencies for your project. Here's the complete build file:

buildscript {
    repositories {
        jcenter()
    }
    dependencies {
        classpath 'com.github.jengelman.gradle.plugins:shadow:1.2.4'
    }
}

apply plugin: 'java'
apply plugin: 'com.github.johnrengelman.shadow'

group = 'com.example'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = 1.8
targetCompatibility = 1.8

repositories {
    mavenCentral()
}

dependencies {
    compile group: 'com.google.guava', name: 'guava', version: '21.0'
}

shadowJar {
    baseName = 'iscream'
    manifest {
        attributes 'Main-Class': 'com.example.iscream.Application'
    }
}

Now that the build knows how to find the project's dependencies, we can run gradle shawdowJar to create a fat jar that includes the Guava dependency. After it completes, you should see /build/lib/iscream-0.0.1-SNAPSHOT-all.jar, which can be ran in the usual way (java -jar ...).

Summary

Gradle brings a lot of flexibility and power to the Java build ecosystem. Of course, there is always some danger with highly customizable tools--suddenly you have to be aware of code quality in your build file. This is not necessarily bad, but worth considering when evaluating how your team will use the tool. Furthermore, much of Gradle's power comes from third-party plugins. And since it is relatively new, it still sometimes feels like you are using a bunch of plugins developed by SomeRandomPerson. You may find yourself comparing three plugins that ostensibly do the same thing, each have a few dozen GitHub stars, and little documentation to boot. Despite these downsides, Gradle is gaining popularity and is particularly appealing to developers who like to have more control over their builds.

For a more in-depth comparison and other practical advice about the Java ecosystem, check out my book Java for the Real World.

Click here to get Java for the Real World!

Announcing Java for the Real World


When I started my first Java job, I was immediately overwhelmed by my knowledge gaps of the Java ecosystem. I knew how to write decent code and had a good understanding of the Java language, but I had never used Hibernate or “deployed a war to Tomcat” …and what’s a “pom”? I soon found that answers to these questions were less than straightforward. Wading through pages of dense documentation and poorly written tutorials often left me just as confused as when I started.

That’s why I decided to write Java for the Real World. Having lived through the pain of learning the Java ecosystem, I wanted to create a resource for anyone else in their first Java job to quickly become aware of all the ancillary technologies that Java uses. I intentionally do not provide deep tutorials in the book. Not only would that be an insurmountable task (nearly all of these technologies have books of their own), but I have found that companies use the tools in such different ways that tutorials only go so far. Instead, I provide an overview of the most common tools you are likely to encounter, example code to see the technology in action, and suggested resources for more in-depth study.

I’m also trying something new with this book. Instead of waiting to finish it, I am making the in-progress version available today for 50% off the suggested price. If you purchase today, you will get immediate access to everything I have already written and updates as more chapters are completed!

Click here to save 50%!

Markov Chains in Scala

Although Markov chains have use in machine learning, a more trivial application that pops up from time-to-time is in text generation. Given a sufficiently large enough corpus, the generated text will usually be unique and comprehensible (at least from sentence to sentence).

The full code and a sample corpus used in this post can be found here.

To store the corpus information, we will use a Map.

import scala.collection.mutable

val MARKOV_MAP:mutable.Map[Seq[String], mutable.Map[String, Int]] = new mutable.HashMap()

This structure maps chains of words to “next” words, according to how frequently those words follow the chain. For example, the corpus “I like apples. I like apples. I like oranges. I hate apples.” could create the this structure:

I like -> [(apples -> 2), (oranges -> 1)]
I hate -> [(apples -> 1)]

I say “could” because we can choose a chain size. A larger chain size will produce sentences more similar to those in the corpus, and a smaller chain size will result in more diversion from the corpus.

val CHAIN_SIZE = 2

Having established a chain size, the following function creates the chains from a sentence, and loads the data into the Markov map.

def adjustProbabilities(sentence:String):Unit = {
  val segments = sentence.split(" ").+:("").:+("").sliding(CHAIN_SIZE + 1).toList
  for(segment <- segments) {
    val key = segment.take(CHAIN_SIZE)
    val probs = MARKOV_MAP.getOrElse(key, scala.collection.mutable.Map())
    probs(segment.last) = probs.getOrElse(segment.last, 0) + 1
    MARKOV_MAP(key) = probs
  }
}

Line 2 looks a bit intimidating, but all we are doing is splitting the sentence into words, adding a start empty string and terminal empty string (we’ll see why shortly), and using sliding to process the sentence in chunks. For example, the sentence “Shall I compare thee to a summer’s day” we get the list [["","Shall","I"],["Shall","I","compare"],["I","compare","thee"],["compare","thee","to"],["thee","to","a"],["to","a","summer's"],["a","summer's","day"],["summer's","day",""]].

In general, we don’t want to consider “Shall” and “shall” as separate words include commas, etc. so I also created a method to normalize the corpus. You may need to make adjustments for your specific corpus.

def normalize(line: String): String = {
  line.stripLineEnd
    .toLowerCase
    .filterNot("\\.-,\";:&" contains _)
}

Now we can read in a corpus and process it into the Markov map.

import scala.io.Source
val filePath = "/path/to/shakespeare_corpus.txt"

Source.fromFile(filePath).getLines()
  .map(normalize(_))
  .map(s => s.trim)
  .foreach(s => adjustProbabilities(s))

This assumes each line is a sentence. If your corpus has multiple sentences per line, you might use something like this instead:

Source.fromFile(filePath).getLines()
  .flatMap(normalize(_).split("\\."))
  .map(s => s.trim)
  .foreach(s => adjustProbabilities(s))

Now that the map is built, we can work on generating text. We first need to isolate words that start sentences, which we can do by leveraging the empty string inserted earlier.

val startWords = MARKOV_MAP.keys.filter(_.head == "").toList

A primary feature of Markov chains is that they only care about the current state, which in this case is a chain of words. Given a chain of words, we want to randomly select the next word, according to the probabilities established earlier.

import scala.util.Random
val r = new Random()

def nextWord(seed:Seq[String]):String = {
  val possible = MARKOV_MAP.getOrElse(seed, List())
  r.shuffle(possible.flatMap(pair => List.fill(pair._2)(pair._1))).head
}

This is admittedly a little ham-handed and likely not performant for large corpuses, but in my testing there was no noticeable delay. First we expand the list of possible words into a list with duplicates according to their probabilities. For example [("apple", 2), ("orange", 1), ("pear", 3)] expands to ["apple", "apple", "orange", "pear", "pear", "pear"]. Then we shuffle the list, and pull off the first word. This becomes the next word in the sentence.

Now that we have a method to generate words, we can start with a random starting word (startWords) and build the sentence from there. The process knows to stop when it reaches a terminal word, i.e. a word empty string.

import scala.collection.mutable.ArrayBuffer
def nextSentence():String = {
  val seed = startWords(r.nextInt(startWords.size))
  val sentence:ArrayBuffer[String] = ArrayBuffer()
  sentence.appendAll(seed)
  while(sentence.last != "") {
    sentence.append(nextWord(sentence.view(sentence.size - CHAIN_SIZE, sentence.size)))
  }
  sentence.view(1, sentence.size - 1).mkString(" ").capitalize + ","
}

Since my sample corpus was Shakespeare’s sonnets, I generated 14 lines:

(0 until 14).map(_ => nextSentence()).mkString("\n")

With a little formatting cleanup…

Oaths of thy beauty being mute,
Unthrifty loveliness why dost thou too and therein dignified,
Ah! If thou survive my wellcontented day,
Betwixt mine eye,
These poor rude lines of thy lusty days,
Devouring time blunt thou the master mistress of my faults thy sweet graces graced be,
Leaving thee living in posterity?
Compared with loss and loss with store,
Proving his beauty new,
Whilst I thy babe chase thee afar behind,
Coral is far more red than her lips red,
Admit impediments love is as a careful housewife runs to catch,
Lascivious grace in whom all ill well shows,
Savage extreme rude cruel not to be.

Feel free to use my corpus or choose your own! Project Gutenberg is a great source.

How to combine Scala pattern matching with regex

Scala’s pattern matching is arguably one of its most powerful features and is straight-forward to use when matching on patterns like x::xs vs. x vs. Nil, but you can also use it to match regular expressions. This short tutorial will show you how to use pattern matching and regex to parse a simple DSL for filtering search results.

The domain of the tutorial is a library system where users can search by author, title, or year. They can also combine filters to make the search results more narrow. We’ll start by defining some objects to work with.

case class Book(title:String, author:String, year:Int)

val books = List(
  Book("Moby Dick", "Herman Melville", 1851),
  Book("A Tale of Two Cities", "Charles Dickens", 1859),
  Book("Oliver Twist", "Charles Dickens", 1837),
  Book("The Adventures of Tom Sawyer", "Mark Twain", 1876),
  Book("The Left Hand of Darkness", "Ursula Le Guin", 1969),
  Book("Never Let Me Go", "Kazuo Ishiguro", 2005)
)

To filter the books, we need to supply one or more predicates. A predicate is a function that accepts a Book and returns a Boolean. Our goal is to turn something like “author=Charles Dickens” into a predicate. For starters, we need to be able to parse out user-supplied value “Charles Dickens”. Scala’s regex compiler allows for groups to be surrounded by parentheses which can then be extracted as values. The example in the documentation is val date = """(\d\d\d\d)-(\d\d)-(\d\d)""".r. You can see there are three groups defined: one each for year, month, and day. Here are the patterns we’ll allow to constrain search results:

val authorEquals = """author=([\w\s]+)""".r
val authorLike   = """author~([\w\s]+)""".r
val titleEquals  = """title=([\w\s]+)""".r
val titleLike    = """title~([\w\s]+)""".r
val yearBefore   = """year<(\d+)""".r
val yearAfter    = """year>(\d+)""".r

Remember that the goal is to return a predicate for each filter. The syntax for an anonymous predicate is (b:Book) => [boolean]. Using our example, we could create a predicate (b:Book) => b.author == "Charles Dickens". To make the function generic, we need to be able to extract the supplied author value from the filter. Using the predefined regular expressions combined with pattern matching, we can do just that.

def parseFilter(filterString:String):Book => Boolean = filterString match {
  case authorEquals(value) => (b:Book) => b.author == value
}

The filterString is passed in and pattern matched against the pre-defined regular expression authorEquals. Since we declared one group in the expression, we can name that group (value) and then use that group as a variable. Here’s the complete function that includes all of the expressions.

def parseFilter(filterString:String):Book => Boolean = filterString match {
  case authorEquals(value) => (b:Book) => b.author == value
  case authorLike(value)   => (b:Book) => b.author.contains(value)
  case titleEquals(value)  => (b:Book) => b.title == value
  case titleLike(value)    => (b:Book) => b.title.contains(value)
  case yearBefore(value)   => (b:Book) => b.year < Integer.valueOf(value)
  case yearAfter(value)    => (b:Book) => b.year > Integer.valueOf(value)
  case _                   => (b:Book) => false
}

The last case catches any filter that doesn’t match a pattern and returns a predicate that does not match any book. The functional result being that an invalid filter returns no search results.

Finally, we need to be able to check a book against one or more filters. The forall method is true only if all of the filters match the given book.

def checkBook(b:Book, filterString:String) = {
  val filters = filterString.split(",").map(s => parseFilter(s))
  filters.forall(_(b))
}

We now have everything in place to filter the books according to our search string. Here are some examples:

books.filter(b => checkBook(b, "author=Charles Dickens"))
res0: List[Book] = List(
    Book(A Tale of Two Cities,Charles Dickens,1859),
    Book(Oliver Twist,Charles Dickens,1837))

books.filter(b => checkBook(b, "author=Charles Dickens,year>1840"))
res1: List[Book] = List(
    Book(A Tale of Two Cities,Charles Dickens,1859))

books.filter(b => checkBook(b, "title~of"))
res2: List[Book] = List(
    Book(A Tale of Two Cities,Charles Dickens,1859),
    Book(The Adventures of Tom Sawyer,Mark Twain,1876),
    Book(The Left Hand of Darkness,Ursula Le Guin,1969))

Try to add some more filters such as “starts with” or “year equals” to get practice working with regex matching.