NaNoGenMo 2014 Dev Diary #1: Concordance with Neo4j

NaNoGenMo is an idea created by Darius Kazemi to parody NaNoWriWo. Instead of writing a novel, developers write programs to generate 50k+ word “novels”. This series of posts will document my participation throughout the month.

Generating an original novel with software is certainly a Hard Problem, but the rules of NaNoGenMo are lax enough that programmers of any level can participate. It also seems to be a perfect opportunity to explore new technologies. In my case, I wanted to experiment with modeling language in a graph database.

Conceptually, it is possible to model all possible sentences of a corpus in a single graph. Words follow one another in sentences which creates natural links between the words. Consider the corpus “I am hungry. You are hungry. Hungry people eat food.” We can model that corpus in the following manner:
All sentences begin at the node marked “12” and end at the node marked “11”. This shows, for example, that sentences can start with “hungry” or contain “hungry” in the middle. Additionally, “I am” is not a complete sentence in this corpus. This describes the concept of concordances—the ordering of words in a corpus.

With this idea in mind, I decided that I wanted to create a text generated from a concordance graph. I have already shown that Seinfeld transcripts make for an interesting and amusing corpus, so I will probably use that as my source again. To get my feet wet, I wanted to start a with an extremely limited corpus. And what’s better than the intentionally limited Green Eggs and Ham?

I honestly thought a good chunk of this post would be about installing Neo4j, but these two lines did it for me:
brew install neo4j neo4j start
I first populate the graph with three nodes: statement start, question start, and sentence end.
create (s:Start {type:'statement'}); create (s:Start {type:'question'}); create (e:End);
Next, I populate the graph one sentence at a time. The merge query acts as a “get or create” which is applied to each word. Sentences that end in a question mark start at the “question start” node and other sentences start at the “statement start” node. Each word in a sentence then has a concordance to the next, with the final word terminating at the “end” node.

Let’s see how this works for the first sentence, “I am Sam”.
//"get or create" I merge (w:Word {word:'I'}) return w; //The sentence does not end in a question mark so find the //"statement start" node, find the "I" node (which now must exist), //and link them with the "BEGINS" relationship match (s:Start {type:'statement'}) match (word:Word {word: 'I'}) create (s)-[:BEGINS]->(word); //"get or create" AM merge (w:Word {word:'AM'}) return w; //Find the "I" node and the "AM" node and link them with the "CONCORDANCE" relationship match (a:Word {word: 'I'}) match (b:Word {word: 'AM'}) create (a)-[:CONCORDANCE]->(b); //"get or create" SAM merge (w:Word {word:'SAM'}) return w; //Find the "AM" node and the "SAM" node and link them with the "CONCORDANCE" relationship match (a:Word {word: 'AM'}) match (b:Word {word: 'SAM'}) create (a)-[:CONCORDANCE]->(b); //The sentence ends. Find the "SAM" node and the "end" node and link //them with a "TERMINATES" relationship match (word:Word {word:'SAM'}) match (e:End) create (word)-[:TERMINATES]->(e);
After repeating this for all of the sentences, a complete graph of the book is available to query. For example, we can find all of the nodes that can start a question:
match (s:Start {type:"question"})-[:BEGINS]->(w) return s, w;
Notice that some of these words are themselves connected. Since these words appear more than once, we can also count the occurrences:
match (s:Start {type:"question"})-[:BEGINS]->(w) return w, count(*);

w	count(*)
IN	2
COULD	3
WOULD	9
DO	1
YOU	2

With this proof of concept in place, my next task is going to be parsing and loading the Seinfeld transcripts into Neo4j.

Tagged on: corpus, cypher, graph, language, Neo4j

Leave a Reply Cancel reply