NaNoGenMo 2014 Dev Diary #3: Results

NaNoGenMo is an idea created by Darius Kazemi to parody NaNoWriWo. Instead of writing a novel, developers write programs to generate 50k+ word “novels”. This series of posts will document my participation throughout the month.

With the scaffolding in place that I described in my last post, I was able to pretty easily substitute the sample words for real words. The most significant change I made was implementing a simple cache. There’s no reason to search the graph and calculate the probabilities for the same word more than once.

As I expected, this method resulted in mostly gibberish. The reason for this is because using only word concordances means that the only guaranteed grammatical correctness is between each pair of words. There is no guarantee that the sentence as a whole is grammatically correct. Nonetheless, I did end up with some humorous text. Here are just a few of my favorites.

GEORGE: OH THAT’S NEWMAN ALL RIGHT. THE CORNER.
ELAINE: I’M VERY VERY AWKWARD.

ELAINE: SO HE HAD ORGASMS?
GEORGE: WELL MAYBE. HOW? MOVE EDWARD SCISSORHANDS. I’M SPECIAL. NO PROBLEM FOR BREAKFAST. SO LENNY BRUCE USED.

JERRY: FIRST DATE WAS IT WAS.
GEORGE: YOU TELL ME? SHE SAID THANK YOU SLEEPY.

KRAMER: MINIPLEX MULTITHEATER.
JERRY: GIDDYUP.

GEORGE: OH THAT SOUP NAZI.

If I redid this, I would want to find a cleaner data source. My source was filled with typos, inconsistencies, and non-standard punctuation. This led to some difficulties seeding the database. Also I think I would count “…” as its own distinct sentence-terminator. Finally, a similar approach that would have less gibberish would be using Markov chains instead of simple concordances.

All of my code and the final “Missing Season” is available on github.

NaNoGenMo 2014 Dev Diary #2: Setting up the template

NaNoGenMo is an idea created by Darius Kazemi to parody NaNoWriWo. Instead of writing a novel, developers write programs to generate 50k+ word “novels”. This series of posts will document my participation throughout the month.

Having loaded sample data into Neo4j, I was able to move on to loading the full Seinfeld scripts. It was a bit slow, about an hour, although that was with minimal optimization on my part. When finished, the graph contained over 16,000 nodes and almost 500,000 relationships.

Although I know my final product will be nonsense, I want it to have the feel of an actual transcript. To do this, I generated some statistics about number of lines per script, words per sentence, etc. I discovered a handy function buried in scipy that accepts a list of percentages and returns random numbers distributed according to the percentages. For example, here is some code that returns a number of sentences for a given line:

from scipy import stats

line_lengths = {1: 0.37012475970532455, 2: 0.13680991884813376, 3: 0.05066508405475493, 4: 0.019468706859848535, 5: 0.00896670064121586, 6: 0.004683327097498355, 7: 0.002425524777767743, 8: 0.001535305577416816, 9: 0.0010579416583880582, 10: 0.0003741500986982157, 11: 0.0005289708291940291, 12: 0.0003096414609916268, 13: 0.0001419190029544956, 14: 7.74103652479067e-05, 15: 0.00012901727541317782, 16: 6.450863770658891e-05, 17: 5.160691016527113e-05, 18: 0.00010321382033054225, 19: 5.160691016527113e-05, 20: 3.870518262395335e-05, 21: 1.2901727541317782e-05, 22: 3.870518262395335e-05, 23: 2.5803455082635563e-05, 24: 1.2901727541317782e-05, 26: 1.2901727541317782e-05, 34: 1.2901727541317782e-05, 37: 1.2901727541317782e-05}

probability = stats.rv_discrete(a=1,
    values = (list(line_lengths.keys()), list(line_lengths.values())))

line_length = probability.rvs(size=1)[0] 

Applying these probabilities with a nonsense sample sentence, I was able to achieve something like this:

SEASON UNKNOWN: EPISODE 1 — THE PLACEHOLDER
=============================

KRAMER: The goats. The goats? The goats are going!
GEORGE: The goats are going! The goats are going? The.
ELAINE: The goats?
GEORGE: The.
JERRY: The?
GEORGE: The goats are?

The next step will be getting words out of the Neo4j graph instead of the sample sentence.

NaNoGenMo 2014 Dev Diary #1: Concordance with Neo4j

NaNoGenMo is an idea created by Darius Kazemi to parody NaNoWriWo. Instead of writing a novel, developers write programs to generate 50k+ word “novels”. This series of posts will document my participation throughout the month.

Generating an original novel with software is certainly a Hard Problem, but the rules of NaNoGenMo are lax enough that programmers of any level can participate. It also seems to be a perfect opportunity to explore new technologies. In my case, I wanted to experiment with modeling language in a graph database.

Conceptually, it is possible to model all possible sentences of a corpus in a single graph. Words follow one another in sentences which creates natural links between the words. Consider the corpus “I am hungry. You are hungry. Hungry people eat food.” We can model that corpus in the following manner:
Concordance graph of small corpusAll sentences begin at the node marked “12” and end at the node marked “11”. This shows, for example, that sentences can start with “hungry” or contain “hungry” in the middle. Additionally, “I am” is not a complete sentence in this corpus. This describes the concept of concordances—the ordering of words in a corpus.

With this idea in mind, I decided that I wanted to create a text generated from a concordance graph. I have already shown that Seinfeld transcripts make for an interesting and amusing corpus, so I will probably use that as my source again. To get my feet wet, I wanted to start a with an extremely limited corpus. And what’s better than the intentionally limited Green Eggs and Ham?

I honestly thought a good chunk of this post would be about installing Neo4j, but these two lines did it for me:

brew install neo4j
neo4j start

I first populate the graph with three nodes: statement start, question start, and sentence end.

create (s:Start {type:'statement'});
create (s:Start {type:'question'});
create (e:End);

Next, I populate the graph one sentence at a time. The merge query acts as a “get or create” which is applied to each word. Sentences that end in a question mark start at the “question start” node and other sentences start at the “statement start” node. Each word in a sentence then has a concordance to the next, with the final word terminating at the “end” node.

Let’s see how this works for the first sentence, “I am Sam”.

//"get or create" I
merge (w:Word {word:'I'}) return w;
//The sentence does not end in a question mark so find the
//"statement start" node, find the "I" node (which now must exist),
//and link them with the "BEGINS" relationship
match (s:Start {type:'statement'}) match (word:Word {word: 'I'}) create (s)-[:BEGINS]->(word);
//"get or create" AM
merge (w:Word {word:'AM'}) return w;
//Find the "I" node and the "AM" node and link them with the "CONCORDANCE" relationship
match (a:Word {word: 'I'}) match (b:Word {word: 'AM'}) create (a)-[:CONCORDANCE]->(b);
//"get or create" SAM
merge (w:Word {word:'SAM'}) return w;
//Find the "AM" node and the "SAM" node and link them with the "CONCORDANCE" relationship
match (a:Word {word: 'AM'}) match (b:Word {word: 'SAM'}) create (a)-[:CONCORDANCE]->(b);
//The sentence ends. Find the "SAM" node and the "end" node and link
//them with a "TERMINATES" relationship
match (word:Word {word:'SAM'}) match (e:End) create (word)-[:TERMINATES]->(e);

After repeating this for all of the sentences, a complete graph of the book is available to query. For example, we can find all of the nodes that can start a question:
match (s:Start {type:"question"})-[:BEGINS]->(w) return s, w;
Graph of question wordsNotice that some of these words are themselves connected. Since these words appear more than once, we can also count the occurrences:
match (s:Start {type:"question"})-[:BEGINS]->(w) return w, count(*);

wcount(*)
IN2
COULD3
WOULD9
DO1
YOU2

With this proof of concept in place, my next task is going to be parsing and loading the Seinfeld transcripts into Neo4j.

Prevalence of #occupycentral in Hong Kong Instagrams

Many Hong Kong citizens are currently protesting for democratic reform in downtown Hong Kong. As is expected, the Chinese netizens have taken to social media to spread their message. Instagram in particular was in the press as there were reports of the Chinese government blocking access to the service. Nonetheless, many users in Hong Kong were still able to get Instagrams posted. Using the Instagram API, I gathered geocoded instagrams in Hong Kong tagged “#occupycentral”.

The tag had some rumblings late last week, but really exploded over the past few days.Line graph showing the rise of the #occupycentral tag
Although instagram users around the world used this tag, I wanted to get a visual of where in Hong Kong the tweets were coming from. By far, the majority are clustered around the downtown area, although there are some stragglers further away. Also note the few posts from Victoria Harbor.Map of Instagrams tagged #occupycentral
It is common for Instagram users to tag photos with multiple hashtags. This can increase visibility of posts since users often browse media by tag. Of the Instagrams that were tagged “#occupycentral”, these are the most common other tags. I was surprised to see the popularity of “#umbrellarevolution”, even surpassing the Chinese versions of “#occupycentral”.Bar chart of tags used alongside #occupycentral
If you’d like to see some of the actual photos being uploaded, check out this live map from Geofeedia.

One Year of My Workout Data

Penn Jillette would say that there are two kinds of people in the world: skinny fucks and fat fucks. While he places himself in the latter category, I am definitely part of team skinny fuck. Around this time last year I started casually lifting weights. In typical LTD fashion, I also started tracking my weight and workouts.

This chart shows my body weight gain, approximately 10% over the year.Body weight line graphAs for my workouts, I tracked the exercise, amount of weight, and number of reps. I don’t know what the standard is for recording free weight, but I made my recordings “per limb” so that a bench press of 30lbs means 30lbs per arm. Any days where I skipped a particular exercise were marked as 0lbs. (Mouseover to highlight.)
One thing this graph hides is the number of reps. For example, the transition from 10 reps of 20lbs to 5 reps of 25lbs. This is the same graph except with the y-axis showing the weight multiplied by the number of reps.
I’m still tracking my data and next year I’ll be able to do an update with double the data!