NaNoGenMo 2014 Dev Diary #2: Setting up the template
NaNoGenMo is an idea created by Darius Kazemi to parody NaNoWriWo. Instead of writing a novel, developers write programs to generate 50k+ word “novels”. This series of posts will document my participation throughout the month. Having loaded sample data into Neo4j, I was able to move on to loading the full Seinfeld scripts. It was a bit slow, about an hour, although that was with minimal optimization on my part. When finished, the graph contained over 16,000 nodes and almost 500,000 relationships. Although I know my final product will be nonsense, I want it to have the feel of an actual transcript. To do this, I generated some statistics about number of lines per script, words per sentence, etc. I discovered a handy function buried in scipy that accepts a list of percentages and returns random numbers distributed according to the percentages. For example, here is some code that returns a number of sentences for a given line:
from scipy import stats
line_lengths = {
1: 0.37012475970532455,
2: 0.13680991884813376,
3: 0.05066508405475493,
4: 0.019468706859848535,
5: 0.00896670064121586,
6: 0.004683327097498355,
7: 0.002425524777767743,
8: 0.001535305577416816,
9: 0.0010579416583880582,
10: 0.0003741500986982157,
11: 0.0005289708291940291,
12: 0.0003096414609916268,
13: 0.0001419190029544956,
14: 7.74103652479067e-05,
15: 0.00012901727541317782,
16: 6.450863770658891e-05,
17: 5.160691016527113e-05,
18: 0.00010321382033054225,
19: 5.160691016527113e-05,
20: 3.870518262395335e-05,
21: 1.2901727541317782e-05,
22: 3.870518262395335e-05,
23: 2.5803455082635563e-05,
24: 1.2901727541317782e-05,
26: 1.2901727541317782e-05,
34: 1.2901727541317782e-05,
37: 1.2901727541317782e-05
}
probability = stats.rv_discrete(a=1, values=(list(line_lengths.keys()), list(line_lengths.values())))
line_length = probability.rvs(size=1)[0]
Applying these probabilities with a nonsense sample sentence, I was able to achieve something like this: > SEASON UNKNOWN: EPISODE 1 – THE PLACEHOLDER ============================= KRAMER: The goats. The goats? The goats are going! GEORGE: The goats are going! The goats are going? The. ELAINE: The goats? GEORGE: The. JERRY: The? GEORGE: The goats are?
The next step will be getting words out of the Neo4j graph instead of the sample sentence.