NaNoGenMo is an idea created by Darius Kazemi to parody NaNoWriWo. Instead of writing a novel, developers write programs to generate 50k+ word “novels”. This series of posts will document my participation throughout the month. Having loaded sample data into Neo4j, I was able to move on to loading the full Seinfeld scripts. It was a bit slow, about an hour, although that was with minimal optimization on my part. When finished, the graph contained over 16,000 nodes and almost 500,000 relationships. Although I know my final product will be nonsense, I want it to have the feel of an actual transcript. To do this, I generated some statistics about number of lines per script, words per sentence, etc. I discovered a handy function buried in scipy that accepts a list of percentages and returns random numbers distributed according to the percentages. For example, here is some code that returns a number of sentences for a given line:

from scipy import stats

line_lengths = {
    1: 0.37012475970532455,
    2: 0.13680991884813376,
    3: 0.05066508405475493,
    4: 0.019468706859848535,
    5: 0.00896670064121586,
    6: 0.004683327097498355,
    7: 0.002425524777767743,
    8: 0.001535305577416816,
    9: 0.0010579416583880582,
    10: 0.0003741500986982157,
    11: 0.0005289708291940291,
    12: 0.0003096414609916268,
    13: 0.0001419190029544956,
    14: 7.74103652479067e-05,
    15: 0.00012901727541317782,
    16: 6.450863770658891e-05,
    17: 5.160691016527113e-05,
    18: 0.00010321382033054225,
    19: 5.160691016527113e-05,
    20: 3.870518262395335e-05,
    21: 1.2901727541317782e-05,
    22: 3.870518262395335e-05,
    23: 2.5803455082635563e-05,
    24: 1.2901727541317782e-05,
    26: 1.2901727541317782e-05,
    34: 1.2901727541317782e-05,
    37: 1.2901727541317782e-05
}

probability = stats.rv_discrete(a=1, values=(list(line_lengths.keys()), list(line_lengths.values())))
line_length = probability.rvs(size=1)[0]

Applying these probabilities with a nonsense sample sentence, I was able to achieve something like this: > SEASON UNKNOWN: EPISODE 1 – THE PLACEHOLDER ============================= KRAMER: The goats. The goats? The goats are going! GEORGE: The goats are going! The goats are going? The. ELAINE: The goats? GEORGE: The. JERRY: The? GEORGE: The goats are?

The next step will be getting words out of the Neo4j graph instead of the sample sentence.