NaNoGenMo is an idea created by Darius Kazemi to parody NaNoWriWo. Instead of writing a novel, developers write programs to generate 50k+ word “novels”. This series of posts will document my participation throughout the month.
Having loaded sample data into Neo4j, I was able to move on to loading the full Seinfeld scripts. It was a bit slow, about an hour, although that was with minimal optimization on my part. When finished, the graph contained over 16,000 nodes and almost 500,000 relationships.
Although I know my final product will be nonsense, I want it to have the feel of an actual transcript. To do this, I generated some statistics about number of lines per script, words per sentence, etc. I discovered a handy function buried in scipy that accepts a list of percentages and returns random numbers distributed according to the percentages. For example, here is some code that returns a number of sentences for a given line:
from scipy import stats line_lengths = {1: 0.37012475970532455, 2: 0.13680991884813376, 3: 0.05066508405475493, 4: 0.019468706859848535, 5: 0.00896670064121586, 6: 0.004683327097498355, 7: 0.002425524777767743, 8: 0.001535305577416816, 9: 0.0010579416583880582, 10: 0.0003741500986982157, 11: 0.0005289708291940291, 12: 0.0003096414609916268, 13: 0.0001419190029544956, 14: 7.74103652479067e-05, 15: 0.00012901727541317782, 16: 6.450863770658891e-05, 17: 5.160691016527113e-05, 18: 0.00010321382033054225, 19: 5.160691016527113e-05, 20: 3.870518262395335e-05, 21: 1.2901727541317782e-05, 22: 3.870518262395335e-05, 23: 2.5803455082635563e-05, 24: 1.2901727541317782e-05, 26: 1.2901727541317782e-05, 34: 1.2901727541317782e-05, 37: 1.2901727541317782e-05} probability = stats.rv_discrete(a=1, values = (list(line_lengths.keys()), list(line_lengths.values()))) line_length = probability.rvs(size=1)[0]
Applying these probabilities with a nonsense sample sentence, I was able to achieve something like this:
SEASON UNKNOWN: EPISODE 1 — THE PLACEHOLDER
=============================KRAMER: The goats. The goats? The goats are going!
GEORGE: The goats are going! The goats are going? The.
ELAINE: The goats?
GEORGE: The.
JERRY: The?
GEORGE: The goats are?
The next step will be getting words out of the Neo4j graph instead of the sample sentence.