Web Scraping and Corpus Analysis with Python: A Seinfeld Case Study

A few years ago, I discovered the hilarity that is Curb Your Enthusiasm. Good or bad, I see a lot of myself in Larry’s character. I kept seeing references to Seinfeld and of course culturally know it as a classic sitcom, but I’ve never seen more than an episode or two. I was probably too young when it was in its first run to get the humor, but by now, I feel I’m sarcastic and jaded enough to appreciate the show.

I decided to use Python to scrape scripts from the show and then get the words to do a basic corpus analysis. By comparing the dialogue of Seinfeld episodes to the dialogue of other English TV shows, I hoped to see what topics are more common in Seinfeld.

Here are some of the words with the most abnormally high distribution in the Seinfeld corpus:

  1. Twix
  2. Armoire
  3. Denim
  4. Biologist
  5. Pakistani
  6. Taps
  7. Yearn
  8. Calzone
  9. Wiz
  10. Holistic

If you are interested in doing something like this for yourself, these are the methods I followed. There may be better ways of coding this since my Python-fu isn’t great yet, but it worked for me and I hope others learning Python may be able to glean something from this.

Web Scraping with Python

First you’ll need to install the BeautifulSoup library. This will allow you to scan the HTML structure of a webpage easily. The website I used for this Seinology. I chose this site because the URLs for every episode are in the same format, making it simple to loop through them. Here’s the code to pull the data.

from bs4 import BeautifulSoup
import urllib2
import sys

sys.setrecursionlimit(2000)

f = open("data1.txt","a")
f.truncate()

def getStuff(script):
	try:
		#Load the content from the url into a BeautifulSoup object
		url = urllib2.urlopen("http://www.seinology.com/scripts/script-" + script + ".shtml")
		content = url.read()

		soup = BeautifulSoup(content)

		#Navigate DOM and find what I want
		p = soup.find_all("p")
		f.write(str(p[3]))
		print "Done with script " + script
	except:
		print "Unable to open script " + script

#Looping through each of the script numbers
lownums = ["01","02","03","04","05","06","07","08","09"]
for num in lownums:
	getStuff(num)

for i in range(10,177):
	getStuff(str(i))

f.close()

Let me explain lines 19 and 20 in a little more detail. First I looked at the HTML source of the webpage and found that the script contents are inside of an unnamed <p> tag. Furthermore, that <p> is always the 4th <p> in the document. On line 19 I use the soup.find_all("p") method to load the contents of each <p> into an object. In line 20, I write the 4th item of the list into my file. Your specific webpage will of course be different but the BeautifulSoup documentation is very thorough.

Building the Corpora

Now that I have the data, I can start to parse it. The basic idea of this code is to select only dialogue lines, then split every word, remove punctuation, and count the number of times each word occurs. This is a perfect use for a dictionary structure. (I used this code as a model for my code.)

f = open("data2.txt","r")

#Only look at dialogue lines which always have a colon.
#Not perfect, but good enough.
text = ""
for line in f:
	if ':' in line:
		text = text + line

f.close()

punctuationMarks = ['!',',','.',':','"','?','-',';','(',')','[',']','\\','/']

dict = {}
words = text.lower().split()

for word in words:

	for mark in punctuationMarks:
		if mark in word:
			word = word.replace(mark,"")

	if word in dict:
		dict[word] += 1
	else:
		dict[word] = 1

f = open("counts.txt","w")
f.truncate()

for word, count in dict.items():
	f.write(word + "\t" + str(count) + "\n")

f.close()

Finally I had some corpus data to work with! But what I wanted to see is what words are more common in Seinfeld episodes than normal dialogue. Serendipitously, Wiktionary actually has an English frequency list based off of TV scripts. I copied that list through word #22000 into another file and then wrote a third Python script to compare the two corpora. (I did do a bit of cleanup first to remove blanks and some stray HTML, hence different file names).

f = open("sranks.txt","r")

sranks = {}

#Split each line into the word and the count
for line in f:
	pair = line.split()
	word = pair[0]
	value = pair[1]
	sranks[word] = value

f.close()

f = open("eranks.txt","r")

eranks = {}

for line in f:
	pair = line.split()
	word = pair[0]
	value = pair[1]
	eranks[word] = value

f.close()

f = open("compare.txt","w")
f.write("Word\tSRank\tERank\n")
for word, count in sranks.items():
	if word in eranks:
		f.write(word + "\t" + str(count) + "\t" + str(eranks[word]) + "\n")

f.close()

This script ran surprisingly fast–about 4 seconds on an i5 3.20GHz processor.

Tagged on: , , ,

6 thoughts on “Web Scraping and Corpus Analysis with Python: A Seinfeld Case Study

  1. nthitz

    > By comparing the dialogue of Seinfeld episodes to the dialogue of other English TV shows, I hoped to see what topics are more common in Seinfeld.

    awesome! your results definitely all are seinfeld related. and while these topics might be more common in seinfeld than in other shows they are not very descriptive as seinfeld as a whole. I dunno how possible this would be by just using absolute n-grams or similar from senifeld

    1. Phillip

      Yeah it’s tricky. For example “phone” and “coffee” are relatively high in the list of Seinfield words, but those are pretty common words in general so they are also fairly high in the regular English distribution.

  2. Kris

    Just to help with your python-fu:
    punctuationMarks = [‘!’,’,’,’.’,’:’,'”‘,’?’,’-‘,’;’,'(‘,’)’,'[‘,’]’,’\\’,’/’]
    can be replaced with
    punctuationMarks = list(‘!,\’.:”?-;()[]\\/’)
    This will help reduce the number of times you have to press the single quote sign.

    Serenity now.

Leave a Reply to nthitz Cancel reply

Your email address will not be published. Required fields are marked *