I recently made a post over at reddit showing the IMDB scores of Seinfeld episodes by season. It got a lot of traffic, but as it turned out the data I was using was incorrect/outdated/malformed. I also got a lot of questions about how I made the graph. This post should help to answer those questions!

The Data

I got the revised data from IMDB. You can access a clean CSV version here.

Plotting in R with ggplot2

This is the plot we’re going to make:

Episode ratings by season

Here is the code used:

#Read the data into R
#I copied from excel, you can use read.csv() too
sf<-read.table("clipboard",sep="\t",header=T)

#Load ggplot2
#(use install.packages"ggplot2" if you don't have it yet)
library(ggplot2)

#OPTIONAL
#Use the Cairo library for anti-aliased images in Windows
library(Cairo)
CairoWin()

#Make the plot
#Use aesthetics to set the axes and season coloration
#Note: By setting color here stat_smooth will plot separate fits
ggplot(sf,aes(x=c(1:nrow(sf)),y=Rating,color=factor(Season)))+
    geom_point()+
    #Use any method you like, loess is default but I specified lm
    stat_smooth(method="lm")+
    #Clean up
    labs(x="Episode #",y="Average Rating",title="Seinfeld Episode IMDB Ratings by Season",color="Season")+
    ylim(c(7,9))