I recently made a post over at reddit showing the IMDB scores of Seinfeld episodes by season. It got a lot of traffic, but as it turned out the data I was using was incorrect/outdated/malformed. I also got a lot of questions about how I made the graph. This post should help to answer those questions!
The Data
I got the revised data from IMDB. You can access a clean CSV version here.
Plotting in R with ggplot2
This is the plot we’re going to make:
Here is the code used:
#Read the data into R #I copied from excel, you can use read.csv() too sf<-read.table("clipboard",sep="\t",header=T) #Load ggplot2 #(use install.packages"ggplot2" if you don't have it yet) library(ggplot2) #OPTIONAL #Use the Cairo library for anti-aliased images in Windows library(Cairo) CairoWin() #Make the plot #Use aesthetics to set the axes and season coloration #Note: By setting color here stat_smooth will plot separate fits ggplot(sf,aes(x=c(1:nrow(sf)),y=Rating,color=factor(Season)))+ geom_point()+ #Use any method you like, loess is default but I specified lm stat_smooth(method="lm")+ #Clean up labs(x="Episode #",y="Average Rating",title="Seinfeld Episode IMDB Ratings by Season",color="Season")+ ylim(c(7,9))
The two links you provided for IMDB were 404’d for me. How did you obtain the data from IMDB? I tried looking around on the site and it said that you could access it via FTP; could you possibly show how you got it?
Thanks 🙂
Thanks, I’ll fix the link above. Here’s a link to the IMDB FTP locations.
Pingback: The Siiiiiiiiiiiiiiiiiiiiiiimpsooooooooooooooooons | zahlenbitte