Programming languages are often described in terms of paradigms or features. This naturally leads to families of closely-related languages. By using a clustering algorithm, we can visualize how closely any two languages are related. This dendrogram places languages in a tree that is similar to a family tree. All of the languages are siblings or cousins of the same generation. Languages with closer common ancestors are more paradigmatically similar than those with more distant common ancestors. The size of the name represents the language’s relatively popularity.To produce this, I first gathered the feature information from Wikipedia, language popularity from the Tiobe index, and modified all the data into a single table of features (data available here). In R, I used the clara library to create a distance matrix, hclust to create the clusters, and ape to make the visualization.
lang<-read.table("clipboard",sep="\t",header=T) library(cluster) cl_main <- clara(lang[,3:25],4) #Exclude name and popularity to_cluster <- cl_main$data to_cluster[,5:11] <- to_cluster[,6:12]*5 #Add some weight to certain paradigms hc<-hclust(dist(to_cluster),method="complete") hc$labels <- lang$Language library(ape) myph<-as.phylo(hc) myph$tip.label <- as.character(myph$tip.label) png(filename="programming_language_dendo.png", width=1500, height=1500, units="px", pointsize=12, type="cairo") plot(myph, type="fan", cex = pmax(1.0,log(lang$Popularity+.01)), label.offset = 0.25, tip.col="#333333", no.margin=T) dev.off()
Since the comparison explicitly does not look at syntax or common language uses, it is interesting to see language families like Java-C++-C# (which are commonly grouped together) separated. We can also see some common “versus” pairs like Python and Ruby cluster very closely to one another. What interesting relationships do you find in the tree?