Review: Coursera Functional Programming Principles in Scala

Functional programming is one of those generic “scary” programming phrases like “concurrency” or “cryptography.” We know it’s probably useful, but we can usually get by without it. While I vaguely knew a little about Haskell, I wasn’t really sure what functional programming was before I took Coursera’s Functional Programming Principles in Scala. But, I felt obligated as a programmer to set aside my fears and jump in.

I noticed two things about the class immediately:

  1. The instructor is Martin Odersky, the author of the Scala programming language.
  2. The class is only seven weeks long.

As it turned out, these ended up being two of my favorite things about the class.

Although Scala is used, the class isn’t really about Scala. Rather, Scala is a tool to learn about functional programming. Nonetheless, it was exciting to get Odersky’s perspective on functional programming and his approach to implementing it via Scala. Being such a big name in computer science, there is an additional certainty that the instructor knows what he is talking about–something that you can’t always be sure of in MOOCs.

The duration of the class was shorter than the other Coursera classes I’ve taken, but I found it refreshing. Sometimes, by the 8th or 9th week I’m feeling bored or annoyed with a class. Seven weeks went by very quickly but still managed to be long enough to get an intro to the main concepts of functional programming.

The only part of this class that I found occasionally frustrating was the weekly homework assignments. In general, they are well thought-out and guided, but I definitely got stuck a few times. It would have been nice to see “correct” implementations after receiving a grade. For example, there are many situations when a for-loop, fold, or map would all work correctly, but I was unsure of what implementation would be best. There’s also the issue of googling “how to X in Scala” only to get four or five different possibilities. Upon implementing one of them, there’s always the nagging feeling that Odersky has something else in mind.

Despite this frustration, I still highly recommend the class. If you are interested in functional programming (or even if you want to get a taste of Scala), this is a fun class that won’t take up too much of your time.

Parallel File Reading: Python vs Java

Given a set of files, I wanted to see how Python and Java would perform in both single- and multi- threaded environments. As a simple task, I chose to just count up the number of bytes in a given file by manually iterating over the bytes. Essentially–an intentionally non-optimal method of calculating the file size. Java is usually faster than Python, but I was surprised to see that for this task, Python significantly faster.Java vs Python file IO performanceMy test for this was to read approximately 185MB worth of data spread across 18 files on my 2012 MacBook Pro Intel i7 (2.9GHz). Both programs performed approximately 40% better when utilizing multiple threads and Python was overall about 70% faster.

In Java, we can use a SimpleFileVisitor to walk the directory tree, and ask an ExecutorService to execute a list of Callables.

import java.util.concurrent.*;
import java.util.List;
import java.util.ArrayList;
import java.io.IOException;
import java.io.File;
import java.io.FileReader;
import java.io.FileInputStream;
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;

public class FileProc {
	
	public static void main(String[] args) throws Exception {
		List<FileProcessor> todo = new ArrayList<>();
		Path rootDir = Paths.get("/Users/phillip/Dropbox/eBooks");
		
		Files.walkFileTree(rootDir, new SimpleFileVisitor<Path>() {
		   @Override
		   public FileVisitResult visitFile(Path path, BasicFileAttributes attr) throws IOException {
		      String location = path.toString();
			  todo.add(new FileProcessor(new File(location)));
		      return FileVisitResult.CONTINUE;
		   }
		});

		//Choose one:
		//ExecutorService executor = Executors.newSingleThreadExecutor();
		//ExecutorService executor = Executors.newFixedThreadPool(8);
		List<Future<FileDetails>> futures = executor.invokeAll(todo);
		for(Future<FileDetails> future : futures) {
			System.out.println(future.get());
		}
		executor.shutdown();
	}
}

class FileProcessor implements Callable<FileDetails> {
	private File file;

	public FileProcessor(File file){
		this.file = file;
	}

	public FileDetails call() throws IOException {
		String fileName = file.getName();
		Path path = Paths.get(file.getPath());
		int fileSize = getSizeManually();
		return new FileDetails(fileName, fileSize);
	}

	private int getSizeManually() throws IOException {
		int sum = 0;
		try(FileInputStream fis = new FileInputStream(file)) {
			while(fis.read() != -1){
				sum++;
			}
		}

		return sum;
	}
}

class FileDetails {
	private String fileName;
	private int fileSize;

	public FileDetails(String fileName, int fileSize) {
		this.fileName = fileName;
		this.fileSize = fileSize;
	}

	@Override
	public String toString(){
		return fileName + " .... " + fileSize + " bytes";
	}
}

In Python, the code is much more concise. We use os.walk to walk the directory tree and map a function to a (thread) Pool:

from multiprocessing import Pool
import os

def main():
    with Pool(processes=8) as pool:
        print(pool.map(process_file, get_files()))

def get_files():
    files = []
    root = "/Users/phillip/Dropbox/eBooks"
    for (dirpath, dirnames, filenames) in os.walk(root):
        for f in filenames:
            files.append(dirpath + "/" + f)

    return files

def process_file(file):
    with open(file,'rb') as to_process:
        sum = 0
        byte = to_process.read(1)
        while byte:
            byte = to_process.read(1)
            sum += 1

    return {"fileName":to_process.name,"fileSize":sum}

if __name__=="__main__":
    main()

I don’t know much about the Python internals, so let me know if you have any insight as to why it was so much faster than the Java code!

Distribution of U.S. Infant Birth Weights

I recently attended a talk by the co-author of this paper, Dr. James Collins. He shared a figure from this paper in his presentation that showed American black mothers give birth to lower weight babies than both American white mothers and foreign-born black mothers.

Since the data is from the early 1990s, I wanted to find some more recent data and create a similar plot. The CDC releases vital statistics information for all U.S. births and I used the most recent data set (2012). In 2012 there were over 3.5 million births, 3,637,278 of which I included in this plot.

Density plot of U.S. infant birth weightsIt may seem like these distributions are similar, but the differences between them are quite significant. Tukey-adjusted p-values are as follows:

Category
diff
lower
upper
adjusted p
Foreign Black-American Black95.6761214.08538177.26680.0165018
White-American Black223.57396221.66927225.47870.0000000
White-Foreign Black127.8978546.32161209.47410.0006971

It is unfortunate that the U.S. has so many more per-capita infant deaths than other comparable nations. While the problem is not isolated to one race, black women are significantly more likely to experience an infant death than white women. If you’re interested in learning more about current theories about the issue, this NPR article offers a good summary.

Review: Coursera Machine Learning

Although not as much of a buzzword as “Big Data”, “Machine Learning” is definitely en vogue. To be honest, the phrase sounds much more exotic than the actual technique! Nonetheless, the area of artificial intelligence interests me so I decided to take this class in order to learn more about the use cases of machine learning algorithms. Unfortunately, I finished the class feeling like I had not learned much at all.

Don’t take this the wrong way–I was certainly exposed to a lot of material. In fact, perhaps the best aspect of this class was the breadth of the course content. Many major algorithms are explored and several fundamentals of machine learning are covered. However, I found it very difficult to retain this information and was forgetting even simple concepts from week to week.

It can be difficult to engage students via an online video, but sadly this class felt like it wasn’t even trying. The segments of the videos where the professor is visible look and sound like they were recorded in a closet using a webcam from the early 2000s. The rest of the videos are essentially powerpoint presentations with some annotations. I should mention that this course is taught by the founder of Coursera, Andrew Ng. Therefore, I would have thought that Coursera would want to make the Machine Learning class a “flagship” class for the website. Instead it seems like this was likely one of the first Coursera classes and no one has thought to go back and revise the content.

The assignments for the class consisted almost entirely of algorithm implementation. While this is likely a philosophical decision of the course designers, I don’t really see the benefit. When I work with machine learning algorithms such as k-means clustering, I have no reason to think I can write a better implementation than professionals can. In almost all cases, it’s better to use a library for complex algorithms than it is to roll your own. I would have preferred the assignments to focus more on using the algorithms instead of implementing them.

By design, MOOCs require students to be more self-motivated and dedicated than traditional classrooms. I’m sure that if I had put in more effort into the quizzes and homework assignments, I would have retained more information. However, the quality of the instructional videos and the uninspired content made taking Machine Learning feel like a chore. While I like to see myself as a lifelong learner, this class gave me a bad case of senioritis: “How long until I graduate from lifelong learning?”

Programming language similarity

Programming languages are often described in terms of paradigms or features. This naturally leads to families of closely-related languages. By using a clustering algorithm, we can visualize how closely any two languages are related. This dendrogram places languages in a tree that is similar to a family tree. All of the languages are siblings or cousins of the same generation. Languages with closer common ancestors are more paradigmatically similar than those with more distant common ancestors. The size of the name represents the language’s relatively popularity.Fan dengrogram showing programming language paradigm similarityTo produce this, I first gathered the feature information from Wikipedia, language popularity from the Tiobe index, and modified all the data into a single table of features (data available here). In R, I used the clara library to create a distance matrix, hclust to create the clusters, and ape to make the visualization.

lang<-read.table("clipboard",sep="\t",header=T)
library(cluster)
cl_main <- clara(lang[,3:25],4) #Exclude name and popularity
to_cluster <- cl_main$data
to_cluster[,5:11] <- to_cluster[,6:12]*5 #Add some weight to certain paradigms
hc<-hclust(dist(to_cluster),method="complete")
hc$labels <- lang$Language
library(ape)
myph<-as.phylo(hc)
myph$tip.label <- as.character(myph$tip.label)
png(filename="programming_language_dendo.png",
    width=1500,
    height=1500,
    units="px",
    pointsize=12,
    type="cairo")
plot(myph, type="fan",
     cex = pmax(1.0,log(lang$Popularity+.01)),
     label.offset = 0.25,
     tip.col="#333333",
     no.margin=T)
dev.off()

Since the comparison explicitly does not look at syntax or common language uses, it is interesting to see language families like Java-C++-C# (which are commonly grouped together) separated. We can also see some common “versus” pairs like Python and Ruby cluster very closely to one another. What interesting relationships do you find in the tree?