Falling Cost of Memory: 1957 to Present

On a recent episode of 99% Invisible (highly recommended), they briefly mentioned that people fixate on different design elements and take more notice of them than the average person. For example, anthropomorphized teeth in dentist offices. I think my design fixation is college professor websites from the 1990s.

You can tell that the university created some obscure directory structure when they first got the internet because the sites are usually at an address like mycollege.edu/~staff/prof/cs/esmith/home.html. The pages are always a white background with minimal HTML formatting. Lists, links, and if you’re lucky, header tags. The pages are often in this weird limbo where everything looks outdated and yet you see “Winter 2014 Teaching Schedule”. You’ll find the typical pages such as “CV”, “Classes”, and “Publications”, but the best pages are the “personal” ones. Scanned family photos taken in the 1980s, lists of hobbies, clubs started years ago, and of course, cats. I love these webpages and spend way too much time browsing when I find one.

One such site is John C. McCallum’s homepage which has a slew of content only weird people like me appreciate (although he does have his own domain). On his site he has a list and graph of computer memory prices in history. In honor of weird professor sites, I recreated my own graph of his data (although I adjusted for inflation). Web designs may change, but good data is always in style.Computer memory price over time scatterplot

UNESCO World Heritage Site Distribution

The United Nations has designated several hundred sites throughout the world as World Heritage Sites. These sites are determined to have “special cultural or physical significance.” However, these sites are not evenly distributed throughout the world. In fact, when we control for the area of a country, we see that Europe disproportionately has many more heritage sites than other regions of the world:A map showing the distribution of UNESCO sites.This is further clarified if we plot the same data and display the UNESCO region of each point:Dotplot of UNESCO Site DistributionWhat are the causes of this Eurocentrism? An article in Geographical magazine suggests a few possibilities. Eurpoean nations joined the U.N. earlier and thus have had more time to apply for World Heritage Sites. Or perhaps the idea of a “site” that shows heritage is itself Eurocentric–many cultures may consider traditions and art more culturally significant than buildings. Another possibility is simply that these countries have more money[PDF]. Whatever the reason, it may be wise to consider the UNESCO site list as merely a sample of the world’s cultures.

Implementing a Naive Hash Table in Java

Suppose you have a set of documents that you want to be able to retrieve easily on demand. A back-end searching algorithm could take in a string of text and then find the document whose text matches the input. But what if you had thousands or millions of documents? This process would be both resource intensive and, more importantly, slow. A common alternative is to use a “hash.” In computer science, a hash is basically just a “stand-in” for another piece of data. As an example, the contents of the U.S. Constitution hashes into 62c23ae3519bf455cc32a4d4be608c8e when using the MD5 algorithm. Now, instead of the computer program saying “Go find the document whose contents are ‘We the People of the United States, in Order to form a more perfect Union…’”, it says, “Go find the document whose hash is 62c23ae3519bf455cc32a4d4be608c8e.” This is the basic premise of a hash table: data is hashed into “keys” which are used to access the pre-hashed data (also referred to as “values”).

In Java, the most common implementation of a hash table is HashSet. It uses the hashCode() algorithm (which is implemented for each object by the programmer) to create integer keys for the lookup table. I was curious how easy it would be to get my own hash table to perform similarly to HashSet.

First I created my simple data object, a “Person” with a first name, last name, and age. I used my IDE to generate the hash code algorithm, but it’s a pretty standard:

@Override
public int hashCode() {
    int result = firstName.hashCode();
    result = 31 * result + lastName.hashCode();
    result = 31 * result + age;
    return result;
}

The back-end of my hash table is a 2D array. The reason for two dimensions is that it is possible for two objects to have the same hashcode. A well-designed hash algorithm should make this uncommon, but not impossible. In situations where two objects “collide” with the same hash code, I store each of those objects in the second dimension of the array.

So finding the correct place in the table requires two steps. First we find the correct slot (the first dimension). This is done using the method suggested on Wikipedia which is by applying a bit mask to the hash code of the object.

private int findSlot(Object o, Object[][] home){
    return  o.hashCode() & (home.length - 1;
}

After finding the correct slot, the object is simply stored in the next available place at that slot.

To retrieve an object, we essentially apply the same steps: find the correct slot for the object, then check all of the objects at that slot for equality.

public boolean contains(Object o){
    if(o == null){
        return false;
    }
    int slot = findSlot(o,hashTable);
    Object[] objects = hashTable[slot];
    for(Object object : objects){
        if(o.equals(object)){
            return true;
        }
    }
    return false;
}

To profile my hash table for performance, I compared it against two controls: 1) a HashSet and 2) an ArrayList (which is analogous to the slow method of checking every object for equality). In each scenario I created one million Person objects, stored them in the collection, then called the contains() method to search the collection for a specific Person. This lookup was repeated 10000 times for each collection.

As expected, both hashing collections blew ArrayList out of the water. On average, they were three orders of magnitude faster than ArrayList. Both had similar lookup time distributions:
Distribution plot of search times using two hashing collectionsAfter bootstrapping the data sets to tease out mean lookup times, I was surprised to find my hash table was faster.Hash Table Mean DistributionMy guess as to why my implementation is faster could be for any number of reasons: it is incomplete, not thoroughly tested, not memory efficient, etc. These are all reasons why you should never actually use your own hash set implementation in a program! Java’s core libraries are designed by experts and have years of testing and performance tweaking behind the scenes. However, it was interesting for me to find implementing a hash table to be relatively straight-forward and comparable in performance to the “gold standard” HashSet.

The code for my hash set is available here.

Bourbon: Best Bang for your Buck

A friend and I went to a local whiskey bar recently, and were a bit overwhelmed at the options. However, we’re both bourbon fans and managed to pick a few at random to try. But what makes the difference between a $4.00 glass of bourbon and a $7.00 glass of bourbon? Probably a combination of factors, but I wanted to see if ratings had any bearing on price.

This graph shows 187 bourbons from Proof66 and compares the price per bottle to the professional judges’ rating. Mouseover for details!

According to the website, any rated liquor is worth drinking according to the judges. However, it is interesting to see which are overpriced and which are bargains. Note that I had to use a log scale for the price because of those few bourbons that are significantly more expensive.

The site also provides ratings of regular users. These unsurprisingly do not always align with the judges’ ratings. I normalized the two rating systems and then picked the top ten most controversial bourbons, in both directions. The bourbons at the top are preferred by judges and the bourbons at the bottom are preferred by users.

Dot plot of controversial bourbon ratings

Finally, I wanted to see how bourbons stacked up compared to other types of whiskey. This chart shows that they actually rank fairly highly and the only non-single-malt that beats it is Tennessee whiskey (which I did not know was a formal legal category!).Boxplot of whiskeys by rating

Surprised by any ratings? What’s your liquor of choice? Let me know in the comments or make some charts about what you like to drink!

Review: Udacity Tales from the Genome (BIO110)

Until recently, Udacity has specialized in technical courses. However, late last year they released Tales from the Genome, an introductory course on genetics. I was a bit skeptical of how the material would be presented, but I was interested enough in the subject to give it a shot.

The first thing I noticed about this course were the varied types of video. Although some of the lecture content is in the traditional Udacity style of whiteboard instruction, this course includes “in-person” lectures with instructors Matt and Lauren, interviews with professionals in the field, and discussions with everyday people who have some sort of genetic story relevant to the course. I like this idea, but I found the quality of the content to vary widely. Most of the interviews with professionals were interesting, but many of the “everyday person” discussions were not engaging. For example, they spoke to one woman who has a family history of sickle cell anemia, but didn’t seem to be well informed of how the disease actually affected herself, her family, or her community. Or in an another case they spoke with an adopted man who was searching for his biological parents. While his story was intriguing, there was no conclusion and the closest he came to finding a parent is locating some people near his hometown with a last name that could be the last name of one of his parents. Not exactly riveting. While I appreciate that these are real people sharing their stories, I feel the content would have been more interesting if Udacity had spent some additional time vetting the people before deciding to include them in the lecture content.

This additional video content makes the overall length of the course longer than the other Udacity courses I’ve taken. The video time runs about 13 hours and the quizzes and exams make for about 15-20 hours of total time involvement. The quizzes are mostly multiple choice questions, although some of them are short-response and/or include a “Why?” textbox. I like the idea behind this–there is no one grading the free responses but it encourages students to think about their answers. However, because there is so little feedback, I occasionally found myself skipping those questions by entering bogus text. While the short response questions weren’t as effective for me as intended, I do encourage Udacity to keep experimenting with alternate methods of assessment. Their code analysis/checker is phenomenal and it would be great if they could do something similar for non-technical classes. The only gripe I have about the multiple choice questions is that sometimes I did not know the answer and there’s not always a clear way to find the correct answer. I would love a “I don’t know” button that would mark the question as wrong but then give the answer with an explanation. The only other option now is to try all the combinations and then be confused by the correct answer.

Probably my favorite thing about this course was the wide variety of material. It was an entry level course with a lot of breadth and little depth, but I felt like it gave me an interesting look at many questions and problems in genetics today.

A final note: This course is sponsored by 23andMe, a genetic testing company. I was already a 23andMe customer before taking this course, but non-customers might feel like they are being advertised to. The inclusion of 23andMe marketing information is subtle and well-integrated with the course syllabus, but may turn off some people looking for a “pure” educational experience.

Overall I did enjoy this class, mostly due to the information that I learned. However, it still felt a little “beta” to me. I am excited to see Udacity branch out from technical courses, but the model used in Tales from the Genome isn’t quite polished enough for me to completely endorse it.