UNESCO World Heritage Site Distribution

The United Nations has designated several hundred sites throughout the world as World Heritage Sites. These sites are determined to have “special cultural or physical significance.” However, these sites are not evenly distributed throughout the world. In fact, when we control for the area of a country, we see that Europe disproportionately has many more heritage sites than other regions of the world:A map showing the distribution of UNESCO sites.This is further clarified if we plot the same data and display the UNESCO region of each point:Dotplot of UNESCO Site DistributionWhat are the causes of this Eurocentrism? An article in Geographical magazine suggests a few possibilities. Eurpoean nations joined the U.N. earlier and thus have had more time to apply for World Heritage Sites. Or perhaps the idea of a “site” that shows heritage is itself Eurocentric–many cultures may consider traditions and art more culturally significant than buildings. Another possibility is simply that these countries have more money[PDF]. Whatever the reason, it may be wise to consider the UNESCO site list as merely a sample of the world’s cultures.

Implementing a Naive Hash Table in Java

Suppose you have a set of documents that you want to be able to retrieve easily on demand. A back-end searching algorithm could take in a string of text and then find the document whose text matches the input. But what if you had thousands or millions of documents? This process would be both resource intensive and, more importantly, slow. A common alternative is to use a “hash.” In computer science, a hash is basically just a “stand-in” for another piece of data. As an example, the contents of the U.S. Constitution hashes into 62c23ae3519bf455cc32a4d4be608c8e when using the MD5 algorithm. Now, instead of the computer program saying “Go find the document whose contents are ‘We the People of the United States, in Order to form a more perfect Union…’”, it says, “Go find the document whose hash is 62c23ae3519bf455cc32a4d4be608c8e.” This is the basic premise of a hash table: data is hashed into “keys” which are used to access the pre-hashed data (also referred to as “values”).

In Java, the most common implementation of a hash table is HashSet. It uses the hashCode() algorithm (which is implemented for each object by the programmer) to create integer keys for the lookup table. I was curious how easy it would be to get my own hash table to perform similarly to HashSet.

First I created my simple data object, a “Person” with a first name, last name, and age. I used my IDE to generate the hash code algorithm, but it’s a pretty standard:

@Override
public int hashCode() {
    int result = firstName.hashCode();
    result = 31 * result + lastName.hashCode();
    result = 31 * result + age;
    return result;
}

The back-end of my hash table is a 2D array. The reason for two dimensions is that it is possible for two objects to have the same hashcode. A well-designed hash algorithm should make this uncommon, but not impossible. In situations where two objects “collide” with the same hash code, I store each of those objects in the second dimension of the array.

So finding the correct place in the table requires two steps. First we find the correct slot (the first dimension). This is done using the method suggested on Wikipedia which is by applying a bit mask to the hash code of the object.

private int findSlot(Object o, Object[][] home){
    return  o.hashCode() & (home.length - 1;
}

After finding the correct slot, the object is simply stored in the next available place at that slot.

To retrieve an object, we essentially apply the same steps: find the correct slot for the object, then check all of the objects at that slot for equality.

public boolean contains(Object o){
    if(o == null){
        return false;
    }
    int slot = findSlot(o,hashTable);
    Object[] objects = hashTable[slot];
    for(Object object : objects){
        if(o.equals(object)){
            return true;
        }
    }
    return false;
}

To profile my hash table for performance, I compared it against two controls: 1) a HashSet and 2) an ArrayList (which is analogous to the slow method of checking every object for equality). In each scenario I created one million Person objects, stored them in the collection, then called the contains() method to search the collection for a specific Person. This lookup was repeated 10000 times for each collection.

As expected, both hashing collections blew ArrayList out of the water. On average, they were three orders of magnitude faster than ArrayList. Both had similar lookup time distributions:
Distribution plot of search times using two hashing collectionsAfter bootstrapping the data sets to tease out mean lookup times, I was surprised to find my hash table was faster.Hash Table Mean DistributionMy guess as to why my implementation is faster could be for any number of reasons: it is incomplete, not thoroughly tested, not memory efficient, etc. These are all reasons why you should never actually use your own hash set implementation in a program! Java’s core libraries are designed by experts and have years of testing and performance tweaking behind the scenes. However, it was interesting for me to find implementing a hash table to be relatively straight-forward and comparable in performance to the “gold standard” HashSet.

The code for my hash set is available here.

Bourbon: Best Bang for your Buck

A friend and I went to a local whiskey bar recently, and were a bit overwhelmed at the options. However, we’re both bourbon fans and managed to pick a few at random to try. But what makes the difference between a $4.00 glass of bourbon and a $7.00 glass of bourbon? Probably a combination of factors, but I wanted to see if ratings had any bearing on price.

This graph shows 187 bourbons from Proof66 and compares the price per bottle to the professional judges’ rating. Mouseover for details!

According to the website, any rated liquor is worth drinking according to the judges. However, it is interesting to see which are overpriced and which are bargains. Note that I had to use a log scale for the price because of those few bourbons that are significantly more expensive.

The site also provides ratings of regular users. These unsurprisingly do not always align with the judges’ ratings. I normalized the two rating systems and then picked the top ten most controversial bourbons, in both directions. The bourbons at the top are preferred by judges and the bourbons at the bottom are preferred by users.Dot plot of controversial bourbon ratings

Finally, I wanted to see how bourbons stacked up compared to other types of whiskey. This chart shows that they actually rank fairly highly and the only non-single-malt that beats it is Tennessee whiskey (which I did not know was a formal legal category!).Boxplot of whiskeys by rating

Surprised by any ratings? What’s your liquor of choice? Let me know in the comments or make some charts about what you like to drink!

Review: Udacity Tales from the Genome (BIO110)

Until recently, Udacity has specialized in technical courses. However, late last year they released Tales from the Genome, an introductory course on genetics. I was a bit skeptical of how the material would be presented, but I was interested enough in the subject to give it a shot.

The first thing I noticed about this course were the varied types of video. Although some of the lecture content is in the traditional Udacity style of whiteboard instruction, this course includes “in-person” lectures with instructors Matt and Lauren, interviews with professionals in the field, and discussions with everyday people who have some sort of genetic story relevant to the course. I like this idea, but I found the quality of the content to vary widely. Most of the interviews with professionals were interesting, but many of the “everyday person” discussions were not engaging. For example, they spoke to one woman who has a family history of sickle cell anemia, but didn’t seem to be well informed of how the disease actually affected herself, her family, or her community. Or in an another case they spoke with an adopted man who was searching for his biological parents. While his story was intriguing, there was no conclusion and the closest he came to finding a parent is locating some people near his hometown with a last name that could be the last name of one of his parents. Not exactly riveting. While I appreciate that these are real people sharing their stories, I feel the content would have been more interesting if Udacity had spent some additional time vetting the people before deciding to include them in the lecture content.

This additional video content makes the overall length of the course longer than the other Udacity courses I’ve taken. The video time runs about 13 hours and the quizzes and exams make for about 15-20 hours of total time involvement. The quizzes are mostly multiple choice questions, although some of them are short-response and/or include a “Why?” textbox. I like the idea behind this–there is no one grading the free responses but it encourages students to think about their answers. However, because there is so little feedback, I occasionally found myself skipping those questions by entering bogus text. While the short response questions weren’t as effective for me as intended, I do encourage Udacity to keep experimenting with alternate methods of assessment. Their code analysis/checker is phenomenal and it would be great if they could do something similar for non-technical classes. The only gripe I have about the multiple choice questions is that sometimes I did not know the answer and there’s not always a clear way to find the correct answer. I would love a “I don’t know” button that would mark the question as wrong but then give the answer with an explanation. The only other option now is to try all the combinations and then be confused by the correct answer.

Probably my favorite thing about this course was the wide variety of material. It was an entry level course with a lot of breadth and little depth, but I felt like it gave me an interesting look at many questions and problems in genetics today.

A final note: This course is sponsored by 23andMe, a genetic testing company. I was already a 23andMe customer before taking this course, but non-customers might feel like they are being advertised to. The inclusion of 23andMe marketing information is subtle and well-integrated with the course syllabus, but may turn off some people looking for a “pure” educational experience.

Overall I did enjoy this class, mostly due to the information that I learned. However, it still felt a little “beta” to me. I am excited to see Udacity branch out from technical courses, but the model used in Tales from the Genome isn’t quite polished enough for me to completely endorse it.

How to use the Yelp API in Python

In last week’s post, I pulled data about local restaurants from Yelp to generate a dataset. I was happy to find that Yelp actually has a very friendly API. This guide will walk you through setting up some boiler plate code that can then be configured to your specific needs.

Step 1: Obtaining Access to the Yelp API

Before you can use the Yelp API, you need to submit a developer request. This can be done here. I’m not sure what the requirements are, but my guess is they approve almost everyone. After getting access, you will need to get your API keys from the Manage API access section on the site.

Step 2: Getting the rauth library

Yelp’s API uses OAuth authentication for API calls. Unless you want to do a lot of work, I suggest that you use a third party library to handle the OAuth for you. For this tutorial I’m using rauth, but feel free to use any library of your choice.

You can use easy_install rauth or pip install rauth to download the library.

Step 3: Write the code to query the Yelp API

You’ll first need to figure out what information you actually want to query. The API Documentation gives you all of the different parameters that you can specify and the correct syntax.

For this example, we’re going to be doing some location-based searching for restaurants. If you store each of the search parameters in a dictionary, you can save yourself some formatting. Here’s a method that accepts a latitude and longitude and returns the search parameter dictionary:

def get_search_parameters(lat,long):
	#See the Yelp API for more details
	params = {}
	params["term"] = "restaurant"
	params["ll"] = "{},{}".format(str(lat),str(long))
	params["radius_filter"] = "2000"
	params["limit"] = "10"

	return params

Next we need to build our actual API call. Using the codes from the Manage API access page, we’re going to create an OAuth session. After we have a session, we can make an actual API call using our search parameters. Finally, we take that data and put it into a Python dictionary.

def get_results(params):

	#Obtain these from Yelp's manage access page
	consumer_key = "YOUR_KEY"
	consumer_secret = "YOUR_SECRET"
	token = "YOUR_TOKEN"
	token_secret = "YOUR_TOKEN_SECRET"
	
	session = rauth.OAuth1Session(
		consumer_key = consumer_key
		,consumer_secret = consumer_secret
		,access_token = token
		,access_token_secret = token_secret)
		
	request = session.get("http://api.yelp.com/v2/search",params=params)
	
	#Transforms the JSON API response into a Python dictionary
	data = request.json()
	session.close()
	
	return data

Now we can put it all together. Since Yelp will only return a max of 40 results at a time, you will likely want to make several API calls if you’re putting together any sort of sizable dataset. Currently, Yelp allows 10,000 API calls per day which should be way more than enough for compiling a dataset! However, when I’m making repeat API calls, I always make sure to rate-limit myself.

Companies with APIs will almost always have mechanisms in place to prevent too many requests from being made at once. Often this is done by IP address. They may have some code in place to only handle X calls in Y time per IP or X concurrent calls per IP, etc. If you rate limit yourself you can increase your chances of always getting back a response.

def main():
	locations = [(39.98,-82.98),(42.24,-83.61),(41.33,-89.13)]
	api_calls = []
	for lat,long in locations:
		params = get_search_parameters(lat,long)
		api_calls.append(get_results(params))
		#Be a good internet citizen and rate-limit yourself
		time.sleep(1.0)
		
	##Do other processing

At this point you have a list of dictionaries that represent each of the API calls you made. You can then do whatever additional processing you want to each of those dictionaries to extract the information you are interested in.

When working with a new API, I sometimes find it useful to open an interactive Python session and actually play with the API responses in the console. This helps me understand the structure so I can code the logic to find what I’m looking for.

You can get this complete script here. Every API is different, but Yelp is a friendly introduction to the world of making API calls through Python. With this skill you can construct your own datasets from any of the companies with public APIs.