Suppose you have a set of documents that you want to be able to retrieve easily on demand. A back-end searching algorithm could take in a string of text and then find the document whose text matches the input. But what if you had thousands or millions of documents? This process would be both resource intensive and, more importantly, slow. A common alternative is to use a “hash.” In computer science, a hash is basically just a “stand-in” for another piece of data. As an example, the contents of the U.S. Constitution hashes into 62c23ae3519bf455cc32a4d4be608c8e when using the MD5 algorithm. Now, instead of the computer program saying “Go find the document whose contents are ‘We the People of the United States, in Order to form a more perfect Union…'”, it says, “Go find the document whose hash is 62c23ae3519bf455cc32a4d4be608c8e.” This is the basic premise of a hash table: data is hashed into “keys” which are used to access the pre-hashed data (also referred to as “values”).
In Java, the most common implementation of a hash table is HashSet. It uses the hashCode() algorithm (which is implemented for each object by the programmer) to create integer keys for the lookup table. I was curious how easy it would be to get my own hash table to perform similarly to HashSet.
First I created my simple data object, a “Person” with a first name, last name, and age. I used my IDE to generate the hash code algorithm, but it’s a pretty standard:
@Override public int hashCode() { int result = firstName.hashCode(); result = 31 * result + lastName.hashCode(); result = 31 * result + age; return result; }
The back-end of my hash table is a 2D array. The reason for two dimensions is that it is possible for two objects to have the same hashcode. A well-designed hash algorithm should make this uncommon, but not impossible. In situations where two objects “collide” with the same hash code, I store each of those objects in the second dimension of the array.
So finding the correct place in the table requires two steps. First we find the correct slot (the first dimension). This is done using the method suggested on Wikipedia which is by applying a bit mask to the hash code of the object.
private int findSlot(Object o, Object[][] home){ return o.hashCode() & (home.length - 1; }
After finding the correct slot, the object is simply stored in the next available place at that slot.
To retrieve an object, we essentially apply the same steps: find the correct slot for the object, then check all of the objects at that slot for equality.
public boolean contains(Object o){ if(o == null){ return false; } int slot = findSlot(o,hashTable); Object[] objects = hashTable[slot]; for(Object object : objects){ if(o.equals(object)){ return true; } } return false; }
To profile my hash table for performance, I compared it against two controls: 1) a HashSet and 2) an ArrayList (which is analogous to the slow method of checking every object for equality). In each scenario I created one million Person objects, stored them in the collection, then called the contains() method to search the collection for a specific Person. This lookup was repeated 10000 times for each collection.
As expected, both hashing collections blew ArrayList out of the water. On average, they were three orders of magnitude faster than ArrayList. Both had similar lookup time distributions:
After bootstrapping the data sets to tease out mean lookup times, I was surprised to find my hash table was faster.My guess as to why my implementation is faster could be for any number of reasons: it is incomplete, not thoroughly tested, not memory efficient, etc. These are all reasons why you should never actually use your own hash set implementation in a program! Java’s core libraries are designed by experts and have years of testing and performance tweaking behind the scenes. However, it was interesting for me to find implementing a hash table to be relatively straight-forward and comparable in performance to the “gold standard” HashSet.
The code for my hash set is available here.