Let's Talk Data

How to Hide Text in a BMP using Python

This picture of presidential cat Socks is not what it seems.In addition to being a cat, it also contains the text of the Declaration of the Rights of Man and of the Citizen.

By manipulating the image data and the text data at the bit level, we can tweak the image to include information that is imperceptible to the human eye.

Color data in BMP files is stored as RGB values where one byte is allocated for each of red, blue, and green. Each of these values can be changed by a small amount without changing the appearance of the image. To determine what amount to change the bytes by, we use the binary data of the text file.

As an example, let’s see how we would embed the letter “D” into a solid block of orange. The letter “D” has the ASCII hex value 44 and binary value 01000100. The shade of orange I chose has the hex value FF 7F 27 which has the binary value 11111111 01111111 00100111. I mentioned before that we can alter these RGB numbers slightly. The specific method I chose was to set the last bit of each RGB byte to 1 or 0 based on the corresponding bit in the text data. This chart shows how each bit of the “D” is stored across three orange pixels (for a total of 9 bytes–one of which is not used). The orange in the “result” column really is changed. I used the new hex values, but as you can see it looks exactly the same as the “template” column.

Here is some Python code that will actually do this for us. (Click here for the full script.) First we need to be able to break down the text data into bits:

def data_to_bits(data):
	bits = []
	for i in range(len(data)):
		#A byte can at max be 8 digits long, i.e. 0b11111111 = 255
		#We start at the left most bit (position 7) and work down to 0
		for j in range(7,-1,-1):
			#Create the logic array of bits for our data
			bits.append(nth_bit_present(data[i],j))

	return bits

Then we need a method that will alter an arbitrary byte by setting its last bit to 1 or 0, given a parameter.

def nth_bit_present(my_byte,n):
	#Bitwise check to see what the nth bit is
	#If we get anything other than 0, it is TRUE else FALSE
	return (my_byte & (1 << n)) != 0

def set_final_bit(my_byte,ends_in_one):
	new_byte = 0
	if ends_in_one:
		if(nth_bit_present(my_byte,0)):
			new_byte = my_byte #No modification needed, it already ends in one
		else:
			new_byte = my_byte + 1
	else:
		if(nth_bit_present(my_byte,0)):
			new_byte = my_byte - 1
		else:
			new_byte = my_byte #No modification needed, it already ends in zero
	return new_byte

To extract the “D” from the image, we simply reverse the process. First we extract the final bit from each byte in the image.

with open(sys.argv[1],'rb') as bmp_file:
		bmp = bmp_file.read()
        #The byte at position 10 tells us where the color data starts
	start_offset = bmp[10] 

	#Deconstruct each byte and get its final bit
	bits = []
	for i in range(start_offset,len(bmp)):
		bits.append(nth_bit_present(bmp[i],0))

And then we can convert a chunk of 8 bits into a byte using some bit shifting.

def bits_to_byte(bits):
	assert len(bits) == 8
	new_byte = 0
	for i in range(8):
		if bits[i]==True:
			#This bit==1 and the "position" we are at in the byte is 7-i
			#Bitwise OR will insert a 1 a this position
			new_byte |= 1 << 7 - i
		else:
			#This bit==0 and the "position" we are at in the byte is 7-i
			#Bitwise OR will insert a 0 a this position
			new_byte |= 0 << 7 - i
	return new_byte

By converting all of the “final bits” back into bytes we will have recreated the original text in binary. Using chr(byte) we can get the ASCII value back and print it to the console.

This is a very basic method of steganography that offers some simple obscurity, but not much security. However, if you were to combine steganography with well proven security methods such as encryption, it becomes a powerful tool for hiding data.

By Phillip Johnson | April 27, 2014 | Programming | No Comments |

Falling Cost of Memory: 1957 to Present

On a recent episode of 99% Invisible (highly recommended), they briefly mentioned that people fixate on different design elements and take more notice of them than the average person. For example, anthropomorphized teeth in dentist offices. I think my design fixation is college professor websites from the 1990s.

You can tell that the university created some obscure directory structure when they first got the internet because the sites are usually at an address like mycollege.edu/~staff/prof/cs/esmith/home.html. The pages are always a white background with minimal HTML formatting. Lists, links, and if you’re lucky, header tags. The pages are often in this weird limbo where everything looks outdated and yet you see “Winter 2014 Teaching Schedule”. You’ll find the typical pages such as “CV”, “Classes”, and “Publications”, but the best pages are the “personal” ones. Scanned family photos taken in the 1980s, lists of hobbies, clubs started years ago, and of course, cats. I love these webpages and spend way too much time browsing when I find one.

One such site is John C. McCallum’s homepage which has a slew of content only weird people like me appreciate (although he does have his own domain). On his site he has a list and graph of computer memory prices in history. In honor of weird professor sites, I recreated my own graph of his data (although I adjusted for inflation). Web designs may change, but good data is always in style.

By Phillip Johnson | April 21, 2014 | Other | No Comments |

UNESCO World Heritage Site Distribution

The United Nations has designated several hundred sites throughout the world as World Heritage Sites. These sites are determined to have “special cultural or physical significance.” However, these sites are not evenly distributed throughout the world. In fact, when we control for the area of a country, we see that Europe disproportionately has many more heritage sites than other regions of the world:This is further clarified if we plot the same data and display the UNESCO region of each point:What are the causes of this Eurocentrism? An article in Geographical magazine suggests a few possibilities. Eurpoean nations joined the U.N. earlier and thus have had more time to apply for World Heritage Sites. Or perhaps the idea of a “site” that shows heritage is itself Eurocentric–many cultures may consider traditions and art more culturally significant than buildings. Another possibility is simply that these countries have more money[PDF]. Whatever the reason, it may be wise to consider the UNESCO site list as merely a sample of the world’s cultures.

By Phillip Johnson | March 27, 2014 | Viz | No Comments |

Implementing a Naive Hash Table in Java

Suppose you have a set of documents that you want to be able to retrieve easily on demand. A back-end searching algorithm could take in a string of text and then find the document whose text matches the input. But what if you had thousands or millions of documents? This process would be both resource intensive and, more importantly, slow. A common alternative is to use a “hash.” In computer science, a hash is basically just a “stand-in” for another piece of data. As an example, the contents of the U.S. Constitution hashes into 62c23ae3519bf455cc32a4d4be608c8e when using the MD5 algorithm. Now, instead of the computer program saying “Go find the document whose contents are ‘We the People of the United States, in Order to form a more perfect Union…'”, it says, “Go find the document whose hash is 62c23ae3519bf455cc32a4d4be608c8e.” This is the basic premise of a hash table: data is hashed into “keys” which are used to access the pre-hashed data (also referred to as “values”).

In Java, the most common implementation of a hash table is HashSet. It uses the hashCode() algorithm (which is implemented for each object by the programmer) to create integer keys for the lookup table. I was curious how easy it would be to get my own hash table to perform similarly to HashSet.

First I created my simple data object, a “Person” with a first name, last name, and age. I used my IDE to generate the hash code algorithm, but it’s a pretty standard:

@Override
public int hashCode() {
    int result = firstName.hashCode();
    result = 31 * result + lastName.hashCode();
    result = 31 * result + age;
    return result;
}

The back-end of my hash table is a 2D array. The reason for two dimensions is that it is possible for two objects to have the same hashcode. A well-designed hash algorithm should make this uncommon, but not impossible. In situations where two objects “collide” with the same hash code, I store each of those objects in the second dimension of the array.

So finding the correct place in the table requires two steps. First we find the correct slot (the first dimension). This is done using the method suggested on Wikipedia which is by applying a bit mask to the hash code of the object.

private int findSlot(Object o, Object[][] home){
    return  o.hashCode() & (home.length - 1;
}

After finding the correct slot, the object is simply stored in the next available place at that slot.

To retrieve an object, we essentially apply the same steps: find the correct slot for the object, then check all of the objects at that slot for equality.

public boolean contains(Object o){
    if(o == null){
        return false;
    }
    int slot = findSlot(o,hashTable);
    Object[] objects = hashTable[slot];
    for(Object object : objects){
        if(o.equals(object)){
            return true;
        }
    }
    return false;
}

To profile my hash table for performance, I compared it against two controls: 1) a HashSet and 2) an ArrayList (which is analogous to the slow method of checking every object for equality). In each scenario I created one million Person objects, stored them in the collection, then called the contains() method to search the collection for a specific Person. This lookup was repeated 10000 times for each collection.

As expected, both hashing collections blew ArrayList out of the water. On average, they were three orders of magnitude faster than ArrayList. Both had similar lookup time distributions:
After bootstrapping the data sets to tease out mean lookup times, I was surprised to find my hash table was faster.My guess as to why my implementation is faster could be for any number of reasons: it is incomplete, not thoroughly tested, not memory efficient, etc. These are all reasons why you should never actually use your own hash set implementation in a program! Java’s core libraries are designed by experts and have years of testing and performance tweaking behind the scenes. However, it was interesting for me to find implementing a hash table to be relatively straight-forward and comparable in performance to the “gold standard” HashSet.

The code for my hash set is available here.

By Phillip Johnson | March 16, 2014 | Programming | No Comments |

Bourbon: Best Bang for your Buck

A friend and I went to a local whiskey bar recently, and were a bit overwhelmed at the options. However, we’re both bourbon fans and managed to pick a few at random to try. But what makes the difference between a $4.00 glass of bourbon and a $7.00 glass of bourbon? Probably a combination of factors, but I wanted to see if ratings had any bearing on price.

This graph shows 187 bourbons from Proof66 and compares the price per bottle to the professional judges’ rating. Mouseover for details!

According to the website, any rated liquor is worth drinking according to the judges. However, it is interesting to see which are overpriced and which are bargains. Note that I had to use a log scale for the price because of those few bourbons that are significantly more expensive.

The site also provides ratings of regular users. These unsurprisingly do not always align with the judges’ ratings. I normalized the two rating systems and then picked the top ten most controversial bourbons, in both directions. The bourbons at the top are preferred by users and the bourbons at the bottom are preferred by judges.

Finally, I wanted to see how bourbons stacked up compared to other types of whiskey. This chart shows that they actually rank fairly highly and the only non-single-malt that beats it is Tennessee whiskey (which I did not know was a formal legal category!).

Surprised by any ratings? What’s your liquor of choice? Let me know in the comments or make some charts about what you like to drink!

By Phillip Johnson | March 5, 2014 | Portfolio, Viz | 10 Comments |