Breaking Hash Codes - Part I
When we last left off, we discussed hash codes - a small unique identifier for an item based on its properties or features. There are many different ways to calculate unique identifiers for items. These different ways are called Hash Algorithms. Each serves a different purpose, but some popular ones right now are are MD5, SHA1, and SHA256. Why so many? The reason is security. Our last post finished with the question:
Using hash codes for simple content verification is one thing. But what if an attacker is trying to trick you with hash code collisions? For example, can you use that Visual Studio download if it’s from your worst enemy – but the hash code still matches? It depends on the algorithm!
The security industry first started touching on the importance of “secure” hash algorithms in the 1970s. There were a flurry of proposals in the 80s that were immediately dismissed with security problems, but one called “MD5” was introduced in the early 90s that seemed like it might last the test of time. In 2004, however, advances in security research found issues in MD5 – thus “breaking MD5”. While you can still use MD5 just fine when security is not a concern, you cannot trust a version of Visual Studio from your enemy just because the MD5 hash value matches. Here are the four features of an ideal cryptographic hash algorithm:
- it is easy (but not necessarily quick) to compute the hash value for any given message
- it is infeasible to generate a message that has a given hash
- it is infeasible to modify a message without changing the hash
- it is infeasible to find two different messages with the same hash
If you get #2, then for the most part you get #3 and #4 automatically. How does one “break” a hash algorithm? Let’s take a look a a few from the .NET Framework for an interesting exercise in reverse engineering. As a background, every object in the .NET Framework supports a method called “GetHashCode()”. It does a good job of returning a small random(ish) identifier for an object, so it is used primarily for putting objects into hash tables. This method returns a 32-bit integer, and is not designed for security purposes:
public virtual int GetHashCode()
The HashAlgorithm class is designed for security purposes, however, and should be used when security is a concern. We won’t use that class as an example, because if I can break that class in this blog, then it wouldn’t be in the .NET Framework!
Breaking Int32
Let’s see how this works on integers, and see how it would fare as a cryptographic hash algorithm:
PS > (1).GetHashCode() 1 PS > (2).GetHashCode() 2 PS > (1337).GetHashCode() 1337 PS > (-1337).GetHashCode() -1337
I think we see a pattern :) It looks like .NET is returning our input right back at us. It’s a sure way to prevent collisions, that’s for sure! If we open Int32.GetHashCode() in .NET Reflector (or ILSpy), we can see why:
public override int GetHashCode()
{
return this;
}
The score sheet?
- ✅ it is easy to compute the hash value for any given message
- ❌ it is infeasible to generate a message that has a given hash
- ✅ it is infeasible to modify a message without changing the hash
- ✅ it is infeasible to find two different messages with the same hash
Not bad, but you can’t do much with just integers.
Breaking Int64
That wasn’t too challenging to break. What about Int64? That doesn’t fit in 32 bits by definition, so let’s see if there is any pattern:
PS > ([Int64] 1).GetHashCode()
1
PS > ([Int64] 1337).GetHashCode()
1337
PS > ([Int64] 12345678987654321).GetHashCode()
1656367333
For “low” numbers (below 32 bits?) it seems to be the same as the input number. After that, not quite sure. We can’t break this one by intuition. Let’s see the source:
public override int GetHashCode()
{
return (int)this ^ (int)(this >> 32);
}
What the .NET Framework is doing here is pretty simple:
(int)this
Takes the lower 32-bits of the 64-bit number.
(int)(this >>32)
Shifts the number to the right by 32 places, and then takes the lower 32 bits of the number. That is the same thing as taking the upper 32 bits of the number.
(int)this ^(int)(this >>32)
Uses the binary XOR operator to combine the upper 32 bits and the lower 32 bits.
The first row explains why all of our “low” numbers seemed to just return themselves. The low numbers were the lower 32 bits, and there were no upper 32 bits. Let’s take a look at:
PS > ([Int64] 12345678987654321).GetHashCode()
1656367333
If we convert this to the upper and lower 32 bits, we get:
PS > [Convert]::ToString(12345678987654321, 2)
101011110111000101010001100010100100011111010010110001
PS > [Convert]::ToString(12345678987654321, 2).PadLeft(64, "0")
0000000000101011110111000101010001100010100100011111010010110001
Now, split them into the top and bottom 32 bits take the binary exclusive OR (either bit is 1, but not both):
Section | Bits |
---|---|
Upper 32 | 00000000001010111101110001010100 |
Lower 32 | 01100010100100011111010010110001 |
XOR | 01100010101110100010100011100101 |
Taking a look at the final result, it looks like we have the same as what .NET gave us:
PS > [Convert]::ToInt32("01100010101110100010100011100101", 2)
1656367333
PS > ([Int64] 12345678987654321).GetHashCode()
1656367333
Now that we understand the algorithm, let’s see how we might break the strongest guarantee of a hash algorithm – namely, “it is infeasible to generate a message that has a given hash.” Let’s pretend we want to generate a 64-bit number with the hash value of 1337. That has the following binary value:
PS > [Convert]::ToString(1337, 2).PadLeft(32, "0")
00000000000000000000010100111001
Since the 64-bit hash is generated by combining the upper and lower halves of the number, the simplest way is to simply ignore the upper 32 bits:
Section | Bits |
---|---|
Upper 32 | 00000000000000000000000000000000 |
Lower 32 | 00000000000000000000010100111001 |
XOR | 00000000000000000000010100111001 |
That was simple enough. What about breaking the next strongest feature: “it is infeasible to find two different messages with the same hash.” Given that we know how to find an Int64 for any given hash value, how can we find multiple Int64s that have the same hash value? To figure this out, we’ll use a trick from the XOR operation frequently used in XOR cyphers and XOR encryption:
If A XOR B = C
Then
A = B XOR C
And
B = A XOR C
If we pretend that A is the upper 32 bits of a number, B is the lower 32 bits of a number, and C is the hash code, then we have the following options:
Upper 32 bits = (Lower 32 bits) XOR (Hash code)
or
Lower 32 bits = (Upper 32 bits) XOR (Hash code)
Given that, here’s all we have to do to find a number with a specific hash code:
- Pick an arbitrary number for the lower (or upper) 32 bits
- XOR it with the hash code
- The result will be the other 32 bits you need
Let’s look at an example, using 10001011001010110101101011010111 as a randomly-chosen number:
Section | Bits |
---|---|
Upper 32 | 10001011001010110101101011010111 |
Hash Code | 00000000000000000000010100111001 |
XOR (= Lower 32) | ? |
PowerShell has the “-bxor” operator, so let’s make this easier on ourselves:
PS > $upper32 = [Convert]::ToInt32("10001011001010110101101011010111", 2)
PS > $hash = [Convert]::ToInt32("00000000000000000000010100111001", 2)
PS > $lower32 = $upper32 -bxor $hash
PS > [Convert]::ToString($lower32, 2)
10001011001010110101111111101110
So now we have a hash collision – let’s be sure:
PS > $number = [Convert]::ToInt64(
>> "10001011001010110101101011010111" +
>> "10001011001010110101111111101110", 2)
>>
PS > $number
-8418535196639666194
Let’s see:
PS > ([Int64] -8418535196639666194).GetHashCode()
1337
Bingo. We can now generate hash collisions for arbitrary hash codes. What does that mean?
- ✅ it is easy to compute the hash value for any given message
- ❌ it is infeasible to generate a message that has a given hash
- ❌ it is infeasible to modify a message without changing the hash
- ❌ it is infeasible to find two different messages with the same hash
Bonus question: how many hash collisions can you generate for a given hash code? And how might you do so systematically? Chew on that a bit – in our next post, we’ll take a crack at reverse-engineering String.GetHashCode().