Thursday, September 30, 2010

C# .NET - Equals() and GetHashCode() investigated

Just the other day I was ready to commit a rather large chunk of code to SVN repository and needed a fellow developer to approve the commit. While going through the changes made to the code my fellow developer Rasmus (you can see his blog here RasmusKL's Blog - Bit by bit...) spotted a weak implementation of Equals().

To illustrate what was wrong have a look at this class.

using System;

namespace Models
{
 public class EmailAddress
 {
  public string Address { get; set; }

  public override bool Equals(Object o)
  {
   if(o==null)
   {
    return false;
   }
   if (this.GetType() == o.GetType())
   {
    return o.GetHashCode()==this.GetHashCode();
   }
   return false;
  }

  public override int GetHashCode()
  {
   return this.Address.GetHashCode();
  }
 }
}

The issue above is that GetHashCode() returns an integer which only can take 2^32 different values, whereas Address is a string which can take an arbitrary large number of combinations depending on the string length, different encoding types ... so eventually we would run out of different int32 distinct values and this would end up in GetHashCode() returning an already used value for a string that did not match the first value.

This was not my code, but I voted for us to leave the code as it was to be changed in a following commit since I felt that the risk of potentially conflicting hashcodes caused by the following line


return o.GetHashCode()==this.GetHashCode();


was relatively low and that this should not necessarily be a big cause of concern.

I was wrong.

As the following will show there is a potentially large risk that 2 different strings generate the same hashcode with a rather small number of iterations.


The problem is that we are comparing

return o.GetHashCode()==this.GetHashCode();


rather than

var other = o as EmailAddress;
...
return other.Address==this.Address;


The MSDN documentation for Object.GetHashCode() states the following

A hash function must have the following properties:
  • If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
  • The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.
  • For the best performance, a hash function must generate a random distribution for all input.
Out of pure interest I set out to investigate how often on average GetHashCode() will return the same hashcode for different strings. 


On my specific machine (Intel Windows 7 64 bit) this turned out to be on average every 82039'th call to GetHashCode().


The smallest number of calls to GetHashCode() to return the same hashcode for a different string was after 565 iterations and the highest number of iterations before getting a hashcode collision was 296390 iterations. This is the code I used to get the statistics, the code is using Guids for unique strings and pushes them into a Dictionary, since Address in the above class is just a string I have run this test as a string.GetHashCode scenario.

using System;
using NUnit.Framework;
using System.Collections.Generic;

namespace Equals
{
 [TestFixture]
 public class EmailAddressTest
 {
  [Test]
  public void GetHashCode_HowFastCanWeGenerateSameHashCodeWithDifferentValues_NotEqualsHasSameHash()
  {
   Guid guid2;
   int guid2HashCode = 0;
   string guid2AsString = string.Empty;
   long sum = 0;
   Dictionary<int, string> generatedHashesGuids = new Dictionary<int, string>();

   for (int x = 1; x < 10000; x++)
   {
    for (int i = 0; i < 1000000; i++)
    {
     guid2 = Guid.NewGuid();
     guid2AsString = guid2.ToString();
     guid2HashCode = guid2AsString.GetHashCode();

     if (generatedHashesGuids.ContainsKey(guid2HashCode) && generatedHashesGuids[guid2HashCode] != guid2AsString)
     {
      sum = sum + i;
      Console.WriteLine (string.Format("Average object.GetHashCode() iterations before collision (iterations before collision / number of tries): {0} / {1} = {2}", sum.ToString(), x.ToString(), (sum / x).ToString()));
      generatedHashesGuids.Clear();
      break;
     }
     else
     {
      generatedHashesGuids.Add(guid2HashCode, guid2.ToString());
     }
    }
   }
  }
 }
}

The image below show the average number of iterations before a collision occurs. The test ran through 10K collisions.

And here is the graph again after removing the initial 10 first averages that had a large spread.

So the conclusion is that if you think that GetHashCode() (in its default implementation for string) only will clash/collide a few times in Int32.MaxValue then you are mistaken. Personally I was wrong, I thought this happened maybe one time in a million or so, but as this test shows it can happen after 535 iterations and also quite likely after a single iteration.

Kudos to Rasmus for this one.

2 comments:

Jim Mischel said...

As a rule of thumb, your chances of a collision for a hash code of n bits is about 50% when you have 2^(n-1) items. So, as you discovered, for a 32-bit hash code, your chances of a collision are 50% when you have between 64K and 128K items. See http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=792 and http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table for more information.

Jim Mischel said...

*sigh* I meant 2^(n/2) items in the above comment. If there are 2^(n-1) items in a hash table, there is a 50% probability of the item you're adding being a duplicate.