equals and hashCode for dummies (again)

In java, writing equals and hashcode methods are perennial problems. Newbies and experts alike are confounded when things go haywire and troubleshooting problems can be extremely difficult. A common thing that a poor hashcode/equals implementation will cause are intermittent problems and/or mysterious out of memory errors. To help clear things up, I'll give a 10,000 overview of how the core java libraries use the hashcode/equals methods.

Essentially, the hashcode/equals methods are used when storing things in Hashes (HashMap, Hashtable, HashSet). Things often go wrong when folks start using custom objects as keys for these classes. For example, let's say we have a person class that looks something like this:

            public class Person {
                public Person(String orgName) {
                    this.name = orgName
                }
                private String name;
                public String getName() {
                    return name;

                }
                public void setName(String newName) {
                    this.name = newName;
                }
            }
        

This seems pretty straightforward, no surprises, can't get any simpler than this right?

Now let's say we add some people to a hashset that represents a club that people can belong to:

            HashSet<Person> myClub = new HashSet<Person>();
            Person jsmith = new Person("Joe Smith");
            Person sjones = new Person("Sarah Jones");
            myClub.add(jsmith);
            myClub.add(sjones);

        

The contract of a set says that it will guarantee uniqueness... to test this, we can try and add a duplicate Sarah Jones:

            Person sjones2 = new Person("Sarah Jones");
            assertTrue(myClub.add(sjones2));

        

This tells us that we can add two Person objects with the same name. In our strange use case, we don't want that, we want only 1 "Sarah Jones" person in the club. To accomplish this, we need to write an equals method in order to let the system know that our people are unique by name. So we override the equals method in our class by adding the following to the Person.

            @Override
            public equals(Object o) {
                if (o == null || o.getClass() != this.getClass()) return false;
                Person other = (Person)o;
                return other.getName().equals(this.getName());
            }
        

Now, if we try to add our second Sarah Jones, it should fail right? Well, it turns out it only "might" fail, but it also "might" work (sometimes) and this is where things get wonky (especially for newbies). Consider the following:

            HashSet<Person> myClub = new HashSet<Person>();
                    Person sjones = new Person("Sarah Jones");
                    Person sjones2 = new Person("Sarah Jones");
                    assertEquals(sjones, sjones2);


            // Now for the guts
                    myClub.add(sjones);
                    assertTrue(myClub.contains(sjones);

            //Random failures after here
                    assertFalse(myClub.add(sjones2));
                    assertTrue(myClub.contains(sjones2);


        

The above code will randomly fail the third and fourth assertions for no apparent reason. Why? It has to do with how the Hash* implementations work in java. For a really nice, more in-depth write-up, go here, but I'll just give a quick overview.

The Hash* java implementations store things in buckets typically with linked lists (or arrays) in each bucket. For example, a trivial implementation takes 10 buckets and when storing something, it takes the hashcode of the object, does a mod-10 operation on it and then adds it to the linked list of that bucket. The problem with our implementation is that we didn't override the hashCode function and the default implementation in java just uses the memory location for the result of the hashCode function. So, for example our above example would work if the following happens:

  • new hashSet is created
  • sjones gets created and stored at memory location 4
  • sjones2 gets created and stored at memory location 14
  • java compares sjones and sjones2, sees they are equal and returns true
  • java adds sjones to myClub by grabbing the mod-10 of sjones (which would be 4) and putting it in the list on bucket 4 of myClub
  • java verifies that sjones is in myClub by grabbing the hashCode, seeing sjones should be in bucket 4, then walking down the list calling the equals method on each object. There's only one and sjones.getName() in fact is equal to sjones.getName() so it can be found
  • java then tries to add sjones2 by doing a mod-10 of the sjones2 hashcode, seeing it should ALSO go in bucket 4, then walking the list to see if sjones is already there. It sees sjones is already there (because sjones.getName() is equal to sjones2.getName(). Due to the nature of the HashSet contract java will return false to indicate that sjones2 was NOT added because it was already there.
  • Next, java will again look up the hashcode of sjones2, mod-10, look in the bucket, and verify that sjones2 is, in fact, in the set... even though it previously said it was not added (which is totally correct and makes sense if you understand how things are supposed to work)

If the above results are suprising, you'll want to take a break, but for those of you who are following around, now it gets more interesting. Lets run through the above scenario, but instead of sjones2 getting created at memory location 14, let's say it gets created at memory location 17. This would mean that NOW, sjones2 will get added to the set and you'll end up with duplicates. Worse yet, if you call it enough, you could end up with 10 copies of Person objects that are all equal in the Set. How will this happen?

Well, we see that in order to determine which "bucket" to look for an Object, java will use the hashcode. If two Objects that are "equal" have different hashCodes, java will look in the wrong bucket and NOT find it. Because the default implementation uses memory locations as the hashCode, it's often the case that things will "sometimes maybe" work, but other times completely fail.

In short, if you're having wonky behaviour with complex types in Sets or the keys of Maps, carefully verify that all things that can be "equal" will "always" have the same hashCode. Note, things with the same hashCode DON'T need to be equal, but that's a completely different discussion.

Comments

Popular posts from this blog

Push versus pull deployment models

the myth of asynchronous JDBC

Installing virtualbox guest additions on Centos 7 minimal image