Skip to main content

HashMap Behavior in JDK6

It is important to read the documentation. Lots of people don't seem to do this, and they get bitten by it.

The iterator for java.util.HashSet is documented as follows:

Returns an iterator over the elements in this set. The elements are returned in no particular order.

Of course, I've now seen code that ignores this, so I thought I would draw a few underlines for the kind of folks who miss it. I wish to state for the record that I did not write this code.

It turns out that HashSets (and HashMaps) don't just use the user-defined hashCode as the hash value for objects they are passed. They rehash the number that method returns with their own hash function, to defend against poor quality hash functions. This makes a lot of sense, as many people write bad hash functions.

Anyway, this means that when they change this function, objects will hash to different places. Then, when iterators visit these items, they will be visited in a different order.

I wouldn't be posting this unless they had changed it between JDK5 and JDK6, of course. For a quick illustration:

import java.util.HashSet;

public class Test {
  public static void main(String [] args) {
   HashSet<string> set = new HashSet<string>();
   set.add("axons");
   set.add("bandrils");
   set.add("chumblies");
   System.out.println(set);
  }
}

Run that in JDK5, and it will print "[chumblies, bandrils, axons]". Run it in JDK6, and it will print "[axons, bandrils, chumblies]".

If you are having trouble switching from JDK5 to JDK6, and this comes up in your code, smack your forehead with your palm, say, "duh!", and move on with your life.

Comments

Anonymous said…
The order can also change if the capacity is changed.

LinkedHashMap can be useful.
mk said…
Just to clarify:

This discussion concerns an internal hashcode and doesn't relate to the HashSet/HashMap's own publicly visible hashcode() method. In other words, if you had code like this:


HashSet set = new HashSet();
set.add("foo");
set.add("bar");
System.out.println("hash is: " + set.hashCode());


Then you should still see the same value in any version of the JDK.
Jeremy Manson said…
Mickey --

It's not a good idea to rely on behavior like that. The value returned by a hashCode invocation isn't documented to be the same across multiple implementations.

In other words, the value printed out by that program might coincidentally the same in any two versions of the JDK, but they could change it tomorrow without violating any rules.

Having said that, it is true that the behavior you were describing and the behavior I was describing are two different things.
Good catch! It seemed to be satisfying the very own doubt of mine at the moment and Mr. Google helped me land here!

Its because of the uncertainty of the ordering of elements in hashmap - to put in a simple way! Still, my question is a little different. If time permits, pay a visit here -> http://www.coderanch.com/t/431661/Java-General-intermediate/java/HashMap-JDK-Vs-JDK-strange

Cheers,
Raghavan alias Saravanan M.
Jeremy Manson said…
@Raghavan: To simplify it, the way HashMaps work, in effect, is that there are n linked lists, numbered from 0 to n-1. In Sun's JDK5, the hashCode() % n is used to decide what linked list the element was added. Thus, they will be added to the same linked lists in the same order every time. In JDK6, they apply a function to the hashCode() to make it more evenly distributed, but the resulting list is still uniquely determined based on the hashCode, so elements will still be added to the same linked lists in the same order every time.
Anurag Saxena said…
In Sun's JDK5, the hashCode() % n is used to decide what linked list the element was added.

Jeremy its not hashCode() % n. Its rather hashCode() & n. The two operators may give different results. n is the number of entry buckets in hashmap.
mk said…
This now-ancient topiccame up in a discussion the other day, so a long-belated correction to Jeremy's comment that "the value printed out might coincidentally be the same in any two versions of the JDK, but they could change it tomorrow without violating any rules."

In the specific case of a HashSet of Strings, it *is* guaranteed to return the same hashCode on any JDK because both the Set interface and the String class explicitly document how their hashCodes must be calculated. Every "correct" JDK must return exactly the same well-defined value. (Theoretically, either contract *could* be changed in a future version of Java, but practically speaking, such a change would be so massively disruptive that it's hard to imagine why they would ever do such a thing.)

However, Jeremy's warning against relying on the values of hashCode implementations that do *not* advertise how they are calculated is certainly sound advice.
Jeremy Manson said…
Well defined hashCodes don't solve this problem. Java's HashMap/HashSet implementations don't necessarily treat hashCode in the same way from implementation to implementation. Hash-based collections in the JDK currently rehash your hash code so that it won't be as predictable:

http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/1d41042b9e4f/src/java.base/share/classes/java/util/HashMap.java#l336

At Google, we are looking at permanently rejiggering Hash-based collections so that iteration happens differently at each program invocation. That will make it much harder to make this mistake. This is probably worth a follow-up post at some point.

Popular posts from this blog

Double Checked Locking

I still get a lot of questions about whether double-checked locking works in Java, and I should probably post something to clear it up. And I'll plug Josh Bloch's new book, too. Double Checked Locking is this idiom: // Broken -- Do Not Use! class Foo {   private Helper helper = null;   public Helper getHelper() {     if (helper == null) {       synchronized(this) {         if (helper == null) {           helper = new Helper();         }       }     }   return helper; } The point of this code is to avoid synchronization when the object has already been constructed. This code doesn't work in Java. The basic principle is that compiler transformations (this includes the JIT, which is the optimizer that the JVM uses) can change the code around so that the code in the Helper constructor occurs after the write to the helper variable. If it does this, then after the constructing thread writes to helper, but before it actually finishes constructing the object,

What Volatile Means in Java

Today, I'm going to talk about what volatile means in Java. I've sort-of covered this in other posts, such as my posting on the ++ operator , my post on double-checked locking and the like, but I've never really addressed it directly. First, you have to understand a little something about the Java memory model. I've struggled a bit over the years to explain it briefly and well. As of today, the best way I can think of to describe it is if you imagine it this way: Each thread in Java takes place in a separate memory space (this is clearly untrue, so bear with me on this one). You need to use special mechanisms to guarantee that communication happens between these threads, as you would on a message passing system. Memory writes that happen in one thread can "leak through" and be seen by another thread, but this is by no means guaranteed. Without explicit communication, you can't guarantee which writes get seen by other threads, or even the order in whic

Date-Race-Ful Lazy Initialization for Performance

I was asked a question about benign data races in Java this week, so I thought I would take the opportunity to discuss one of the (only) approved patterns for benign races. So, at the risk of encouraging bad behavior (don't use data races in your code!), I will discuss the canonical example of "benign races for performance improvement". Also, I'll put in another plug for Josh Bloch's new revision of Effective Java (lgt amazon) , which I continue to recommend. As a reminder, basically, a data race is when you have one (or more) writes, and potentially some reads; they are all to the same memory location; they can happen at the same time; and that there is nothing in the program to prevent it. This is different from a race condition , which is when you just don't know the order in which two actions are going to occur. I've put more discussion of what a data race actually is at the bottom of this post. A lot of people think that it is okay to have a data