Do not default to HashMap

Posted on 2021-05-19 by Oleg Grenrus

There are two widely used associative containers in Haskell:

  • Map from containers
  • HashMap from unordered-containers

Map is using Ord (total order), and most operations are \mathcal{O}(\log n) . HashMap is using Hashable (some hash), and most operations are \mathcal{O}(1) . HashMap is obviously better!

It depends.

Sometimes HashMap is the only choice, for example when the keys are Unique which has Hashable instance but does not have Ord one. 1 Let us discard such cases and only consider situations when key type has both Ord and Hashable instances.

The thing we often forget to consider is unordered in unordered-containers. toList :: HashMap k v -> [(k, v)]2 doesn't not respect the equality. The order in the resulting list is arbitrary, it depends on how HashMap is constructed.

We should be aware that we trade stability for performance, it is a trade-off, not a pure win.

And not only on that, hash functions provided by hashable are not stable. They depend on a library version, operating system, architecture. The reason is simple: we should be able to tweak hash functions to whatever is the best for performance related properties (speed and collision resistance is own tradeoff already).

Today released hashable- optionally could randomise the hash seed on startup of final executable. If random-init-seed flag is enabled, the initial seed is not a compile-time constant but a value initialized on the first use. I added the random-init-seed flag to help find issues when we (unintentionally) depend on (non-existent) stability of the hash function. I immediately found a bug in the lucid test suite. Even the test-suite took care of different orders already, there was a typo in test case.

I do not recommend enabling that flag in production, only in tests.

Why lucid uses HashMap (for HTML tag attributes) instead of Map. I actually don't know. These attribute dictionaries are usually of "human size", at which point the performance difference doesn't show up. (Maybe a Trie Text would be the best option. I'm not sure.) See my previous blog post on benchmarking discrimination package. List.nub is faster then Set or HashSet based ordNub and hashNub with small inputs, even the member check in List has linear complexity. Constant factors matter. If lucid used Map, its output would be stable. Now it's not, attributes can appear in arbitrary order, and that causes problems, e.g. for dhall-docs.3

Another example is aeson. It is using HashMap for the representation of JSON objects. I don't have any data whether using Map would be worse for performance in typical cases, I doubt. It might matter when JSON contains an actual dictionary mapping from keys to values, but "record objects" are small. Whether key-value mappings would be that big that it mattered, is also not unknown. We can argue that it would be better in worst case, as there is known DDoS attacks on HashMaps with deterministic (and weak, such hashable) hash functions. If we would use stronger hash function, the constant factor will go up, and then Map might be faster and safer by construction.

To conclude, I'm sorry that I broke all of your test-suites with hashable- release which changed initial seed of the default hash. On the other hand, I think it was good to realise how much we rely on something we shouldn't. Therefore hashable- introduces the random-init-seed flag which you can use in your tests to find those issues. (N.B. I don't consider that flag to be a security feature). That release also finally mentions in haddocks the "known but hidden" fact that hash in hashable is not stable. Don't rely on that.

I suggest you default to Map when you need an associative container, and only consider HashMap when you actually have a case for it Benchmark, don't guess, base your decision on data, rerun these benchmark periodically.

  1. And concurrent one as in unique package cannot have Ord instance.↩︎

  2. If you rely (directly or indirectly) on HashMap.toList or HashSet.toList where the order may leak, they are likely wrong data structures. It is ok to use toList if we remove order-dependency later for example by sorting the data at the end or folding with a commutative operation.↩︎

  3. I think dhall-docs should use e.g. tagsoup to reparse the output, (normalise), and compare it as "tag tree", not as text. That's not very convenient, but it is more correct — even if lucid had fully stable output.↩︎

Site proudly generated by Hakyll