I originally posted these as a Twitter thread. Its point is to illustrate that constant factors matter, not only the whether it is , or (though quadratic is quite bad quite soon).
I have been playing with discrimination package.
It offers linear-time grouping and sorting. Data.List.nub vs Data.Discrimination.nub chart is fun to look at. (Look at the x-axis: size of input).
Don't use Data.List.nub.
There is ordNub (in many librairies, e.g. in Cabal). And indeed, it is a good alternative. Data.Discrimination.nub is still faster, when n is large enough (around hundred thousands for Word64 I use in these benchmarks).
hashNub :: (Eq a, Hashable a) => [a] -> [a]performs well too:
All four variants on the same graph. Even for small n, nub is bad. ordNub and Discrimination.nub are about the same. hashNub is fastest.
(I have to find something better to play with than Word64)
UUID is slightly more intestesting type.
data UUID = UUID !Word64 !Word64
deriving (Eq, Ord, Generic, Show, NFData, Hashable, Grouping, Sorting)Same pattern: hashNub is the fastest, ordNub becomes slower then Discrimination.nub when there are enough elements.
I was asked about zooming for small .
Because it is hard to see, we can try loglog plot. Everything looks linear there, but we can see crossover points better.
But what about sorting, you may ask.
It turns out that Data.List.sort is quite good, at least if your lists are less than a million in length. Comparison with Data.Discrimination.sort: (I sort UUIDs, for more fun).
Making a vector from a list, sorting it (using vector-algorithms) and converting back to list seems to be a good option too, (we only copy pointers, copying million pointers is ... cheap).
Something weird happens on GHC-9.0 though. discrimination has the same performance, yet vector based degrades.
Yet, @ekmett thinks there is still plenty of opportunity to make discrimination faster. In the meantime, I'll add (a variant of) these benchmarks to the repository.