I originally posted these as a Twitter thread. Its point is to illustrate that constant factors matter, not only the whether it is , or (though quadratic is quite bad quite soon).
I have been playing with discrimination
package.
It offers linear-time grouping and sorting. Data.List.nub
vs Data.Discrimination.nub
chart is fun to look at. (Look at the x-axis: size of input).
Don't use Data.List.nub.
There is ordNub
(in many librairies, e.g. in Cabal
). And indeed, it is a good alternative. Data.Discrimination.nub
is still faster, when n is large enough (around hundred thousands for Word64 I use in these benchmarks).
hashNub :: (Eq a, Hashable a) => [a] -> [a]
performs well too:
All four variants on the same graph. Even for small n, nub is bad. ordNub and Discrimination.nub are about the same. hashNub is fastest.
(I have to find something better to play with than Word64
)
UUID is slightly more intestesting type.
data UUID = UUID !Word64 !Word64
deriving (Eq, Ord, Generic, Show, NFData, Hashable, Grouping, Sorting)
Same pattern: hashNub
is the fastest, ordNub
becomes slower then Discrimination.nub
when there are enough elements.
I was asked about zooming for small .
Because it is hard to see, we can try loglog plot. Everything looks linear there, but we can see crossover points better.
But what about sorting, you may ask.
It turns out that Data.List.sort
is quite good, at least if your lists are less than a million in length. Comparison with Data.Discrimination.sort
: (I sort UUIDs, for more fun).
Making a vector from a list, sorting it (using vector-algorithms) and converting back to list seems to be a good option too, (we only copy pointers, copying million pointers is ... cheap).
Something weird happens on GHC-9.0 though. discrimination
has the same performance, yet vector
based degrades.
Yet, @ekmett thinks there is still plenty of opportunity to make discrimination faster. In the meantime, I'll add (a variant of) these benchmarks to the repository.