How to Hash a Set

Hash tables are widely used. They rely on good quality hash functions. Popular data 1 structure libraries either provide no hash functions or weak hash functions for sets or maps, making 2 it impossible or impractical to use them as keys in other tables. This article presents three algorithms 3 for hashing a set, two of which are simple to implement, practically fast, and can be combined. 4 The quality evaluations follow the method of [1, chapter 2]. The insight that we are looking for 5 commutative semigroups suggests that even better methods than symmetric polynomials may be 6 found. 7


Introduction
Many programming languages these days have some sort of library offering a range of container data structures.Hash tables, whether for sets or dictionaries or both, are a popular offering.You can expect to find good support for strings as keys; you may hope to find good support for numbers and time-stamps and even for general sequences.What you are not likely to find is good support for sets themselves as elements of hashed sets or keys of hashed dictionaries.
The author has encountered this problem several times: a clause of propositional logic may be represented as a pair of disjoint sets of variables, an itemset in data mining [2] is precisely a set of items, and sets of variables are encountered in data flow analysis [3].Counting or associating properties with such things requires hashing sets.This paper describes three techniques for computing hash codes for unordered collections such as sets, bags, and dictionaries.They are (a) using or imposing a canonical order, (b) partitioning, and (c) using a symmetric polynomial to combine element hashes.

Current Practice
Major textbooks such as Cormen et al. [4], Knuth [5], and Sedgewick [6], have much to say about the theory of hash tables, about hash table data structures, and about techniques for computing hash codes of numbers and strings.Advice about how to construct hash functions for other data structures is rare, and for hashing sets nonexistent.ANSI Common Lisp [7] simply doesn't offer a set data type to start with.Equality does not look inside hash tables, so the built-in sxhash function considers only the identity of a hash table.
Ada 2005 [8,9] provides sets and dictionaries, but does not provide any hash functions for them.

If you want one, you must write it all yourself.
C# version 4 [10,11] has an ISet interface 1 which has nothing to say about the hash code for a set, while the use of exclusive or in all the examples in the GetHashCode page 2 is not encouraging.Andrés Valloud's wonderful survey of hashing algorithms [1, section 6.2.5] despairingly remarks 'because no particular order of enumeration is guaranteed . . .we will nto be able to do much better than to join hash values via a[n exclusive or] funnel".Java 3 , Python 4 , GNU Smalltalk [12,13], Squeak Smalltalk [14,15], VisualWorks Smalltalk 7.10 and later [16], and Pharo Smalltalk [17,18] try harder.
All of the algorithms in this paper use the following identifiers: s the set to be hashed; h the hash code to be computed from s; x an element of s; a an accumulator or array of accumulators; c computes a commencing value from the cardinality of s; e a hash function for the elements of s, which is often a scrambled version of the hash function that is used to locate elements in s, but could be the same function or a completely unrelated one; u the update function; w a wrap-up function to compute h from a.
When a function identifier appears in a cost formula, it stands for the cost of a call to that function, assumed to be independent of its arguments.
The languages above use variants of the following scheme: where u must be commutative and associative.That is, the domain of a (typically 0-2 m − 1 for some m) and the function u form a commutative semigroup.There are two popular choices for u: bitwise exclusive or (used by Python and some Smalltalk systems) and sum (used by Java).
These hashes have the potentially useful property that the hash value (more precisely, a rather than h) can be incrementally maintained as elements are added to and removed from s.
Incremental addition places no extra constraints on u; incremental removal needs an inverse where They also have the quality-related property that if x and y are uniformly distributed over [0, 2 n ) then so are x ⊕ y and x + y.
As the Spartans replied to Phillip II, "if".The main problem with this approach is that the distribution of element hash values is so often not uniform.If c(n) = 0, as it is in Java and some of the Smalltalks, and if e(i) = i when i is a small integer, as it is in all the listed systems except GNU Smalltalk and Python, we find that {1, 2} and {3} have the same hash value whether u is + or ⊕.This can be improved by using a non-trivial c such as α + βn, but even then {1, 6} and {2, 5} have the same hash value.This is why Python and GNU Smalltalk include a "scrambling" stage in their e calculations.
That helps a great deal, but is not a complete solution, hence this paper.

Do we need something better?
A colleague who read a draft of this paper asked the obvious question: if major programming languages hash sets so badly, and there is no outcry, is this problem really of practical interest?
First, there is a vicious circle.If major languages and libraries hash sets badly or not at all, programmers quickly learn not to do that, and that is then taken as evidence that the status quo is fine.
Second, it is not just sets that are hashed badly.Consider four classes in Java: • Point2D has the property that (x, y) and (−x, −y) always have the same hash code. 32.This is very weak.Consider all identifiers of the form [a-zA-Z][a-zA-Z0-9_]*.There are 3276 such identifiers.There are 480 triples of such identifiers with the same hash.That is, about 44% of the identifiers are in triples with the same hash.There are a further 790 pairs (1580 identifiers) that have the same hash.Only about 12% of the identifiers get distinct hash codes.Robert Jenkins' hash [19] gets no collisions for this data set.
• ArrayList uses the same method as String.As a test case that should be easy for a hash function to discriminate, construct all triples (a, b, c) with 0 ≤ a, b, c < 64 as ArrayLists and tally their hash codes.The collision chains come in three sizes: The constant 31 was chosen for speed, not for quality.A bigger constant would do better.
• HashMap.Node computes the same hash value for the maplets x → y and y → x for no apparent reason.This means that if you represent the edges of a graph by a HashMap Node,Node , a graph and its converse will have the same hash value.
As Valloud [1, section 5.1.14]puts it: in Java "it is expected that hash functions will be of bad quality, and that instead of developers addressing this situation, what will happen is that applications will spend computing resources calculating bad quality hash value regardless.Then, an attempt will be made to fix this issue by mixing the bits of the hash values. . .No amount of mixing in any amount of dimensions will solve this fundamental collision problem after the collisions are allowed to occur!" If programmers are willing to tolerate low quality hash functions for simple data structures, it is not surprising that they have needlessly tolerated low quality hash functions for sets and maps.

Canonical order
Some kinds of sets, such as Java's BitSet, EnumSet, and TreeSet store their elements in a canonical order, so that it is possible to hash them as if they were sequences.This is current practice when applicable.Java 1.8 does this for BitSet but not for TreeSet.Any comparison-based data structure such as search trees, jump lists, or skip lists, which allows the elements to be traversed in a canonical order in linear time, can be hashed using sequence hashing techniques, with low overhead.
Hash tables do not make such a traversal easy, nor is there any universal total order we could use for generic sets.But we can convert the hashes of the elements to a canonical sequence by sorting them.Since the element hash values are bounded integers, we can do this using a radix sort [20,21], which has linear worst-case time.

Space: O(|s|)
Typically, the cost of w will be linear in the size of a, so the overheads are linear in |s|.It is unusual for a hashing function to require this much workspace, and the constant factor of the sorting algorithm is not small.So we may take this as a benchmark for quality, and look for a related but efficient approach.

Partitioning
Bucket sort works by partitioning the input into buckets and then sorting the buckets.If the number of buckets is small, and they are not recursively sorted, then we get the following algorithm where each u i is commutative and associative.The wrap-up function w can be any function of B + 1 integers.One thing it should not be is a This method supports incremental revision (of the elements of a) if and only if each u i does.

Symmetric Polynomial
We can generalise the Java/Smalltalk approach another way, by looking for another u.
The symmetric functions of two Boolean variables are 0, 1, x ∧ y, ¬x ∧ ¬y, x ∨ y, ¬x ∨ ¬y, x ⊕ y, and x ≡ y.Of these, only ⊕ and its near-equivalent ≡ are plausible.To go beyond this requires bit-oriented operations that mix up entire words, such as rotations.
Generalising + is more promising.Instead of looking at just x + y, we can look for symmetric polynomials in two variables with integer coefficients satisfying u(x, u(y, z)) = u(u(x, y), z), and use that for u in the Java method.
The simplest family that works is u(x, y) = p + q(x + y) + rxy.This is symmetric by construction, and a little algebra shows that it is associative if and only if pr = q(q − 1). 5 Higher degree polynomials do not work.This was confirmed using SageMath [22].Revising a when an element is added is obvious.To revise a when an element x is removed, let y = e(x).We need to solve a = u(a , y) = (p + qy) + (q + ry)a or a = (q + ry) −1 (a − p − qy).This has a solution if and only if q + ry has an inverse modulo 2 n , which it does provided q + ry is odd.The inverse can then be found using the Extended Euclidean Algorithm6 .But q + ry is odd for all y if and only if q is odd and r is even.
This method can be combined with partitioning; each of the partitions may use different values for (p, q, r).

Speed
Five methods were written in C and compiled at optimisation level 2 using clang Apple LLVM version 9.0.0 (clang-900.0.38), and run on a mid-2015 MacBook Pro with 2.2 Ghz intel Core i7 processor under the macOS Sierra 1012.6 operating system.An array of 1,000,000 random 32-bit integers was hashed 10,000 times using each method and the times reported.In each case y is the hash code of the new element, which is just that element itself.The times are averages per iteration of the inner loop.To put this in context, the time to hash 146,522 words from a Scrabble dictionary using Robert J. Jenkins Jr.'s 32-bit "newhash" function [19]

Quality
Six different set hash functions were implemented: • Sum(1): the Java method where the element hashes are simply summed.
• Sum(4): the radix sort-inspired method where the bottom 2 bits of each element hash selects an accumulator, the remaining bits are summed into that accumulator, and at the end the accumulators and set size are combined.
• Xor(1): the Smalltalk method where the element hashes are combined using exclusive or.
• Xor(4): the radix sort-inspired method where the bottom 2 bits of each element hash selects an accumulator, the remaining bits are xor-ed into that accumulator, and at the end the accumulators and set size are combined.
• Sort: the element hashes are sorted and then combined as if they were the element hashes of a sequence.
• Fold: the symmetric polynomial method with (p, q, r) = (3860031, 2779, 2).Apart from pr = q(q − 1), p odd, q odd, r even, and gcd(p, r)=1, there was nothing special about them.Perhaps there are additional criteria that could be used to select better parameters.Several different parameter sets were tried, all giving similar results.

Table 1
considers all 16,384 subsets of the integers {1, 2, . . ., 14}.#hash values is the number of distinct hash values.Collision rate is the number of items divided by the number of distinct hash