Hamming distances
The Hamming distance hamming(a,b) between two strings a and b of equal
length is the number of positions where the strings differ. For example,
hamming(01101,11001) = 2.
Consider the following problem: Given a list of n bit strings, each of length k,
calculate the minimum Hamming distance between two strings in the list. For
example, the answer for [00111,01101,11110] is 2, because
- hamming(00111,01101) = 2,
- hamming(00111,11110) = 3, and
- hamming(01101,11110) = 3.
A straightforward way to solve the problem is to go through all pairs of strings
and calculate their Hamming distances, which yields an O(n2k) time
algorithm.
The following function can be used to calculate distances:
int hamming(string a, string b) {
int d = 0;
for(int i = 0; i < k; i++) {
if(a[i] != b[i]) d++;
}
return d;
}
However, if k is small, we can optimize the code by storing the bit strings
as integers and calculating the Hamming distances using bit operations. In
particular, if k ≤ 32, we can just store the strings as int values and use the
following function to calculate distances:
int d = 0;
for(int i = 0; i < k; i++) {
if(a[i] != b[i]) d++;
}
return d;
}
int hamming(int a, int b) {
return __builtin_popcount(a^b);
}
In the above function, the xor operation constructs a bit string that has one bits
in positions where a and b differ. Then, the number of bits is calculated using
the __builtin_popcount function.
return __builtin_popcount(a^b);
}
To compare the implementations, we generated a list of 10000 random bit
strings of length 30. Using the first approach, the search took 13.5 seconds, and
after the bit optimization, it only took 0.5 seconds. Thus, the bit optimized code
was almost 30 times faster than the original code.