Design and Analysis of Algorithms: Hash Tables *

Dictionaries

Dictionary ADT.

Operations associated with this data type allow:

(Source)

Typical uses:

(Ordered Dictionary ADT next time.)

Direct addressing and Hashing are two ways of implementing a dictionary. Are there others?

Direct addressing.
Hashing

Basic Hashing:
  • O(1) average case time for lookup.
  • Universe of keys U mapped into slots of a hash table of size m by hash function h.
  • Because size(U) > m, collisions are always possible.
    Imagine we hash by word length: 'mark' and 'beam' both hash to 4. (Stupid hash function, but it illustrates the idea.) We must resolve this collision somehow.
  • Resolve collisions by chaining:
    Each slot holds a linked list of values.
  • Cryptographic hashing
    Use large hash keys: SHA-1 uses 160 bit keys. SHA-2 uses keys of up to 512 bits.
  • Perceptual hashing
Introducing probability into an algorithm.

What happens to the usual assumptions?
Correctness: always, most of the time?
Termination: always, or almost always? What does "performance" mean if the running time/answer/even termination change from one run to the next?

Probability Basics

Reviewed in this document.

Simple uniform hashing

This employs chaining. Furthermore, we assume that the distribution of elements is uniform across hash table slots.

  • Hash table T with m slots storing n elements.
  • Load factor: α = n / m
    α is the average number of elements stored in a chain.
  • Our analysis is in terms of α, which can be less than, equal to, or greater than one.
  • Worst case is very bad:
    All n keys hash to the same slot.
    Worst case for searching becomes Θ(n) plus time to compute hash function.
    We could have just used a linked list directly!
  • Average case:
    Assuming any given element is equally likely to hash into any slot...
    We get average case Θ(1 + α) time.
    See proofs in our textbook.

Hash functions
  • First, convert key to an integer.
    E.g., we can interpret characters in a string by their ASCII values.
    Then treat each value as a digit in a radix-128 integer.
    ( See this article for more on radices.)
  • Keys could be many other things besides strings.
    E.g., genomes:
  • Division method:
    h(k) = k mod P, where P is a suitably-chosen prime number.
  • Multiplication method:
    h(k) = [m (k A mod 1)], where 0 < A < 1.
Universal hashing
  • Establish a family of hash functions.
  • Choose so that Prob[h(x) = h(y)] <= 1/m, where m is the size of our hash table.
    In other words, the hash functions have no more chance of collision than simply randomly choosing to slots between 1 and m.
  • Choose one at random each execution.
    Tricky: what if we store hash values?
  • Good average case behavior
    If a "bad" function handles some data once, a "good" one will handle it another time.
    So a "bad" set of programming variable names one run will turn into a good set the next run.
Open addressing
  • All elements are stored directly in the table; no chaining.
  • Linear probing
    Easy: just move along array indices!
    Prone to clustering
  • Quadratic probing
  • Double hashing
    Uses two hash functions to search array for key.
  • Source code here.
Perfect hashing
Source Code

Java
Ruby
Go
Javascript
C++
Python
Clojure

Homework

* Based on Prof. Boris Aronov's lecture notes.