# SMILE — Hash Functions User Guide

The `smile.hash` package provides a collection of high-performance, non-cryptographic
hash functions and data structures built on top of them. It currently contains five
classes:

| Class | Purpose |
|---|---|
| `MurmurHash2` | Fast 32-bit and 64-bit non-cryptographic hash for raw byte data |
| `MurmurHash3` | Improved 32-bit and 128-bit non-cryptographic hash for raw byte data |
| `PerfectHash` | Compile-time-style perfect hash mapping a fixed string set to `[0, n)` |
| `PerfectMap` | Immutable `String → T` map backed by `PerfectHash` |
| `SimHash` | Locality-sensitive fingerprinting for near-duplicate detection |

---

## 1. MurmurHash2

MurmurHash2 is an extremely fast, non-cryptographic hash designed by Austin Appleby.
The name comes from the two basic operations in its inner loop: **multiply** (MU) and
**rotate** (R).  It is well-suited to hash-table construction, checksumming, and any
task that needs a fast, well-distributed digest but does **not** require cryptographic
strength.

SMILE's implementation is adapted from Apache Cassandra and is provided as a static
interface (`MurmurHash2`) with two overloads:

```
MurmurHash2.hash32(ByteBuffer data, int offset, int length, int seed)  → int
MurmurHash2.hash64(ByteBuffer data, int offset, int length, long seed) → long
```

### 1.1 32-bit hash

```java
import smile.hash.MurmurHash2;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;

byte[] data = "hello world".getBytes(StandardCharsets.UTF_8);
ByteBuffer buf = ByteBuffer.wrap(data);

int h32 = MurmurHash2.hash32(buf, 0, data.length, 42 /*seed*/);
System.out.printf("MurmurHash2-32 = 0x%08X%n", h32);
```

### 1.2 64-bit hash

```java
long h64 = MurmurHash2.hash64(buf, 0, data.length, 0L /*seed*/);
System.out.printf("MurmurHash2-64 = 0x%016X%n", h64);
```

### 1.3 Properties

| Property | Value |
|---|---|
| Output width | 32 or 64 bits |
| Endianness | Little-endian |
| Seed support | Yes — different seeds produce independent hash families |
| Cryptographic | No |
| Suitable for | Hash tables, bloom filters, checksums, sharding |

---

## 2. MurmurHash3

MurmurHash3 supersedes MurmurHash2 with better avalanche behavior and wider output
options.  SMILE provides two variants, both adapted from Apache Cassandra:

```
MurmurHash3.hash32(String text, int seed)                                → int
MurmurHash3.hash32(byte[] data, int offset, int length, int seed)        → int
MurmurHash3.hash128(ByteBuffer data, int offset, int length,
                    long seed, long[] result)                             → void
```

The 128-bit variant implements the **x64 flavour**: two independent 64-bit lanes
that are mixed together in the finalization step.

> **Note:** The x86 and x64 128-bit variants produce *different* outputs for the same
> input.  SMILE implements only the x64 variant.

### 2.1 32-bit hash from a String

```java
import smile.hash.MurmurHash3;

int h = MurmurHash3.hash32("the quick brown fox", 0);
```

### 2.2 32-bit hash from a byte array

```java
byte[] data = "the quick brown fox".getBytes(StandardCharsets.UTF_8);
int h = MurmurHash3.hash32(data, 0, data.length, 0);
```

### 2.3 128-bit hash (x64)

The result is written into a caller-supplied `long[2]` array: `result[0]` = lower
64 bits, `result[1]` = upper 64 bits.

```java
byte[] data = "the quick brown fox".getBytes(StandardCharsets.UTF_8);
ByteBuffer buf = ByteBuffer.wrap(data);
long[] result = new long[2];
MurmurHash3.hash128(buf, 0, data.length, 0L, result);
System.out.printf("h1=0x%016X  h2=0x%016X%n", result[0], result[1]);
```

### 2.4 Choosing between MurmurHash2 and MurmurHash3

| Criterion | MurmurHash2 | MurmurHash3 |
|---|---|---|
| Output widths | 32, 64 | 32, 128 |
| Speed (64-bit JVM) | Slightly faster 64-bit path | Slightly faster 32-bit path |
| Avalanche quality | Good | Excellent |
| Recommendation | Legacy code / Cassandra compatibility | New code |

---

## 3. PerfectHash

`PerfectHash` builds a **minimal perfect hash** over a fixed set of strings known
at construction time.  Every key maps to a unique integer in `[0, n)` in O(length)
time — there are no collisions and no chaining.

### 3.1 How it works

The hash function is:

```
h(w) = (len(w) + Σ kvals[w[i] − min]) mod tsize
```

where `kvals` is an integer array indexed by `(character − min_char)` and `tsize`
is the table size (≥ n).  Construction fills `kvals` with random integers and
checks whether all n keys map to distinct slots.  Failed attempts draw a new
random `kvals`; with a load factor of ≈ 2/3 this succeeds in O(1) expected trials.
The final lookup table maps each hash slot to the index of the keyword stored there,
or −1 for empty slots; correctness is verified by comparing the stored keyword
against the query string.

### 3.2 Basic usage

```java
import smile.hash.PerfectHash;

PerfectHash hash = new PerfectHash("abstract", "boolean", "break",
                                   "class", "continue", "default");

int idx = hash.get("break");     // returns 2 (index in the array above)
int miss = hash.get("integer");  // returns -1 — not in the set
```

### 3.3 Selecting character positions

For performance with long strings you can tell the hash to use only specific
character positions.  This is analogous to the `%` keyword option in `gperf`.

```java
// Only look at characters at positions 0 and 1.
PerfectHash hash = new PerfectHash(new int[]{0, 1}, keywords);
```

Choosing positions that maximise distinctiveness (e.g., the first and last
characters of a keyword) typically reduces construction time.

### 3.4 Constraints

* The keyword array must be non-empty.
* Duplicate keywords throw `IllegalArgumentException`.
* The set is **frozen at construction time** — there is no way to add or remove
  keywords after construction.
* `PerfectHash` is `Serializable` and thread-safe for concurrent reads.

---

## 4. PerfectMap

`PerfectMap<T>` is an **immutable, perfect-hash-backed `String → T` map**.  Lookup
is O(key length) with zero allocation on the hot path.  It is ideal for small,
fixed dictionaries that are read very frequently (e.g., HTTP header tables,
language keyword→token mappings, configuration key→handler tables).

### 4.1 Building with the fluent Builder

```java
import smile.hash.PerfectMap;

PerfectMap<Integer> tokenTypes = new PerfectMap.Builder<Integer>()
    .add("abstract",   1)
    .add("boolean",    2)
    .add("break",      3)
    .add("class",      4)
    .add("continue",   5)
    .add("default",    6)
    .build();

int type = tokenTypes.get("break");   // 3
Object miss = tokenTypes.get("goto"); // null
```

### 4.2 Building from an existing Map

```java
Map<String, String> source = Map.of(
    "Content-Type",     "application/json",
    "Accept",           "application/json",
    "Authorization",    "Bearer …"
);

PerfectMap<String> headers = new PerfectMap.Builder<>(source).build();
String ct = headers.get("Content-Type"); // "application/json"
```

### 4.3 Null semantics

`get(key)` returns `null` for any key that was not in the original key set.
It never throws for well-formed (non-null) query strings.

---

## 5. SimHash

`SimHash` is a **locality-sensitive fingerprinting** algorithm.  Unlike ordinary
hash functions, two similar inputs produce fingerprints with a small Hamming
distance, while dissimilar inputs produce fingerprints with a large Hamming
distance.  This makes SimHash useful for near-duplicate detection in large
document collections (it is used by the Google crawler for exactly this purpose).

Each fingerprint is a 64-bit `long`.  The Hamming distance between two fingerprints
is `Long.bitCount(h1 ^ h2)`.  Documents are typically considered near-duplicates
when their Hamming distance is ≤ 3.

### 5.1 Text similarity (tokenised documents)

Use `SimHash.text()` to get a `SimHash<String[]>` that treats each element of the
array as a token.  Internally, every token is hashed with MurmurHash2-64 and its
hash bits vote (+1 or −1) on 64 counters; the sign of each counter becomes one bit
of the fingerprint.

```java
import smile.hash.SimHash;

SimHash<String[]> sh = SimHash.text();

String[] doc1 = {"the", "quick", "brown", "fox", "jumps"};
String[] doc2 = {"the", "quick", "brown", "fox", "leaps"}; // one word different

long h1 = sh.hash(doc1);
long h2 = sh.hash(doc2);

int hamming = Long.bitCount(h1 ^ h2);
System.out.println("Hamming distance = " + hamming); // small (≤ 5 typically)
```

### 5.2 Weighted feature vectors

Use `SimHash.of(byte[][] features)` when each document is represented as a sparse
vector of pre-defined feature weights.  Each feature is identified by its byte
representation; the weight array passed to `hash(int[] weights)` must be the same
length as the `features` array.

```java
byte[][] features = {
    "title".getBytes(),
    "body".getBytes(),
    "tags".getBytes()
};
SimHash<int[]> sh = SimHash.of(features);

// A document with title weight 5, body weight 3, tags weight 1
long fp = sh.hash(new int[]{5, 3, 1});
```

### 5.3 Near-duplicate detection pipeline

A typical pipeline computes fingerprints for all documents and then uses a
**hash table** keyed on the fingerprint to find exact duplicates, or a
**Hamming-distance index** (e.g., sorted permutations) to find near-duplicates:

```java
SimHash<String[]> sh = SimHash.text();
Map<Long, String> seen = new HashMap<>();

for (var entry : corpus.entrySet()) {
    String id  = entry.getKey();
    String[] tokens = entry.getValue();
    long fp = sh.hash(tokens);

    // Exact fingerprint match → likely duplicate
    String prev = seen.putIfAbsent(fp, id);
    if (prev != null) {
        System.out.println(id + " is a near-duplicate of " + prev);
    }
}
```

For large corpora, partition the 64 bits into k bands of b bits each
(standard LSH technique) and bucket documents by each band independently to
find pairs within Hamming distance ≤ b·k efficiently.

### 5.4 Properties

| Property | Value |
|---|---|
| Output width | 64 bits |
| Similarity metric | Hamming distance |
| Internal hash | MurmurHash2-64 |
| Near-duplicate threshold | Hamming ≤ 3 (configurable by application) |
| Thread safety | Stateless — the `SimHash` instance is thread-safe |

---

## 6. Quick-reference table

| Use case | Recommended class / method |
|---|---|
| Hash a byte buffer into a 32-bit int | `MurmurHash2.hash32` |
| Hash a byte buffer into a 64-bit long | `MurmurHash2.hash64` |
| Hash a string into a 32-bit int | `MurmurHash3.hash32(String, seed)` |
| Hash a byte array into a 128-bit value | `MurmurHash3.hash128` |
| O(1) lookup in a fixed string set | `PerfectHash` |
| Immutable `String → T` dictionary | `PerfectMap` |
| Near-duplicate document detection | `SimHash.text()` |
| Weighted feature-vector fingerprinting | `SimHash.of(byte[][])` |


---

*SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.*