CS210 - Week 13 - Comprehensive Reference
Hash tables provide constant-time average case performance for search, insert, and delete operations by using a hash function to compute an array index directly from the key.
A map is a data structure that models a searchable collection of key-value pairs, where:
get(key), put(key, value), remove(key)Save items in a key-indexed table where the index is computed as a function of the key.
| Scenario | Strategy | Issue |
|---|---|---|
| No space limitation | Trivial hash function with key as index | Wastes enormous amounts of memory |
| No time limitation | Sequential search through all items | Very slow for large datasets |
| Real world | Hashing with collision resolution | Balance space and time efficiently |
Scramble the keys uniformly to produce a table index that is:
public int hashCode() {
return value;
}
Simply return the integer value itself.
public int hashCode() {
if (value) return 1231;
else return 1237;
}
Use two different prime numbers for true and false.
public int hashCode() {
long bits = doubleToLongBits(value);
return (int) (bits ^ (bits >>> 32));
}
Convert to 64-bit representation and XOR the two halves.
public int hashCode() {
int hash = 0;
for (int i = 0; i < length(); i++)
hash = s[i] + (31 * hash);
return hash;
}
hash = 99·31³ + 97·31² + 108·31 + 108
= 108 + 31Ā·(108 + 31Ā·(97 + 31Ā·99))
= 3,045,982
Convert hash code (any 32-bit integer) to table index (0 to M-1):
private int hash(Key key) {
return key.hashCode() % M;
}
Can return negative values!
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % M;
}
Masks sign bit to ensure positive result
Key Assumption: Each key is equally likely to hash to an integer between 0 and M-1.
This assumption is crucial for performance analysis but is difficult to achieve in practice.
Use an array of M < N linked lists. Each array position points to a chain of items that hash to that index.
Array Index Linked List Chain
0 ā [A] ā [B] ā [E] ā null
1 ā null
2 ā [X] ā [S] ā null
3 ā null
4 ā [L] ā [P] ā null
5 ā [M] ā [H] ā [C] ā [R] ā null
public class SeparateChainingHashST<Key, Value> {
private int M = 97; // number of chains
private Node[] st = new Node[M]; // array of chains
private static class Node {
private Object key;
private Object val;
private Node next;
}
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % M;
}
public Value get(Key key) {
int i = hash(key);
for (Node x = st[i]; x != null; x = x.next)
if (key.equals(x.key))
return (Value) x.val;
return null;
}
public void put(Key key, Value val) {
int i = hash(key);
for (Node x = st[i]; x != null; x = x.next)
if (keys[i].equals(key)) {
x.val = val;
return;
}
st[i] = new Node(key, val, st[i]);
}
}
Under uniform hashing assumption, the probability that the number of keys in a list is within a constant factor of N/M is extremely close to 1.
| Operation | Average Case | Worst Case |
|---|---|---|
| Search Hit | ~3-5 probes | N probes |
| Search Miss | ~3-5 probes | N probes |
| Insert | ~3-5 probes | N probes |
Goal: Keep average chain length N/M ā constant
When a collision occurs, probe the next array position (i+1, i+2, ...) until an empty slot is found.
Critical: Array size M must be greater than number of key-value pairs N.
Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[P] [M] [ ] [ ] [A] [C] [S] [H] [L] [ ] [E] [ ] [ ] [ ] [R] [X]
Insert K: hash(K) = 5
Position 5 occupied ā try 6 ā occupied ā try 7 ā occupied
ā try 8 ā occupied ā try 9 ā EMPTY! Insert at 9
public class LinearProbingHashST<Key, Value> {
private int M = 30001;
private Value[] vals = (Value[]) new Object[M];
private Key[] keys = (Key[]) new Object[M];
private int hash(Key key) {
return (key.hashCode() & 0x7fffffff) % M;
}
public Value get(Key key) {
for (int i = hash(key); keys[i] != null; i = (i+1) % M)
if (key.equals(keys[i]))
return vals[i];
return null;
}
public void put(Key key, Value val) {
int i;
for (i = hash(key); keys[i] != null; i = (i+1) % M)
if (keys[i].equals(key))
break;
keys[i] = key;
vals[i] = val;
}
}
Analogy: Cars arrive at one-way street with M parking spaces. Each wants random space i; if taken, try i+1, i+2, etc.
Search Hit: ~½(1 + 1/(1-α))
Search Miss/Insert: ~½(1 + 1/(1-α)²)
where α = N/M (load factor)
| Load Factor α | Search Hit | Search Miss |
|---|---|---|
| ½ | ~1.5 | ~2.5 |
| ā | ~2.0 | ~5.0 |
| ¾ | ~3.0 | ~8.5 |
Keep α = N/M ā ½ for constant-time operations
Cannot simply delete an entry and set it to null! This would break search for items that had collided with the deleted item.
Solution: After deletion, must rehash all keys in the same cluster that comes after the deleted key.
| Implementation | Search | Insert | Delete | Ordered? | Key Interface |
|---|---|---|---|---|---|
| Sequential Search | N | N | N | No | equals() |
| Binary Search | lg N | N | N | Yes | compareTo() |
| BST | 1.39 lg N | 1.39 lg N | āN | Yes | compareTo() |
| Red-Black BST | 1.0 lg N | 1.0 lg N | 1.0 lg N | Yes | compareTo() |
| Separate Chaining | 3-5* | 3-5* | 3-5* | No | equals(), hashCode() |
| Linear Probing | 3-5* | 3-5* | 3-5* | No | equals(), hashCode() |
* Under uniform hashing assumption
Hash to two positions and insert in the shorter chain.
Result: Reduces expected length of longest chain to log log N
Use linear probing but skip a variable amount (not just 1) each time.
h(x) = h1(x) + jĀ·h2(x) where j = 0, 1, 2, ...
Advantages: Effectively eliminates clustering, table can be nearly full
Hash key to two positions. If both occupied, displace one key and reinsert it.
Result: Constant worst-case time for search!
A malicious adversary who knows your hash function can craft inputs that all hash to the same bucket, causing performance to degrade to O(N).
Examples:
Due to Java's base-31 hash function, it's possible to generate 2N strings of length 2N that all hash to the same value!
"Aa" and "BB" both hash to 2112
"AaAa", "AaBB", "BBAa", "BBBB" all hash to -540425984
Special hash functions where it's computationally infeasible to:
Examples: SHA-256, SHA-3, BLAKE3
Uses: Password storage, digital signatures, blockchain
Note: Too expensive for general symbol table use
hashCode() that uses all significant fieldsequals() and hashCode() are consistent| Use Case | Best Choice | Reason |
|---|---|---|
| Need ordered operations | Red-Black BST | Hash tables don't maintain order |
| Simple, fast lookups | Hash Table (either variant) | Constant-time average case |
| Unpredictable load | Separate Chaining | Degrades gracefully |
| Memory constrained | Linear Probing | No pointer overhead |
| Cache-sensitive code | Linear Probing | Better locality of reference |
// Template for user-defined types
public int hashCode() {
int hash = 17; // nonzero constant
hash = 31 * hash + field1.hashCode(); // reference type
hash = 31 * hash + ((Integer) field2).hashCode(); // primitive
hash = 31 * hash + field3.hashCode(); // reference type
return hash;
}
You now have a comprehensive understanding of hash tables. Practice implementing both separate chaining and linear probing to solidify these concepts.
Study Guide Created from CS210 Week 13 Materials
Algorithms by Robert Sedgewick & Kevin Wayne
Q: What are the two idealistic goals of a hash function, and why is achieving both simultaneously difficult?
A: The two goals are (1) the function must be efficiently computable and (2) it must make each table index equally likely for each key (uniform distribution). They are hard to achieve simultaneously because truly uniform hashing requires knowledge of the key distribution in advance, which is rarely available. Real hash functions use heuristics (e.g., polynomial rolling with base 31 for strings) that work well in practice but can be exploited by adversarial inputs. The uniform hashing assumption is therefore an idealization used for analysis, not a guaranteed property.
Q: Why does return key.hashCode() % M produce a bug, and how is it fixed?
A: Java's hashCode() can return any 32-bit integer, including negative values. The % operator in Java preserves the sign of the dividend, so a negative hash code produces a negative index, which would cause an ArrayIndexOutOfBoundsException. The fix is to mask the sign bit first: return (key.hashCode() & 0x7fffffff) % M. The bitwise AND with 0x7fffffff clears the most significant (sign) bit, ensuring the result is always a non-negative integer before the modulo operation.
Q: Compare separate chaining and linear probing on three dimensions: memory usage, sensitivity to hash quality, and ease of deletion.
A:
- Memory: Separate chaining uses extra memory for linked-list node pointers even when chains are short. Linear probing stores everything in the array itself (no pointers), so it uses less memory per element and has better cache locality.
- Sensitivity to hash quality: Separate chaining degrades more gracefully with a poor hash function ā worst case is one very long chain, but other chains are unaffected. Linear probing suffers from clustering: once a cluster forms, insertions anywhere near it make it grow faster, severely degrading performance.
- Deletion: Separate chaining allows straightforward deletion (remove node from linked list). Linear probing cannot simply set a slot to null ā doing so breaks search for items that probed past that slot. The fix requires rehashing all keys in the same cluster after the deleted position.
Q: For linear probing with load factor α = N/M, give the expected number of probes for a search hit and a search miss when α = ½ and α = ¾. What does this tell you about why keeping α low matters?
A: Using Knuth's formula, expected probes ā ½(1 + 1/(1-α)) for a hit and ½(1 + 1/(1-α)²) for a miss.
- At α = ½: hit ā 1.5 probes, miss ā 2.5 probes ā excellent performance.
- At α = ¾: hit ā 2.5 probes, miss ā 8.5 probes ā misses are 3Ć more expensive.
As α approaches 1, the denominator (1-α) approaches 0 and the number of probes diverges to infinity. This is why linear probing tables resize (double) when α reaches ½, trading memory for guaranteed constant-time operations.
Q: A colleague wants to use a hash table to find all keys in the range [lo, hi] efficiently. Why is this a bad idea, and what data structure should they use instead?
A: Hash tables destroy the natural ordering of keys ā a key hashes to an essentially random bucket position bearing no relationship to its value. There is no way to enumerate keys in sorted order, find a predecessor/successor, or answer range queries without scanning all N entries (O(N)). For ordered operations, a balanced BST such as a Red-Black tree should be used. It supports search, insert, and delete in O(log N), but also supports ordered operations like floor, ceiling, rank, and range queries in O(log N) time.
Q: What is an algorithmic complexity attack on a hash table, how does Java's string hash function make it possible, and what are two defenses?
A: An attacker who knows your hash function can craft a large set of keys that all hash to the same bucket, turning O(1) expected operations into O(N) worst-case operations. Java's base-31 string hash makes this possible because it is deterministic and publicly known ā it is demonstrably possible to generate 2^N strings of length 2N that all share the same hash value.
Defense 1: Use a randomized hash function (e.g., add a secret random salt at startup so the attacker cannot predict hash values).
Defense 2: Use a stronger collision-resolution strategy like cuckoo hashing, which guarantees O(1) worst-case lookups regardless of key distribution.
Q: When a hash table resizes (doubles its capacity), why must ALL existing keys be rehashed rather than simply moved to the same relative position?
A: The bucket index for a key is computed as (hashCode & 0x7fffffff) % M. When M changes (e.g., from 97 to 194), the modulo changes and every key's bucket index changes. A key that was in bucket 5 of a size-97 table would compute a completely different index in a size-194 table. Simply copying the slots to new positions would place keys in wrong buckets, breaking lookup. Therefore, every key must be run through the hash function again with the new M and inserted into its new correct bucket ā this is called rehashing. Its amortized cost is O(1) per operation when using doubling.
Q: Why is it recommended to use a prime number for the hash table size M, particularly when using modular hashing?
A: When M is prime, the modulo operation h % M distributes keys more uniformly across all M buckets. If M has small factors (e.g., M = 100), then keys whose hash codes differ only in the lower bits will cluster together. For example, if all hash codes are even, they all map to even buckets and half the table is wasted. A prime M has no such factors, ensuring no systematic bias in which buckets receive keys. In practice, common choices are prime numbers slightly less than powers of 2, such as 97, 997, or 30011, to balance resizing convenience with uniform distribution.