📚 Hash Tables Study Guide

CS210 - Week 13 - Comprehensive Reference

📖 Table of Contents

1. Introduction to Hash Tables
2. Hashing: Basic Plan
3. Hash Functions
4. Separate Chaining
5. Linear Probing
6. Performance Comparison
7. Advanced Topics
8. Summary & Best Practices

1. Introduction to Hash Tables

🎯 Main Concept

Hash tables provide constant-time average case performance for search, insert, and delete operations by using a hash function to compute an array index directly from the key.

What is a Map?

A map is a data structure that models a searchable collection of key-value pairs, where:

Keys are unique - no duplicate keys allowed
Each key maps to exactly one value
Main operations: get(key), put(key, value), remove(key)

Real-World Examples

Address Book: Name (key) → Phone Number (value)
Student Records: Student ID (key) → Student Data (value)
Dictionary: Word (key) → Definition (value)

2. Hashing: Basic Plan

Core Idea

Save items in a key-indexed table where the index is computed as a function of the key.

🔑 Key Components

Hash Function: Method for computing array index from key
Equality Test: Method for checking if two keys are equal
Collision Resolution: Strategy for handling keys that hash to the same index

The Space-Time Tradeoff

Scenario	Strategy	Issue
No space limitation	Trivial hash function with key as index	Wastes enormous amounts of memory
No time limitation	Sequential search through all items	Very slow for large datasets
Real world	Hashing with collision resolution	Balance space and time efficiently

3. Hash Functions

⚡ Idealistic Goal

Scramble the keys uniformly to produce a table index that is:

Efficiently computable
Makes each table index equally likely for each key

Computing Hash Codes

For Integers

public int hashCode() { 
    return value; 
}

Simply return the integer value itself.

For Booleans

public int hashCode() {
    if (value) return 1231;
    else return 1237;
}

Use two different prime numbers for true and false.

For Doubles

public int hashCode() {
    long bits = doubleToLongBits(value);
    return (int) (bits ^ (bits >>> 32));
}

Convert to 64-bit representation and XOR the two halves.

For Strings (Horner's Method)

public int hashCode() {
    int hash = 0;
    for (int i = 0; i < length(); i++)
        hash = s[i] + (31 * hash);
    return hash;
}

h = s[0]·31^L-1 + s[1]·31^L-2 + ... + s[L-2]·31 + s[L-1]

String Hash Example: "call"

hash = 99·31³ + 97·31² + 108·31 + 108
     = 108 + 31·(108 + 31·(97 + 31·99))
     = 3,045,982

Modular Hashing

Convert hash code (any 32-bit integer) to table index (0 to M-1):

❌ Bug Version

private int hash(Key key) {
    return key.hashCode() % M;
}

Can return negative values!

✅ Correct Version

private int hash(Key key) {
    return (key.hashCode() & 0x7fffffff) % M;
}

Masks sign bit to ensure positive result

Uniform Hashing Assumption

Key Assumption: Each key is equally likely to hash to an integer between 0 and M-1.

This assumption is crucial for performance analysis but is difficult to achieve in practice.

Birthday Problem & Load Balancing

Birthday Problem: Expect two balls in same bin after ~√(πM/2) tosses
Coupon Collector: Expect every bin has ≥1 ball after ~M ln M tosses
Load Balancing: After M tosses, most loaded bin has Θ(log M / log log M) balls

4. Separate Chaining

💡 Core Idea

Use an array of M < N linked lists. Each array position points to a chain of items that hash to that index.

How It Works

Hash: Map key to integer i between 0 and M-1
Insert: Put at front of i^th chain (if not already there)
Search: Search only the i^th chain

Visual Representation

Array Index          Linked List Chain
    0         →     [A] → [B] → [E] → null
    1         →     null
    2         →     [X] → [S] → null
    3         →     null
    4         →     [L] → [P] → null
    5         →     [M] → [H] → [C] → [R] → null

Java Implementation

public class SeparateChainingHashST<Key, Value> {
    private int M = 97;  // number of chains
    private Node[] st = new Node[M];  // array of chains
    
    private static class Node {
        private Object key;
        private Object val;
        private Node next;
    }
    
    private int hash(Key key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }
    
    public Value get(Key key) {
        int i = hash(key);
        for (Node x = st[i]; x != null; x = x.next)
            if (key.equals(x.key))
                return (Value) x.val;
        return null;
    }
    
    public void put(Key key, Value val) {
        int i = hash(key);
        for (Node x = st[i]; x != null; x = x.next)
            if (keys[i].equals(key)) { 
                x.val = val; 
                return; 
            }
        st[i] = new Node(key, val, st[i]);
    }
}

Performance Analysis

✅ Proposition

Under uniform hashing assumption, the probability that the number of keys in a list is within a constant factor of N/M is extremely close to 1.

Average # of probes = N/M

Operation	Average Case	Worst Case
Search Hit	~3-5 probes	N probes
Search Miss	~3-5 probes	N probes
Insert	~3-5 probes	N probes

Resizing Strategy

Goal: Keep average chain length N/M ≈ constant

Double array size M when N/M ≥ 8
Halve array size M when N/M ≤ 2
Must rehash all keys when resizing (hash value changes!)

5. Linear Probing

💡 Core Idea (Open Addressing)

When a collision occurs, probe the next array position (i+1, i+2, ...) until an empty slot is found.

How It Works

Hash: Map key to integer i between 0 and M-1
Insert: Put at table index i if free; if not, try i+1, i+2, etc.
Search: Search table index i; if occupied but no match, try i+1, i+2, etc.

Critical: Array size M must be greater than number of key-value pairs N.

Visual Example

Index:  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
       [P] [M] [ ] [ ] [A] [C] [S] [H] [L] [ ] [E] [ ] [ ] [ ] [R] [X]

Insert K: hash(K) = 5
          Position 5 occupied → try 6 → occupied → try 7 → occupied
          → try 8 → occupied → try 9 → EMPTY! Insert at 9

Java Implementation

public class LinearProbingHashST<Key, Value> {
    private int M = 30001;
    private Value[] vals = (Value[]) new Object[M];
    private Key[] keys = (Key[]) new Object[M];
    
    private int hash(Key key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }
    
    public Value get(Key key) {
        for (int i = hash(key); keys[i] != null; i = (i+1) % M)
            if (key.equals(keys[i]))
                return vals[i];
        return null;
    }
    
    public void put(Key key, Value val) {
        int i;
        for (i = hash(key); keys[i] != null; i = (i+1) % M)
            if (keys[i].equals(key))
                break;
        keys[i] = key;
        vals[i] = val;
    }
}

Knuth's Parking Problem

Analogy: Cars arrive at one-way street with M parking spaces. Each wants random space i; if taken, try i+1, i+2, etc.

Half-full (M/2 cars): Mean displacement ≈ 3/2
Full (M cars): Mean displacement ≈ √(πM/8)

Performance Analysis

Search Hit: ~½(1 + 1/(1-α))

Search Miss/Insert: ~½(1 + 1/(1-α)²)

where α = N/M (load factor)

Load Factor α	Search Hit	Search Miss
½	~1.5	~2.5
⅔	~2.0	~5.0
¾	~3.0	~8.5

⚙️ Typical Choice

Keep α = N/M ≈ ½ for constant-time operations

Double size when N/M ≥ ½
Halve size when N/M ≤ ⅛

Deletion in Linear Probing

⚠️ Critical Issue

Cannot simply delete an entry and set it to null! This would break search for items that had collided with the deleted item.

Solution: After deletion, must rehash all keys in the same cluster that comes after the deleted key.

6. Performance Comparison

Separate Chaining vs Linear Probing

✅ Separate Chaining

Performance degrades gracefully
Less sensitive to poorly-designed hash functions
Easy deletion
Can exceed load factor of 1

✅ Linear Probing

Less wasted space (no pointers)
Better cache performance
Simpler implementation
Faster when load factor is low

Complete Symbol Table Implementations Comparison

Implementation	Search	Insert	Delete	Ordered?	Key Interface
Sequential Search	N	N	N	No	equals()
Binary Search	lg N	N	N	Yes	compareTo()
BST	1.39 lg N	1.39 lg N	√N	Yes	compareTo()
Red-Black BST	1.0 lg N	1.0 lg N	1.0 lg N	Yes	compareTo()
Separate Chaining	3-5*	3-5*	3-5*	No	equals(), hashCode()
Linear Probing	3-5*	3-5*	3-5*	No	equals(), hashCode()

* Under uniform hashing assumption

7. Advanced Topics

Hash Function Variants

Two-Probe Hashing

Hash to two positions and insert in the shorter chain.

Result: Reduces expected length of longest chain to log log N

Double Hashing

Use linear probing but skip a variable amount (not just 1) each time.

h(x) = h1(x) + j·h2(x)  where j = 0, 1, 2, ...

Advantages: Effectively eliminates clustering, table can be nearly full

Cuckoo Hashing

Hash key to two positions. If both occupied, displace one key and reinsert it.

Result: Constant worst-case time for search!

Algorithmic Complexity Attacks

🚨 Security Concern

A malicious adversary who knows your hash function can craft inputs that all hash to the same bucket, causing performance to degrade to O(N).

Examples:

Denial-of-service attacks on web servers
Bro server exploits
Perl 5.8.0 associative array attacks

Java String Hash Collision

Due to Java's base-31 hash function, it's possible to generate 2^N strings of length 2N that all hash to the same value!

"Aa" and "BB" both hash to 2112
"AaAa", "AaBB", "BBAa", "BBBB" all hash to -540425984

One-Way Hash Functions

Cryptographic Hash Functions

Special hash functions where it's computationally infeasible to:

Find a key that hashes to a specific value
Find two keys that hash to the same value

Examples: SHA-256, SHA-3, BLAKE3

Uses: Password storage, digital signatures, blockchain

Note: Too expensive for general symbol table use

8. Summary & Best Practices

🎯 Key Takeaways

Hash tables provide constant-time average performance for search/insert/delete
Require a good hash function and collision resolution strategy
Two main approaches: Separate Chaining and Linear Probing
Performance depends on load factor α = N/M
Uniform hashing assumption is critical but hard to achieve

Best Practices

✅ DO:

Use prime numbers or powers of 2 for table size M
Keep load factor α between 0.5 and 0.75 for linear probing
Keep load factor α around 1-4 for separate chaining
Implement proper hashCode() that uses all significant fields
Ensure equals() and hashCode() are consistent
Resize dynamically to maintain good performance

❌ DON'T:

Use predictable hash functions in security-critical applications
Forget to handle the sign bit when converting hash codes to indices
Directly delete entries in linear probing (must rehash cluster)
Use hash tables if you need ordered operations (use BST instead)
Assume uniform hashing without testing your hash function

When to Use What

Use Case	Best Choice	Reason
Need ordered operations	Red-Black BST	Hash tables don't maintain order
Simple, fast lookups	Hash Table (either variant)	Constant-time average case
Unpredictable load	Separate Chaining	Degrades gracefully
Memory constrained	Linear Probing	No pointer overhead
Cache-sensitive code	Linear Probing	Better locality of reference

Common Pitfalls

Hash Code Overflow: Forgetting to mask sign bit can cause negative indices
Poor Hash Functions: Using only part of the key leads to clustering
Inconsistent equals/hashCode: If a.equals(b) but a.hashCode() ≠ b.hashCode(), table breaks
Ignoring Load Factor: Letting table get too full degrades performance
Security: Using predictable hash functions in adversarial environments

Quick Reference: Hash Code Implementation

// Template for user-defined types
public int hashCode() {
    int hash = 17;  // nonzero constant
    hash = 31 * hash + field1.hashCode();  // reference type
    hash = 31 * hash + ((Integer) field2).hashCode();  // primitive
    hash = 31 * hash + field3.hashCode();  // reference type
    return hash;
}

🎓 You're Ready!

You now have a comprehensive understanding of hash tables. Practice implementing both separate chaining and linear probing to solidify these concepts.

Study Guide Created from CS210 Week 13 Materials

Algorithms by Robert Sedgewick & Kevin Wayne

🎓 Practice Questions

Question 1: Hash Function Requirements

Q: What are the two idealistic goals of a hash function, and why is achieving both simultaneously difficult?

A: The two goals are (1) the function must be efficiently computable and (2) it must make each table index equally likely for each key (uniform distribution). They are hard to achieve simultaneously because truly uniform hashing requires knowledge of the key distribution in advance, which is rarely available. Real hash functions use heuristics (e.g., polynomial rolling with base 31 for strings) that work well in practice but can be exploited by adversarial inputs. The uniform hashing assumption is therefore an idealization used for analysis, not a guaranteed property.

Question 2: The Sign-Bit Bug

Q: Why does return key.hashCode() % M produce a bug, and how is it fixed?

A: Java's hashCode() can return any 32-bit integer, including negative values. The % operator in Java preserves the sign of the dividend, so a negative hash code produces a negative index, which would cause an ArrayIndexOutOfBoundsException. The fix is to mask the sign bit first: return (key.hashCode() & 0x7fffffff) % M. The bitwise AND with 0x7fffffff clears the most significant (sign) bit, ensuring the result is always a non-negative integer before the modulo operation.

Question 3: Separate Chaining vs. Linear Probing

Q: Compare separate chaining and linear probing on three dimensions: memory usage, sensitivity to hash quality, and ease of deletion.

A: - Memory: Separate chaining uses extra memory for linked-list node pointers even when chains are short. Linear probing stores everything in the array itself (no pointers), so it uses less memory per element and has better cache locality.
- Sensitivity to hash quality: Separate chaining degrades more gracefully with a poor hash function — worst case is one very long chain, but other chains are unaffected. Linear probing suffers from clustering: once a cluster forms, insertions anywhere near it make it grow faster, severely degrading performance.
- Deletion: Separate chaining allows straightforward deletion (remove node from linked list). Linear probing cannot simply set a slot to null — doing so breaks search for items that probed past that slot. The fix requires rehashing all keys in the same cluster after the deleted position.

Question 4: Load Factor and Performance

Q: For linear probing with load factor α = N/M, give the expected number of probes for a search hit and a search miss when α = ½ and α = ¾. What does this tell you about why keeping α low matters?

A: Using Knuth's formula, expected probes ≈ ½(1 + 1/(1-α)) for a hit and ½(1 + 1/(1-α)²) for a miss.
- At α = ½: hit ≈ 1.5 probes, miss ≈ 2.5 probes — excellent performance.
- At α = ¾: hit ≈ 2.5 probes, miss ≈ 8.5 probes — misses are 3× more expensive.
As α approaches 1, the denominator (1-α) approaches 0 and the number of probes diverges to infinity. This is why linear probing tables resize (double) when α reaches ½, trading memory for guaranteed constant-time operations.

Question 5: Why Hash Tables Cannot Support Ordered Operations

Q: A colleague wants to use a hash table to find all keys in the range [lo, hi] efficiently. Why is this a bad idea, and what data structure should they use instead?

A: Hash tables destroy the natural ordering of keys — a key hashes to an essentially random bucket position bearing no relationship to its value. There is no way to enumerate keys in sorted order, find a predecessor/successor, or answer range queries without scanning all N entries (O(N)). For ordered operations, a balanced BST such as a Red-Black tree should be used. It supports search, insert, and delete in O(log N), but also supports ordered operations like floor, ceiling, rank, and range queries in O(log N) time.

Question 6: Algorithmic Complexity Attacks

Q: What is an algorithmic complexity attack on a hash table, how does Java's string hash function make it possible, and what are two defenses?

A: An attacker who knows your hash function can craft a large set of keys that all hash to the same bucket, turning O(1) expected operations into O(N) worst-case operations. Java's base-31 string hash makes this possible because it is deterministic and publicly known — it is demonstrably possible to generate 2^N strings of length 2N that all share the same hash value.
Defense 1: Use a randomized hash function (e.g., add a secret random salt at startup so the attacker cannot predict hash values).
Defense 2: Use a stronger collision-resolution strategy like cuckoo hashing, which guarantees O(1) worst-case lookups regardless of key distribution.

Question 7: Resizing and Rehashing

Q: When a hash table resizes (doubles its capacity), why must ALL existing keys be rehashed rather than simply moved to the same relative position?

A: The bucket index for a key is computed as (hashCode & 0x7fffffff) % M. When M changes (e.g., from 97 to 194), the modulo changes and every key's bucket index changes. A key that was in bucket 5 of a size-97 table would compute a completely different index in a size-194 table. Simply copying the slots to new positions would place keys in wrong buckets, breaking lookup. Therefore, every key must be run through the hash function again with the new M and inserted into its new correct bucket — this is called rehashing. Its amortized cost is O(1) per operation when using doubling.

Question 8: Choosing the Right Table Size

Q: Why is it recommended to use a prime number for the hash table size M, particularly when using modular hashing?

A: When M is prime, the modulo operation h % M distributes keys more uniformly across all M buckets. If M has small factors (e.g., M = 100), then keys whose hash codes differ only in the lower bits will cluster together. For example, if all hash codes are even, they all map to even buckets and half the table is wasted. A prime M has no such factors, ensuring no systematic bias in which buckets receive keys. In practice, common choices are prime numbers slightly less than powers of 2, such as 97, 997, or 30011, to balance resizing convenience with uniform distribution.