📚 Hash Tables Study Guide

CS210 - Week 13 - Comprehensive Reference

📖 Table of Contents

1. Introduction to Hash Tables
2. Hashing: Basic Plan
3. Hash Functions
4. Separate Chaining
5. Linear Probing
6. Performance Comparison
7. Advanced Topics
8. Summary & Best Practices

1. Introduction to Hash Tables

🎯 Main Concept

Hash tables provide constant-time average case performance for search, insert, and delete operations by using a hash function to compute an array index directly from the key.

What is a Map?

A map is a data structure that models a searchable collection of key-value pairs, where:

Keys are unique - no duplicate keys allowed
Each key maps to exactly one value
Main operations: get(key), put(key, value), remove(key)

Real-World Examples

Address Book: Name (key) → Phone Number (value)
Student Records: Student ID (key) → Student Data (value)
Dictionary: Word (key) → Definition (value)

2. Hashing: Basic Plan

Core Idea

Save items in a key-indexed table where the index is computed as a function of the key.

🔑 Key Components

Hash Function: Method for computing array index from key
Equality Test: Method for checking if two keys are equal
Collision Resolution: Strategy for handling keys that hash to the same index

The Space-Time Tradeoff

Scenario	Strategy	Issue
No space limitation	Trivial hash function with key as index	Wastes enormous amounts of memory
No time limitation	Sequential search through all items	Very slow for large datasets
Real world	Hashing with collision resolution	Balance space and time efficiently

3. Hash Functions

⚡ Idealistic Goal

Scramble the keys uniformly to produce a table index that is:

Efficiently computable
Makes each table index equally likely for each key

Computing Hash Codes

For Integers

public int hashCode() { 
    return value; 
}

Simply return the integer value itself.

For Booleans

public int hashCode() {
    if (value) return 1231;
    else return 1237;
}

Use two different prime numbers for true and false.

For Doubles

public int hashCode() {
    long bits = doubleToLongBits(value);
    return (int) (bits ^ (bits >>> 32));
}

Convert to 64-bit representation and XOR the two halves.

For Strings (Horner's Method)

public int hashCode() {
    int hash = 0;
    for (int i = 0; i < length(); i++)
        hash = s[i] + (31 * hash);
    return hash;
}

h = s[0]·31^L-1 + s[1]·31^L-2 + ... + s[L-2]·31 + s[L-1]

String Hash Example: "call"

hash = 99·31³ + 97·31² + 108·31 + 108
     = 108 + 31·(108 + 31·(97 + 31·99))
     = 3,045,982

Modular Hashing

Convert hash code (any 32-bit integer) to table index (0 to M-1):

❌ Bug Version

private int hash(Key key) {
    return key.hashCode() % M;
}

Can return negative values!

✅ Correct Version

private int hash(Key key) {
    return (key.hashCode() & 0x7fffffff) % M;
}

Masks sign bit to ensure positive result

Uniform Hashing Assumption

Key Assumption: Each key is equally likely to hash to an integer between 0 and M-1.

This assumption is crucial for performance analysis but is difficult to achieve in practice.

Birthday Problem & Load Balancing

Birthday Problem: Expect two balls in same bin after ~√(πM/2) tosses
Coupon Collector: Expect every bin has ≥1 ball after ~M ln M tosses
Load Balancing: After M tosses, most loaded bin has Θ(log M / log log M) balls

4. Separate Chaining

💡 Core Idea

Use an array of M < N linked lists. Each array position points to a chain of items that hash to that index.

How It Works

Hash: Map key to integer i between 0 and M-1
Insert: Put at front of i^th chain (if not already there)
Search: Search only the i^th chain

Visual Representation

Array Index          Linked List Chain
    0         →     [A] → [B] → [E] → null
    1         →     null
    2         →     [X] → [S] → null
    3         →     null
    4         →     [L] → [P] → null
    5         →     [M] → [H] → [C] → [R] → null

Java Implementation

public class SeparateChainingHashST<Key, Value> {
    private int M = 97;  // number of chains
    private Node[] st = new Node[M];  // array of chains
    
    private static class Node {
        private Object key;
        private Object val;
        private Node next;
    }
    
    private int hash(Key key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }
    
    public Value get(Key key) {
        int i = hash(key);
        for (Node x = st[i]; x != null; x = x.next)
            if (key.equals(x.key))
                return (Value) x.val;
        return null;
    }
    
    public void put(Key key, Value val) {
        int i = hash(key);
        for (Node x = st[i]; x != null; x = x.next)
            if (keys[i].equals(key)) { 
                x.val = val; 
                return; 
            }
        st[i] = new Node(key, val, st[i]);
    }
}

Performance Analysis

✅ Proposition

Under uniform hashing assumption, the probability that the number of keys in a list is within a constant factor of N/M is extremely close to 1.

Average # of probes = N/M

Operation	Average Case	Worst Case
Search Hit	~3-5 probes	N probes
Search Miss	~3-5 probes	N probes
Insert	~3-5 probes	N probes

Resizing Strategy

Goal: Keep average chain length N/M ≈ constant

Double array size M when N/M ≥ 8
Halve array size M when N/M ≤ 2
Must rehash all keys when resizing (hash value changes!)

5. Linear Probing

💡 Core Idea (Open Addressing)

When a collision occurs, probe the next array position (i+1, i+2, ...) until an empty slot is found.

How It Works

Hash: Map key to integer i between 0 and M-1
Insert: Put at table index i if free; if not, try i+1, i+2, etc.
Search: Search table index i; if occupied but no match, try i+1, i+2, etc.

Critical: Array size M must be greater than number of key-value pairs N.

Visual Example

Index:  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
       [P] [M] [ ] [ ] [A] [C] [S] [H] [L] [ ] [E] [ ] [ ] [ ] [R] [X]

Insert K: hash(K) = 5
          Position 5 occupied → try 6 → occupied → try 7 → occupied
          → try 8 → occupied → try 9 → EMPTY! Insert at 9

Java Implementation

public class LinearProbingHashST<Key, Value> {
    private int M = 30001;
    private Value[] vals = (Value[]) new Object[M];
    private Key[] keys = (Key[]) new Object[M];
    
    private int hash(Key key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }
    
    public Value get(Key key) {
        for (int i = hash(key); keys[i] != null; i = (i+1) % M)
            if (key.equals(keys[i]))
                return vals[i];
        return null;
    }
    
    public void put(Key key, Value val) {
        int i;
        for (i = hash(key); keys[i] != null; i = (i+1) % M)
            if (keys[i].equals(key))
                break;
        keys[i] = key;
        vals[i] = val;
    }
}

Knuth's Parking Problem

Analogy: Cars arrive at one-way street with M parking spaces. Each wants random space i; if taken, try i+1, i+2, etc.

Half-full (M/2 cars): Mean displacement ≈ 3/2
Full (M cars): Mean displacement ≈ √(πM/8)

Performance Analysis

Search Hit: ~½(1 + 1/(1-α))

Search Miss/Insert: ~½(1 + 1/(1-α)²)

where α = N/M (load factor)

Load Factor α	Search Hit	Search Miss
½	~1.5	~2.5
⅔	~2.0	~5.0
¾	~3.0	~8.5

⚙️ Typical Choice

Keep α = N/M ≈ ½ for constant-time operations

Double size when N/M ≥ ½
Halve size when N/M ≤ ⅛

Deletion in Linear Probing

⚠️ Critical Issue

Cannot simply delete an entry and set it to null! This would break search for items that had collided with the deleted item.

Solution: After deletion, must rehash all keys in the same cluster that comes after the deleted key.

6. Performance Comparison

Separate Chaining vs Linear Probing

✅ Separate Chaining

Performance degrades gracefully
Less sensitive to poorly-designed hash functions
Easy deletion
Can exceed load factor of 1

✅ Linear Probing

Less wasted space (no pointers)
Better cache performance
Simpler implementation
Faster when load factor is low

Complete Symbol Table Implementations Comparison

Implementation	Search	Insert	Delete	Ordered?	Key Interface
Sequential Search	N	N	N	No	equals()
Binary Search	lg N	N	N	Yes	compareTo()
BST	1.39 lg N	1.39 lg N	√N	Yes	compareTo()
Red-Black BST	1.0 lg N	1.0 lg N	1.0 lg N	Yes	compareTo()
Separate Chaining	3-5*	3-5*	3-5*	No	equals(), hashCode()
Linear Probing	3-5*	3-5*	3-5*	No	equals(), hashCode()

* Under uniform hashing assumption

7. Advanced Topics

Hash Function Variants

Two-Probe Hashing

Hash to two positions and insert in the shorter chain.

Result: Reduces expected length of longest chain to log log N

Double Hashing

Use linear probing but skip a variable amount (not just 1) each time.

h(x) = h1(x) + j·h2(x)  where j = 0, 1, 2, ...

Advantages: Effectively eliminates clustering, table can be nearly full

Cuckoo Hashing

Hash key to two positions. If both occupied, displace one key and reinsert it.

Result: Constant worst-case time for search!

Algorithmic Complexity Attacks

🚨 Security Concern

A malicious adversary who knows your hash function can craft inputs that all hash to the same bucket, causing performance to degrade to O(N).

Examples:

Denial-of-service attacks on web servers
Bro server exploits
Perl 5.8.0 associative array attacks

Java String Hash Collision

Due to Java's base-31 hash function, it's possible to generate 2^N strings of length 2N that all hash to the same value!

"Aa" and "BB" both hash to 2112
"AaAa", "AaBB", "BBAa", "BBBB" all hash to -540425984

One-Way Hash Functions

Cryptographic Hash Functions

Special hash functions where it's computationally infeasible to:

Find a key that hashes to a specific value
Find two keys that hash to the same value

Examples: SHA-256, SHA-3, BLAKE3

Uses: Password storage, digital signatures, blockchain

Note: Too expensive for general symbol table use

8. Summary & Best Practices

🎯 Key Takeaways

Hash tables provide constant-time average performance for search/insert/delete
Require a good hash function and collision resolution strategy
Two main approaches: Separate Chaining and Linear Probing
Performance depends on load factor α = N/M
Uniform hashing assumption is critical but hard to achieve

Best Practices

✅ DO:

Use prime numbers or powers of 2 for table size M
Keep load factor α between 0.5 and 0.75 for linear probing
Keep load factor α around 1-4 for separate chaining
Implement proper hashCode() that uses all significant fields
Ensure equals() and hashCode() are consistent
Resize dynamically to maintain good performance

❌ DON'T:

Use predictable hash functions in security-critical applications
Forget to handle the sign bit when converting hash codes to indices
Directly delete entries in linear probing (must rehash cluster)
Use hash tables if you need ordered operations (use BST instead)
Assume uniform hashing without testing your hash function

When to Use What

Use Case	Best Choice	Reason
Need ordered operations	Red-Black BST	Hash tables don't maintain order
Simple, fast lookups	Hash Table (either variant)	Constant-time average case
Unpredictable load	Separate Chaining	Degrades gracefully
Memory constrained	Linear Probing	No pointer overhead
Cache-sensitive code	Linear Probing	Better locality of reference

Common Pitfalls

Hash Code Overflow: Forgetting to mask sign bit can cause negative indices
Poor Hash Functions: Using only part of the key leads to clustering
Inconsistent equals/hashCode: If a.equals(b) but a.hashCode() ≠ b.hashCode(), table breaks
Ignoring Load Factor: Letting table get too full degrades performance
Security: Using predictable hash functions in adversarial environments

Quick Reference: Hash Code Implementation

// Template for user-defined types
public int hashCode() {
    int hash = 17;  // nonzero constant
    hash = 31 * hash + field1.hashCode();  // reference type
    hash = 31 * hash + ((Integer) field2).hashCode();  // primitive
    hash = 31 * hash + field3.hashCode();  // reference type
    return hash;
}

🎓 You're Ready!

You now have a comprehensive understanding of hash tables. Practice implementing both separate chaining and linear probing to solidify these concepts.

Study Guide Created from CS210 Week 13 Materials

Algorithms by Robert Sedgewick & Kevin Wayne