šŸ“š Hash Tables Study Guide

CS210 - Week 13 - Comprehensive Reference

šŸ“– Table of Contents

1. Introduction to Hash Tables

šŸŽÆ Main Concept

Hash tables provide constant-time average case performance for search, insert, and delete operations by using a hash function to compute an array index directly from the key.

What is a Map?

A map is a data structure that models a searchable collection of key-value pairs, where:

Real-World Examples

2. Hashing: Basic Plan

Core Idea

Save items in a key-indexed table where the index is computed as a function of the key.

šŸ”‘ Key Components

  1. Hash Function: Method for computing array index from key
  2. Equality Test: Method for checking if two keys are equal
  3. Collision Resolution: Strategy for handling keys that hash to the same index

The Space-Time Tradeoff

Scenario Strategy Issue
No space limitation Trivial hash function with key as index Wastes enormous amounts of memory
No time limitation Sequential search through all items Very slow for large datasets
Real world Hashing with collision resolution Balance space and time efficiently

3. Hash Functions

⚔ Idealistic Goal

Scramble the keys uniformly to produce a table index that is:

Computing Hash Codes

For Integers

public int hashCode() { 
    return value; 
}

Simply return the integer value itself.

For Booleans

public int hashCode() {
    if (value) return 1231;
    else return 1237;
}

Use two different prime numbers for true and false.

For Doubles

public int hashCode() {
    long bits = doubleToLongBits(value);
    return (int) (bits ^ (bits >>> 32));
}

Convert to 64-bit representation and XOR the two halves.

For Strings (Horner's Method)

public int hashCode() {
    int hash = 0;
    for (int i = 0; i < length(); i++)
        hash = s[i] + (31 * hash);
    return hash;
}
h = s[0]Ā·31L-1 + s[1]Ā·31L-2 + ... + s[L-2]Ā·31 + s[L-1]

String Hash Example: "call"

hash = 99·31³ + 97·31² + 108·31 + 108
     = 108 + 31Ā·(108 + 31Ā·(97 + 31Ā·99))
     = 3,045,982

Modular Hashing

Convert hash code (any 32-bit integer) to table index (0 to M-1):

āŒ Bug Version

private int hash(Key key) {
    return key.hashCode() % M;
}

Can return negative values!

āœ… Correct Version

private int hash(Key key) {
    return (key.hashCode() & 0x7fffffff) % M;
}

Masks sign bit to ensure positive result

Uniform Hashing Assumption

Key Assumption: Each key is equally likely to hash to an integer between 0 and M-1.

This assumption is crucial for performance analysis but is difficult to achieve in practice.

Birthday Problem & Load Balancing

4. Separate Chaining

šŸ’” Core Idea

Use an array of M < N linked lists. Each array position points to a chain of items that hash to that index.

How It Works

  1. Hash: Map key to integer i between 0 and M-1
  2. Insert: Put at front of ith chain (if not already there)
  3. Search: Search only the ith chain

Visual Representation

Array Index          Linked List Chain
    0         →     [A] → [B] → [E] → null
    1         →     null
    2         →     [X] → [S] → null
    3         →     null
    4         →     [L] → [P] → null
    5         →     [M] → [H] → [C] → [R] → null
            

Java Implementation

public class SeparateChainingHashST<Key, Value> {
    private int M = 97;  // number of chains
    private Node[] st = new Node[M];  // array of chains
    
    private static class Node {
        private Object key;
        private Object val;
        private Node next;
    }
    
    private int hash(Key key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }
    
    public Value get(Key key) {
        int i = hash(key);
        for (Node x = st[i]; x != null; x = x.next)
            if (key.equals(x.key))
                return (Value) x.val;
        return null;
    }
    
    public void put(Key key, Value val) {
        int i = hash(key);
        for (Node x = st[i]; x != null; x = x.next)
            if (keys[i].equals(key)) { 
                x.val = val; 
                return; 
            }
        st[i] = new Node(key, val, st[i]);
    }
}

Performance Analysis

āœ… Proposition

Under uniform hashing assumption, the probability that the number of keys in a list is within a constant factor of N/M is extremely close to 1.

Average # of probes = N/M
Operation Average Case Worst Case
Search Hit ~3-5 probes N probes
Search Miss ~3-5 probes N probes
Insert ~3-5 probes N probes

Resizing Strategy

Goal: Keep average chain length N/M ā‰ˆ constant

5. Linear Probing

šŸ’” Core Idea (Open Addressing)

When a collision occurs, probe the next array position (i+1, i+2, ...) until an empty slot is found.

How It Works

  1. Hash: Map key to integer i between 0 and M-1
  2. Insert: Put at table index i if free; if not, try i+1, i+2, etc.
  3. Search: Search table index i; if occupied but no match, try i+1, i+2, etc.

Critical: Array size M must be greater than number of key-value pairs N.

Visual Example

Index:  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
       [P] [M] [ ] [ ] [A] [C] [S] [H] [L] [ ] [E] [ ] [ ] [ ] [R] [X]

Insert K: hash(K) = 5
          Position 5 occupied → try 6 → occupied → try 7 → occupied
          → try 8 → occupied → try 9 → EMPTY! Insert at 9
            

Java Implementation

public class LinearProbingHashST<Key, Value> {
    private int M = 30001;
    private Value[] vals = (Value[]) new Object[M];
    private Key[] keys = (Key[]) new Object[M];
    
    private int hash(Key key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }
    
    public Value get(Key key) {
        for (int i = hash(key); keys[i] != null; i = (i+1) % M)
            if (key.equals(keys[i]))
                return vals[i];
        return null;
    }
    
    public void put(Key key, Value val) {
        int i;
        for (i = hash(key); keys[i] != null; i = (i+1) % M)
            if (keys[i].equals(key))
                break;
        keys[i] = key;
        vals[i] = val;
    }
}

Knuth's Parking Problem

Analogy: Cars arrive at one-way street with M parking spaces. Each wants random space i; if taken, try i+1, i+2, etc.

Performance Analysis

Search Hit: ~½(1 + 1/(1-α))

Search Miss/Insert: ~½(1 + 1/(1-α)²)

where α = N/M (load factor)

Load Factor α Search Hit Search Miss
½ ~1.5 ~2.5
ā…” ~2.0 ~5.0
¾ ~3.0 ~8.5

āš™ļø Typical Choice

Keep α = N/M ā‰ˆ ½ for constant-time operations

Deletion in Linear Probing

āš ļø Critical Issue

Cannot simply delete an entry and set it to null! This would break search for items that had collided with the deleted item.

Solution: After deletion, must rehash all keys in the same cluster that comes after the deleted key.

6. Performance Comparison

Separate Chaining vs Linear Probing

āœ… Separate Chaining

  • Performance degrades gracefully
  • Less sensitive to poorly-designed hash functions
  • Easy deletion
  • Can exceed load factor of 1

āœ… Linear Probing

  • Less wasted space (no pointers)
  • Better cache performance
  • Simpler implementation
  • Faster when load factor is low

Complete Symbol Table Implementations Comparison

Implementation Search Insert Delete Ordered? Key Interface
Sequential Search N N N No equals()
Binary Search lg N N N Yes compareTo()
BST 1.39 lg N 1.39 lg N √N Yes compareTo()
Red-Black BST 1.0 lg N 1.0 lg N 1.0 lg N Yes compareTo()
Separate Chaining 3-5* 3-5* 3-5* No equals(), hashCode()
Linear Probing 3-5* 3-5* 3-5* No equals(), hashCode()

* Under uniform hashing assumption

7. Advanced Topics

Hash Function Variants

Two-Probe Hashing

Hash to two positions and insert in the shorter chain.

Result: Reduces expected length of longest chain to log log N

Double Hashing

Use linear probing but skip a variable amount (not just 1) each time.

h(x) = h1(x) + jĀ·h2(x)  where j = 0, 1, 2, ...

Advantages: Effectively eliminates clustering, table can be nearly full

Cuckoo Hashing

Hash key to two positions. If both occupied, displace one key and reinsert it.

Result: Constant worst-case time for search!

Algorithmic Complexity Attacks

🚨 Security Concern

A malicious adversary who knows your hash function can craft inputs that all hash to the same bucket, causing performance to degrade to O(N).

Examples:

Java String Hash Collision

Due to Java's base-31 hash function, it's possible to generate 2N strings of length 2N that all hash to the same value!

"Aa" and "BB" both hash to 2112
"AaAa", "AaBB", "BBAa", "BBBB" all hash to -540425984
            

One-Way Hash Functions

Cryptographic Hash Functions

Special hash functions where it's computationally infeasible to:

Examples: SHA-256, SHA-3, BLAKE3

Uses: Password storage, digital signatures, blockchain

Note: Too expensive for general symbol table use

8. Summary & Best Practices

šŸŽÆ Key Takeaways

Best Practices

āœ… DO:

āŒ DON'T:

When to Use What

Use Case Best Choice Reason
Need ordered operations Red-Black BST Hash tables don't maintain order
Simple, fast lookups Hash Table (either variant) Constant-time average case
Unpredictable load Separate Chaining Degrades gracefully
Memory constrained Linear Probing No pointer overhead
Cache-sensitive code Linear Probing Better locality of reference

Common Pitfalls

  1. Hash Code Overflow: Forgetting to mask sign bit can cause negative indices
  2. Poor Hash Functions: Using only part of the key leads to clustering
  3. Inconsistent equals/hashCode: If a.equals(b) but a.hashCode() ≠ b.hashCode(), table breaks
  4. Ignoring Load Factor: Letting table get too full degrades performance
  5. Security: Using predictable hash functions in adversarial environments

Quick Reference: Hash Code Implementation

// Template for user-defined types
public int hashCode() {
    int hash = 17;  // nonzero constant
    hash = 31 * hash + field1.hashCode();  // reference type
    hash = 31 * hash + ((Integer) field2).hashCode();  // primitive
    hash = 31 * hash + field3.hashCode();  // reference type
    return hash;
}

šŸŽ“ You're Ready!

You now have a comprehensive understanding of hash tables. Practice implementing both separate chaining and linear probing to solidify these concepts.

Study Guide Created from CS210 Week 13 Materials

Algorithms by Robert Sedgewick & Kevin Wayne

šŸŽ“ Practice Questions

Question 1: Hash Function Requirements

Q: What are the two idealistic goals of a hash function, and why is achieving both simultaneously difficult?

A: The two goals are (1) the function must be efficiently computable and (2) it must make each table index equally likely for each key (uniform distribution). They are hard to achieve simultaneously because truly uniform hashing requires knowledge of the key distribution in advance, which is rarely available. Real hash functions use heuristics (e.g., polynomial rolling with base 31 for strings) that work well in practice but can be exploited by adversarial inputs. The uniform hashing assumption is therefore an idealization used for analysis, not a guaranteed property.

Question 2: The Sign-Bit Bug

Q: Why does return key.hashCode() % M produce a bug, and how is it fixed?

A: Java's hashCode() can return any 32-bit integer, including negative values. The % operator in Java preserves the sign of the dividend, so a negative hash code produces a negative index, which would cause an ArrayIndexOutOfBoundsException. The fix is to mask the sign bit first: return (key.hashCode() & 0x7fffffff) % M. The bitwise AND with 0x7fffffff clears the most significant (sign) bit, ensuring the result is always a non-negative integer before the modulo operation.

Question 3: Separate Chaining vs. Linear Probing

Q: Compare separate chaining and linear probing on three dimensions: memory usage, sensitivity to hash quality, and ease of deletion.

A: - Memory: Separate chaining uses extra memory for linked-list node pointers even when chains are short. Linear probing stores everything in the array itself (no pointers), so it uses less memory per element and has better cache locality.
- Sensitivity to hash quality: Separate chaining degrades more gracefully with a poor hash function — worst case is one very long chain, but other chains are unaffected. Linear probing suffers from clustering: once a cluster forms, insertions anywhere near it make it grow faster, severely degrading performance.
- Deletion: Separate chaining allows straightforward deletion (remove node from linked list). Linear probing cannot simply set a slot to null — doing so breaks search for items that probed past that slot. The fix requires rehashing all keys in the same cluster after the deleted position.

Question 4: Load Factor and Performance

Q: For linear probing with load factor α = N/M, give the expected number of probes for a search hit and a search miss when α = ½ and α = ¾. What does this tell you about why keeping α low matters?

A: Using Knuth's formula, expected probes ā‰ˆ ½(1 + 1/(1-α)) for a hit and ½(1 + 1/(1-α)²) for a miss.
- At α = ½: hit ā‰ˆ 1.5 probes, miss ā‰ˆ 2.5 probes — excellent performance.
- At α = ¾: hit ā‰ˆ 2.5 probes, miss ā‰ˆ 8.5 probes — misses are 3Ɨ more expensive.
As α approaches 1, the denominator (1-α) approaches 0 and the number of probes diverges to infinity. This is why linear probing tables resize (double) when α reaches ½, trading memory for guaranteed constant-time operations.

Question 5: Why Hash Tables Cannot Support Ordered Operations

Q: A colleague wants to use a hash table to find all keys in the range [lo, hi] efficiently. Why is this a bad idea, and what data structure should they use instead?

A: Hash tables destroy the natural ordering of keys — a key hashes to an essentially random bucket position bearing no relationship to its value. There is no way to enumerate keys in sorted order, find a predecessor/successor, or answer range queries without scanning all N entries (O(N)). For ordered operations, a balanced BST such as a Red-Black tree should be used. It supports search, insert, and delete in O(log N), but also supports ordered operations like floor, ceiling, rank, and range queries in O(log N) time.

Question 6: Algorithmic Complexity Attacks

Q: What is an algorithmic complexity attack on a hash table, how does Java's string hash function make it possible, and what are two defenses?

A: An attacker who knows your hash function can craft a large set of keys that all hash to the same bucket, turning O(1) expected operations into O(N) worst-case operations. Java's base-31 string hash makes this possible because it is deterministic and publicly known — it is demonstrably possible to generate 2^N strings of length 2N that all share the same hash value.
Defense 1: Use a randomized hash function (e.g., add a secret random salt at startup so the attacker cannot predict hash values).
Defense 2: Use a stronger collision-resolution strategy like cuckoo hashing, which guarantees O(1) worst-case lookups regardless of key distribution.

Question 7: Resizing and Rehashing

Q: When a hash table resizes (doubles its capacity), why must ALL existing keys be rehashed rather than simply moved to the same relative position?

A: The bucket index for a key is computed as (hashCode & 0x7fffffff) % M. When M changes (e.g., from 97 to 194), the modulo changes and every key's bucket index changes. A key that was in bucket 5 of a size-97 table would compute a completely different index in a size-194 table. Simply copying the slots to new positions would place keys in wrong buckets, breaking lookup. Therefore, every key must be run through the hash function again with the new M and inserted into its new correct bucket — this is called rehashing. Its amortized cost is O(1) per operation when using doubling.

Question 8: Choosing the Right Table Size

Q: Why is it recommended to use a prime number for the hash table size M, particularly when using modular hashing?

A: When M is prime, the modulo operation h % M distributes keys more uniformly across all M buckets. If M has small factors (e.g., M = 100), then keys whose hash codes differ only in the lower bits will cluster together. For example, if all hash codes are even, they all map to even buckets and half the table is wasted. A prime M has no such factors, ensuring no systematic bias in which buckets receive keys. In practice, common choices are prime numbers slightly less than powers of 2, such as 97, 997, or 30011, to balance resizing convenience with uniform distribution.