Hashing
Hashing is a technique used for storing and retrieving information as quickly as possible. Balanced BST provide insert, delete, and search in O(log n) time. Hashing provides the same operations in O(1) asymptotically on average.
What is Hashing?
Hashing is the process of converting input data (of arbitrary size) into a fixed-size value, typically called a hash code or hash value, using a hash function. The output is usually a sequence of bits, numbers, or characters that uniquely represent the input data.
Hash Functions
A hash function is a mathematical algorithm that maps data of arbitrary size to a fixed-size output. Good hash functions have the following properties:
- Deterministic: The same input always produces the same output.
- Fast computation: The function should be quick to compute.
- Uniform distribution: Outputs should be evenly distributed to minimize collisions.
- Irreversibility: It should be hard to reconstruct the input from the output (important for cryptographic hashes).
Common hash functions:
- Division method:
hash = key % table_size - Multiplication method: Uses multiplication and fractional parts.
- Cryptographic hashes: SHA-256, SHA-3, MD5 (not recommended for security).
Collision Resolution Techniques
Collision: Collision is the condition when two keys are hashed to the same slot.
There are effective techniques for resolving conflicts due to collisions.
Chaining
In Chaining each slot in the hash table points to a linked list (or another data structure like a dynamic array) of entries that map to the same index. When a collision occurs, the new key-value pair is simply added to the list at that slot.
Advantages:
- Simple to implement.
- The hash table can store more elements than its size.
- Performance degrades gracefully as the number of elements increases.
Disadvantages:
- Requires extra memory for pointers.
- Linked lists can become long if many collisions occur, degrading performance.
Open Addressing
In Open Addressing if a collision occurs, the hash table seeks the next available slot according to a probing sequence. All elements are stored within the table itself (no linked lists).
Common probing methods:
- Linear Probing: Check the next slot sequentially:
(hash + i) % table_size - Quadratic Probing: Use a quadratic function:
(hash + i^2) % table_size - Double Hashing: Use a second hash function to determine the step size.
hash2 + i * hash1
Advantages:
- No extra memory for pointers or lists.
- Better cache performance due to the locality of reference.
Disadvantages:
- Table can become full, requiring resizing.
- Clustering can occur, especially with linear probing.
Best Practices
- Choose a good hash function to minimize collisions.
- Handle collisions efficiently.
- Resize hash tables when the load factor is high.
Real-World Examples
- Git: Uses SHA-1 hashes to identify commits and files.
- Blockchain: Uses cryptographic hashes to link blocks securely.
- Databases: Use hashing for indexing and partitioning.
- Data Integrity: File downloads often provide a hash SHA-256 to verify the file integrity