Bloom Filters
Introduction
A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can quickly tell you if an element is definitely not in the set or possibly in the set, with a certain probability of false positives but zero false negatives.
Suppose you need to check membership in a very large set (e.g., checking if a URL is in a blacklist, or if a password has been leaked). Storing all elements in a hash set or array can be memory-intensive. Bloom filters provide a way to perform these checks using much less memory, at the cost of allowing some false positives.
It is used in Bigtable, HBase, LevelDB, RocksDB and many other databases. CDNS — checking cached content.
How Bloom Filters Work
A Bloom filter consists of:
- A bit array of size
m, initialized to all 0s. kindependent hash functions, each of which maps an element to one of themarray positions uniformly at random.
Insertion
To add an element, compute its k hash values and set the corresponding bits in the array to 1.
Query
To check if an element is in the set, compute its k hash values and check the corresponding bits:
- If any of these bits is 0, the element is definitely not in the set.
- If all are 1, the element is possibly in the set (could be a false positive).
Properties
- Space Efficient: Uses much less memory than storing the actual elements.
- No False Negatives: If the filter says an element is not present, it is definitely not present.
- False Positives Possible: The filter may say an element is present when it is not.
- No Deletion: Standard Bloom filters do not support removing elements (see Counting Bloom filters for deletions).
False Positive Probability
The probability of a false positive depends on:
- The size of the bit array (
m) - The number of hash functions (
k) - The number of inserted elements (
n)
The false positive rate can be approximated as:
Applications
- Web cache filtering
- Databases and distributed systems (e.g., Bigtable, Cassandra)
- Network security (e.g., blacklist/whitelist checking)
- Spell checkers
- Blockchain and cryptocurrency (e.g., Bitcoin SPV nodes)
Variants
- Counting Bloom Filter: Allows deletions by using counters instead of bits.
- Scalable Bloom Filter: Grows dynamically as more elements are added.
- Compressed Bloom Filter: Reduces the size for transmission.
Limitations
- Cannot remove elements (unless using a variant)
- False positives increase as more elements are added
- Choice of hash functions is critical