Bloom Filters

Introduction

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can quickly tell you if an element is definitely not in the set or possibly in the set, with a certain probability of false positives but zero false negatives.

Suppose you need to check membership in a very large set (e.g., checking if a URL is in a blacklist, or if a password has been leaked). Storing all elements in a hash set or array can be memory-intensive. Bloom filters provide a way to perform these checks using much less memory, at the cost of allowing some false positives.

It is used in Bigtable, HBase, LevelDB, RocksDB and many other databases. CDNS — checking cached content.

How Bloom Filters Work

A Bloom filter consists of:

A bit array of size m, initialized to all 0s.
k independent hash functions, each of which maps an element to one of the m array positions uniformly at random.

Insertion

To add an element, compute its k hash values and set the corresponding bits in the array to 1.

Query

To check if an element is in the set, compute its k hash values and check the corresponding bits:

If any of these bits is 0, the element is definitely not in the set.
If all are 1, the element is possibly in the set (could be a false positive).

Properties

Space Efficient: Uses much less memory than storing the actual elements.
No False Negatives: If the filter says an element is not present, it is definitely not present.
False Positives Possible: The filter may say an element is present when it is not.
No Deletion: Standard Bloom filters do not support removing elements (see Counting Bloom filters for deletions).

False Positive Probability

The probability of a false positive depends on:

The size of the bit array (m)
The number of hash functions (k)
The number of inserted elements (n)

The false positive rate can be approximated as:

\[ \left(1 - e^{-kn/m}\right)^k \]

Applications

Web cache filtering
Databases and distributed systems (e.g., Bigtable, Cassandra)
Network security (e.g., blacklist/whitelist checking)
Spell checkers
Blockchain and cryptocurrency (e.g., Bitcoin SPV nodes)

Variants

Counting Bloom Filter: Allows deletions by using counters instead of bits.
Scalable Bloom Filter: Grows dynamically as more elements are added.
Compressed Bloom Filter: Reduces the size for transmission.

Limitations

Cannot remove elements (unless using a variant)
False positives increase as more elements are added
Choice of hash functions is critical