Skip to content

Bloom Filters

Introduction

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can quickly tell you if an element is definitely not in the set or possibly in the set, with a certain probability of false positives but zero false negatives.

Suppose you need to check membership in a very large set (e.g., checking if a URL is in a blacklist, or if a password has been leaked). Storing all elements in a hash set or array can be memory-intensive. Bloom filters provide a way to perform these checks using much less memory, at the cost of allowing some false positives.

It is used in Bigtable, HBase, LevelDB, RocksDB and many other databases. CDNS — checking cached content.

How Bloom Filters Work

A Bloom filter consists of:

  • A bit array of size m, initialized to all 0s.
  • k independent hash functions, each of which maps an element to one of the m array positions uniformly at random.

Insertion

To add an element, compute its k hash values and set the corresponding bits in the array to 1.

Query

To check if an element is in the set, compute its k hash values and check the corresponding bits:

  • If any of these bits is 0, the element is definitely not in the set.
  • If all are 1, the element is possibly in the set (could be a false positive).

Properties

  • Space Efficient: Uses much less memory than storing the actual elements.
  • No False Negatives: If the filter says an element is not present, it is definitely not present.
  • False Positives Possible: The filter may say an element is present when it is not.
  • No Deletion: Standard Bloom filters do not support removing elements (see Counting Bloom filters for deletions).

False Positive Probability

The probability of a false positive depends on:

  • The size of the bit array (m)
  • The number of hash functions (k)
  • The number of inserted elements (n)

The false positive rate can be approximated as:

\[ \left(1 - e^{-kn/m}\right)^k \]

Applications

  • Web cache filtering
  • Databases and distributed systems (e.g., Bigtable, Cassandra)
  • Network security (e.g., blacklist/whitelist checking)
  • Spell checkers
  • Blockchain and cryptocurrency (e.g., Bitcoin SPV nodes)

Variants

  • Counting Bloom Filter: Allows deletions by using counters instead of bits.
  • Scalable Bloom Filter: Grows dynamically as more elements are added.
  • Compressed Bloom Filter: Reduces the size for transmission.

Limitations

  • Cannot remove elements (unless using a variant)
  • False positives increase as more elements are added
  • Choice of hash functions is critical