Bloom Filters

One of my favorite data structures to use for efficient search operations is the Bloom filter, I just came across this cool demo, check it out.

Named after its creator, Burton Howard Bloom, a Bloom filter is a space-efficient probabilistic data structure designed to answer a simple question: “Is this item in the set?”. Unlike other data structures, a Bloom filter trades off accuracy for speed and space efficiency, meaning it might sometimes return false positives but never false negatives.

The Bloom filter works by using a bit vector of size ‘m’ and ‘k’ hash functions. When an element is added to the filter, it is passed through all ‘k’ hash functions, which return ‘k’ indexes to set to ‘1’ in the bit vector. To check whether an element is in the filter, the same ‘k’ hash functions are applied. If all ‘k’ indexes in the bit vector are ‘1’, the filter returns ‘yes’; otherwise, it returns ‘no’. A ‘yes’ answer can mean either the element is definitely not in the set (true negative) or it might be in the set (false positive), but a ‘no’ answer always means the element is not in the set (true negative). Note that when implementing, you should think carefully about how many hash functions to use and what the acceptable rate of false positives is.

Bloom filters are widely used in software applications where the cost of false positives is less critical than the benefits of speed and space efficiency. Some of these applications include spell checkers, network routers, databases, and caches. For instance, Google’s Bigtable uses Bloom filters to reduce the disk lookups for non-existent rows or columns, significantly improving its performance. Medium uses Bloom filters to avoid recommending articles a user has already read. I have a password generator application and use a Bloom Filter to check if a password has already been used.