Bloom Filter

Feb 02, 2022dsadata-structures

A bloom filter is a simple space-efficient probabilistic data structure designed to test whether an element is present in a set.

It is blazingly fast and use minimal memory at the cost of potential false positives. False positive matches are possible, but false negatives are not.

Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if conventional error-free hashing techniques were applied.

Features

The base of a bloom filter is a bit-array associated with following variables:

  • mm — size of bit-array, proportional to kk and nn.
  • kk — number of different hash functions, typically a constant, smaller than mm.
  • nn — number of elementes that have been inserted.
  • pp — probablity of a false positives.

The Bloom filter takes an array of mm bits, initially all set to 0, and for up to nn different elements, either test or set kk bits using positions chosen using hash functions. If all bits are set, the element probably already exists, with a false positive rate of pp; if any of the bits are not set, the element certainly does not exist.

These variables, kk, mm, and nn, should be picked based on how acceptable false positives are. If the values are picked and the resulting probability is too high, the values should be tweaked and the probability re-calculated.

False Positives

It’s a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter will have less false positives, and a smaller one more.

The formula to calculate probablity of a false positives is:

p(1eknm)kp \approx (1 - e ^ \frac{-kn}{m}) ^ k

Hash Functions

The hash functions should be fast, independent, and uniformly distributed. The more hash functions you have, the slower your Bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives.

Putting this constraint aside, for a given mm and nn, the value of kk that minimizes the false positive probability is:

kmnln2k \approx \frac{m}{n}\ln{2}

Cryptographic hash functions provide stability and guarantee but are expensive in calculation. With increase in kk, bloom filter become slow. Non-cryptographic hash functions do not provide guarantee but provide major performance improvement.

Operations

Bloom filter only supports the insertion and search operations. You can’t iterate through the items in the set or delete items.

Bloom filters take up O(1)O(1) space, regardless of the number of items inserted. Both insertion and search operations have time-complexity of O(1)O(1).

Insertion

During insertion, kk hash functions, are used to create hashes of the input. These hash functions output indexes. At every index received, we simply change the value in our bloom filter to 1.

Search

Bloom filters can only definitively identify true negatives. They cannot identify true positives. When you search an item in a bloom filter, the possible answers are:

  • It’s definitely not in the set. This is a true negative.
  • It might be in the set. This could be a false positive, or it could be a true positive.

During a search, the same hash functions are called and used to hash the input. We then check if the indexes received all have a value of 1 inside our bloom filter. If they all have a value of 1, we know that the bloom filter may have had the value previously inserted.

However, it’s not certain, because it’s possible that other values previously inserted flipped the values to 1. The values aren’t necessarily 1 due to the item currently being searched for. Absolute certainty is impossible unless only a single item has previously been inserted.

While checking the bloom filter for the indexes returned by our hash functions, if even one of them has a value of 0, we definitively know that the item was not previously inserted.

Variants

The Bloom filter has many variants, some deal with deletion and others don’t, some use the memory efficiently but increase the pp where others pay the trade-off in the reversed way.

  • Counting Bloom filters allow deletions as well as insertions of items.
  • Compressed Bloom filters are optimized to minimize space when transmitted.
  • Retouched Bloom filters trade off false positives and false negatives.
  • Bloomier filters keep function values associated with set elements (thereby offering more than set membership).
  • Count-Min sketches and multistage filters track counts associated with items, and approximate concurrent state machines track the dynamically changing state of a changing set of items.

Applications

Typical use cases of Bloom filter are content summaries and sets that would usually grow too large in fields such as networking, distributed systems, databases and analytics.

  • Used to recommend articles in a blog or news website.
  • Used to identify malicious URLs in a web browser.
  • Used to to reduce the disk lookups for non-existent content in a database.
  • Used to to detect previously stored data in a storage system.
  • Used to speed up asymmetric joins in a analaytics framework.

The Bloom filter’s simplicity, ease of use, and excellent performance make it a standard data structure that is and will continue to be of great use in many applications.