2934: Bloom Filter

Explain xkcd: It's 'cause you're dumb.
Revision as of 02:59, 21 June 2024 by PDesbeginner (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Bloom Filter
Sometimes, you can tell Bloom filters are the wrong tool for the job, but when they're the right one you can never be sure.
Title text: Sometimes, you can tell Bloom filters are the wrong tool for the job, but when they're the right one you can never be sure.

Explanation[edit]

The comic is about a data structure called a Bloom filter. Software engineers use Bloom filters to check if something is probably in a set or to estimate how many things are in that set, using limited memory.

  • One example: the Chrome web browser used to store a Bloom filter of URLs that were known to be malicious[1], based on a database that was too large to store locally. Chrome used that Bloom filter to confirm that it didn’t need to warn the user that they were visiting a malicious page. Only in the rare cases that the Bloom filter said the URL might be malicious, Chrome would send the URL to an external service to confirm whether it was known to be malicious. The developers didn’t want the browser to send every URL to the external service because that would leak the user’s entire browsing history to the service and would add an unnecessary network delay whenever a web page was loaded.

Here's how a Bloom filter works:

  1. Adding items: When you add an item, it gets hashed (a way of transforming it into numbers) by several hash functions. These hash functions mark certain spots in a big array of bits (think of it as a row of lights that can be on or off).
  2. Checking items: To check whether an item might be in the set, you hash it with the same functions and see if all the corresponding spots are lit up. If they are, the item might be in the set, but there's a chance of a false positive (the Bloom filter could mistakenly say the item is there when it’s not). If any spot is not lit up, the item is not in the set.
  3. False positives: The larger the array compared to the number of items, the lower the chance of false positives. For example, 10 bits per item gives each tested item a 0.1% chance of matching each item actually in the set.
  4. Counting items: By analyzing the activated bits, with appropriate calculations, you can derive an estimate of how many individual items are 'stored' for confirmation within the array. This estimate's accuracy will depend upon several factors, but more array bits (making themselves potentially available to 'remember' each item) will be one of the most important ones when it comes to narrowing down the likelihood.

In the comic, Cueball is holding a piece of paper or tablet computer with a large "1" digit on it. This is labeled as a 1-bit Bloom filter, which is almost useless. When empty, it correctly returns a negative for any item tested, but as soon as one item is added the bit is set to 1, and now it unhelpfully says that any item tested might be in the set. Its size estimate also becomes "between 1 and infinity," which isn’t helpful.

Having multiple hash functions is pointless for a 1-bit filter since they all end up pointing to the same single bit, which would return the exact same answer as a result.

The title text carries the characteristics of the Bloom filter into the decision-making process for choosing a Bloom filter over other candidate data structures. Analogously (according to the text), you can be sure when they are not the best approach but only conclude that they are with a limited degree of probability.

Transcript[edit]

[Ponytail holds out her hand to Cueball, who is holding a flat object with a 1 on it.]
Ponytail: Does your set contai–
Cueball: Yeah, probably.
[Caption below the panel:]
One-Bit Bloom Filter

Trivia[edit]

  • The Bloom Filter was invented by Burton "Buzz" Bloom, who laughed when shown this comic and said "It's been a long time since I've done filtering... about twenty years."
  • Bloom Filters were also referenced in 2739 as a very lossy form of data compression.

References[edit]

  1. Chromium Issue 10896048: Transition safe browsing from bloom filter to prefix set. (Closed) – https://chromiumcodereview.appspot.com/10896048/


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

It certaintly does contain a thing. 172.68.23.74 00:10, 18 May 2024 (UTC)

The title text deals with inaccuracies in determining whether you have chosen the right programming tool for your membership query (or some different task), not just inaccuracies in the Bloom filter as one of these tools. This analogy remains unexplained. Transgalactic (talk) 11:24, 18 May 2024 (UTC)

The title text makes a self-description joke, where it depicts using a bloom filter to determine whether bloom filters are appropriate, as if bloom filters were the only tool available for human decisions. 172.68.1.132 21:52, 18 May 2024 (UTC)

A perfectly functional GetHashCode() override in .net is "return 1;". 172.70.34.82 23:37, 18 May 2024 (UTC)

It could be used to test whether the set is empty. 172.70.39.96 09:01, 20 May 2024 (UTC)

The most likely thing in there is yes. Psychoticpotato (talk) 23:10, 20 May 2024 (UTC)