Difference between revisions of "2934: Bloom Filter"

Explain xkcd: It's 'cause you're dumb.
Jump to: navigation, search
(Explanation: explained the title text)
(Explanation: Rewriting to make it more accessible to the layperson.)
(4 intermediate revisions by 4 users not shown)
Line 11: Line 11:
 
==Explanation==
 
==Explanation==
 
{{incomplete|PROBABLY CREATED - Please change this comment when editing this page. Do NOT delete this tag too soon.}}
 
{{incomplete|PROBABLY CREATED - Please change this comment when editing this page. Do NOT delete this tag too soon.}}
The comic is referring to a {{w|Bloom Filter}}, a data structure that is used for approximate membership queries and cardinality estimation using a bounded amount of memory.  That is, after a series of objects are added to the bloom filter, given another object, the bloom filter can be queried to see if that object has already been added to it, with a chance of a false positive answer that depends on the size of the bloom filter.  Or, the bloom filter can be queried for an approximate count of the objects that have been added to the bloom filter already.
 
  
A bloom filter uses a large bit array, and a number of hashing functions that produce indexes into this array. When a value is added to the set, it's hashed with each function, and the corresponding bits in the array are set to 1. To test if a value is in the set you hash it with all the functions, and check if all the bits are 1. If they are, the value may be in the set, but there can also be false positives because each hash collides with some other value in the set (assuming reasonable hash functions, a different element for each hash). But if any of the bits is 0, you know for sure the value is not in the set. The higher the ratio between the size of the bit array and the number of elements in the set, the smaller the false positive rate is (10 bits/element has about 1% false positives.
+
The comic is about a data structure called a {{w|Bloom Filter}}. Software engineers use Bloom Filters to check if something is in a set or estimate how many things are in that set, using limited memory. One example is a web browser checking to see if a URL is malicious without storing a large database locally.
  
The joke in the comic is that [[Cueball]] has a 1-bit Bloom filter. When the set is empty, it accurately reports that any value is not in the set. But as soon as anything is added to the set, it has a very large false positive rate, since that single bit will be set and everything will hash to that index.  Similarly the cardinality estimation is (correctly) 0 initially, but after the first addition the estimate will be "somewhere between 1 and infinity" which is not a terribly useful estimate.
+
Here's how it works:
  
There's also no point in having multiple hash functions for a 1-bit filter, since there's only one possible hash value.
+
# '''Adding Items:''' When you add an item, it gets hashed (a way of transforming it into numbers) by several hash functions. These hash functions mark certain spots in a big array of bits (think of it as a row of lights that can be on or off).
 +
# '''Checking Items:''' To check if an item is in the set, you hash it with the same functions and see if all the corresponding spots are lit up. If they are, the item might be in the set, but there's a chance of a false positive (the Bloom Filter could mistakenly say the item is there when it’s not). If any spot is not lit up, the item is definitely not in the set.
 +
# '''False Positives:''' The larger the array compared to the number of items, the lower the chance of false positives. For example, 10 bits per item gives about a 1% false positive rate.
  
The title text references how bloom filters are always accurate in saying that an element is not in the list(bloom filters are not correct), but you can never be sure if an element is actually in the list(bloom filters are actually correct), because of false positives.
+
In the comic, [[Cueball]] has a 1-bit Bloom Filter, which is almost useless. When empty, it correctly says nothing is in the set. But as soon as one item is added, the bit is set to 1, and now it falsely says every possible item is in the set. Its size estimate also becomes "between 1 and infinity," which isn’t helpful.
 +
 
 +
Having multiple hash functions is pointless for a 1-bit filter since they all end up pointing to the same single bit.
 +
 
 +
The title text carries the characteristics of the bloom filter into the decision making process for choosing a bloom filter over other candidate data structures. In an analogous way (according to the text), you can be sure when they are ''not'' the best approach, but only conclude that they ''are'' with a limited degree of probability.
  
 
==Transcript==
 
==Transcript==
Line 35: Line 40:
 
[[Category:Comics featuring Cueball]]
 
[[Category:Comics featuring Cueball]]
 
[[Category:Statistics]]
 
[[Category:Statistics]]
 +
[[Category:Programming]]

Revision as of 18:37, 18 May 2024

Bloom Filter
Sometimes, you can tell Bloom filters are the wrong tool for the job, but when they're the right one you can never be sure.
Title text: Sometimes, you can tell Bloom filters are the wrong tool for the job, but when they're the right one you can never be sure.

Explanation

Ambox notice.png This explanation may be incomplete or incorrect: PROBABLY CREATED - Please change this comment when editing this page. Do NOT delete this tag too soon.
If you can address this issue, please edit the page! Thanks.

The comic is about a data structure called a Bloom Filter. Software engineers use Bloom Filters to check if something is in a set or estimate how many things are in that set, using limited memory. One example is a web browser checking to see if a URL is malicious without storing a large database locally.

Here's how it works:

  1. Adding Items: When you add an item, it gets hashed (a way of transforming it into numbers) by several hash functions. These hash functions mark certain spots in a big array of bits (think of it as a row of lights that can be on or off).
  2. Checking Items: To check if an item is in the set, you hash it with the same functions and see if all the corresponding spots are lit up. If they are, the item might be in the set, but there's a chance of a false positive (the Bloom Filter could mistakenly say the item is there when it’s not). If any spot is not lit up, the item is definitely not in the set.
  3. False Positives: The larger the array compared to the number of items, the lower the chance of false positives. For example, 10 bits per item gives about a 1% false positive rate.

In the comic, Cueball has a 1-bit Bloom Filter, which is almost useless. When empty, it correctly says nothing is in the set. But as soon as one item is added, the bit is set to 1, and now it falsely says every possible item is in the set. Its size estimate also becomes "between 1 and infinity," which isn’t helpful.

Having multiple hash functions is pointless for a 1-bit filter since they all end up pointing to the same single bit.

The title text carries the characteristics of the bloom filter into the decision making process for choosing a bloom filter over other candidate data structures. In an analogous way (according to the text), you can be sure when they are not the best approach, but only conclude that they are with a limited degree of probability.

Transcript

Ambox notice.png This transcript is incomplete. Please help editing it! Thanks.
[Ponytail holds out her hand to Cueball, who is holding a paper with a 1 on it.]
Ponytail: Does your set contai-
Cueball: Yeah, probably.
[Caption below the panel:]
One-Bit Bloom Filter


comment.png add a comment! ⋅ comment.png add a topic (use sparingly)! ⋅ Icons-mini-action refresh blue.gif refresh comments!

Discussion

It certaintly does contain a thing. 172.68.23.74 00:10, 18 May 2024 (UTC)

The title text deals with inaccuracies in determining whether you have chosen the right programming tool for your membership query (or some different task), not just inaccuracies in the Bloom filter as one of these tools. This analogy remains unexplained. Transgalactic (talk) 11:24, 18 May 2024 (UTC)

The title text makes a self-description joke, where it depicts using a bloom filter to determine whether bloom filters are appropriate, as if bloom filters were the only tool available for human decisions. 172.68.1.132 21:52, 18 May 2024 (UTC)

A perfectly functional GetHashCode() override in .net is "return 1;". 172.70.34.82 23:37, 18 May 2024 (UTC)

It could be used to test whether the set is empty. 172.70.39.96 09:01, 20 May 2024 (UTC)

The most likely thing in there is yes. Psychoticpotato (talk) 23:10, 20 May 2024 (UTC)