Comment on xkcd #2934: Bloom Filter
jaybone@lemmy.world 8 months agoSo you’re just putting a bunch of values in memory that you can access quickly, like similar to a hash set contains(), maybe hoping for O(1) time. But other than that there’s no trick to it?
hydroptic@sopuli.xyz 8 months ago
Well, yes and no. With a straight-up hash set, you’re keeping
set_size * bits_per_element
bits plus whatever the overhead of the hash table is in memory, which might not be tenable for very large sets, but with a Bloom filter that has eg. ~1% false positive rate and an ideal k parameter (number of hash functions, see eg. the Bloom filter wiki article) you’re only keeping ~10 bits per element completely regardless of element size because they don’t store the elements themselves or even their full hashes – they only tell you whether some element is probably in the set or not, but you can’t eg. enumerate the elements in the set. As an example of memory usage, a Bloom filter that has a false positive rate of ~1% for 500 million elements would need 571 MiB (noting that the false positive rate goes up the further you go beyond 500M elements.)Lookup time complexity for a Bloom filter is O(k) where k is the parameter I mentioned and a constant – ie. effectively O(1).
Probabilistic set membership queries are mainly useful when you’re dealing with ginormous sets of elements that you can’t shove into a regular in-memory hash set. A good example in the wiki article is CDN cache filtering: