Comment

So you’re just putting a bunch of values in memory that you can access quickly, like similar to a hash set contains(), maybe hoping for O(1) time. But other than that there’s no trick to it?

source

Sort:hotnew top

hydroptic@sopuli.xyz ⁨8⁩ ⁨months⁩ ago
Well, yes and no. With a straight-up hash set, you’re keeping set_size * bits_per_element bits plus whatever the overhead of the hash table is in memory, which might not be tenable for very large sets, but with a Bloom filter that has eg. ~1% false positive rate and an ideal k parameter (number of hash functions, see eg. the Bloom filter wiki article) you’re only keeping ~10 bits per element completely regardless of element size because they don’t store the elements themselves or even their full hashes – they only tell you whether some element is probably in the set or not, but you can’t eg. enumerate the elements in the set. As an example of memory usage, a Bloom filter that has a false positive rate of ~1% for 500 million elements would need 571 MiB (noting that the false positive rate goes up the further you go beyond 500M elements.)

Lookup time complexity for a Bloom filter is O(k) where k is the parameter I mentioned and a constant – ie. effectively O(1).

Probabilistic set membership queries are mainly useful when you’re dealing with ginormous sets of elements that you can’t shove into a regular in-memory hash set. A good example in the wiki article is CDN cache filtering:

Nearly three-quarters of the URLs accessed from a typical web cache are “one-hit-wonders” that are accessed by users only once and never again. It is clearly wasteful of disk resources to store one-hit-wonders in a web cache, since they will never be accessed again. To prevent caching one-hit-wonders, a Bloom filter is used to keep track of all URLs that are accessed by users. A web object is cached only when it has been accessed at least once before, i.e., the object is cached on its second request.

source