Comment on xkcd #2934: Bloom Filter
jaybone@lemmy.world 8 months agoIn this example, what would you use to prepopulate the filter?
Comment on xkcd #2934: Bloom Filter
jaybone@lemmy.world 8 months agoIn this example, what would you use to prepopulate the filter?
hydroptic@sopuli.xyz 8 months ago
Which example do you mean?
If you meant my user ID example, you’d prepopulate the bloom filter with existing user IDs on eg. service startup or whatever, and then update the filter every time a new user ID is added – keeping in mind that the false positive rate will grow as more are added, and that at some point you may need to create a new filter with a bigger backing bit array
jaybone@lemmy.world 8 months ago
So you’re just putting a bunch of values in memory that you can access quickly, like similar to a hash set contains(), maybe hoping for O(1) time. But other than that there’s no trick to it?
hydroptic@sopuli.xyz 8 months ago
Well, yes and no. With a straight-up hash set, you’re keeping
set_size * bits_per_element
bits plus whatever the overhead of the hash table is in memory, which might not be tenable for very large sets, but with a Bloom filter that has eg. ~1% false positive rate and an ideal k parameter (number of hash functions, see eg. the Bloom filter wiki article) you’re only keeping ~10 bits per element completely regardless of element size because they don’t store the elements themselves or even their full hashes – they only tell you whether some element is probably in the set or not, but you can’t eg. enumerate the elements in the set. As an example of memory usage, a Bloom filter that has a false positive rate of ~1% for 500 million elements would need 571 MiB (noting that the false positive rate goes up the further you go beyond 500M elements.)Lookup time complexity for a Bloom filter is O(k) where k is the parameter I mentioned and a constant – ie. effectively O(1).
Probabilistic set membership queries are mainly useful when you’re dealing with ginormous sets of elements that you can’t shove into a regular in-memory hash set. A good example in the wiki article is CDN cache filtering: