Comment on xkcd #2934: Bloom Filter
rockSlayer@lemmy.world 8 months agoA bloom filter is a data structure that is most useful in creating a spell check algorithm
Comment on xkcd #2934: Bloom Filter
rockSlayer@lemmy.world 8 months agoA bloom filter is a data structure that is most useful in creating a spell check algorithm
hydroptic@sopuli.xyz 8 months ago
That’s definitely not what they’re most useful for. I mean, you probably can use a bloom filter for implementing spell check, but saying that’s where they’re most useful severely misses the point of probabilistic set membership queries.
Bloom filters and their relatives are great when you have a huge set of values – eg. 100s of millions of user IDs in some database – and you want to have a very fast way of checking whether some value might be in that set, without having to query the database. Naturally this assumes that you’ve prepopulated a bloom filter with whatever values you need to be checking.
If the result of the bloom filter query is “nope”, you know that the value’s definitely not in the set, but if the result is “maybe” then you can go ahead and double-check by querying the database. This means that the vast majority of checks don’t have to hit that slow DB at all, and even though you’ll get some false positives this’ll still be much much much faster than having to go through that DB every time.
jaybone@lemmy.world 8 months ago
In this example, what would you use to prepopulate the filter?
hydroptic@sopuli.xyz 8 months ago
Which example do you mean?
If you meant my user ID example, you’d prepopulate the bloom filter with existing user IDs on eg. service startup or whatever, and then update the filter every time a new user ID is added – keeping in mind that the false positive rate will grow as more are added, and that at some point you may need to create a new filter with a bigger backing bit array
jaybone@lemmy.world 8 months ago
So you’re just putting a bunch of values in memory that you can access quickly, like similar to a hash set contains(), maybe hoping for O(1) time. But other than that there’s no trick to it?