Comment

Comment on Bloom filters: real-world applications

I know they are used in google’s BigTable. All data there is stored in seperate SSTables and you can specify that a locality group should have bloom filters generated for its SSTables. Apparently cassandra has them too.

Both are the same general application though and you already mentioned databases.

I did think about using them at some point for authentication purposes in a webservice. The idea being to check for double uses of a refresh token. This way the user database would need to store only a small amount of extra storage to check for the reuse of a refresh token and if you set the parameters accordingly, the false positives are kind of a benefit in that users cannot infinitely refresh and they actually have to reauthenticate sometimes.

source

Sort:hotnew top

Reader9@programming.dev ⁨1⁩ ⁨year⁩ ago
Collage sounds really interesting , will check it out. Another variation on bloom filter I recently learned about is count-min-sketch. It allows for storing/incrementing a count along with each key, and can answer “probably in set with count greater than _”, “definitely not in set”.

Thanks for adding more detail on the DB use-cases!

source
- noli@programming.dev ⁨1⁩ ⁨year⁩ ago
  Interesting. Do I understand it correctly if I say it’s a bloom filter where instead of setting a bit to 1 for each of the hashes, you increment a counter for that hash?
  
  How do you infer the count then, take the minimum of all matching hashes? Because intuitively it seems to me like you would need a lot more space to avoid counts being too high
  
  source
  - Reader9@programming.dev ⁨1⁩ ⁨year⁩ ago
    This data structure uses a 2-dimensional array to store data, documented in this scala implementation: github.com/twitter/…/CountMinSketch.scala. I’m still trying to understand it as well.
    
    Similar to your idea, I had thought that by using k bloom filters, each with their own hash function and bit array, one could store an approximate count up to k for each key, which also might be wasteful or a naïve solution.
    
    source