Bloom Filters

A Bloom filter is a probabilistic data-structure. This can be used to store a set of data in a space-efficient manner. For eg; a distributed cache called Cache Digests shared as summaries between the nodes to have a global image. 

The data-structure can be used to provide membership queries ie. checkIfDataPresentInStore() If it is to check an element is already inserted in the filter then it will return true, there are no false negatives. But there can be chance if the element not inserted may return true. But the check for that element can be done in the original store ie. the overhead is associated with the rate of false positives. This is different from dictionary in which the hit/miss is deterministic.

For a set of n elements, a bloom filter can be a vector of size m.Initially, all bits are set to 0. For each element e, k hash functions will set k bits in the bit vector to 1. When a query for membership executed, it will check for the bit positions for the set value. If matches all, the queried element is possibly present in the store else, it is sure not present.Each hash function returns the index to set. This means we have to store these m bits per key. So a total of m * N bits of space required. The use of different hash functions results less collision.


Uses
  • Design a spell checker
  • Database join implementation (Oracle)  
  • Peer to peer (P2P) communication and routing  
  • In HBase, the Bloom filter is stored as meta block in the HFile. When a HFile is opened, the bloom filter is loaded into memory and used to determine if a given key is in that store file. This can avoid the scanning region for the key. 
  • and more

I found a java implementation here
Cassandra's java implementation here

Reference

http://en.wikipedia.org/wiki/Bloom_filter
https://issues.apache.org/jira/browse/HBASE-1200
http://wiki.squid-cache.org/SquidFaq/CacheDigests
http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

2 comments:

  1. I see you have a bloom filter code for developing a spell checker, but I don't understand what is meant by this false positives. Does it mean the checking of the wrong word from the user, and finding the wrong word from the database then suggesting the right words from the other column from a database?. If not, how do I implement this in java?

    ReplyDelete