A Data Mesh from Scratch in Rust — Part 6 — BloomFilter

--

In this blog, I will talk about implementing a BloomFilter for our storage engine. This is not essential to our engine but an optimization. When searching for an entry, instead of always searching through all the SSTables and the MemTable, we can start by searching through BloomFilter. The great thing about BloomFilter is that if it says that an ID does not exist in the system then it does not. This saves us the time and trouble of searching through the actual data structures if the ID does not exist in the database.

I looked at existing Rust crates and did not find anything fresh enough (I suspect that the Skiplist is also going that way and I will ultimately have to build my own). Plus I guess having recently built a BloomFilter elsewhere, I was predisposed to build one.

However, the BloomFilter does have some drawbacks:

  1. It requires a lot of space. For example, if you expect 10M entries and want the false positive probability to be one in a million, then you are already looking at using 40MB of memory. The operations on the filter will also be somewhat slow given that you need 23 hashes for everything. You can speed things up slightly by increasing memory to some extent. If you reduce number of hashes to 2, then memory requirement is up to 7+GB. Another option is to, of course, reduce the false positive probability. For a 1 in a 100 probability, we can get away using 17MB and 7 hash functions. Since we are only interested in when something does not exist and that is guaranteed, my intuition is that might be okay. But remains to be seen! (https://hur.st/bloomfilter/?n=10000000&p=1.0E-7&m=&k=)
  2. The other problem with the BloomFilter is resizing. You can’t! This stems from the fact that indices are dependent on the size of the filter. So you can’t really resize without re-entering all existing entries.

Anyhow, while I decide whether the BloomFilter really makes sense, let’s talk about implementation. The overall idea is simple.

  1. You figure out how many elements you might want to store.
  2. You create a bit vector of that size. This bitvec crate is a gift for that. Saved me a huge amount of time that would have been required to write shift operators correctly.
  3. You figure out the number and implementation of k hash functions. For our purposes we simply need two. Any number of hash functions can simply be generated by combining two hash functions. In my case, I use hash1 + i^2 * hash2 while iterating over i.
  4. When a new entry is to be added, get the k hashes. And change these indices in the bit vector from 0 to 1 (false to true). Deletion is the opposite. And membership test checks if all those indices are 1.

So we start by defining our constants that we obtained by saying we want to store 10M elements with a 1 in a M false positive probability and using 10 hash functions.

const BF_SIZE: usize = 56_166_771;
const NEG_PROB: f64 = 0.0000001;
const NUM_HASHES: usize = 10;

We then define an inner structure:

BloomFilter Inner

This has the hashing defined on it. We use the xx and t1ha hashes, from the fasthash crate, which are purported to be fast. Ensure that you use the wraparound version of the arithmetic operations.

BloomFilter Hash

Getting the indices for an entry then is simply a matter of running this hash NUM_HASHES times. The add, delete and find operations simply run the hash and check the bit vector. You can check the code out for details.

Finally, we create the BloomFilter by enclosing the BFInner struct in an Rc<RefCell>. This is more about future proofing since we expect multiple threads or async contexts to be accessing the filter at the same time. The functions on the BloomFilter simply accesses the ones underneath.

With this, I believe, we are now done with the storage engine. And next time, I will start talking about the interface between the engine and the server that will build the database. While implementing those components, I realized a few things that needed to be changed about the engine — nothing huge. And I thought about going back and making those changes. But have decided against it. So now either I will list those changes in the interface article or as we use them.

Update: The next article turned out to be about error handling in the library. Otherwise, you can go directly to the server design article.

--

--

Distributed Systems researcher and engineer (grad student) at The George Washington University!