A Data Mesh from Scratch in Rust — Part 1 — An Idea

Published in

Towards Dev

2 min readOct 19, 2022

I have recently embarked on an ambitious project — build a distributed (event) database that publishes change data capture as a message broker. Hopefully, I will be able to see this through — my interest not waning and my ability keeping up. And as I work through this, I hope to make blog posts explaining the steps. I am doing this for educational purpose so there will be mistakes and dubious design choices — let’s discuss them.

This post is about the general idea overall. The implementations start in the next one. I landed on this idea of a generic data system while reading Designing Data Intensive Applications by Martin Kleppmann. There the author makes a strong argument about event sourcing systems and database systems being two sides of the same coin.

Anyway the idea is, once again, to build a distributed database, an event store really, that publishes its CDC (change data capture). But how to go about that really?

Frankly, I have no idea. We need a storage engine. I started thinking about a Red Black Tree (RBT) for the MemTable (table held in memory). After spending a few hours trying to implement the RBT, I actually went with the skiplist. Anyhow, beyond the storage engine we need the database management system. And finally, a message broker.

Design wise, I want a system that is distributed in nature. I am still conflicted. In my heart I want to build a leaderless system but that requires strong conflict resolution. Which brings me to the time — how to do it. Local clock time is not monotonic. Across systems it’s a nightmare. I guess I could earmark one leader as a provider of timestamps (with backups maybe?). It seems like a simple-ish solution, though I’d have to build a sub-cluster for that. But it makes me wonder why nobody does it? I think it’s an interesting question. However, I do believe using consistent hashing to make cluster nodes leaders for different writes is the easier way to go. Even that approach implicitly assumes that keys are normally distributed — an assumption I have seen broken enough times (though surprisingly message brokers tend to swear by it — a conundrum!).

There are several other questions like that — do I use the write ahead log to publish updates or should there be a separation of concerns through a separation of data structures? I mean separating the structure does help me implement different retention and other policies for them.

I will maintain this article as a list pointing to all subsequent ones.

[*] Variable Payload

[2] Events and Actions

[3] MemTable

[4] SSTable

[5] Write Ahead Log

[6] BloomFilter

[7] Error Handling in the Library

[8] Server Design Concepts

[9] Library Interface

[10] Server Implementation

[*] Distributed Linearizability without Consensus