FoundationDB: A Distributed Database That Can’t Be Killed
If you’ve seen the 1975 movie (and cult classic) “Monty Python and the Holy Grail,” you might remember the Black Knight, a Knight so determined to guard the river passage that even the loss of all his limbs doesn’t deter him from his post.
![]()
In many ways, FoundationDB is like the Black Knight of databases, able to keep running even when multiple attempts are made to interrupt its operations, noted Peter Boros, founding engineer for Tigris Data, an object storage service provider, speaking at the Linux Foundation‘s Open Source Summit North America.
For FoundationDB, process “failure is just a scratch,” he said. On its own, FoundationDB is resilient. With the help of Kubernetes, it becomes pretty much bulletproof.
And for his talk, he demonstrated how difficult it was to knock down an instance of the distributed transactional key value store.
First, he showed what would happen if you knocked one node offline, then what would happen by knocking off random nodes every minute, then every 10 seconds. Then, he showed what would happen if the log server were to go down.
In all cases, the database quickly restored itself. Occasionally, the service slowed in terms of serving up queries per second (QPS). And in severe cases, it might go offline briefly, but it always returned to an operational state.
As a site reliability engineer (SRE) for Tigris, Boros’ day job is to keep thousands of databases (mostly of the MySQL variety) running. Tigris runs S3-compatible object stores that are available globally.
Currently, Tigris uses FoundationDB, running on Kubernetes, to store the metadata for the petabyte-scale object storage service, and so it serves billions of requests daily.
FoundationDB is “a little-known open source database, but very much undeservedly so,” Boros said. It competes with the likes of CockroachDB, TiDB and YugabyteDB.
A Database of Microservices

Peter Boros.
Architecturally, FoundationDB is built from microservices, meaning different parts of the system are managed by separate components.
Like any good microservcies-based app, many of the components (i.e., the commit proxy, the resolver, the ratekeeper, the sequencer) are stateless, meaning they don’t hold data. And so when they fail, they can be rapidly replaced with identical units.
The architecture offers a number of advantages, though they may seem like limitations at first.
The chief feature/limitation is that it has a five-second timeout on all transactions. This is a limitation to embrace, Boros said.
This limit is not configurable. “After five seconds, your transaction is over,” he said.
This limit allows FoundationDB to recover from a bad transaction quickly.
“Bad queries are impossible, because after five seconds, your transaction will be killed. And that’s it,” he said. No more will your database go down because a dev wrote some bad SQL.
Likewise, a key can only be 10KB or less, and the value can only range up to 100KB. The upside? No bad queries, long transactions or messy undo issues, to name a few.
Nor will an admin experience something like MongoDB‘s “jumbo chunk” issue, in which chunks of data are so large that they get stuck on the database.
“These limitations actually help you at scale,” he said.
The software runs as a single thread, though the performance can be boosted through parallelism, running an individual process on multiple servers. FoundationDB offers strict serializability, meaning transactions are committed to the database in the order in which they were executed.
Tigris uses the simplest form of data replication available on FoundationDB, either double (a second copy of the data is made) or triple, in cases of global databases.
FoundationDB has a “very nice” Kubernetes operator that Tigris uses. A Kubernetes operator is a set of instructions that tells Kuberentes how to deploy (or redeploy) an application.
The FoundationDB operator was built by Apple, which also runs FoundationDB on K8s.
A Mere Flesh Wound
Running FoundationDB on Kubernetes offers two levels of resilience. Both the operator and the database itself are self-healing, Boros explained. This means they can execute failovers automatically.
Boros provided an example of a FoundationDB instance that stores three copies of each bit of data across multiple pods (12 in the example Boros provided):

Pods holding key ranges, each color specifying a different range. Each range has three copies spread out over 12 pods.
In the above example, when a pod goes down, additional copies of the data are still captured on two other nodes. The node failure will kick off the operator to start another node (or, in terms of FoundationDB, another process), with all the lost data recovered from other nodes.
This recovery process rebuilds the lost data quickly, because it is being pulled not from one node, but from two, in parallel.
Each storage process has an individual logging capability, which captures all the changes in the database. If a log process fails, it replaces the data from the other log processes.
Bite Your Kneecaps Off
For this presentation, Boros ran a number of resilience tests of progressive severity against a small-ish Kubernetes test cluster, with 8 servers with 96 cores each, 1TB+ of memory and NVMe connectivity. The tests were completed before the presentation, with Boros summarizing the results for the audience.
He used v7.3.63 FoundationDB (with the Redwood storage engine), which ran across 9 servers, for a total of 72 storage processes.

Client-side specs for testing.
The first case was a single pod failure. Using kubectl, Boros simply killed a running pod. The operator quickly fixed the pod and returned it online without the necessity of copying any data.
Easy-peasy.
So Boros looped the routine to kill a pod every minute, the results for which were illustrated in the graph below. The yellow line shows the downtime:

Visualizations were created by Tigris’ fdb-exporter.
Again, no disruption in service, as the pod was quickly restored.
How about a loop that takes out a pod every 10 seconds? Again, the FoundationDB operator quickly recovered through the repeated outages. Though here, by the time one pod is restored, another one has failed. No downtime, and no data needed to be copied, though the database was in an “unhealthy” state during brief periods of time (meaning it did not have the second copy of the data).

Even as the pod outages came more frequently, the QPS remained almost consistent.
How did these outages affect performance? The test system was designed to offer about 50,000 QPS. As the testing got more aggressive, the QPS dipped a bit, but the database did not go offline.
And, as soon as the testing was finished, the QPS returned to its usual state.
“This kind of recovery time is not normal for other databases,” Boros said.
In a third test, Boros killed all the storage nodes.
This test finally rendered the database unavailable — but only for a brief period of time.

The text is too small to read, but the center graph shows FoundationDB briefly copying data before stopping.
Within a minute, the database was up and running, and normal QPS was restored as the operator quickly restarted the pods and they rejoined the cluster.
The operator started copying data for the still-missing pods, but ceased this operation once other pods popped back up.
Log Failures? We’ve Seen Worse
When a database log goes offline, the results are usually catastrophic. Logs are, after all, keeping track of node failures.
Killing one log in FoundationDB resulted in a five-second stall in responses.
The same is true for multiple log failures, at one per minute, though in this case, throughput temporarily flattens out at 0 QPS:

The FoundationDB operator does an excellent job at keeping the database alive, but FoundationDB also has that ability built in.
The database’s “self-healing works with existing processes. The operator can add/replace processes to the cluster, so it is not in the critical path to restore availability, but it helps to restore the original number of processes,” Boros explained in a follow-up email.
FoundationDB ensures that different copies of the data exist in different availability zones. A zone can be one of the admin’s choosing. Each zone could be a different rack, a separate server, separate host names, or an actual geographical zone.
So, even if you kill all the nodes in a single zone, FoundationDB can recover.
Boros ran through the cases of storage failures without an operator present. This mimics cases where pods can’t come back up, such as rack failures.
Here, a single downed node triggers a data copy, and so it takes slightly longer for the node to reappear, and QPS dips correspondingly.
In Boros’ test, an entire zone was reconstituted within three minutes, which is the time it took to recreate each pod and repopulate it with the data from the other pods, using speedy parallel transfers.
Overall, as many as six nodes can fail “if they fail slowly enough,” he said.
As more zones fail, more data will need to be copied, and so the recovery time increases. A node failure can take a few hours to fully recover through this process.
“If you want to have awesome recovery times, your cluster needs to be big enough,” he said.
Log failures without an operator have similar recovery times to those with operators.
Boros also discussed FoundationDB’s backup and disaster recovery modes in the discussion. But the takeaway is clear: Want to guard a bridge? Hire the Black Knight. Want a transactional database that never goes down? Consider FoundationDB.
Try Tigris’ object storage service here, and enjoy Boro’s entire presentation below:
The Linux Foundation sponsored this reporter’s travel to the Open Source Summit.