High Availability

ArcadeDB is designed for continuous operation. Its high availability (HA) architecture ensures that your database remains accessible even when individual servers fail, providing both fault tolerance and horizontal read scalability.

This page explains the concepts behind ArcadeDB’s HA architecture. For step-by-step setup instructions, see HA Configuration.

Starting with v26.4.1, ArcadeDB’s HA is powered by Apache Ratis, a battle-tested implementation of the Raft consensus protocol also used by Apache Ozone, Apache IoTDB, and Alluxio. The change is transparent to clients: the HTTP API, drivers, and query languages are unchanged.

Leader-Replica Replication Model

ArcadeDB uses a leader-replica replication model. At any point in time, a cluster has exactly one leader server and zero or more replica servers.

  • The leader coordinates all write operations and distributes changes to replicas through the Raft log.

  • Replicas serve read requests (queries) independently, allowing the cluster to scale read throughput horizontally.

Clients can connect to any server in the cluster. If a client sends a write request to a replica, the replica transparently forwards it to the leader over HTTP. Applications do not need to distinguish between leader and replica servers — the cluster handles routing internally.

Server Roles and Election

Every server in the cluster operates in one of two roles:

LEADER

The single server responsible for accepting and coordinating all write operations. Only one leader exists at any time.

REPLICA

A server that maintains a copy of the data and serves read requests. Replicas receive changes from the leader and can be promoted to leader if the current leader fails.

When a server starts, it joins the cluster and participates in a Raft election. Raft guarantees that the cluster agrees on exactly one leader at all times. Apache Ratis implements the full Raft protocol, including pre-vote (prevents disrupted elections from partitioned nodes), leader lease, and parallel voting for fast convergence.

If the cluster already has a healthy leader, the new server simply joins as a replica and catches up via the Raft log or, if it has fallen too far behind, a full snapshot download. If no leader exists — for example, during initial cluster formation or after a leader failure — an election determines which server becomes the new leader.

Replication and the Raft Log

The leader replicates changes to replicas through the Raft log, which combines a write-ahead log with distributed consensus. Each server maintains its own local copy of the Raft log.

When the leader processes a write operation, ArcadeDB uses a replicate-first, commit-after three-phase commit:

  1. Phase 1 — the leader captures the write-ahead log (WAL) pages produced by the transaction.

  2. Phase 2 — the captured payload is appended to the Raft log and replicated to followers. The commit is acknowledged only after a quorum of nodes has persisted the entry.

  3. Phase 3 — the leader applies the pages locally and returns success to the client.

If replication fails at Phase 2, no local writes are applied: the transaction is rolled back and no divergence can occur between the leader and the followers.

Multiple concurrent transactions are automatically batched into a single Raft round-trip via group commit, dramatically increasing throughput under concurrent load.

Write Quorum

ArcadeDB uses a configurable write quorum to control the trade-off between durability and availability. The quorum defines how many servers must acknowledge a write before it is considered committed.

The default quorum is MAJORITY — more than half of the servers in the cluster must confirm the write. This is the standard Raft quorum and provides a strong balance between durability and availability: data is stored on a majority of nodes while tolerating the failure of any minority.

Two quorum modes are supported:

majority (default)

A standard Raft quorum. Writes are committed as soon as a majority of peers acknowledge, so the cluster keeps serving writes as long as a majority is alive.

all

Every configured peer must acknowledge the write. Strongest durability, but a single unavailable peer stops writes.

In v26.4.1, the legacy values none, one, two, three are no longer supported and will cause a startup error. They were part of the old custom HA protocol; with Raft only majority and all are meaningful. If you were using a granular quorum value, update your configuration before upgrading.

If the quorum cannot be met — for example, because too many peers are unavailable — the transaction is rolled back and an error is returned to the client. This prevents partially committed writes.

For more on how transactions interact with replication, see Transactions. For quorum and other cluster settings, see Server Settings.

Read Consistency (from v26.4.1)

When a client sends a read request to a replica, ArcadeDB supports three consistency levels that trade latency for freshness:

eventual

The replica reads locally without waiting. Fastest, but may return slightly stale data.

read_your_writes (default)

The replica waits until the Raft log index corresponding to the client’s most recent write has been applied locally before serving the read. This ensures the client always observes its own writes, even when those writes went to the leader and the read goes to a replica.

linearizable

The replica issues a Ratis ReadIndex request to the leader and waits until its local applied index reaches the leader’s committed index at the time of the request. This is the strongest guarantee and survives leader changes without serving stale data, at the cost of higher latency.

The default is set via arcadedb.ha.readConsistency. Clients can also override it per request with the X-ArcadeDB-Read-Consistency HTTP header. For read-your-writes, the client includes the commit index returned by its last write via the X-ArcadeDB-Commit-Index header so the replica waits only for that index.

Automatic Failover

ArcadeDB handles leader failure automatically. When replicas stop receiving heartbeats from the leader, they initiate a new election using the Raft protocol. A replica with an up-to-date log is elected as the new leader, and the cluster resumes normal operation.

This failover process is transparent to clients: after a brief interruption during the election, clients reconnect (or are proxied) to the new leader and continue operating without manual intervention.

Common causes of leader unavailability include process termination, server shutdown or reboot, and network partitions that isolate the leader from the rest of the cluster.

Split-Brain Protection

A split-brain occurs when a network partition divides a cluster into two or more groups, each believing it should operate independently. This can lead to conflicting writes and data divergence.

ArcadeDB prevents split-brain scenarios through the Raft majority quorum requirement. Because a leader must receive acknowledgment from a majority of servers to commit writes, at most one partition can ever hold a majority. Any minority partition cannot elect a leader or commit writes, so it effectively becomes read-only until the network heals. Pre-vote further prevents a partitioned node from forcing a disruptive election when it regains connectivity.

This design guarantees that the cluster never produces conflicting committed data, even during network failures.

Snapshots and Catch-Up

As the Raft log grows, ArcadeDB periodically takes a snapshot of the database state. The snapshot allows old log segments to be compacted, preventing unbounded disk growth.

When a replica falls too far behind the leader (for example, after a long network partition), ArcadeDB automatically transfers a fresh snapshot: the follower downloads a ZIP of the database from the leader over HTTP and installs it atomically. Once the snapshot is installed, the follower resumes normal log-based replication from the snapshot index.

Dynamic Membership (from v26.4.1)

Cluster membership can be changed at runtime without restarting the cluster. Operators can add peers, remove peers, transfer leadership, or have a node gracefully leave the cluster through the cluster REST API. On Kubernetes, scaling a StatefulSet up or down automatically adds and removes peers.

Cluster Discovery and Configuration

Servers discover each other through a configured list of peers (arcadedb.ha.serverList). Each entry has the form [name@]host:raftPort:httpPort, where the Raft gRPC port (default 2434) is used for consensus and the HTTP port is used for command forwarding. The optional name@ prefix (since v26.5.1) lets operators give each peer a stable, human-readable name (for example, frankfurt, london, nyc) shown in logs and Studio.

You can run multiple independent clusters on the same network by assigning each cluster a unique name via arcadedb.ha.clusterName. Inter-node communication is authenticated with a shared cluster token automatically derived from the cluster name and the root password.

For detailed configuration options and deployment instructions, see HA Configuration.

Continuous Correctness Testing with Jepsen

Distributed systems are easy to claim correct and hard to prove correct. To know how ArcadeDB actually behaves under partitions, process kills, clock skew, and simulated power loss — rather than how we hope it behaves — the team maintains a public Jepsen test suite at ArcadeData/arcadedb-jepsen.

Jepsen is the de-facto framework for testing distributed systems under faults. It pairs aggressive failure injection with two model checkers — Knossos for linearizability and Elle for transactional isolation — that build a history of every operation the cluster acknowledged and then look for any execution that no correct system could have produced. Both checkers are model-driven, so a system can’t pass them by being merely fast or quiet under load.

ArcadeDB is not officially Jepsen-certified — that requires an independent engagement with Jepsen LLC. The tests in this section are a self-hosted suite using the open-source Jepsen framework. We publish them in full so the community can review, reproduce, and extend them.

What gets tested

The suite combines six workloads with eight fault models (nemeses), so the cluster is exercised the way users actually break it in production:

Workload What it verifies

bank

ACID conservation across concurrent transfers between accounts; the total balance must stay constant under any failure schedule.

set

No acknowledged write is ever lost; every successfully-inserted element survives partitions, kills, and replays.

elle

Multi-key transaction isolation. The Elle checker rejects G0, G1a, G1b, G2, and lost-update anomalies via dependency-graph cycle detection.

register

Single-key linearizability when reading from the leader.

register-follower

Linearizability for follower reads via Raft ReadIndex.

register-bookmark

Read-your-writes consistency through commit-index bookmarks (the same mechanism documented in Read Consistency).

Nemesis What it injects

partition

Network partitions that isolate one or more nodes from the rest of the cluster.

kill

Hard process termination of randomly chosen servers.

pause

SIGSTOP / SIGCONT pairs that freeze a server’s process for seconds at a time.

clock

Bidirectional clock skew on individual nodes.

lazyfs

LazyFS-backed simulated power loss — drops un-fsynced writes to test recovery from crashes that lose the OS page cache.

all

Random combinations of partition + kill + pause.

all+clock

all plus simultaneous clock skew.

all+lazyfs

all plus simulated power loss.

The full matrix runs in Docker against a five-node cluster (n1n5) and is reproducible end-to-end from the repository’s run-all-tests.sh.

Current status

At the time of writing, the suite reports all tests passing — leader linearizability, follower linearizability via ReadIndex, ACID conservation, replication completeness, and read-your-writes — across every nemesis combination, including the LazyFS power-loss workloads added most recently. See the launch announcement for the methodology and original results, and the repository for the up-to-date matrix.

Concretely, this means that under every failure pattern we exercise:

  • every write the cluster acknowledged remains durable after recovery,

  • concurrent transactions remain ACID — Elle finds no isolation anomaly,

  • single-key operations are linearizable on the leader and on followers using ReadIndex,

  • clients reading their own writes never observe regressed state, even when their reads land on a different replica than the write.

What’s not (yet) tested

Honesty matters more than green checkmarks. The suite does not yet cover:

  • multi-hour or multi-day soak runs that surface slow leaks and pathological GC interactions,

  • Byzantine-fsync scenarios (a filesystem that lies about durability beyond what LazyFS simulates),

  • geo-distributed clusters with sustained tens-of-milliseconds latency between nodes,

  • every imaginable compounded worst case — for example, an expired leader lease plus clock skew plus an active partition all happening at once.

These are explicitly tracked as future work. Pull requests for additional workloads, tighter checkers, or longer-duration runs are very welcome.

Mission

The reason the project exists is straightforward: we want to know what ArcadeDB actually does under failure before we ask anyone else to trust it. Jepsen is the highest-leverage way the industry has found to answer that question, and we run it in public so the answer is verifiable rather than assertive. If you find a workload that breaks, please open an issue at arcadedb-jepsen/issues — that is exactly what the suite is for.