An architect’s breakdown: quorum DR, split-brain, leases — and why “wait for the standby” isn’t the same as “survive minority failures.”
What Is Paxos Saying — and Why It Matters?
If you’ve never bumped into distributed systems theory, Paxos might sound like an inside joke from a computer-science department. A useful mental model is a roomful of people trying to agree on one outcome — except some people arrive late, drop off the call, or contradict themselves. The system still has to produce a single decision that won’t unravel later. That is the problem distributed consensus exists to solve. The Paxos family of protocols, introduced by Turing Award winner Leslie Lamport, is a classic answer: no outcome is final unless a majority of participants has accepted it.
Why a majority? Because the math is clean: if every decision requires a majority, then any two majorities must share at least one member. You cannot end up with “half the cluster believes A” and “the other half believes B” forever — a situation that, in databases, is the nightmare called split-brain, where two nodes both think they are the writable leader and happily accept conflicting writes.
Takeaway: Paxos isn’t magic optimism. It’s a rule for turning messy partial failures into one durable story about what happened.
OceanBase applies this idea to durability. Data is replicated across multiple copies called replicas. Every database change first becomes redo / commit log entries. Those entries must be durably recorded on a majority of replicas — including the leader — before the transaction can return “commit success” to the client. So if a single machine loses a disk — or even the machine hosting the leader dies — as long as a majority of replicas is still reachable, your change has already been voted in and persisted on more than one machine. That is the engineering reason OceanBase can target RPO = 0 in many disaster-tolerance setups: reliability is not “this one box is special,” but “the quorum is the source of truth.”
RPO (Recovery Point Objective): the maximum acceptable amount of data loss after a failure — here, the design aims for zero committed data loss for the covered failure modes.
One clarification matters: replicas are not voting on “every tiny row change” as an isolated Paxos instance. In OceanBase, consensus operates at the log stream layer. A log stream is an internal abstraction that merges ordered log records for multiple partitions (shards of data). Batching many partition updates into one ordered stream lets a single Multi-Paxos interaction synchronize multiple partitions at once — less chatter on the network, better end-to-end efficiency.
Architecturally, OceanBase is commonly deployed across multiple Zones (isolation domains analogous to availability zones in public clouds — separate failure domains within a region). Each log stream has one leader replica and several follower replicas. The leader executes writes and drives replication. Under the hood, replication uses a Multi-Paxos-style protocol: once a stable leader is established, the steady state avoids the classic two-round RPC pattern of naive Paxos for every log entry — often one round trip is enough to achieve quorum persistence — so you keep correctness and keep latency under control.
Multi-Paxos vs Strong-Sync vs Raft
In a traditional primary/secondary topology, a common anti-loss tactic is strong synchronous replication: the primary waits until the standby has also persisted the log before acknowledging the client. That does give you a full log on the standby if the primary dies — but the price is blunt: if primary, standby, or the network between them hiccups, the primary can stall — or appear unavailable. You are often forced into an ugly trade: pick “data safety” or “write availability,” not both.
Multi-Paxos with three or five replicas is different by design: decisions are quorum-based. If more than half the replicas are alive and can talk, the system can usually continue accepting writes and converging on one history. **When a minority fails (for example, one replica out of three), you can still have both “no loss of committed work” and “service keeps moving” — **something strict two-node strong sync struggles to deliver cleanly.
Raft and Multi-Paxos share a goal — majority agreement — but not the same engineering ergonomics. Raft emphasizes strict log continuity: entries at the same index, in the same term, must line up, and commits advance in a tidy sequence. That makes leader election and replication easier to reason about — great for teaching and for many implementations. Multi-Paxos, as OceanBase uses it, can allow more out-of-order confirmation patterns at the protocol level; individual log entries can be advanced and learned with flexibility when nodes recover, and a new leader may need an extra round of reconciliation for uncommitted tails. The upside is adaptability under messy real-world failures and topologies.
OceanBase’s Multi-Paxos tuning also enables choosing a latency-friendly quorum when geography allows — for example, two facilities in the Seattle metro plus one in a distant region might prefer acknowledging the two “near” copies first to cut round-trip time while still satisfying majority. Compared with a more rigid Raft-shaped story, this flexibility tends to matter when you are optimizing multi-site, multi-Zone deployments where “low latency” and “survive a datacenter loss” are both non-negotiable.
Plain-language contrast: Raft is often “cleaner on the whiteboard.” OceanBase’s Multi-Paxos flavor is “more willing to negotiate with reality on the WAN.”
Where Does Write Performance Come From?
Consensus must be correct — and also fast enough. OceanBase applies several engineering optimizations on top of textbook Paxos.
First: in the steady state, log replication often collapses to one RPC round. Classic Paxos per entry can look like prepare + accept — two trips. With Multi-Paxos, after a stable leader is in place, followers can accept new entries in a streamlined path: the leader ships the log; once a majority has persisted, the entry is committed. Day-to-day latency is dominated by one network round trip + quorum fsync, not a committee meeting for every line item.
Second: entries can be acknowledged and committed out of strict global order. The system does not always need “line 42 before line 43, or nothing counts.” In production, micro-bursts of packet loss or jitter are normal. If one replica falls behind briefly but a quorum is still durable, commits can proceed. The cluster doesn’t let one slow link throttle the entire write path — a practical resilience feature on noisy networks.
Third: “arbitration replicas” trim the cost of high availability. OceanBase can use arbitration replicas — lightweight members that participate in leader election and voting without storing the full dataset. They reduce cross-site bandwidth pressure and make three-location layouts more economical, which matters when cross-region links are expensive or capacity-constrained.
Beyond the Log: Leases, Failover, and Routing
Replication is the spine — but not the whole skeleton.
Automatic failover and split-brain resistance (leases)
The leader is not a lifetime appointment. OceanBase elects leaders through an election protocol and uses a lease: at any moment, only one node should believe it may act as leader for a term. If the leader fails or a partition isolates it, the survivors wait for the lease to expire before electing anew — reducing the classic “zombie primary” failure mode. After a failure, the switchover and service can be completed and restored in a short time (e.g., RTO < 8 seconds).
RTO (Recovery Time Objective): how long service can be interrupted after a failure before you violate business requirements.
Local failures shouldn’t crater the whole fleet: fine-grained leader movement
Unlike monolithic “one primary for the entire database” stories, OceanBase ties replication and failover to log streams. If a physical host dies, only the log streams for which that host was leader need fast re-election; other partitions keep serving. That parallel recovery is a big reason OceanBase can talk about second-scale RTO at real scale.
Application-transparent routing: the smart proxy layer
Clients typically don’t pin connections to raw database nodes (OBServer — OceanBase’s data/compute process). They connect through OBProxy, the database proxy / load balancer that routes by partition topology and leader location. When leadership moves, OBProxy learns the new map (via feedback and periodic refresh), so applications usually don’t rewrite configs or bounce processes just because a leader moved.
Automatic repair: evict the bad node, refill the quorum
Root Service is OceanBase’s cluster management control plane — and it is itself replicated for HA. Nodes report liveness via heartbeats. If a member goes dark long enough, it can be removed from the Paxos group and replaced on healthy machines so the quorum stays intact through machine-level and site-level incidents.
Summary
OceanBase’s HA story centers on a simple contract encoded in the protocol: majority persistence is what makes a commit real. That is how you get RPO = 0 for committed work under minority failures without the “primary/secondary/network trinity” deadlock of naive strong sync. Versus a textbook Raft-shaped implementation, Multi-Paxos-style flexibility helps when you are optimizing multi-Zone, multi-site latency and recovery paths. Add leases, stream-granular failover, OBProxy routing, a replicated Root Service, read replicas, and deployment patterns for DR — and the slide from theory to operations becomes believable for teams that cannot choose between data durability and service continuity.
References
Paxos protocol: https://en.oceanbase.com/docs/common-oceanbase-database-10000000001031451
High availability: https://en.oceanbase.com/docs/community-odp-en-10000000001007334
OceanBase overview: https://en.oceanbase.com/docs/common-oceanbase-database-10000000003678727
Cross-cloud active-active architecture: https://en.oceanbase.com/docs/common-oceanbase-cloud-10000000001781970
OceanBase GitHub: https://github.com/oceanbase/oceanbase/blob/develop/README.md
Designing HA systems? What trade-offs are you making between RPO and availability? Drop your lessons below.
👏 Clap · 🔔 Follow for more database engineering deep dives



Top comments (0)