Boosting Resilience in Bare-Metal Active Active Clusters: 4 and 5 Node Control Plane Architecture (4.17 ⬆ Version)

Organizations running active-active deployments across two locations—especially those hosting stateful workloads like OpenShift Virtualization VMs that run only a single instance—depend heavily on the underlying infrastructure to guarantee availability.
While traditional virtualization platforms handle this natively, running these workloads on OpenShift bare metal introduces new architectural considerations.
The Challenge: What Happens When the Primary Site Fails? ⚠️
In typical stretched OpenShift clusters, the control plane is often deployed in a 2+1 or 1+1+1 topology.
But if the data center hosting the majority of control-plane nodes goes down:
-
The surviving control-plane node becomes the only source of truth for the cluster.
-
That single node must switch to read-write mode and act as the exclusive etcd copy.
-
If that node fails… recovery becomes catastrophic, especially when running stateful VMs.
This risk becomes even more critical in environments leveraging OpenShift Virtualization for production workloads.
The Solution: 4-Node and 5-Node Control Plane for Stretched Clusters 🚀
To increase resiliency during data-center-level failures, OpenShift can leverage 4-node or 5-node control-plane deployments, such as:
-
2+2
-
3+2
With these designs, even if an entire site is lost, the remaining location still retains two read-only copies of etcd, significantly boosting cluster recoverability and reducing the risk of losing quorum.
Today, the cluster-etcd-operator already supports up to five etcd members, automatically scaling in environments using MachineSets.
But in bare-metal or agent-based installations, MachineSets are not available—meaning the operator won't scale automatically but will adjust etcd peers when control-plane nodes are added manually.
This is exactly the workflow we aim to validate and officially support.
🔧 Note: This capability is specifically targeted at bare-metal clusters, with a strong focus on OpenShift Virtualization use cases.
Goals 🎯
Validate and support 4-node and 5-node control-plane architectures for bare-metal stretched clusters, under the following constraints:
-
Bare-metal control-plane nodes
-
Installed via Assisted Installer or Agent-based Installer
-
Shared Layer 3 network across locations
-
Latency < 10 ms between all control-plane nodes
-
Minimum 10 Gbps bandwidth
-
etcd stored on SSD or NVMe
Acceptance Criteria ✔️
📌 Performance
Control plane performance and scalability must show less than 10% degradation when compared to standard HA clusters.
📌 Recovery Procedures
Documentation must be validated and updated for manual control-plane recovery in cases of quorum loss.

