Boosting Resilience in Bare-Metal Active Active Clusters: 4 and 5 Node Control Plane Architecture (4.17 ⬠Version)
Ā

Ā
Organizations running active-active deployments across two locationsāespecially those hosting stateful workloads like OpenShift Virtualization VMs that run only a single instanceādepend heavily on the underlying infrastructure to guarantee availability.
While traditional virtualization platforms handle this natively, running these workloads on OpenShift bare metal introduces new architectural considerations.
Ā
The Challenge: What Happens When the Primary Site Fails? ā ļø
In typical stretched OpenShift clusters, the control plane is often deployed in a 2+1 or 1+1+1 topology.
But if the data center hosting the majority of control-plane nodes goes down:
-
The surviving control-plane node becomes the only source of truth for the cluster.
-
That single node must switch to read-write mode and act as the exclusive etcd copy.
-
If that node fails⦠recovery becomes catastrophic, especially when running stateful VMs.
This risk becomes even more critical in environments leveraging OpenShift Virtualization for production workloads.
Ā
The Solution: 4-Node and 5-Node Control Plane for Stretched Clusters š
To increase resiliency during data-center-level failures, OpenShift can leverage 4-node or 5-node control-plane deployments, such as:
-
2+2
-
3+2
With these designs, even if an entire site is lost, the remaining location still retains two read-only copies of etcd, significantly boosting cluster recoverability and reducing the risk of losing quorum.
Today, the cluster-etcd-operator already supports up to five etcd members, automatically scaling in environments using MachineSets.
But in bare-metal or agent-based installations, MachineSets are not availableāmeaning the operator won't scale automatically but will adjust etcd peers when control-plane nodes are added manually.
This is exactly the workflow we aim to validate and officially support.
š§ Note: This capability is specifically targeted at bare-metal clusters, with a strong focus on OpenShift Virtualization use cases.
Ā
Goals šÆ
Validate and support 4-node and 5-node control-plane architectures for bare-metal stretched clusters, under the following constraints:
-
Bare-metal control-plane nodes
-
Installed via Assisted Installer or Agent-based Installer
-
Shared Layer 3 network across locations
-
Latency < 10 ms between all control-plane nodes
-
Minimum 10 Gbps bandwidth
-
etcd stored on SSD or NVMe
Ā
Acceptance Criteria āļø
š Performance
Control plane performance and scalability must show less than 10% degradation when compared to standard HA clusters.
š Recovery Procedures
Documentation must be validated and updated for manual control-plane recovery in cases of quorum loss.
Ā

Ā
