Introduction
Data integrity is crucial in a cluster environment to ensure that information remains consistent and reliable across multiple nodes. In clusters, nodes work together to process and store data, often sharing resources. If data becomes corrupted or inconsistent due to node failures, network issues, or concurrency conflicts, it can lead to incorrect outcomes, application crashes, or system downtime. Ensuring data integrity prevents split-brain scenarios, where multiple nodes may try to control shared resources, risking data corruption.
The Red Hat Linux Cluster uses a technique called Fencing to preserve data integrity.
What is Fencing in Red Hat Linux Cluster?
Fencing is a critical technique used in cluster environments to ensure data integrity and availability by isolating failed or unresponsive nodes. This process involves forcibly removing a problematic node from the cluster to prevent it from affecting the operation of other nodes. By isolating failed nodes, fencing helps to avoid issues like data corruption or split-brain scenarios, where two nodes might attempt to control the same resources independently. Implementing effective fencing strategies ensures that clusters maintain consistent and reliable performance, even when encountering hardware or software failures.
The process of fencing is performed in the following steps:
- Fencing is the disconnection of a node from the cluster's shared storage.
Fencing is a protective measure used in clusters to isolate a malfunctioning or failed node from shared resources. This prevents potential data corruption or other issues that could arise from the failed node's continued participation in the cluster. By disconnecting the node, the cluster ensures that only healthy nodes can access and modify shared storage.
- Fencing cuts off I/O from shared storage, thus ensuring data integrity.
When a node is fenced, its input/output (I/O) operations are halted to prevent it from accessing shared storage. This is crucial because a failed node may have incomplete or erroneous operations that could compromise data integrity. By stopping these operations, the cluster maintains the consistency and accuracy of the data stored on shared resources.
- The cluster infrastructure performs fencing through the fence daemon, fenced.
In Red Hat clusters, the fencing process is managed by a component called the "fence daemon," often referred to as "fenced." This daemon is responsible for executing the fencing actions, such as cutting off a node's access to the network or powering it down. The fenced component ensures that the fencing operations are carried out swiftly and reliably, preventing the problematic node from causing further issues.
- When CMAN determines that a node has failed, it communicates to other cluster-infrastructure components that the node has failed.
The Cluster Manager (CMAN) monitors the health and status of nodes within the cluster. Upon detecting a failure, CMAN communicates this information to other cluster components, signalling that a particular node is no longer functioning correctly. This communication is essential for initiating fencing and ensuring that the cluster can take appropriate action to maintain stability.
- Other cluster-infrastructure components determine what actions to take — that is, they perform any recovery that needs to be done.
Once a node failure is detected and communicated, other cluster components, such as the fencing agent, resource manager, or recovery manager, determine the necessary actions. These actions may include fencing the failed node, redistributing workloads, or initiating recovery processes. The goal is to restore normal cluster operations as quickly and efficiently as possible, ensuring minimal disruption to services.
Types of Fencing in Red Hat Linux Cluster
Fencing mechanisms can include hardware-based solutions like power switches or software-based methods such as virtual fencing. Here are the types of fencing used in the Red Hat Linux Cluster:
Power Fencing:
Power fencing is a method used to isolate a node in a cluster by cutting off its power supply. This technique ensures that the node cannot access shared resources, thereby preventing potential data corruption. Power fencing is implemented using devices like Uninterruptible Power Supplies (UPS) or Power Distribution Units (PDU). These devices can remotely control the power state of the nodes, allowing administrators to power off problematic nodes swiftly, ensuring the integrity and consistency of the cluster environment.
Figure 1 Power Fencing in Red Hat Linux Cluster
Storage Fencing:
This type of fencing disables the Fiber Channel port that connects the shared storage to the failed node. It is also known as Fiber Channel Switch Fencing or SCSI-3 Persistent Group Reservation (PGR) fencing. This fencing mechanism blocks a specific port on a Fiber Channel switch. This method is often employed in environments with shared disk subsystems to ensure that a failed node cannot interfere with shared storage resources. The cluster manager software communicates with the Fiber Channel switch to remove the reservations of the problematic node, effectively isolating it from the shared storage. This process prevents unauthorized or erroneous data writes, maintaining data integrity.
Figure 2 Storage Fencing in Red Hat Linux Cluster
Other Fencing
There can be multiple fencing methods that disable the failed nodes. Some of these methods are:
Fencing a Node with Dual Power Supplies:
In scenarios where a node has dual power supplies, fencing must account for both to ensure complete isolation. This is typically achieved by configuring both power supplies with independent fencing devices. For example, each power supply might be connected to a separate PDU or UPS, allowing the fencing mechanism to cut power to both supplies if needed. This redundancy ensures that even if one power supply is cut off, the other cannot provide power, thereby isolating the node entirely.
Fencing a Node with Dual Fiber Channel Connection:
For nodes with dual Fiber Channel connections, fencing must disable access on both connections to ensure the node is fully isolated from the shared storage. This is accomplished by configuring the fencing mechanism to block both ports on the Fiber Channel switch. The use of SCSI-3 PGR allows the cluster manager to manage and enforce these blocks, preventing the node from reconnecting or accessing shared data. This type of fencing is crucial in high-availability systems where ensuring data consistency and integrity is paramount.
Conclusion
Maintaining integrity helps cluster environments to achieve high availability, reliability, and smooth failover processes. This ensures uninterrupted and dependable services in an enterprise environment.