The paper discusses the problem of maintaining consistency in distributed control systems that employ redundant controllers. Redundancy is commonly used to mitigate the risk of unplanned downtime due to hardware failures, where an active primary controller manages the process and a passive backup is ready to take over in case of primary failure.
The key highlights and insights are:
Redundancy communication can be carried out over a dedicated, point-to-point connection or a redundant network backbone. Failure of the redundancy link can partition the controller pair, disrupting synchronization and causing their internal states to diverge, potentially resulting in inconsistent outputs.
The Network Reference Point Failure Detection (NRP FD) algorithm is proposed to prioritize consistency over availability in redundant controller systems. It uses an external Network Reference Point (NRP) as a tiebreaker for primary role determination, aiding the backup controller in differentiating between primary and network failures.
The paper models and formally verifies the NRP FD algorithm using Timed Rebeca, an actor-based modeling language. The verification identifies potential issues where the algorithm may result in a dual primary situation, compromising consistency.
To address the identified issues, the paper proposes an enhanced version called Leasing NRP FD, where the primary role is "leased" from the NRP. This ensures a singular primary controller in all failure scenarios, preserving consistency.
The paper discusses the rationale for choosing Timed Rebeca as the modeling language, highlighting its faithfulness to the problem domain and usability for the modeler.
The paper also explores various failure scenarios, including transient errors, and provides a comprehensive analysis of the proposed algorithms.
To Another Language
from source content
arxiv.org
Deeper Inquiries