TY - GEN
T1 - Scalable, adaptive, time-bounded node failure detection
AU - Gillen, Matthew
AU - Rohloff, Kurt
AU - Manghwani, Prakash
AU - Schantz, Richard
PY - 2007
Y1 - 2007
N2 - This paper presents a scalable, adaptive and time-bounded general approach to assure reliable, real-time Node-Failure Detection (NFD) for large-scale, high load networks comprised of Commercial Off-The-Shelf (COTS) hardware and software. Nodes in the network are independent processors which may unpredictably fail either temporarily or permanently. We present a generalizable, multi-layer, dynamically adaptive monitoring approach to NFD where a small, designated subset of the nodes are communicated information about node failures. This subset of nodes are notified of node failures in the network within an interval of time after the failures. Except under conditions of massive system failure, the NFD system has a zero false negative rate (failures are always detected with in a finite amount of time after failure) by design. The NFD system continually adjusts to decrease the false alarm rate as false alarms are detected. The NFD design utilizes nodes that transmit, within a given locality, "heartbeat" messages to indicate that the node is still alive. We intend for the NFD system to be deployed on nodes using commodity (i.e. not hard-real-time) operating systems that do not provide strict guarantees on the scheduling of the NFD processes. We show through experimental deployments of the design, the variations in the scheduling of heartbeat messages can cause large variations in the false-positive notification behavior of the NFD subsystem. We present a per-node adaptive enhancement of the NFD subsystem that dynamically adapts to provide run-time assurance of low false-alarm rates with respect to past observations of heartbeat scheduling variations while providing finite node-failure detection delays. We show through experimentation that this NFD subsystem is highly scalable and uses low resource overhead.
AB - This paper presents a scalable, adaptive and time-bounded general approach to assure reliable, real-time Node-Failure Detection (NFD) for large-scale, high load networks comprised of Commercial Off-The-Shelf (COTS) hardware and software. Nodes in the network are independent processors which may unpredictably fail either temporarily or permanently. We present a generalizable, multi-layer, dynamically adaptive monitoring approach to NFD where a small, designated subset of the nodes are communicated information about node failures. This subset of nodes are notified of node failures in the network within an interval of time after the failures. Except under conditions of massive system failure, the NFD system has a zero false negative rate (failures are always detected with in a finite amount of time after failure) by design. The NFD system continually adjusts to decrease the false alarm rate as false alarms are detected. The NFD design utilizes nodes that transmit, within a given locality, "heartbeat" messages to indicate that the node is still alive. We intend for the NFD system to be deployed on nodes using commodity (i.e. not hard-real-time) operating systems that do not provide strict guarantees on the scheduling of the NFD processes. We show through experimental deployments of the design, the variations in the scheduling of heartbeat messages can cause large variations in the false-positive notification behavior of the NFD subsystem. We present a per-node adaptive enhancement of the NFD subsystem that dynamically adapts to provide run-time assurance of low false-alarm rates with respect to past observations of heartbeat scheduling variations while providing finite node-failure detection delays. We show through experimentation that this NFD subsystem is highly scalable and uses low resource overhead.
UR - http://www.scopus.com/inward/record.url?scp=48349148135&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=48349148135&partnerID=8YFLogxK
U2 - 10.1109/HASE.2007.66
DO - 10.1109/HASE.2007.66
M3 - Conference contribution
AN - SCOPUS:48349148135
SN - 0769530435
SN - 9780769530437
T3 - Proceedings of IEEE International Symposium on High Assurance Systems Engineering
SP - 179
EP - 186
BT - Proceedings - 10th IEEE International Symposium on High Assurance Systems Engineering, HASE 2007
T2 - 10th IEEE International Symposium on High Assurance Systems Engineering, HASE 2007
Y2 - 14 November 2007 through 16 November 2007
ER -