Last modified 16 years ago
Last modified on 05/29/07 15:18:17
Downtime
Downtime is a system for detecting failures within a cluster, tracking their resolution, and correlating them with service outages. It integrates with the Fault Tolerant Backplane to track managed/averted failures, and Cobalt to calculate the affect of failed systems.
There are some more concerns in the ImplementationPath, but some work has been done on a more formal SystemDesign.
collaborations
questions
- How many times has a given system failed?
- How much compute time was lost?
- How many reservations were interrupted?
scope
- node availability
- compute nodes
- management nodes
- shared filesystem
- PVFS
- NFS
- network
- internal network (inter-node)
- external network
components
- issue tracker
- reporting scripts
- correlation engine
- report generation