Downtime is a system for detecting failures within a cluster, tracking their resolution, and correlating them with service outages. It integrates with the Fault Tolerant Backplane to track managed/averted failures, and Cobalt to calculate the affect of failed systems.
- How many times has a given system failed?
- How much compute time was lost?
- How many reservations were interrupted?
- node availability
- compute nodes
- management nodes
- shared filesystem
- internal network (inter-node)
- external network
- issue tracker
- reporting scripts
- correlation engine
- report generation