Last modified 12 years ago Last modified on 05/22/07 10:17:35

Downtime is a combination of four main components:

  • detection
  • tracking
  • correlation
  • reporting

This implementation focuses on an issue tracking system with API hooks. When faults are detected, either by a human or by automated scripts, a script is called to insert a new ticket into the tracker.

A correlation engine, with information about the relationships between the elements of the system, "blames" one fault upon another. If a system is able to detect the correction of a fault, it can record the resolution as well, and try to test whether or not other problems which had been "blamed" on it have been fixed as well.

Upon request, reports can be generated, containing information from the correlation engine and tracking system:

  • How many times did blame shift?
  • How many faults was a given system blamed for?
  • How many times was a given system brought down?