July 05, 2008

Fault Tree Analysis in 6 Steps

Often misunderstood, FTA requires nothing more complex than paper, pencil, and an understanding of the service at hand, writes ITSM Watch columnist Hank Marquis of itSM Solutions.

If you are IT Infrastructure Library (ITIL) certified you have no doubt heard of fault tree analysis, or FTA, as a means to improve availability. FTA requires no special software and can help you discover the root causes of a failure or detect where potential failures may occur.
It sounds complicated, but it’s actually fairly easy to use. Reactively, FTA starts with a top-level event, like a service outage, which you evaluate to determine the root cause. Proactively, FTA begins with an event you do not want to occur, like a server failure, and then helps you understand how to prevent the event from occurring.

In either case, you use the fault tree diagram to identify countermeasures to eliminate the causes of the failure. FTA does this through structured analysis of contributing faults and their cause that lead (or has led) to its occurrence.

FTA requires nothing more complex than paper, pencil, and an understanding of the service at hand. You will need accurate CI (configuration items) contextual information in order to get the most value from FTA. The following 6 simple steps can help you resolve tough design issues or problems quickly and easily.

1. Select a top level event for analysis. Try to be specific, for example, “Email server down for more than 4 hours.” Sources of top level events include: Problem/Known Error Records; service outage analysis; potential failures from brainstorming; and “what-if” scenarios based on service level agreements, etc.

2. Identify faults that could lead to the top level event. Continuing the above example, some possible faults leading to an outage lasting more than four hours might be “loss of power”, another might be “hardware failure.” List all the faults under the top level event in boxes and connect the fault boxes to the top level event box by drawing lines.

3. For each fault, list as many causes as possible in boxes below the related fault. Continuing the example above, in the case of “loss of power," some causes might be “electrical outage,” “power supply failure,” and so on. Connect the boxes to the appropriate fault box.

4. Draw a diagram of the “fault tree." Two logic operators – and and or, also known as logic gates – are used to represent the sequencing of faults and causes.

For example, “Email server down for more than 4 hours” could be caused by “loss of power” or “hardware fault." Another might be “loss of building power” and “battery backup exhausted.”

Update faults and causes by grouping logically related items using and or or between faults and events; and faults and causes. Re-draw the lines from top level event to logic gates to faults to logic gates to causes.

5. Continue identifying causes for each fault until you reach a root cause (reactive FTA), or one that you can do something about (proactive FTA). For example, the root cause of “power supply failure” might be “filter clogged;" the root cause of “battery backup exhausted” might be “battery backup too small."

6. Consider countermeasures. A root cause is one you can do something about; so now you need to think of the countermeasures you might apply to each root cause. List countermeasures for each root cause in a box under the root cause. For example, for “filter clogged” a countermeasure might be “clean filter monthly.” Link the countermeasure to the root cause by drawing a line.

FTA is useful alone on paper and as a group exercise on a media board. When your diagram is complete, you have a fault tree. Fault trees show how an event can occur and what you can about it from a design or change perspective. For Problems, you also have a possible root cause and a solution.

FTA is actually pretty simple, but don’t let its simplicity fool you. If you want to get fancy, you can play with probability statistics to try and get even more precise – determining the “chance” that a fault or cause could occur. Applying probabilities derived from CI metrics like mean time to repair and mean time between can produce some very precise calculations.

Even if you don’t get fancy with FTA you will have taken a powerful step toward preventing problems in the first place, or resolving tough problems. Often just the act of creating a fault tree as a group generates excellent ideas and possible solutions where before there were none.

FTA can be used by technical observation post (TOP) teams, problem managers, availability manager, and even IT service continuity management teams with a minimum of training. The graphical nature of FTA makes it easy to understand and easy to maintain in the face of changes.

No comments: