Failure Modes and Effects Analysis

Probability (P)

It is necessary to look at the cause of a failure mode and the likelihood of occurrence. This can be done by analysis, calculations / FEM, looking at similar items or processes and the failure modes that have been documented for them in the past. A failure cause is looked upon as a design weakness. All the potential causes for a failure mode should be identified and documented. This should be in technical terms. Examples of causes are: Human errors in handling, Manufacturing induced faults, Fatigue, Creep, Abrasive wear, erroneous algorithms, excessive voltage or improper operating conditions or use (depending on the used ground rules). A failure mode is given a Probability Ranking.

Rating	Meaning
A	Extremely Unlikely (Virtually impossible or No known occurrences on similar products or processes, with many running hours)
B	Remote (relatively few failures)
C	Occasional (occasional failures)
D	Reasonably Possible (repeated failures)
E	Frequent (failure is almost inevitable)

Severity (S)

Determine the Severity for the worst-case scenario adverse end effect (state). It is convenient to write these effects down in terms of what the user might see or experience in terms of functional failures. Examples of these end effects are: full loss of function x, degraded performance, functions in reversed mode, too late functioning, erratic functioning, etc. Each end effect is given a Severity number (S) from, say, I (no effect) to VI (catastrophic), based on cost and/or loss of life or quality of life. These numbers prioritize the failure modes (together with probability and detectability). Below a typical classification is given. Other classifications are possible. See also hazard analysis.

Rating	Meaning
I	No relevant effect on reliability or safety
II	Very minor, no damage, no injuries, only results in a maintenance action (only noticed by discriminating customers)
III	Minor, low damage, light injuries (affects very little of the system, noticed by average customer)
IV	Moderate, moderate damage, injuries possible (most customers are annoyed, mostly financial damage)
V	Critical (causes a loss of primary function; Loss of all safety Margins, 1 failure away from a catastrophe, severe damage, severe injuries, max 1 possible death )
VI	Catastrophic (product becomes inoperative; the failure may result in complete unsafe operation and possible multiple deaths)

Detection (D)

The means or method by which a failure is detected, isolated by operator and/or maintainer and the time it may take. This is important for maintainability control (Availability of the system) and it is especially important for multiple failure scenarios. This may involve dormant failure modes (e.g. No direct system effect, while a redundant system / item automatic takes over or when the failure only is problematic during specific mission or system states) or latent failures (e.g. deterioration failure mechanisms, like a metal growing crack, but not a critical length). It should be made clear how the failure mode or cause can be discovered by an operator under normal system operation or if it can be discovered by the maintenance crew by some diagnostic action or automatic built in system test. A dormancy and/or latency period may be entered.

Rating	Meaning
1	Certain - fault will be caught on test
2	Almost certain
3	High
4	Moderate
5	Low
6	Fault is undetected by Operators or Maintainers

Dormancy or Latency Period

The average time that a failure mode may be undetected may be entered if known. For example:

Seconds, auto detected by maintenance computer
8 hours, detected by turn-around inspection
2 months, detected by scheduled maintenance block X
2 years, detected by overhaul task x

Indication

If the undetected failure allows the system to remain in a safe / working state, a second failure situation should be explored to determine whether or not an indication will be evident to all operators and what corrective action they may or should take.

Indications to the operator should be described as follows:

Normal. An indication that is evident to an operator when the system or equipment is operating normally.
Abnormal. An indication that is evident to an operator when the system has malfunctioned or failed.
Incorrect. An erroneous indication to an operator due to the malfunction or failure of an indicator (i.e., instruments, sensing devices, visual or audible warning devices, etc.).

This type of analysis is useful to determine how effective various test processes are at the detection of latent and dormant faults. The method used to accomplish this involves an examination of the applicable failure modes to determine whether or not their effects are detected, and to determine the percentage of failure rate applicable to the failure modes which are detected. The possibility that the detection means may itself fail latent should be accounted for in the coverage analysis as a limiting factor (i.e., coverage cannot be more reliable than the detection means availability). Inclusion of the detection coverage in the FMEA can lead to each individual failure that would have been one effect category now being a separate effect category due to the detection coverage possibilities. Another way to include detection coverage is for the FTA to conservatively assume that no holes in coverage due to latent failure in the detection method affect detection of all failures assigned to the failure effect category of concern. The FMEA can be revised if necessary for those cases where this conservative assumption does not allow the top event probability requirements to be met.

After these three basic steps the Risk level may be provided.

Risk level (P*S) and (D)

Risk is the combination of End Effect Probability And Severity where probability and severity includes the effect on non-detectability (dormancy time). This may influence the end effect probability of failure or the worst case effect Severity. The exact calculation may not be easy in all cases, such as those where multiple scenarios (with multiple events) are possible and detectability / dormancy plays a crucial role (as for redundant systems). In that case Fault Tree Analysis and/or Event Trees may be needed to determine exact probability and risk levels.

Preliminary Risk levels can be selected based on a Risk Matrix like shown below, based on Mil. Std. 882.^[24] The higher the Risk level, the more justification and mitigation is needed to provide evidence and lower the risk to an acceptable level. High risk should be indicated to higher level management, who are responsible for final decision-making.

Probability / Severity -->	I	II	III	IV	V	VI
A	Low	Low	Low	Low	Moderate	High
B	Low	Low	Low	Moderate	High	Unacceptable
C	Low	Low	Moderate	Moderate	High	Unacceptable
D	Low	Moderate	Moderate	High	Unacceptable	Unacceptable
E	Moderate	Moderate	High	Unacceptable	Unacceptable	Unacceptable

Timing

The FMEA should be updated whenever:

A new cycle begins (new product/process)
Changes are made to the operating conditions
A change is made in the design
New regulations are instituted
Customer feedback indicates a problem

Uses

Development of system requirements that minimize the likelihood of failures.
Development of designs and test systems to ensure that the failures have been eliminated or the risk is reduced to acceptable level.
Development and evaluation of diagnostic systems
To help with design choices (trade-off analysis).

Advantages

Improve the quality, reliability and safety of a product/process
Improve company image and competitiveness
Increase user satisfaction
Reduce system development time and cost
Collect information to reduce future failures, capture engineering knowledge
Reduce the potential for warranty concerns
Early identification and elimination of potential failure modes
Emphasize problem prevention
Minimize late changes and associated cost
Catalyst for teamwork and idea exchange between functions
Reduce the possibility of same kind of failure in future
Reduce impact on company profit margin
Improve production yield
Maximizes profit

References

Jump up System Reliability Theory: Models, Statistical Methods, and Applications, Marvin Rausand & Arnljot Hoylan, Wiley Series in probability and statistics - second edition 2004, page 88
Jump upProject Reliability Group (July 1990). Koch, John E., ed. Jet Propulsion Laboratory Reliability Analysis Handbook (pdf). Pasadena, California: Jet Propulsion Laboratory. JPL-D-5703. Retrieved 2013-08-25.
Jump up Goddard Space Flight Center (GSFC) (1996-08-10). Performing a Failure Mode and Effects Analysis (pdf). Goddard Space Flight Center. 431-REF-000370. Retrieved 2013-08-25.
Jump up Langford, J. W. (1995). Logistics: Principles and Applications. McGraw Hill. p. 488.
Jump up United States Department of Defense (9 November 1949). MIL-P-1629 - Procedures for performing a failure mode effect and critical analysis. Department of Defense (US). MIL-P-1629.
Jump up United States Department of Defense (24 November 1980). MIL-STD-1629A - Procedures for performing a failure mode effect and criticality analysis. Department of Defense (USA). MIL-STD-1629A.
Jump up Neal, R.A. (1962). Modes of Failure Analysis Summary for the Nerva B-2 Reactor (PDF). Westinghouse Electric Corporation Astronuclear Laboratory. WANL–TNR–042. Retrieved 2010-03-13.
Jump up Dill, Robert; et al. (1963). State of the Art Reliability Estimate of Saturn V Propulsion Systems (PDF). General Electric Company. RM 63TMP–22. Retrieved 2010-03-13.
Jump upProcedure for Failure Mode, Effects and Criticality Analysis (FMECA) (PDF). National Aeronautics and Space Administration. 1966. RA–006–013–1A. Retrieved 2010-03-13.
Jump upFailure Modes, Effects, and Criticality Analysis (FMECA) (PDF). National Aeronautics and Space Administration JPL. PD–AD–1307. Retrieved 2010-03-13.
Jump up Experimenters' Reference Based Upon Skylab Experiment Management (PDF). National Aeronautics and Space Administration George C. Marshall Space Flight Center. 1974. M–GA–75–1. Retrieved 2011-08-16.
Jump up Design Analysis Procedure For Failure Modes, Effects and Criticality Analysis (FMECA). Society for Automotive Engineers. 1967. ARP926.
Jump up Dyer, Morris K.; Dewey G. Little; Earl G. Hoard; Alfred C. Taylor; Rayford Campbell (1972). Applicability of NASA Contract Quality Management and Failure Mode Effect Analysis Procedures to the USFS Outer Continental Shelf Oil and Gas Lease Management Program (PDF). National Aeronautics and Space Administration George C. Marshall Space Flight Center. TM X–2567. Retrieved2011-08-16.

Failure Modes and Effects Analysis

Sunday, January 10, 2016