A Proposal For An Alternative To MTBF and MTTF PDF

You might also like

Download as pdf
Download as pdf
You are on page 1of 29
DfR Solutions reliability designed, reliability delivered Does the DoD use the Wrong Reliability Metric? A Proposal for an Alternative to MTBF/MTTF James McLeish ASQ Reliability Division Webinar July 10, 2014 imcleish@¢frsolutions.com DER Solutions Fd. Site 290, Besville MD 20705 | 301-474-0607 | wr. dfs Abstract o Accurate measurements are essential for understanding, controlling and improving processes and performance. o For many decades the primary metric for measuring reliability in the defense and aerospace industries has been the Mean Time Between or To Failures (MTBF/MTTF (ie. MTxF)) metrics. They are used despite the facts that these metrics are widely misunderstood and misinterpreted, due to the incomplete view they provide of actual system or equipment dependability performance across the entire product lifecycle. This often results in a misleading, overly optimistic reliability assessment that can hide potential failure and safety issues. Yet the use of the MTBF/MTTF metrics continues because they have been codified into numerous military standards, specs and handbooks. This webinar will review the origins of the MTBF/MTTF metrics and discuss what they do and do not represent from a Physics of Failure point of view and provide recommendations for Better Reliability Metrics. © Note: This Webinar evolved an article Post at http://nomtbf.com/ » Replacing MTBF/MTTF with Bx/Lx Reliability Metrics o http://nomtbf.com/201 /replacing-mtbf-bx/ DE£R Solutions 2 9000 Virginia Manor Rd. Ste 290, Besvile MD 20705 | 307-474-0407 | ww dfsluion. com What is MTBF / MTTF o Mean Time Between Failures (MTBF also known as Theta (@)): = The statistical average of the time between failures across a population or fleet of systems or components that are repairable or replaceable, calculated by dividing the total accumulated population operating or field time by the number of failures. o Mean Time To Failure (MTTF): > The statistical average of the time between failures of a population or fleet of systems or components that are not repairable or replaceable, calculated by dividing the total accumulated population operating or field time by the number of failures. (Units, * Each Units, Operating or Field Time) MIxF (0) = Number of Failures o The Inverse of MTxF is known as the Failure Rate (A) —_1_ MTxF (Note: for Vehicles Mean Miles Between/To Failure is sometime used instead of time) DER Solutions 3 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 301-474-0407 | www. dfscution com Failure Rate (A) = MTBF / MTTF Conceptions vs Reality ° MTXF is used in various industries (especially defense & aerospace) to represent reliability, but is widely misquoted, misunderstood & sometimes abused o The most common misconception is that MTxF refers to the expected service life or failure free operating period between failures OF A SINGLE DEVICE In reality MTxF is the inverse of the average failure rate of a fleet of devices o Example: A MTBF of 1,000,000 fleet hours in a fleet of 10,000 units © Results in a time to failure ever 100 fleet hours © If the 10,000 unit fleet operates 24 hrs./day » 240,000 Fleet hrs. are accumulated daily ° > Then on average a failure can be expected somewhere in the fl o While this metric has value to Logisticians working to provide the needed spare parts to support the deployment of a military division, It does not represent actual reliability or service life. > Many misconceptions could be avoided if the term “Fleet or Population” Hours, Time, or Miles” were used instead of simply Hours, Time or Miles DER Solutions 4 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 301-474-0407 | www. dfscution com MTBF / MTTF & Reliability o “IF” a constant Failure Fate or MIxF applies, the reliability at a point in time can be calculate by the equation: -eit— -t/MTxF R,=e*t= e -2,400/1,000,000 = 9.0024 = 9.9976 » Example: R@o,c00 hs = = 10,000 units x (1-0.9976) = 23.97 Failures (2,400 hrs, Operating at 24 hrs/day = 100 deys) = @78:760/1,000,000 = g-.008760 = 9.9912 10,000 units x (1-0.991 2) = 87.2 Failures (8,760 hrs/Operating at 24 hrs/day = 365 days = year) R@a,760 hrs o Unfortunately the constant failure (or Random Failure) Portion of the hypothetical bath tub curve is not realistic, and o If a constant failure rate period did exists, it can not last forever, wear-out failures of the device would eventually limit its life much earlier than its MTBF » Therefore, there is no direct correlation between the service life of a device and its failure rate or MTxF DER Solutions 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 301-474-0407 | www. dfscution com The Traditional View of Quality, Reliability & Durability (QRD) = Product Life Cycle Failure Rate “Bath Tub” Curve Focuses on 3 Separate & Individual Life Cycle Phases, each with Separate Control & Improvement Strategies, Produced the Misguided Belief that Reliability Efforts Should Focus Only On Random Failure Issues eng of Usetul Lite Typ. Replacement Decision Pt The Bath Tub Curve (Sum of 3 Independent Phenomena) But “True” Root Causes Can Be Disguised by Actuarial Assumptions that Make QRD Data Analysi Easy to Perform & Administer This is an Inaccurate & Misleading Point of Views Quality = Infant Mortality Durability = Wear Out (End of Useful Life) Problem or Failure Rate Reliability = Random or Chance Problems (Constant Unavoidable) 2 3 Time- Years 4 5 ‘ 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 301-474-0407 | www. dfscution com Problem or Failure Rate A “PoF FAILURE MECHANISM” Based “REALISTIC” View Reveals the True Interactive Relationships Between Q, R & D Weak Designs That Start to Wear Out Prematurely “Cause & Effect” Root Causes Can Be Disguised by Actuarial Statisties Once Problems Are Accurately Categorized You Have Realistic Picture of “True Root Causes” ° TRUE Random Problems Are Rare Once Vi Correlated to y Neat ES ee So A Constant MTxF/Failure Rate is a Statistical Aberration that Doesn't Occur in Reality Just because a distribution can be averaged does not mean that the average accurately represents the entire distribution. > The constant Failure Rate /MTxF concept is the result of statistical manipulations by 1960 era cctuarial principles used to simplify data crunching and reporting o Physics of Failure Research has produced a Different Point of View & Classifications: Infant Morality Failures are Actually due to: Decreasing Increasing In Weak Designs Manuf, Errors. Wear Out Excessive Manuf. Failure Variation or Mechanisms Desian Errors that Observed Failure Can Start Produce Defects or Maettality” = Prematurely Weaknesses Failure Also Correlates Can cause either to the Safety Initial Failures or Enrg. Definition Latent Failures of Random Throughout Life (Correlates to the Time crayons Ealures “Tiue”Random Fallures are Due to Chance Encounters with “Overstress” Conditions That Exceeds the Capabilities Strength of the Device (Pot Holes, EOS, VehisTo Impacts) = In Well Designed Products Random Failure are Rare Attributable to “Acts of God or War” DfR Solutions 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 301-474-0407 | www. dfscution com MTBF is a Non-Intuitive Reliability “Buzz Word”. ° MTxF metrics are often used without an understanding of what they represent Basic and necessary assumptions on what constitutes a failure are not stated. 2 MTXF does not characterize the expected failure free period or the useful life A single discrete number does not give any insight into the characteristics of the actual distribution beyond the arithmetic mean. While MTxF may be one aspect of reliability, it is insufficient to accurately represent all attributes of Reliability. Which results in uncertainty on what is the actual reliability of a device DER Solution: ° 9000 Virgic Manor Rd. Se 290, Bekwvile MD 20705 | 301-474-0407 | ww. dfscution com MTBF/MTTF is Insufficient for Representing the Classical Definition of Reliability: o “The Probability of an item to perform required functions, under stated conditions, tat iod of time” > MTxF is a coarse metric with limited value for designing reliability into products An arithmetic mean is a poor metric for representing complex relationships o Insufficient to represent the influenced of outliers spread and distribution of the endurance / reliability of a population. © Time to first failure, failures over time, usage /durability life & total service lifetime failure are more useful reliability metrics. > The misconception of a Constant/Random Failure Period can results in qualification using short reliability demonstration tests instead of Longer Durability Testing 3 Failure Distributions (with the same Mean): But Vastly Different Times to 1* Failure Identifying Equipment Durability and Times to 1° Failure for Various Failure Mechanisms & Operating Conditions is More Vital than a MTxF or Mean Failure Rate DER Solutions 0 000 io Menor Pd. Site 290, Belle MD 20705 | 301-474-0607 | ww dfscltionscom If MTxF is so Misleading - Then Why is it So Widely Used o MT&F is cited in numerous military standards, specs & handbooks. > Caused the practice to spread throughout the defense, aerospace & electronic industries as contractors were required to providing MTxF data. > Further documented in numerous reliability text books and college courses » Migrated to other engineering disciplines especially FAA & Safety Engineering o The MTxF metric is desired in some industries o Simplicity of a single number metric o Easy alternative to implementing more comprehensive activities that better address all reliability issues. Since MTxF DOES NOT represent the actual expected service life of a device or its expected failure free period, this confusion may sometimes be desired as a marketing scheme to produce a perception of High Reliability that avoids addressing real reliability issue DER Solutions n 000 Virginia Manor Rd. Ste 290, Besvile MD 20705 | 301-474-0407 | wr dfscluion. com The Origins of MIBF/MTTF c 1957 AGREE Commision Report - Task Group 1 Advisory Group on Reliability of Electronic Equipment 1 reloved sintmus-acceptebiliny 6 Pask Group 2 has Sypes of nt Fequired by the mission or oparation unier considers! ihe eoverane OF tree © AReliability Metric Tailored 7 to the Leading Electronic Technology of the 1950's i ™ The Vacuum Tube & Vacuum Tube Assemblies * Early Discreet Transistor & Diodes © Developed for use on the Computational , Technologies of the 1950's ™ Mechanical Adding Machines & The Slide Ruler DER Solution: O a Manor Rd. Site 290, Basile MD 20705 | 301-£74.0407 | ww disttionscom The Need for a Review and updating of Reliability Tools and Paradigms for Effectiveness and Best Practices 2 Continuous Improvement 1SO-9000 6- Sigma Quality Methods COT ae eto) eens Due to lack of effort to update and maintain standards or spec templates > Due to a lack of cross pollination of best practices from other industries a 9000 Virgic Manor Rd. Se 290, Bekwvile MD 20705 | 301-474-0407 | www. dfscution com Current Situation as defined by the U.>. Detense science Board|y Task Force on Developmental Test and Evaluation (DT&E) ) Non optimized Reliability Metrics may be a reason why “In recent years, there has been a dramatic increase in the number (~2/3) of (military) systems not meeting suitability requirements”. “RAM deficiencies comprise the pr Demonstrated Reliability vs Requirements for All Operational Tests The results have been: “Costly redesign & schedule a delays.” “High maintenance burden & costs as field personnel must replace or repair unreliable systems and components that were deployed without achieving relia! objectives.” i 20 0 0 wo yoo Requirement MTB, DER Solutions 4 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 307-474-0407 | www. dfscution com 14 iews on - The Need for Updating the Reliability Profession Other > “Reliability engineering historically has been focused on statistical & probabilistic models which often do not have valid traceability to physical failure mechanisms” o Kirk Gray - Accelerated Reliability Solutions & Hobbs Engineering Instructor “What Why When and How to Apply HALT & HAS” © "Unfortunately, the development of reliability engineering has been afflicted with more nonsense than any other branch of Engineering.” © Patrick O'Connor Consultant & Author of “Practical Reliability Engineering” © “In Reliability and Quality Engineering, Physics Always Trumps Mathemattics” © Dr. Andre Kleyner, Global Reliability Engineering Leader - Delphi Electronics “Notable Quotes” ASQ Quality Progress Nov. 2013 © “What started as a simple observation has developed into « personal mission to stop the widespread misuse, misunderstanding and misinformation circling around MTBF. The acronym, MTBF, stands for Mean Time Between Failure. It is very likely the worst four letter cicronym in the reliability engineering profession.” © Fred Schenkelberg at hitp://nomibf.com/ Former ASQ Reliability Division Chairman . ” DER Solutions svile MD 20705 | 301-474-0607 | wr 9000 Virgin Manor Rd Ste 25 Recommendation For An Improved Reliability Met - A Blast from the Past o Bx/Lx - the Life Point (hrs., days, yrs. or cycles) When No More Than x% of Failures Have Occurred. » A single metric that includes a Performance AND Durability element © Max. Allowable % Failures (i.e. 1-R% ) AND the a Durability Life Point. o Life point where no more than 10% (R>90%) Prebablity -Generaiend Gamma of failures occur in a population. " > A Time to “Early Failure” Focus o Failure values other that 10% can be used (ie. 5%, 2%, 1%, 0.5% 0.1%...) > Predates MTBF/MTTF © Evolved from the B10 Bearing Life metric, (also used in Machinery & Auto Industries) © Promotes Weibull Analysis A Valid, Widely Used, Comprehensive Metric, that the AGREE Commission failed to adapt to Electrical Equipment, due to the desire for a metric that related more to Logistics than sustainability — Bx/Lx “DIR Solutions. 16 000 Virginia Manor Rd. Ste 290, Besvile MD 20705 | 301-474-0407 | wr dfscluion. com Benefits of the Bx/Lx Reliability Metric o A More Comprehensive Reliability Metric Requires: o Reliability Values Correlated to a Point in Usage or Field Time , Under Application Appropriate Usage and Environmental Stress Conditions o The Bx/Lx - Life Point can de defined in Hrs., Days, Yrs., Miles, Cycles .. . as appropriate to the durability characteristic of interest in an application o A Time to “Early or First Failure” Focus o Failure values other that 10% can be used (i.e. 5%, 2%, 1%, 0.5% 0.1%...) o Improvement over the Traditional (MTBF/MTTF) Reliability Metric > Mean Time Between Failure / Mean Time To Failure o Represents when 50% of the failures in a diverse population have occurred during only the useful life phase (assumes wearout does not occur] © Arithmetic mean is ¢ poor metric since it is greatly influenced by outliers and the spread/distribution of the population. o Can be used in conjunction with MTxF Since many organization are familiar/comfortable with MTxF and use it for logistic, there would be resistance to eliminating MTxF Would be easier to add Bx/Lx metrics along side MTxF DER Solutions v 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 307-474-0407 | www. dfscution com Physics of Failure Durability Simulation Modeling - Failure Risk Life Curves for each Failure Mechanism Tallied to Produce a Combined Life Curve “ Over All =| Example of a Physics of Failure Mose =| Failure Risk Over Time Plot from °nbe »| the Sherlock ADA Gycling "| Durability Simulation CAE App. pony Wear Out ww dirsolutions.comv/sotiw PTH Thermal Cycling Fatigue Wear Out _ Cumulative Failures trom Generic, penal Constant Mean lure Ri MIL-HDBK-217 o Bx © Detailed Design & Application Specific PoF Life Curves are Far More Useful than a simple single point MTxF value DfR Solution \2 MD 20705 | 301-474-0407 | www. dfsclution com 8 9000 Virgie Menor Rd Site 290, Bl What Can You Do If Your Industry Uses MTBF? (From Fred Schenkelberg : http: //nomtbf.com/201 4/06 /industry-mthf-use/#more-1374) o First, stop using MTBF yourself. » Take and use the life data you already have and instead of calculating the MTBF, calculate appropriate reliability function. Fit to Weibull or Lognormal or whatever is appropriate o Second, show others the information produced by directly using Reliability data rather than using MTBF. > Show the real life data to your customers, vendors, suppliers &engineering teams = Show to marketing, finance, sales & especially decisions makes data > Show that using an accurate reflection of reliability data permits better decisions © It will save you time, money, resources, and frustration o Be amazed at how quickly others understand the value of real reliability data o Even managers will get it. o Third, if require translate your work back to MTBF o Provide the MTBF value with the duration over which it is appropriate >» Show the impact of assuming a constant failure rate when it isn’t true > Focus on the value of making good decisions and the cost of making poor decisions. DER Solutions ” 000 Virginia Manor Rd. Ste 290, Besvile MD 20705 | 301-474-0407 | wr dfscluion. com Random Failure Definition Differences Between Safety & Reliability Professionals o Emerging Function Safety Standards » IEC 61508 E/E Equipment 2 ISO 26262 Automotive E/E Systems o Risk-based safety standard, where the risk of hazardous failure operational situations are qualitatively assessed and safety measures are defined to: > Avoid or control Systematic Failures » Detect, control or mitigate effects of Random Hardware Failures c Requires commonly recognized industry sources be used to determine the hardware part failure rates and the failure mode distributions + IEC/TR 62380, IEC 61709, MIL HDBK 217 F notice 2, RIAC HDBK 217 Plus, UTE C80-811, NPRD 95, EN 50129:2003, Annex C, IEC 62061:2005, Annex D, RIAC FMD97 and MIL HDBK 338. o Preparation of Self Driving Robotic Vehicle o Examples: Google Car, Autonomous Drones o Revealing fundamental difference in definition in MTxF /Random Failure Rate between the Safety Engineering & Reliabi Engineering Professions PfR Solutions 20 9000 Virgic Manor Rd. Se 290, Bekwvile MD 20705 | 301-474-0407 | www. dfscution com Failure Definition Differences Between Safety & Reliability Professionals o In Safety Engineering faults which lead to failures are classified as either Random or Systematic: » Random Faults are due to physical causes (such as corrosion, thermal stressing and wear-out... etc.) © To safety professions “Random Failure are not assumed to have a Constant Failure Rate o However they due reference averaged failure probability & risk derived from statistical analysis from testing and historical data. > Systematic Faults are produced by human error during system development & operation. © Can be created in any stage of the system’s life (i.e. specification, design, manufacture, operation, maintenance, decommissioning). o Since it is difficult to predict the occurrence of systematic faults and their effect on safety the implementation of best practices to prevent errors and defects are employed DER Solutions 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 307-474-0407 | www. dfscution com ue Salle ara rode g Systematic and Random Failure ‘Ths entry descnbes the differences between systematic and ‘random failures. It goes on to explain the relevance of these types: Of faire to hardware and software Faults, which lead to fal ‘one of two types: * Random Faults 1 systematic Faults 15 within a system, can be classified as ‘simple hardware components within a system. This type of fault is caused by effects such as corrosion, thermal stressing and ‘wear-out. Due to their random nature, statistical information can bbe produced from testing and historical data about this type of fault. Thus, the average probability, and hence the isk, associated with the occurrence of a random fault can be calculated, mi development and operation. They can be created in any stage of the system's Ue inchsing specification, design, marufectors, ‘operation, maintenance and decommissioning. After 3 systematlc Fault has “been created, it wil always” appear, when the ‘Grcumetances are exactly the same, untl itis removed. Howaver it ttfect on the safety of a system. Ths i because of the sifculty of predicting when the same “circumstances” wil arise, Fatures of cinple Hardware Taiwes are pimaily random in nature rather than systematic. While it is possble that hardware can be a Teva oF car ae means that predominately fafures are random in nature. However, this changing with the growing complexity of processors and the use of Apphcation-Specifc Integrated Grcults (ASICS), The use of hardware within safety cnibcal systems significantly predates the use of software within safety critical systems. Historicaly, this led fo tha assessment of eafaty entical systems using quantified ask ‘2esecomont bated upon statitical calcuations of failure rates Definition of Random Failure Differences Between Safety & Reliability Professionals The Safety Profes ‘Common Sense Definitio Unfortunately Classical Reliability Professionals & the “Recognized Industry Sources’ use a Different Definition Developed by Actuaries & Defined in the 1960's US DoD AGREE Commission Report (Advisory Group on Reliability of Electrical Equipment) n used a Al software faults are systematic, thus demonstrating the safety of software reles upon assessing the lkelhoed of ths type of fault. Software within Safety Crbeal Systems is creasing in size and extent of use, and this & leading { the risks associated with software systematic Faults becoming more prevalent. The level of uthority given to and complexity associated with software within Safety critical applications means that itis extremely important to be able to assess and argue about the effects of software with respect to aystem safety. Its not possible to statistically predict the probability of systematic faults, thus for software It is not possible to quantify the associated risks. instead, most current Spproaches for arguing the acceptabiity of software are based lupon appeal to the suitablity of the development processes Tolowed, 26 recommended by standards and the development of a software eafety argument. DER Solutions 9000 Virgic Manor Rd. Se 290, Bekwville MD 20705 | 307-474-0407 | www. dfscution com Failure Definition Differences Between Safety & Reliability Professionals > In Classical Reliability Engineering Random Failures are denoted by the “Flat” (ie. Constant Failure Rate portion) of the hazard function (bathtub curve) between: ‘A Presumed Short Infant Mortality Phase (denoted by a “Decreasing” Failure Rate) A Presumed Distant Wear out Phase (denoted by an “Increasing” Failure Rate) Therefore the “recognized industry sources” in theory do not account for Infant Mortality or Durability Wearout issue that the safety professional at expecting to be quantified. Random Means Constant The Bathtub Curve i.e. Equally likely to occur, Hypothetical Failure Rate versus Time at any time, in the usage life End of Life Wear-Out Increasing Fail Infant Mosoity Increasing Failure Rate ecreesng Pare Rate / ' Random Failure Rate | Na te etd) Low *Constant” Failure Rate Increased Failure Rate The Inverse of the Random Constant Failure Rate is Known as the Mean Time Between Failures 1/h= MTBF DER Solutions 9000 Virgin Manor Rd Ste 25 svile MD 20705 | 301-474-0607 | wr Summary: Mean Time Between or To Failures (MTBF / MTTF) o The MTXF reliability metric is widely cited & often criticized « Ibis the average usage time between repairable or permanent failures, of a fleet of items > The inverse of the average failure rate (Aavg) ie. MTBF = 1 / Aavg » Characterizes a system often for logistic maintenance spare parts purposes o Often misinterpreted as a failure free life period of a single system > A single “number” that attempts to describe a complex life time « By assuming that failures occur at a constant rate because: © Quality related infant mortality failures are insignificant due to screening © End of life wearout failures occur outside of the useful service life > Results in an over simplification that misrepresents reality. o Better Reliability Metrics correlate reliability or failure to a point in operating or in service durability time or are plots across a time line, Examples . > Bx/Lx Reliability Metrics » Physics of Failure Durability Simulation plots DER Solutions 2 000 Virginia Manor Rd. Ste 290, Besvile MD 20705 | 301-474-0407 | wr dfscluion. com In Conclusion: co Reliability Metrics that are Better Than MTxF Exist © Is it Now Time of All Segments of the Reliability Profession to Update from 50-60 Year Old AGREE Commission Principles? Especially with the introduction of Functional Safety REquirements @ |f Not, Preventable QRD Issues May Continue into the Far Future and Galaxies Far, Far Away! DER Solutions Want to Know More — Suggested Reading MECHANICAL ANALYSIS OF aetna cree, MECHANICAL RELIABILITY and DESIGN ROTO ragicl Technology ones DER Solutions 9000 vig Questions & Discussion Thank you for your attention For More Information or a copy of the Presentation Slides Contact: jmcleish@dfrsolutions.com DER Solutions \2 MD 20705 | 301-474-0407 | www. dfsclution com Today’ s Speaker James McLeish J Bio: James McLeish is a senior technical staff consultant and manager of the Michigan office of DfR (Design for Reliability) Solutions, a Failure Analysis, Laboratory Services and Reliability Physics Engineering Consulting Firm headquartered in Beltsville Maryland. Mr. McLeish is a senior member of the ASQ Reliability Division and a core member of the SAE’s Reliability Standard Committee with over 32 years of automotive and military E/E experience in design, development, validation testing, production quality and field reliability. He has held numerous technical expert and management position in automotive electronics product design, development, vehicle electrical system integration, product assurance, validation testing and warranty problem solving as an E/E Reliability Manager and E/E Quality/Reliability/Durability (QRD) technical specialists at General Motors. ASG —, Fastest Growing Companies in the Electronics cin ce ters a Ets LISS ETA aL * - Printed Circuit Design 7a a. ttle 2012 Global Technology Award Winner

You might also like