10 1 1 385 6682

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 432

Reliability Modeling

The RIAC Guide to Reliability Prediction,


Assessment and Estimation

L
LI [(Ta1,Tb1 ),...,(TaL ,TbL ) /θ] ∝ ∏[F(Tb i ) − F(Tai )] |θ)
i=1

− Ea

λ = λ b e KT S n

k r −1

1 − CL = ∑
r
(λt ) ⎡ ( λt ) (λt )r ⎤
e =e ⎢1 + λt + ⋅ ⋅ ⋅ ⋅ +
−λt −λt
+ ⎥
k =0 k! ⎢⎣ (r −1)! ( r)! ⎥⎦

RIAC is a DoD Information Analysis Center sponsored by the Defense Technical Information Center. RIAC is operated by a
team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn State University Applied Research
Laboratory and the State University of New York Institute of Technology.
Ordering No.: RPAE

Reliability Modeling -
The RIAC Guide to Reliability
Prediction, Assessment and
Estimation
Prepared by:

Reliability Information Analysis Center


6000 Flanagan Rd.
Suite 3
Utica, NY 13502-1348

Under Contract to:

Defense Technical Information Center


DTIC-AI
8725 John J. Kingman Rd.
Suite 0944
Fort Belvoir, VA 22060

RIAC is a DoD Information Analysis Center sponsored by the Defense


Technical Information Center. RIAC is operated by a team of Wyle
Laboratories, Quanterion Solutions Inc., the University of Maryland, the
Penn State University Applied Research Laboratory and the State University
of New York Institute of Technology.
The information and data contained herein have been compiled from
government and nongovernment technical reports and from material
supplied by various manufacturers and are intended to be used for reference
purposes. Neither the United States Government nor the Wyle Laboratories
contract team warrant the accuracy of this information and data. The user is
further cautioned that the data contained herein may not be used in lieu of
other contractually cited references and specifications.

Publication of this information is not an expression of the opinion of The


United States Government or of the Wyle Laboratories contract team as to
the quality or durability of any product mentioned herein and any use for
advertising or promotional purposes of this information in conjunction with
the name of The United States Government or the Wyle Laboratories
contract team without written permission is expressly prohibited.

ISBN-10: 1-933904-17-8 (Hardcopy)


ISBN-13: 978-1-933904-17-7 (Hardcopy)

ISBN-10: 1-933904-18-6 (PDF Download)


ISBN-13: 978-1-933904-18-4 (PDF Download)
Form Approved
REPORT DOCUMENTATION PAGE OMB No. 0704-0188
Public reporting burden for this collection is estimated to average 1 hour per response including the time for reviewing instructions, searching existing data sources,
gathering and maintaining the data needed and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other
aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington Headquarters Services, Directorate for
Information Operations and Reports(0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that
notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a current
or valid OMB control number.
PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE 2. REPORT TYPE 3. DATES COVERED (From - To)
31 May 2010 Technical N/A
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
HC1047-05-D-4005
Reliability Modeling – The RIAC Guide to Reliability Prediction, 5b. GRANT NUMBER N/A
Assessment and Estimation
5c. PROGRAM ELEMENT NUMBER
N/A
6. AUTHORS 5d. PROJECT NUMBER
N/A
William Denson 5e. TASK NUMBER
N/A
5f. WORK UNIT NUMBER
N/A
7. PERFORMING ORGANIZATIONS NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
REPORT NUMBER
Reliability Information Analysis Center
100 Sherman Rd. RPAE
Suite C101
Utica, NY 13502-1348
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITOR’S
ACRONYM(S)
Defense Technical Information Center DTIC-AI Air Force Research Lab/RISE
8725 John J. Kingman Rd. STE 0944 525 Brooks Rd. DTIC-AI and AFRL/RISE
Ft. Belvoir, VA 22060 Rome, NY 13440 11. SPONSORING/MONITOR’S
REPORT NUMBERS
N/A
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release, distribution unlimited.
13. SUPPLEMENTARY NOTES
Hardcopies available from Reliability Information Analysis Center, 100 Sherman Rd., Suite C101, Utica, NY 13502-1348. (Price: $85
US/$95 Non-US). PDF Download available from http://theRIAC.org (Price $70).
14. ABSTRACT

The intent of this book is to provide guidance on modeling techniques that can be used to quantify the reliability of a product or system. In
this context, reliability modeling is the process of constructing a mathematical model that is used to estimate the reliability characteristics of
a product. There are many ways in which this can be accomplished, depending on the product or system and the type of information that
is available, or practical to obtain, to the analyst. This book will review possible approaches, summarize their advantages and
disadvantages, and provide guidance on selecting a methodology based on the specific goals and constraints of the analyst. While this
book will not discuss the use of specific published methodologies, in cases where examples are provided, tools and methodologies with
which the author has personal experience in their development are used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus.

15. SUBJECT TERMS


Reliability Modeling Reliability Prediction Reliability Assessment Reliability Estimation
NPRD MIL-HDBK-217 217Plus
16. SECURITY CLASSIFICATION OF: 17. LIMITATION 18. NUMBER 19a. NAME OF RESPONSIBLE
OF ABSTRACT OF PERSON
UNCLASSIFIED PAGES
David Nicholls
a. REPORT b. ABSTRACT c. THIS PAGE 19b. TELEPHONE NUMBER
UNLIMITED 410 (include area code)
UNCLASSIFIED UNCLASSIFIED UNCLASSIFIED 315.351.4202
Standard Form 298 (Rev. 8/98)
Prescribed by ANSI Std. Z39.18
The Reliability Information Analysis Center (RIAC), formerly the Reliability Analysis Center (RAC),
is a Department of Defense Information Analysis Center sponsored by the Defense Technical
Information Center, managed by the Air Force Research Laboratory (formerly Rome Laboratory), and
operated by a team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn
State University Applied Research Laboratory and the State University of New York Institute of
Technology. RIAC is chartered to collect, analyze and disseminate reliability, maintainability,
quality, supportability and interoperability (RMQSI) information pertaining to systems and products,
as well as the components used in them. The RIAC addresses both military and commercial
perspectives.

The data contained in the RIAC databases is collected on a continuous basis from a broad range of
sources, including testing laboratories, device and equipment manufacturers, government laboratories
and equipment users (government and industry). Automatic distribution lists, voluntary data
submittals and field failure reporting systems supplement an intensive data solicitation program.
Users of RIAC are encouraged to submit their RMQSI data to enhance these data collection efforts.

RIAC publishes documents for its users in a variety of formats and subject areas. While most are
intended to meet the needs of RMQSI practitioners, many are also targeted to managers and designers.
RIAC also offers RMQSI consulting, training and responses to technical and bibliographic inquiries.

REQUESTS FOR TECHNICAL ASSISTANCE


AND INFORMATION ON AVAILABLE RIAC ALL OTHER RIAC REQUESTS SHOULD BE
SERVICES AND PUBLICATIONS MAY BE DIRECTED TO:
DIRECTED TO:

Reliability Information Analysis Center Air Force Research Laboratory


100 Sherman Rd. AFRL – Systems and Information
Suite C101 Interoperability Branch
Utica, NY 13502-1348 Attn: R. Hyle
525 Brooks Road
General Information:(877) 363-RIAC Rome, NY 13441-4505
(877) 363-7422
Technical Inquiries: (315) 351-4200 Telephone: (315) 330-4857
Fax: (315) 351-4209 DSN: 587-4857
E-Mail: inquiry@theRIAC.org Fax: (315) 330-7647
Internet: http://theRIAC.org E-Mail: Richard.Hyle@rl.af.mil

Copyright © 2010 by Quanterion Solutions Incorporated. This handbook was developed by Quanterion
Solutions Incorporated, in support of the prime contractor (Wyle Laboratories) in the operation of the Department
of Defense Reliability Information Analysis Center (RIAC) under Contract HC1047-05-D-4005. The Government
has a fully paid up perpetual license for free use of and access to this publication and its contents among all the
DOD IACs in both hardcopy and electronic versions, without limitation on the number of users or servers. Subject
to the rights of the Government, this document (hardcopy and electronic versions) and the content contained within
it are protected by U.S. Copyright Law and may not be copied, automated, re-sold, or redistributed to multiple
users without the express written permission. The copyrighted work may not be made available on a server for use
by more than one person simultaneously without the express written permission. If automation of the technical
content for other than personal use, or for multiple simultaneous user access to a copyrighted work is desired,
please contact 877.363.RIAC (toll free) or 315.351.4202 for licensing information.
Table of Contents: Reliability Modeling – The RIAC Guide

Table of Contents
Page
1.  INTRODUCTION  1 
1.1.  Scope  2 
1.2.  Book Organization  5 
1.3.  Reliability Program Elements  7 
1.4.  The History of Reliability Prediction  11 
1.5.  Acronyms  17 
1.6.  References  18 
 
 
2.  GENERAL ASSESSMENT APPROACH  19 
 
2.1.  Define System  20 
2.2.  Identify the Purpose of the Model  22 
2.3.  Determine the Appropriate Level at Which to Perform the Modeling  25 
2.3.1.  Level vs. Data Needed  26 
2.3.2.  Using an FMEA as the basis for a reliability model  28 
2.3.3.  Model Form vs. Level  34 
2.4.  Assess Data Available  36 
2.5.  Determine and Execute Appropriate Approach  38 
2.5.1.  Empirical  44 
2.5.1.1.  Test 44 
2.5.1.2.  Field Data 77 
2.5.2.  Physics  106 
2.5.2.1.  Stress/Strength Modeling 106 
2.5.2.2.  First Principals 111 
2.6.  Combine Data  114 
2.6.1.  Bayesian Inference  121 
2.7.  Develop System Model  123 
2.7.1.  Monte Carlo Analysis  127 
2.8.  References  133 
 
 
3.  FUNDAMENTAL CONCEPTS  135 
 
3.1.  Reliability Theory Concepts  135 
3.2.  Probability concepts  142 
3.2.1.  Covariance  142 
3.2.2.  Correlation Coefficient  142 
3.2.3.  Permutations and Combinations  143 
3.2.4.  Mutual Exclusivity  144 

i
Table of Contents: Reliability Modeling – The RIAC Guide

Table of Contents
Page
3.2.5.  Independent Events  144 
3.2.6.  Non‐independent (Dependent) Events  145 
3.2.7.  Non‐independent (Dependent) Events: Bayes Theorem  146 
3.2.8.  System Models  146 
3.2.9.  K‐out‐of‐N Configurations  151 
3.3.  Distributions  153 
3.3.1.  Exponential  159 
3.3.2.  Weibull  160 
3.3.3.  Lognormal  166 
3.4.  References  169 
 
 
4.  DOE­BASED APPROACHES TO RELIABILITY MODELING  171 
 
4.1.  Determine the Feature to be Assessed  172 
4.2.  Determine Factors  172 
4.3.  Determine the Factor Levels  172 
4.4.  Design the Tests  174 
4.5.  Perform Tests and Measurements  180 
4.6.  Analyze the Data  181 
4.7.  Develop the Life Model  183 
4.8.  References  183 
 
 
5.  LIFE DATA MODELING  185 
 
5.1.  Selecting a Distribution  185 
5.2.  Parameter Estimation Overview  186 
5.2.1.  Closed Form Parameter Approximations  189 
5.2.2.  Least Squares Regression  190 
5.2.3.  Parameter Estimation Using MLE  192 
5.2.3.1.  Brief Historical Remarks 193 
5.2.3.2.  Likelihood Function 193 
5.2.3.3.  Maximum Likelihood Estimator (MLE) 195 
5.2.4.  Confidence Bounds and Uncertainty  198 
5.2.4.1.  Confidence Bounds with MLE 198 
5.2.4.2.  Confidence Bounds Approximations 199 
5.3.  Acceleration Models  206 
5.3.1.  Fundamental Acceleration Models  207 
5.3.1.1.  Examples 208 
ii
Table of Contents: Reliability Modeling – The RIAC Guide

Table of Contents
Page
5.3.2.  Combined Models  210 
5.3.3.  Cumulative Damage Model  214 
5.4.  MLE Equations  216 
5.4.1.  Likelihood Functions  217 
5.5.  References  221 
 
 
6.  INTERPRETATION OF RELIABILITY ESTIMATES  223 
 
6.1.  Bathtub Curve  223 
6.2.  Common Cause vs. Special Cause  225 
6.3.  Confidence Bounds  238 
6.3.1.  Traditional Techniques for Confidence Bounds  238 
6.3.2.  Uncertainty in Reliability Prediction Estimates  240 
6.4.  Failure Rate vs pdf  243 
6.5.  Practical Aspects of Reliability Assessments  245 
6.6.  Weibayes  245 
6.7.  Weibull Closure Property  246 
6.8.  Estimating Event‐Related Reliability  247 
6.9.  Combining Different Types of Assessments at Different Levels  248 
6.10.  Estimating the Number of Failures  250 
6.11.  Calculation of Equivalent Failure Rates  251 
6.12.  Failure Rate Units  252 
6.13.  Factors to be Considered When Developing Models  253 
6.13.1.  Causes of Electronic System Failure  253 
6.13.2.  Selection of Factors  255 
6.13.3.  Reliability Growth of Components  257 
6.13.4.  Relative vs. Absolute Humidity  259 
6.14.  Addressing Data with No Failures  259 
6.15.  Reliability of Components Used Outside of Their Rating  261 
6.16.  References  262 
 
 
7.  EXAMPLES  263 
 
7.1.  MIL‐HDBK‐217 Model Development Methodology  264 
7.1.1.  Identify Possible Variables  266 
7.1.2.  Develop Theoretical Model  266 
7.1.3.  Collect and QC Data  267 
7.1.4.  Correlation Coefficient Analysis  268 

iii
Table of Contents: Reliability Modeling – The RIAC Guide

Table of Contents
Page
7.1.5.  Stepwise Multiple Regression Analysis  270 
7.1.6.  Goodness‐of‐Fit Analysis  271 
7.1.7.  Extreme Case Analysis  272 
7.1.8.  Model Validation  272 
7.2.  217Plus Reliability Prediction Models  273 
7.2.1.  Background  273 
7.2.2.  System Reliability Prediction Model  274 
7.2.2.1.  217Plus Background 274 
7.2.2.2.  Methodology Overview 277 
7.2.2.3.  System Reliability Model 278 
7.2.2.4.  Initial Failure Rate Estimate 279 
7.2.2.5.  Process Grading Factors 280 
7.2.2.6.  Basis Data for the Model 281 
7.2.2.7.  Uncertainty in Traditional Approach Estimates 281 
7.2.2.8.  System Failure Causes 282 
7.2.2.9.  Environmental Factor 287 
7.2.2.10.  Reliability Growth 291 
7.2.2.11.  Infant Mortality 292 
7.2.2.12.  Combining Predicted Failure Rate with Empirical Data 292 
7.2.3.  Development of Component Reliability Models  292 
7.2.3.1.  Model Form 292 
7.2.3.2.  Acceleration Factors 294 
7.2.3.3.  Time Basis of Models 294 
7.2.3.4.  Failure Mode to Failure Cause Mapping 295 
7.2.3.5.  Derivation of Base Failure Rates 296 
7.2.3.6.  Combining the Predicted Failure Rate with Empirical Data 296 
7.2.3.7.  Estimating Confidence Levels 298 
7.2.3.8.  Using the 217Plus Model in a Top-Down Analysis 298 
7.2.3.9.  Capacitor Model Example 299 
7.2.3.10.  Default Values 301 
7.2.4.  Photonic Model Development Example  303 
7.2.4.1.  Introduction 303 
7.2.4.2.  Model development methodology and results 306 
7.2.4.3.  Uncertainty Analysis 322 
7.2.4.4.  Comments on Part Quality Levels 325 
7.2.4.5.  Explanation of Failure Rate Units 325 
7.2.5.  System‐Level Model  326 
7.2.5.1.  Model Presentation 326 
iv
Table of Contents: Reliability Modeling – The RIAC Guide

Table of Contents
Page
7.2.5.2.  217Plus Process Grading Criteria 328 
7.2.5.3.  Design Process Grade Factor Questions 330 
7.2.5.4.  Manufacturing Process Grade Factor Questions 336 
7.2.5.5.  Part Quality Process Grade Factor Questions 340 
7.2.5.6.  System Management Process Grade Factor Questions 342 
7.2.5.7.  Can Not Duplicate (CND) Process Grade Factor Questions 346 
7.2.5.8.  Induced Process Grade Factor Questions 347 
7.2.5.9.  Wearout Process Grade Factor Questions 348 
7.2.5.10.  Growth Process Grade Factor Questions 349 
7.3.  Life Modeling Example  350 
7.3.1.  Introduction  350 
7.3.2.  Approach  350 
7.3.3.  Reliability Test Plan  350 
7.3.4.  Results  352 
7.3.4.1.  Times to Failure Summary 352 
7.3.4.2.  Life Models 354 
7.4.  NPRD Description  357 
7.4.1.  Data Collection  358 
7.4.2.  Data Interpretation  361 
7.4.3.  Document Overview  366 
7.4.3.1.  "Part Summaries" Overview 366 
7.4.3.2.  "Part Details" Overview 373 
7.4.3.3.  Section 4 "Data Sources" Overview 374 
7.4.3.4.  Section 5 "Part Number/MIL Number" Index 374 
7.4.3.5.  Section 6 “National Stock Number Index with Federal Stock Class” 375 
7.4.3.6.  Section 7 "National Stock Number Index without Federal Stock Class
Prefix" 375 
7.5.  References  375 
 
 
8.  THE USE OF FMEA IN RELIABILITY MODELING  377 
 
8.1.  Introduction  377 
8.2.  Definitions  381 
8.3.  FMEA Logistics  383 
8.3.1.  When initiated  383 
8.3.2.  FMEA Team  383 
8.3.3.  FMEA Facilitation  384 

v
Table of Contents: Reliability Modeling – The RIAC Guide

Table of Contents
Page
8.3.4.  Implementation  385 
8.4.  How to Perform an FMEA  385 
8.5.  Identify System Hierarchy  387 
8.6.  Function Analysis  388 
8.7.  IPOUND Analysis  388 
8.8.  Identify the Severity  390 
8.9.  Identify the Possible Effect(s) that Result from Occurrence of Each Failure Mode  392 
8.10.  Identify Potential Causes of Each Failure Mode  392 
8.11.  Identify Factors for Each Failure Cause  398 
8.11.1.  Accelerating Stress(es) or Potential Tests  398 
8.11.2.  Occurrence  398 
8.11.2.1.  Occurrence Rankings 398 
8.11.3.  Preventions  401 
8.11.4.  Detections  401 
8.11.5.  Detectability  401 
8.12.  Calculate the RPN  404 
8.13.  Determine Appropriate Corrective Action  405 
8.14.  Update the RPN  408 
8.15.  Using Quality Function Deployment to Feed the FMEA  408 
8.16.  References  410 
 
 
9.  CONCLUDING REMARKS  411 

vi
List of Figures: Reliability Modeling – The RIAC Guide

List of Figures
Page
FIGURE 1.1‐1:  PHASES OF A RELIABILITY PROGRAM ..................................................................................... 2 
FIGURE 1.1‐2:  RELATIVE COST OF FAILURES VS. PHASE ................................................................................ 3 
FIGURE 1.1‐3:  RELIABILITY PREDICTION, ASSESSMENT AND ESTIMATION.................................................... 4 
FIGURE 1.1‐4:  PERCENT OF COMPANIES USING RELIABILITY ENGINEERING TOOLS ..................................... 5 
FIGURE 1.3‐1:  EXAMPLE RELIABILITY PROGRAM APPROACH ........................................................................ 7 
FIGURE 2.0‐1:  GENERAL MODELING APPROACH ......................................................................................... 20 
FIGURE 2.1‐1:  FAULT TREE REPRESENTATION OF SYSTEM MODEL ............................................................. 21 
FIGURE 2.1‐2:  FAULT TREE REPRESENTATION TO THE FAILURE CAUSE LEVEL ............................................ 21 
FIGURE 2.2‐1:  BREAKDOWN OF POTENTIAL RELIABILITY MODELING PURPOSES ....................................... 23 
FIGURE 2.3‐1:  TYPICAL DATA REQUIREMENTS VS. LEVEL OF HIERARCHY ................................................... 27 
FIGURE 2.3‐2: THE BASIC FMEA APPROACH ................................................................................................. 28 
FIGURE 2.3‐3: HIERARCHICAL RELATIONSHIP BETWEEN CAUSE, MODE AND EFFECT ................................. 29 
FIGURE 2.3‐4: APPROACH TO IDENTIFYING CAUSES .................................................................................... 29 
FIGURE 2.3‐5:  FAULT TREE OF PRODUCT OR SYSTEM ................................................................................. 32 
FIGURE 2.3‐6:  FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE AS THE LOWEST LEVEL ....................... 32 
FIGURE 2.3‐7:  FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE ABOVE THE LOWEST LEVEL ................ 33 
FIGURE 2.3‐8:  FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE TWO LEVELS ABOVE THE LOWEST 
LEVEL ................................................................................................................................................... 33 
FIGURE 2.5‐1: BREAKDOWN OF RELIABILITY ASSESSMENT OPTIONS .......................................................... 38 
FIGURE 2.5‐2: QUALIFICATION CONCEPTS AND TERMINOLOGY .................................................................. 46 
FIGURE 2.5‐3: EVT, DVT AND PVT RELATIONSHIPS....................................................................................... 48 
FIGURE 2.5‐4:  ACCELERATION LEVELS ......................................................................................................... 51 
FIGURE 2.5‐5:  UNCERTAINTY IN EXTRAPOLATION ...................................................................................... 52 
FIGURE 2.5‐6:  ACCELERATION LEVELS ......................................................................................................... 53 
FIGURE 2.5‐7: ACCELERATION ALTERNATIVES ............................................................................................. 53 
FIGURE 2.5‐8:  RELATIVE LIFETIME VS. STRESS ............................................................................................. 54 
FIGURE 2.5‐9:  RELIABILITY REQUIREMENT VS. SMALL POPULATION RELIABILITY INFERENCE ................... 60 
FIGURE 2.5‐10:  LIFE MODELING METHODOLOGY ....................................................................................... 62 
FIGURE 2.5‐11: IDENTIFICATION OF TEST STRESSES BASED ON THE FMEA ................................................. 64 
FIGURE 2.5‐12:  USING THE DESTRUCT LIMIT TO DEFINE THE LIFE TEST MAX STRESS ................................ 66 
FIGURE 2.5‐13:  POSSIBLE STRESS PROFILES ................................................................................................ 67 
FIGURE 2.5‐14: MEASUREMENT POINTS FOR AN INFANT MORTALITY FAILURE CAUSE .............................. 69 
FIGURE 2.5‐15: MEASUREMENT POINTS FOR A WEAROUT FAILURE CAUSE ............................................... 69 
FIGURE 2.5‐16: ACCELERATION WHEN THE DISTRIBUTIONS FOR AT LEAST TWO STRESSES ARE AVAILABLE
 ............................................................................................................................................................ 71 
FIGURE 2.5‐17: ACCELERATION WHEN THE DISTRIBUTIONS FOR LOW STRESSES ARE NOT AVAILABLE ..... 71 
FIGURE 2.5‐18: LIFE MODEL SEQUENCE ....................................................................................................... 72 
FIGURE 2.5‐19 DEGRADATION MODELING APPROACH ................................................................................ 75 
FIGURE 2.5‐20: DEGRADATION DATA EXAMPLE .......................................................................................... 76 
FIGURE 2.5‐21: DEGRADATION DATA CONVERSION TO TIMES TO FAILURE ................................................ 77 
FIGURE 2.5‐22: RELIABILITY ESTIMATES FROM FIELD DATA ........................................................................ 78 

vii
List of Figures: Reliability Modeling – The RIAC Guide

List of Figures
Page
FIGURE 2.5‐23: FMEA AS A TOLL FOR ASSESSING SIMILARITY ..................................................................... 81 
FIGURE 2.5‐24: MIL‐HDBK‐217 PART COUNT EXAMPLE ............................................................................... 85 
FIGURE 2.5‐25: MIL‐HDBK‐217 PART STRESS EXAMPLE ............................................................................... 86 
FIGURE 2.5‐26: TELCORDIA SR‐332 (BELLCORE) ........................................................................................... 87 
FIGURE 2.5‐27: RAC PRISM REPLACED BY RIAC 217PLUS ............................................................................. 88 
FIGURE 2.5‐28: CNET/RDF 2000 ................................................................................................................... 89 
FIGURE 2.5‐29: CNET/RDF 2000 MODEL EXAMPLE ...................................................................................... 90 
FIGURE 2.5‐30: FIDES .................................................................................................................................... 91 
FIGURE 2.5‐31: USES OF PROGRAM DATA ELEMENTS ................................................................................. 93 
FIGURE 2.5‐32: PROGRAM DATABASE STRUCTURE ..................................................................................... 93 
FIGURE 2.5‐33: DATABASE INFORMATION FLOW ........................................................................................ 95 
FIGURE 2.5‐34: HIERARCHY OF MAINTENANCE ACTIONS ............................................................................ 97 
FIGURE 2.5‐35: CALCULATION OF PART LIFE UNIT ..................................................................................... 100 
FIGURE 2.5‐36: FAILURE TIMES BASED ON OPERATING TIME .................................................................... 101 
FIGURE 2.5‐37: FAILURE TIMES BASED ON CALENDAR TIME ..................................................................... 102 
FIGURE 2.5‐38: FAILURE RATE SIMULATION WITH WEIBULL BETA = 20 .................................................... 103 
FIGURE 2.5‐39: FAILURE RATE SIMULATION WITH WEIBULL BETA = 5.0 ................................................... 103 
FIGURE 2.5‐40: FAILURE RATE SIMULATION WITH WEIBULL BETA = 2.0 ................................................... 104 
FIGURE 2.5‐41: FAILURE RATE SIMULATION WITH WEIBULL BETA = 1.0 ................................................... 104 
FIGURE 2.5‐42: FAILURE RATE SIMULATION WITH WEIBULL BETA = 0.5 ................................................... 105 
FIGURE 2.5‐44: STRESS/STRENGTH INTERFERENCE ................................................................................... 108 
FIGURE 2.5‐45: STRESS/STRENGTH INTERFERENCE VS. TIME .................................................................... 109 
FIGURE 2.6‐1: 217PLUS APPROACH TO FAILURE RATE ESTIMATION ......................................................... 114 
FIGURE 2.6‐3.  BAYESIAN INFERENCE OUTLINE .......................................................................................... 122 
FIGURE 2.7‐1: COMBINING SEVEN FAILURE CAUSE DISTRIBUTIONS .......................................................... 125 
FIGURE 2.7‐2: POSSIBLE FAULT TREE REPRESENTATION OF A SERIES RELIABILITY BLOCK DIAGRAM ........ 126 
FIGURE 2.7‐3: PDF OF NORMAL DISTRIBUTION WITH MEAN OF 10 AND STANDARD DEVIATION OF 3. ... 128 
FIGURE 2.7‐4: CUMULATIVE NORMAL DISTRIBUTION WITH MEAN OF 10 AND STANDARD DEVIATION OF 3
 .......................................................................................................................................................... 128 
FIGURE 2.7‐5: VALUE SELECTION FROM A DISTRIBUTION ......................................................................... 129 
FIGURE 2.7‐6: VALUE SELECTION FROM A WEIBULL DISTRIBUTION .......................................................... 130 
FIGURE 2.7‐7: RELIABILITY BLOCK DIAGRAM OF REDUNDANT EXAMPLE .................................................. 131 
FIGURE 2.7‐8: SYSTEM MONTE CARLO EXAMPLE....................................................................................... 131 
FIGURE 2.7‐9: MONTE CARLO SIMULATION OF EXAMPLE SYSTEM ........................................................... 132 
FIGURE 3.1‐1:  DISCRETE PROBABILITY DISTRIBUTION ............................................................................... 135 
FIGURE 3.1‐2:  CONTINUOUS PROBABILITY DISTRIBUTION ........................................................................ 136 
FIGURE 3.2‐1: EXAMPLES OF CORRELATION COEFFICIENTS ....................................................................... 142 
FIGURE 3.2‐2: VENN DIAGRAM OF MUTUALLY EXCLUSIVE EVENTS ........................................................... 144 
FIGURE 3.2‐3: INDEPENDENT EVENTS ........................................................................................................ 145 
FIGURE 3.2‐4: FAULT TREE OR GATE .......................................................................................................... 147 
FIGURE 3.2‐5: RELIABILITY BLOCK DIAGRAM FOR AN OR GATE ................................................................. 147 
FIGURE 3.2‐6: FAULT TREE AND GATE ........................................................................................................ 148 
viii
List of Figures: Reliability Modeling – The RIAC Guide

List of Figures
Page
FIGURE 3.2‐7: RELIABILITY BLOCK DIAGRAM FOR AN AND GATE .............................................................. 149 
FIGURE 3.2‐8: FAULT TREE OF AN AND/OR COMBINATION ....................................................................... 150 
FIGURE 3.2‐9: RBD OF AND/OR COMBINATION ......................................................................................... 150 
FIGURE 3.3‐1: SHAPES OF FAILURE DENSITY AND RELIABILITY FUNCTIONS OF COMMONLY USED DISCRETE 
DISTRIBUTIONS (FROM MIL‐HDBK‐338B) ......................................................................................... 157 
FIGURE 3.3‐2: SHAPES OF FAILURE DENSITY, RELIABILITY AND HAZARD RATE FUNCTIONS FOR COMMONLY 
USED CONTINUOUS DISTRIBUTIONS (FROM MIL‐HDBK‐338B) ........................................................ 158 
FIGURE 3.3‐3:  EXAMPLE PDF PLOTS FOR THE WEIBULL DISTRIBUTION .................................................... 164 
FIGURE 3.3‐4:  EXAMPLE HAZARD RATE PLOTS FOR THE WEIBULL DISTRIBUTION .................................... 164 
FIGURE 3.3‐5:  EXAMPLE PROBABILITY PLOTS FOR WEIBULL DISTRIBUTION ............................................. 165 
FIGURE 3.3‐6: EXAMPLE PDF PLOTS FOR THE LOGNORMAL DISTRIBUTION .............................................. 167 
FIGURE 3.3‐7: EXAMPLE HAZARD RATE PLOTS FOR THE LOGNORMAL DISTRIBUTION .............................. 168 
FIGURE 3.3‐8: EXAMPLE PROBABILITY PLOTS FOR THE LOGNORMAL DISTRIBUTION ............................... 168 
FIGURE 4.0‐1: THE DOE CONCEPT .............................................................................................................. 171 
FIGURE 4.3‐1: POSSIBLE RESPONSE‐FACTOR LEVEL RELATIONSHIP ........................................................... 173 
FIGURE 4.4‐1: DOE TERMINOLOGY ............................................................................................................ 174 
FIGURE 4.4‐2: ONE‐FACTOR‐AT‐A‐TIME EXPERIMENTS ............................................................................. 176 
FIGURE 4.4‐3: STANDARD DOE NOMENCLATURE ...................................................................................... 177 
FIGURE 4.4‐4: POTENTIAL INTERACTIONS .................................................................................................. 178 
FIGURE 4.6‐1: ANALYSIS OF MEANS ........................................................................................................... 182 
FIGURE 4.6‐2: LINEARIZATION OF THE ARRHENIUS RELATIONSHIP ........................................................... 182 
FIGURE 4.6‐3: OPTIMAL FACTOR SETTINGS................................................................................................ 183 
FIGURE 5.4‐1: LIKELIHOOD CONTOUR EXAMPLE........................................................................................ 220 
FIGURE 6.1‐1: BATHTUB CURVE ................................................................................................................. 223 
FIGURE 6.2‐1: EXAMPLE OF NON‐MONOMODAL DISTRIBUTION .............................................................. 228 
FIGURE 6.2‐2: MULTIMODAL DISTRIBUTION EXAMPLE 1 ........................................................................... 229 
FIGURE 6.2‐3: MULTIMODAL DISTRIBUTION EXAMPLE 2 ........................................................................... 230 
FIGURE 6.2‐4: MULTIMODAL DISTRIBUTION EXAMPLE 3 ........................................................................... 231 
FIGURE 6.2‐5: MULTIMODAL DISTRIBUTION EXAMPLE 4 ........................................................................... 232 
FIGURE 6.2‐6: MULTIMODAL DISTRIBUTION EXAMPLE 5 ........................................................................... 233 
FIGURE 6.2‐7: MULTIMODAL DISTRIBUTION EXAMPLE OF POOLED DATA SET ......................................... 234 
FIGURE 6.2‐8: AGE AT DEATH DATA ........................................................................................................... 235 
FIGURE 6.2‐9: PDF OF MULTIMODE DISTRIBUTION OF AGES .................................................................... 236 
FIGURE 6.2‐10: FAILURE RATE OF AGE DATA ............................................................................................. 236 
FIGURE 6.2‐11: PROBABILITY PLOT OF AGE DATA ...................................................................................... 237 
FIGURE 6.2‐12: SINGLE MODE WEIBULL FIT TO THE AGE DATA ................................................................. 238 
FIGURE 6.3‐1: SOURCES OF ERROR IN EMPIRICAL MODELS ....................................................................... 241 
FIGURE 6.3‐2: CONFIDENCE LEVEL THROUGH PREDICTION, ASSESSMENT AND ESTIMATION .................. 243 
FIGURE 6.6‐1: WEIBAYES EXAMPLE ............................................................................................................ 246 
FIGURE 6.13‐1: NOMINAL FAILURE CAUSE DISTRIBUTION OF ELECTRONIC SYSTEMS ............................... 254 
 

ix
List of Figures: Reliability Modeling – The RIAC Guide

List of Figures
Page
FIGURE 6.13‐2: IPO MODEL ........................................................................................................................ 256 
FIGURE 6.13‐3: RELATIONSHIP BETWEEN ABSOLUTE AND RELATIVE HUMIDITY....................................... 259 
FIGURE 6.14‐1: ESTIMATED UPPER BOUND FAILURE RATES VS OPERATING TIME AT 60 AND 90% 
CONFIDENCE ..................................................................................................................................... 260 
FIGURE 7.1‐1: MIL‐HDBK‐217 MODEL DEVELOPMENT METHODOLOGY ................................................... 265 
FIGURE 7.2‐1: FAILURE CAUSE DISTRIBUTION OF ELECTRONIC SYSTEMS .................................................. 275 
FIGURE 7.2‐2: OPTICAL AMPLIFIER FAILURE CAUSE DISTRIBUTION ........................................................... 277 
FIGURE 7.2‐3:  ΠG VS. TIME AND GROWTH RATES ..................................................................................... 291 
FIGURE 7.2‐4: MODEL DEVELOPMENT METHODOLOGY FLOWCHART ...................................................... 306 
FIGURE 7.2‐5: DISTRIBUTION OF LOG10 PREDICTED/OBSERVED FAILURE RATE RATIO FOR ALL DATA .... 323 
FIGURE 7.2‐6: DISTRIBUTION OF LOG10 PREDICTED/OBSERVED RATIO FOR FIELD DATA ONLY ............... 324 
FIGURE 7.2‐7: DISTRIBUTIONS OF THE PREDICTED/OBSERVED FAILURE RATE RATIO FOR ALL DATA AND 
FOR FIELD DATA ONLY ...................................................................................................................... 324 
FIGURE 7.3‐1: TIMES TO FAILURE DISTRIBUTIONS ..................................................................................... 354 
FIGURE 7.3‐2: PROBABILITY OF FAILURE VS. TEMPERATURE AND RELATIVE HUMIDITY AT 50,000 HOURS
 .......................................................................................................................................................... 357 
FIGURE 7.4‐1: APPARENT FAILURE RATE FOR REPLACEMENT UPON FAILURE........................................... 362 
FIGURE 7.4‐3:  EXAMPLE OF PART DETAIL ENTRIES ................................................................................... 374 
FIGURE 8.1‐1: TWO BASIC TYPES OF FMEA ................................................................................................ 378 
FIGURE 8.4‐1: FMEA PROCESS FLOW ......................................................................................................... 386 
FIGURE 8.7‐1: FAILURE CAUSE‐MODE EFFECT RELATIONSHIP ................................................................... 390 
FIGURE 8.10‐1: FAILURE CAUSE, MODE AND EFFECT HIERARCHY ............................................................. 393 
FIGURE 8.10‐2: FAILURE CAUSES ................................................................................................................ 395 
FIGURE 8.11‐1: OCCURRENCE DEFINITIONS ............................................................................................... 399 
FIGURE 8.11‐2: OCCURRENCE GUIDELINES ................................................................................................ 400 
FIGURE 8.11‐3: DETECTABILITY DEFINITIONS ............................................................................................. 402 
FIGURE 8.11‐4: LIFE CYCLE VS DETECTABILITY DIMENSION ....................................................................... 403 
FIGURE 8.13‐1: POTENTIAL CORRECTIVE ACTIONS .................................................................................... 407 
FIGURE 8.15‐1: QFD‐TO‐FMEA LINKS ......................................................................................................... 408 
FIGURE 8.15‐2: QFD‐FMEA ......................................................................................................................... 410 

x
List of Tables: Reliability Modeling – The RIAC Guide

List of Tables
Page
TABLE 1.3‐1:  RANGES OF POTENTIAL CUSTOMER REACTIONS...................................................................... 8 
TABLE 2.2‐1:  RELIABILITY ASSESSMENT PURPOSES ..................................................................................... 24 
TABLE 2.2‐2:  PROGRAM PHASE VS. RELIABILITY ASSESSMENT PURPOSE ................................................... 25 
TABLE 2.3‐1:  EXAMPLES OF INITIAL CONDITIONS, STRESSES AND MECHANISMS ...................................... 30 
TABLE 2.3‐2:  RELATIONSHIP BETWEEN CAUSE, MODE AND EFFECT. .......................................................... 31 
TABLE 2.5‐1:  SUMMARY OF RELIABILITY ASSESSMENT OPTIONS ............................................................... 39 
TABLE 2.5‐1:  SUMMARY OF ASSESSMENT OPTIONS (CONTINUED) ............................................................ 40 
TABLE 2.5‐2: RELEVANCY OF APPROACH TO PREDICTION, ASSESSMENT AND ESTIMATION....................... 41 
TABLE 2.5‐3:  IDENTIFICATION OF APPROPRIATE APPROACHES BASED ON THE PURPOSE ......................... 43 
TABLE 2.5‐4:  RANKING THE ATTRIBUTES OF EMPIRICAL DATA ................................................................... 44 
TABLE 2.5‐5:  EVT, DVT AND PVT PURPOSE AND APPROACH ....................................................................... 47 
TABLE 2.5‐6:  RELIABILITY DEMONSTRATION EXAMPLE ............................................................................... 50 
TABLE 2.5‐7:  EXAMPLE OF A QUALIFICATION PLAN FOR AN ASSEMBLY ..................................................... 57 
TABLE 2.5‐8:  QUALIFICATION EXAMPLE FOR A LASER DIODE ..................................................................... 58 
TABLE 2.5‐9:  STRESS PROFILE OPTION ADVANTAGES AND DISADVANTAGES ............................................. 68 
TABLE 2.5‐10: SIMILARITY ANALYSIS ............................................................................................................ 80 
TABLE 2.5‐11: DIGITAL CIRCUIT BOARD FAILURE RATES (IN FAILURES PER MILLION PART HOURS) ........... 83 
TABLE 2.5‐12: TEST CONDITIONS ............................................................................................................... 111 
TABLE 2.5‐13: DATA TO ESTIMATE DIFFUSION RATE ................................................................................. 112 
TABLE 2.5‐14: PREDICTED LIFETIMES VS. OBSERVED ................................................................................. 113 
TABLE 3.1‐1:  PROBABILITY DISTRIBUTION NOTATION & MATHEMATICAL REPRESENTATIONS ............... 141 
TABLE 3.2‐1: COMBINATIONS EXAMPLE .................................................................................................... 143 
TABLE 3.2‐2: COMBINATIONS OF AN OR CONFIGURATION ....................................................................... 147 
TABLE 3.2‐3: COMBINATIONS OF AN AND CONFIGURATION ..................................................................... 149 
TABLE 3.2‐4: EXAMPLE OF “K‐OUT‐OF‐N” PROBABILITY CALCULATIONS................................................... 151 
TABLE 3.2‐5: EXAMPLE OF “2‐OUT‐OF‐3” REQUIRED FOR SUCCESS .......................................................... 152 
TABLE 3.3‐1:  PROBABILITY DISTRIBUTIONS APPLICABLE TO RELIABILITY ENGINEERING .......................... 154 
TABLE 3.3‐2:  EXPONENTIAL DISTRIBUTION PARAMETERS ........................................................................ 160 
TABLE 3.3‐3:  CONFUSING TERMINOLOGY OF THE WEIBULL DISTRIBUTION ............................................. 162 
TABLE 3.3‐4:  WEIBULL DISTRIBUTION PARAMETERS ................................................................................ 163 
TABLE 4.3‐1: POSSIBLE CONCLUSIONS FOR A NON‐LINEAR RESPONSE‐FACTOR RELATIONSHIP ............... 173 
TABLE 4.4‐1: FULL‐FACTORIAL EXAMPLE .................................................................................................... 175 
TABLE 4.4‐2: FULL AND HALF FACTORIAL EXAMPLE FOR CORROSION ...................................................... 179 
TABLE 5.2‐1:  TERMINOLOGY USED IN PARAMETER ESTIMATION ............................................................. 187 
TABLE 5.2‐2:  TECHNIQUES FOR PARAMETER ESTIMATION ....................................................................... 188 
TABLE 5.2‐3:  PARAMETERS TYPICALLY ESTIMATED FROM STATISTICAL DISTRIBUTIONS ......................... 189 
TABLE 5.2‐4:  CONFIDENCE BOUNDS FOR THE POISSON DISTRIBUTION ................................................... 200 
TABLE 5.2‐5:  CONFIDENCE BOUNDS FOR THE BINOMIAL DISTRIBUTION ................................................. 201 
TABLE 5.2‐6:  CONFIDENCE BOUNDS FOR THE EXPONENTIAL DISTRIBUTION ........................................... 202 
TABLE 5.2‐8:  CONFIDENCE BOUNDS FOR THE NORMAL DISTRIBUTION ................................................... 203 
TABLE 5.3‐10:  CONFIDENCE BOUNDS FOR THE WEIBULL DISTRIBUTION ................................................. 205 

xi
List of Tables: Reliability Modeling – The RIAC Guide

List of Tables
Page
TABLE 6.1‐1: CATEGORIES OF FAILURE EFFECTS ........................................................................................ 227 
TABLE 6.2‐2: BIMODAL POPULATION EXAMPLE 1 ...................................................................................... 229 
TABLE 6.2‐3: BIMODAL POPULATION EXAMPLE 2 ...................................................................................... 230 
TABLE 6.1‐4: BIMODAL POPULATION EXAMPLE 3 ...................................................................................... 231 
TABLE 6.1‐5: BIMODAL POPULATION EXAMPLE 4 ...................................................................................... 232 
TABLE 6.1‐6: BIMODAL POPULATION EXAMPLE 5 ...................................................................................... 233 
TABLE 6.1‐7: FOUR MODE WEIBULL DISTRIBUTION PARAMETERS ............................................................ 235 
TABLE 6.3‐1: FAILURE RATE UNCERTAINTY LEVEL MULTIPLIERS ................................................................ 242 
TABLE 6.9‐1: EXAMPLE OF COMBING DIFFERENT TYPES OF MODELS........................................................ 248 
TABLE 6.13‐1: FACTORS TO BE CONSIDERED IN A RELIABILITY MODEL ..................................................... 256 
TABLE 6.13‐2:  FAILURE RATE DATA SUMMARY ......................................................................................... 258 
TABLE 7.1‐1: DATA COLLECTED FOR MODEL DEVELOPMENT .................................................................... 269 
TABLE 7.1‐2: DATA TRANSFORMS .............................................................................................................. 270 
TABLE 7.1‐3: REGRESSION DATA INCLUDING CATEGORICAL VARIABLES ................................................... 271 
TABLE 7.2‐1: UNCERTAINTY LEVEL MULTIPLIER ......................................................................................... 282 
TABLE 7.2‐2: PERCENTAGE OF FAILURES ATTRIBUTABLE TO EACH FAILURE CAUSE .................................. 283 
TABLE 7.2‐3: WEIBULL PARAMETERS FOR FAILURE CAUSE PERCENTAGES ................................................ 283 
TABLE 7.2‐4:  MULTIPLIERS AS A FUNCTION OF PROCESS GRADE ............................................................. 284 
TABLE 7.2‐5: EXAMPLE OF FAILURE MODE‐TO‐FAILURE CAUSE CATEGORY MAPPING ............................. 295 
TABLE 7.2‐6: CAPACITOR PARAMETERS ..................................................................................................... 301 
TABLE 7.2‐7: DEFAULT ENVIRONMENTAL STRESS VALUES ........................................................................ 302 
TABLE 7.2‐8: DEFAULT OPERATING PROFILE VALUES................................................................................. 303 
TABLE 7.2‐9: FAILURE CAUSE SUMMARY FOR CONNECTORS .................................................................... 308 
TABLE 7.2‐10:  FAILURE MODE TO FAILURE CAUSE CATEGORY FOR CONNECTORS (SC AND FC) .............. 309 
TABLE 7.2‐11: FAILURE CAUSE PERCENTAGES FOR CONNECTORS ............................................................. 311 
TABLE 7.2‐12: DATA COLLECTED FOR CONNECTORS.................................................................................. 312 
TABLE 7.2‐13: CATEGORIES OF ACCELERATION MODEL PARAMETERS ...................................................... 315 
TABLE 7.2‐14: ACCELERATION MODEL PARAMETERS ................................................................................ 315 
TABLE 7.2‐15: DEFAULT MODEL PARAMETERS .......................................................................................... 316 
TABLE 7.2‐16: SUMMARY OF PI‐FACTOR CALCULATIONS .......................................................................... 317 
TABLE 7.2‐17: APPLICABILITY OF TEST DATA .............................................................................................. 318 
TABLE 7.2‐18: BASE FAILURE RATES (FAILURES PER MILLION CALENDAR HOURS) .................................... 319 
TABLE 7.2‐19:  PART QUALITY PROCESS GRADE FACTOR QUESTIONS FOR PHOTONIC DEVICE MODELS .. 320 
TABLE 7.2‐20: SUMMARY OF UNCERTAINTY METRICS ............................................................................... 323 
TABLE 7.2‐21: PARAMETERS FOR THE PROCESS GRADE FACTORS ............................................................. 327 
TABLE 7.2‐22.  INDEX OF PROCESS GRADE TYPE QUESTIONS .................................................................... 328 
TABLE 7.2‐23:  DESIGN PROCESS GRADE FACTOR QUESTIONS .................................................................. 330 
TABLE 7.2‐24:  MANUFACTURING PROCESS GRADE FACTOR QUESTIONS ................................................. 336 
TABLE 7.2‐25:  PART QUALITY PROCESS GRADE FACTOR QUESTIONS ....................................................... 340 
TABLE 7.2‐26:  SYSTEM MANAGEMENT PROCESS GRADE FACTOR QUESTIONS ........................................ 342 
TABLE 7.2‐27:  CAN NOT DUPLICATE (CND) PROCESS GRADE FACTOR QUESTIONS .................................. 346 
TABLE 7.2‐28:  INDUCED PROCESS GRADE FACTOR QUESTIONS ............................................................... 347 
xii
List of Tables: Reliability Modeling – The RIAC Guide

List of Tables
Page
TABLE 7.2‐29:  WEAROUT PROCESS GRADE FACTOR QUESTIONS ............................................................. 348 
TABLE 7.2‐30:  GROWTH PROCESS GRADE FACTOR QUESTIONS ............................................................... 349 
TABLE 7.3‐1: PARAMETER LEVELS .............................................................................................................. 350 
TABLE 7.3‐2: TEST PLAN SUMMARY ........................................................................................................... 351 
TABLE 7.3‐3: LIFE TEST RESULTS ................................................................................................................. 352 
TABLE 7.3‐4: TIMES TO FAILURE DISTRIBUTION PARAMETERS .................................................................. 353 
TABLE 7.3‐5: ESTIMATED PARAMETER 80% 2‐SIDED CONFIDENCE BOUNDS ............................................ 356 
TABLE 7.4‐1:  DATA SUMMARIZATION PROCESS ........................................................................................ 359 
TABLE 7.4‐2: TIME AT WHICH ASYMPTOTIC VALUE IS REACHED ............................................................... 363 
TABLE 7.4‐3 α/MTTF RATIO AS A FUNCTION OF β ..................................................................................... 363 
TABLE 7.4‐4: PERCENT FAILURE FOR WEIBULL DISTRIBUTION ................................................................... 364 
TABLE 7.4‐5: FIELD DESCRIPTIONS ............................................................................................................. 367 
TABLE 7.4‐6:  APPLICATION ENVIRONMENTS DEFINED IN NPRD ............................................................... 368 
TABLE 8.7‐1: FAILURE MODE RELATIONSHIP TO TAGUCHI LOSS FUNCTION ............................................. 389 
TABLE 8.8‐1: DIMENSIONS OF FUNCTIONAL SEVERITY .............................................................................. 391 
TABLE 8.8‐2: DIMENSIONS OF SEVERITY .................................................................................................... 392 
TABLE 8.11‐1: CATEGORIES OF FAILURE EFFECTS ...................................................................................... 401 
TABLE 8.11‐2: RECOMMENDED DETECTABILITY RATING CRITERIA ............................................................ 404 

xiii
List of Tables: Reliability Modeling – The RIAC Guide

This page intentionally left blank

xiv
Chapter 1: Introduction

1. Introduction
Few engineering techniques have caused as much controversy in the last several decades
as the topic of reliability prediction. One of the primary reasons for this is the stochastic
nature of reliability. Whereas many engineering disciplines are governed by
deterministic processes, reliability is governed by a complex interaction of stochastic
processes. As a result, the metrics of interest in other engineering disciplines are
generally much more quantifiable by their very nature. While there is always a stochastic
element in any engineering model, the topic of reliability quantification must address its
extreme stochastic nature.

Many highly respected reliability engineering texts treat the topic of reliability modeling
thoroughly and in great detail. Included in these texts are detailed ways to model system
reliability using techniques like Failure Modes and Effects Analysis (FMEA), Fault Tree
Analysis (FTA), Markov models, fault tolerant design techniques, etc. The techniques
that are addressed in detail in these texts often gloss over a fundamental requirement in
order to effectively utilize these techniques, i.e., the ability to quantify the reliability of
the constituent components and subsystems comprising the system.

The intent of this book is to provide guidance on reliability modeling techniques that can
be used to quantify the reliability of a product or system. In this context, reliability
modeling is the process of constructing a mathematical model that is used to estimate the
reliability characteristics of an item. There are many ways in which this can be
accomplished, depending on the item and the type of information that is available to, or
practical to obtain by, the analyst. This book will review possible approaches, summarize
their advantages and disadvantages, and provide guidance on selecting a methodology
based on specific goals and constraints. While this book will not discuss the use of
specific published methodologies, in cases where examples are provided, tools and
methodologies with which the author has personal experience in their development are
used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus.

The Reliability Information Analysis Center (RIAC) has prepared many documents in the
past relating to many different reliability engineering techniques, such as FMEA, FTA,
Worst Case Analysis (WCA), etc. However, one noteworthy omission from this list is
reliability modeling. This, coupled with (1) the RIAC’s history of providing reliability
modeling data and solutions, and (2) the need to objectively address some of the
confusion and misconceptions related to this topic, formed the inspiration for this book.

Reliability Information Analysis Center


1
Chapter 1: Introduction

In years past, DoD contracts would require specific reliability prediction methodologies,
usually MIL-HDBK-217, be used. This resulted in system developers having very little
flexibility in applying different reliability prediction practices. Since the DoD has not,
until very recently, supported updates to MIL-HDBK-217, companies were encouraged
to use best practices in quantifying product reliability. The difficult question to be
addressed is “what are the best practices that should be used?” This book attempts to
provide guidance on selecting an appropriate methodology based on the specific
conditions and constraints of the company and its products or systems.

It is hoped that the author’s experience gained by attempting many different reliability
assessment approaches, including physics and empirical approaches, can be used to the
advantage of the reader in a practical way.

1.1. Scope
The intent of a reliability program is to identify and mitigate failure modes/mechanisms,
verify their removal through reliability testing, implement corrective actions for
“discovered” failures, and maintain reliability levels after reliability has been designed in.
These correspond to the designing-in reliability, reliability growth and ensuring on-going
reliability goals, respectively, as illustrated in Figure 1.1-1.

Figure 1.1-1: Phases of a Reliability Program


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
2
Chapter 1: Introduction

The cost to an organization increases exponentially as a function of when failure causes


are discovered, as illustrated in Figure 1.1-2. It is most efficient to discover failure
modes and mechanisms as early as possible, when they can be effectively mitigated. If
failure modes and mechanisms are discovered late in development or, worse, in the field,
organizations can be faced with staggering costs associated with corrective actions.

Figure 1.1-2: Relative Cost of Failures vs. Phase

The use of reliability engineering techniques early in the development cycle of a system
is critical to achieving high reliability. An important part of these efforts is the modeling
of reliability before the product or system is fielded.

The term “Reliability Prediction” has had a relatively narrow connotation, primarily
associated with “handbook” approaches. This document attempts to take a broader view
of this topic by investigating the various approaches for quantifying reliability, and their
effectiveness when used to achieve specific objectives. For this reason, the book is
entitled “Reliability Modeling – the RIAC Guide to Reliability Prediction, Assessment
and Estimation”. The definitions of these are:

Prediction - something that is predicted, forecasted

Assessment - to determine the importance, size, or value of

Estimation - A tentative evaluation or rough calculation, as of worth, quantity, or


size

Reliability Information Analysis Center


3
Chapter 1: Introduction

Predictions are performed very early, before there is any empirical data on the item under
analysis. Reliability assessments are made to determine the affects of certain factors on
reliability and to identify failure causes. Reliability estimates are made based on
empirical data. This book covers all three areas, as illustrated in Figure 1.1-3.

Figure 1.1-3: Reliability Prediction, Assessment and Estimation

Figure 1.1-4 summarizes the results of a benchmarking study of best commercial


reliability practices (Reference 9). In this study, reliability predictions were identified by
more than 90% of the participants as being an appropriate reliability task during the
product/system development life cycle. Approximately 70% of the survey respondents
felt that reliability predictions were effective, supporting the proposition that, while
generally perceived as beneficial, there are problems associated with their use. This
information highlights the importance that organizations often place on assessing and
predicting reliability.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


4
Chapter 1: Introduction

Figure 1.1-4: Percent of Companies Using Reliability Engineering Tools

1.2. Book Organization


Chapter 1 of this book presents background information on reliability modeling. The
next section of this chapter includes a description of a typical reliability program, the
intent of which is to present the elements that should be considered when developing a
program, and to highlight how reliability modeling fits into such a program. Also
included is a section on the history of reliability prediction, to provide a historical
perspective of its evolution.

Chapter 2 covers the primary topic of this book, and includes information on the various
ways in which a product can be modeled and guidance on selecting an approach. It
presents a generic approach, and describes the elements of this approach.

Chapter 3 presents fundamental concepts of reliability theory, probability and statistics.


In many books, these topics are presented first. However, in this book, it is presented
after Chapter 3 because it is not the primary topic. Rather, it is presented to provide the
fundamental foundation for the concepts used in reliability modeling. It is also the
foundation for Design of Experiments (DOE) and Life Modeling techniques, which are
further detailed in Chapters 4 and 5.

Reliability Information Analysis Center


5
Chapter 1: Introduction

Approaches like using a “Multi-cell”-based designed experiment to generate data from


which a life model is developed are presented in Chapter 2. Here, a generic approach to
this topic is presented. Since the topic of life modeling is central to reliability modeling,
important elements of it are presented in more detail in Chapters 4 and 5. One of the
critical aspects of life modeling is reliability testing.

Design of Experiments is a technique to maximize the usefulness of the data resulting


from DOE tests, and is the topic of Chapter 4.

Chapter 5 presents information relative to development of the mathematical models that


form the basis of the reliability model, and includes information pertaining to parameter
estimation.

Chapter 6 presents a variety of topics pertaining to the interpretation of reliability models.


This is provided to allow the reader to gain a better appreciation for what can, and cannot,
be concluded from a model.

Chapter 7 is a compilation of examples of reliability models. Presented here are the


following examples:

1. A typical MIL-HDBK-217 model development process


2. Information on the development of the RIAC’s 217Plus methodology
3. A life modeling example
4. A description of RIAC’s Nonelectronic Parts Reliability Data (NPRD), provided
as an example of the use of field data in reliability modeling

These examples are provided to give the reader a better appreciation for the tools,
techniques and limitations of various approaches to reliability modeling.

A discussion of FMEA is presented in Chapter 8. Although FMEA is secondary to the


primary intent of this book, it can form the basis for many elements of a reliability
program, including reliability modeling. Therefore, Chapter 8 is intended to present
FMEA concepts in this context, as well as provide practical information on performing
FMEAs that this author has found to be useful.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


6
Chapter 1: Introduction

1.3. Reliability Program Elements


In order to allow a perspective on how reliability modeling fits into a reliability program,
this section presents a generic reliability program, with a description of its various
elements. It is presented to highlight how reliability modeling fits into such a program.

There are many possible approaches to “designing in” reliability. The specific approach
used will depend on the needs of the specific organization. Figure 1.3-1 presents one
possible approach, and includes the elements that should be included in all approaches.
The premise of this approach is to identify the critical parts and material which warrant
detailed attention. Since it is impractical to perform some reliability modeling
approaches on all system parts, it is imperative to identify the critical parts which are the
highest risk. Since one of the most effective ways to verify the robustness of parts or
materials is from experience, an effective reliability program must leverage knowledge
gained in the development and deployment of previous systems. It will be shown that
reliability assessments impact many of the elements of this approach.

Figure 1.3-1: Example Reliability Program Approach

Reliability Information Analysis Center


7
Chapter 1: Introduction

Elements of the reliability program are summarized as follows:

1. Design requirements: The first step in any product development process is the
identification of requirements. These requirements include items pertaining to
Performance, Reliability (failure rate, life), Maintainability, Diagnostics, and Use
Environment and Operational stresses (i.e., mission profiles). Typically, the medium for
communicating these requirements is the product specification. While the specification
usually contains details regarding the require performance of the product or system, it is
often lacking relative to quantifying the reliability attributes required. The following
questions should be answered to determine these reliability requirements:

• What is the required failure rate of the item in its useful life?
• What is the service life required?
• What criteria will be used to determine when the requirements are not met?
• Whose responsibility will it be to take corrective action if these requirements are
not met?
• What are the operating and environmental profiles expected in field deployed
conditions?

A valuable tool to assist in understanding the requirements is Quality Function


Deployment (QFD).

The reliability that is considered acceptable will, of course, be specific to the industry,
criticality of failure, etc. The specific value may be specified, or it may not be,
depending on the industry and the maturity of the product. The range of potential
customer reactions to various scenarios are summarized in Table 1.3-1.

Table 1.3-1: Ranges of Potential Customer Reactions


Outcome Field reliability Likely Customer reaction
Best No failures Pleased
Failures occur at an acceptable rate Tolerant
Recurring failures, but on a relatively small percent of Annoyed
items
Recurring failures on a high percent of items Angry
An unexpected failure mechanism is discovered that Legal action, loss of business
Worst will affect the entire population, or critical safety
related failures

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


8
Chapter 1: Introduction

If the requirement is not specified, an estimate of the requirement must be made so that
there is a goal that can be used in the development process.

2. Initial Design: After the product requirements are understood, the design team
generally derives an initial, or preliminary, design for the product or system. Inputs to
this initial design should be in the form of design rules and a Standard parts list. Design
rules are the culmination of lessons learned from previous development activities, from
both empirical field or test data, and from analysis. These design rules should be a living
document which is continuously updated based on current information. Effective use of
design rules also saves much effort since reliability attributes which have a reliability
history or which have been previously studied do not need to be addressed in detail, thus
saving resources to be applied to the study of critical parts.

3. Similarity analysis: Once an initial design is available, a similarity analysis can be


performed to identify attributes which are similar to those for which a reliability history is
available, and those for which it is not. A FMEA can be a valuable technique for this
analysis, and will be discussed later. In this analysis, each reliability attribute identified
in the FMEA is reviewed to determine if a reliability history exists or not.

4. Identify attributes that are similar: Similar attributes are those that have a reliability
history

5. Assess robustness of attribute: If the part or attribute does have a history, previous test
data or field experience data can be used to assess the robustness of the part or attribute.

6. Identify attributes that are not similar: Attributes that are not similar do not have a
reliability history.

7. Perform design analysis: Although any attribute that is potentially different in the new
design relative to the previous design must be analyzed, particular attention is given to
the attributes that are not similar. Design techniques that are used for this purpose are
FMEA, tolerance or worst case analysis, thermal analysis, stress analysis, and reliability
predictions.

8. Implement corrective action: From the results of the design analysis, corrective action
should be taken to improve the robustness of the design.

9. Identify critical parts/materials: Based on the results of the analysis, critical parts or
materials are identified.

Reliability Information Analysis Center


9
Chapter 1: Introduction

10. Model critical parts/materials: Once critical parts are identified, action must be taken
to ensure that the parts or materials are robust enough to meet the reliability and
durability requirements. More details of the approach used for this purpose will be
presented later in the book.

11. Identify effective tests for non-similar attributes: Based on the identification of
critical parts and the design analysis that was performed, specific tests that will assess the
reliability and durability of the attribute can be determined. Part of the FMEA should
include identification of stresses that will accelerate the attribute under analysis and
therefore, this analysis is important for identifying the appropriate stress tests.

12. Develop a test plan and execute tests: Based on the design analysis performed and
the identification of tests for non-similar attributes, a test plan can be determined. In the
context of this approach, the goal of these tests is to assess the robustness of the product
by subjecting the product to test stresses that are intended to accelerate the critical parts
and non-similar attributes to failure. In addition to these tests, other test requirements
should be incorporated into this test plan. These additional test requirements include any
tests required by the customer, such as qualification or reliability demonstration tests.

13. Document the test results: Once the tests have been performed and the data analyzed,
the results should be fully documented, since they subsequently will be used for a variety
of purposes.

14. Monitor field reliability: Once the product is deployed, field reliability experience
data should be carefully gathered, since it will be used for a variety of purposes.
Elements of the data to be gathered include:

1. Product or system deployment history by serial number, including when


deployed, when fielded
2. Failure information, including failure date, root failure cause, results of failure
analysis
3. Product or system re-deployment information

15. Update reliability database: A database is required to manage the reliability data, and
should include both test data and field data. This data can be used to generate a
company-specific reliability prediction methodology.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


10
Chapter 1: Introduction

16. Update Design Rules: Data acquired from tests and field surveillance should be used
to update the design rules. Field data is probably the most valuable type of data for this
purpose since it represents the actual product or system in the intended use environment.
The process of maintaining design rules and ensuring that they are used in new designs is
the cornerstone of the means by which reliability is improved in a reliability growth
process.

Critical parts are those which may result in a significant risk to the project. This risk can
be related to reliability, lifetime, availability, or maintainability. Some of the factors that
constitute critical parts are:

• New, unproven technology


• New, unproven manufacturing processes
• Performance limitations: stringent environmental conditions or non-robust design
practices
• Reliability limitations: components/materials with life limitations
• Vendors with a past history of delivery, cost performance or reliability problems
• Old technology with availability problems

These critical parts or items warrant additional attention in assessing their reliability, as
they generally will represent the greatest reliability risk.

1.4. The History of Reliability Prediction


The term “reliability prediction” has historically be used to denote the process of
applying mathematical models and data for the purposes of estimating field reliability of
a product or system before empirical data is available on that product or system. This
section will review some of the developments in the area of reliability prediction from the
1950’s to the present. While there are several techniques available to reliability
practitioners to perform reliability predictions, the discussion inevitably centers around
MIL-HDBK-217 due to its historical prominence as a reliability prediction tool.

During World War II, electronic tubes were by far the most unreliable component used in
DoD electronic systems. This observation led to various studies and ad hoc groups
whose purpose was to identify ways that their reliability, and the reliability of the systems
in which they operated, could be improved. One group in the early 1950’s concluded
that:

1. There needs to be better reliability data collected from the field


2. Better components need to be developed
Reliability Information Analysis Center
11
Chapter 1: Introduction

3. Quantitative reliability requirements need to be established


4. Reliability needs to be verified by test before full scale production
5. A permanent committee needs to be established to guide the reliability discipline

Item 5, above, was implemented in the form of the Advisory Group on Reliability of
Electronic Equipment (AGREE), whose charter was to identify actions that could be
taken to provide more reliable electronic equipment. This time period was the advent of
the reliability engineering discipline. It soon became clear that the emerging discipline
was using several different methods to achieve its goal of higher reliability. One was the
identification of root causes of field failure and determination of mitigating actions.
Another was the specification of quantitative reliability requirements. The specification
of requirements in turn led to the desire to have a means of estimating reliability before
an equipment is built and tested so that the probability of achieving its reliability goal
could be estimated. This, of course, was the beginning of reliability prediction. The
1950’s also saw much pioneering work in the reliability discipline, including;

• A variety of efforts to improve device reliability through data collection and


design
• The establishment of reliability programs
• Symposiums devoted to quality and reliability engineering
• Statistical techniques development such as the Weibull distribution
• Military handbooks that provided guidance on the reliable application of
electronic components

In addition to these accomplishments, the 50’s also included pioneering work in the area
of quantitative reliability prediction. In 1956, RCA released TR-1100, “Reliability Stress
Analysis for Electronic Equipment”, which presented mathematical models for the
estimation of component failure rates. This report turned out to be the predecessor of
MIL-HDBK-217.

Several additional early works in the area of reliability prediction were produced in the
early 1960’s, including D.R. Erles’ report (Reference 2) and the Erles and Edins paper
(Reference 3). In 1962, the first version of MIL-HDBK-217 was published by the Navy.
Once issued, MIL HDBK-217 quickly became the standard by which reliability
predictions were performed, and other sources of failure rates gradually disappeared.
Part of the reason for the demise of other sources was the fact that MIL-HDBK-217 was
often a contractually cited document and defense contractors did not have the option of
using other sources of data.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


12
Chapter 1: Introduction

These early sources of failure rates also often included design guidance on the reliable
application on electronic components. However, subsequent versions of the documents,
primarily MIL-HDBK-217, would delete the application information because it was
treated in more detail elsewhere.

By now, the reliability discipline was working under the tenet that reliability was a
quantitative discipline that needed quantitative data sources to support its many
statistically based techniques, such as allocations and redundancy modeling. However,
another branch of the reliability discipline focused on the physical processes by which
components were failing. The first symposium devoted to this topic was the “Physics of
Failure In Electronics” Symposium sponsored by the Rome Air Development Center
(RADC) and IIT Research Institute (IITRI) in 19621. This symposium later became
known as the International Reliability Physics Symposium (IRPS). In this period of time,
the two branches of reliability engineering seemed to be diverging, with the “systems”
engineers devoted to the tasks of specifying, allocating, predicting and demonstrating
reliability, while the physics-of-failure (PoF) engineers and scientists were devoting their
efforts to identifying and modeling the physical causes of failure. Both branches were
integral parts of the reliability discipline, and both were hosted at RADC (later to become
Rome Laboratory). The physics-based information was necessary to develop part
qualification, screening and application requirements, and the “systems” tasks of
specifying, allocating, predicting and demonstrating reliability were necessary to insure
that reliability requirements were met. The component research efforts of the 1950’s and
1960’s culminated with the implementation of the “ER” and “TX” families of
specifications. This complicated the issue of predicting their reliability because there
were now many different combinations of quality levels and environments that needed to
be addressed in MIL-HDBK-217.

In the early 1970’s, the responsibility for preparing MIL-HDBK-217 was transferred to
RADC, who published revision B in 1974. However, other than the transition to RADC,
the 1970’s maintained the status quo in the area of reliability prediction. MIL-HDBK-
217 was updated to reflect the technology at that time, but there were few other efforts
that changed the manner in which predictions were performed. One exception, however,
was that there was a shift in the complexity of the models being developed for MIL-
HDBK-217. There were several efforts to develop new and innovative models for
reliability prediction. The results of these efforts were extremely complex models that
may have been technically sound, but were criticized by the user community as being too

1
IITRI was the original contractor of the Reliability Analysis Center (RAC). In 2005, the RAC contract was awarded as RIAC to the current team of
Wyle Labs (prime), Quanterion Solutions Incorporated, the University of Maryland Center for Risk and Reliability, the Pennsylvania State Applied
Research Laboratory (ARL), and the State University of New York Institute of Technology (SUNYIT)
Reliability Information Analysis Center
13
Chapter 1: Introduction

complex, too costly, and unrealistic given the low level of detailed design information
available at the point in time when the models were needed. RCA, under contract to
RADC, had developed PoF-based models which were rejected as unusable, since the
detailed design and construction data for microcircuits were simply unavailable to typical
model users. These models were never incorporated into MIL-HDBK-217.

While MIL-HDBK-217 was updated again several times in the 1980’s, there were
agencies that were developing reliability prediction models unique to their industries. As
an example, the automotive industry, under the auspices of the Society of Automotive
Engineers (SAE) Reliability Standards Committee, developed a series of models specific
to automotive electronics. The SAE committee felt that there was no existing prediction
methodologies that were applicable to the specific quality levels and environments of
automotive applications. The Bellcore reliability prediction standard is another example
of a specific industry developing methodologies for their unique conditions and
equipment. It originally was developed by modifying MIL-HDBK-217 to better reflect
the conditions of interest of the telecommunications industry. It has since taken on its
own identity with models derived from telecommunications equipment and is now used
widely within that industry.

The 1980’s also saw explosive growth in integrated circuit technology. Very dense
circuits were being fabricated using feature sizes as small as 0.5 microns. This presented
unique challenges to reliability modelers. The VHSIC (Very High Speed Integrated
Circuit) program was the government’s attempt to leverage from the technological
advancements of the commercial industry and, at the same time, produce circuits capable
of meeting the unique requirements of military applications. From the VHSIC program
came the Qualified Manufacturers List (QML) - a qualification methodology that
qualified an integrated circuit manufacturing line, unlike the traditional qualification of
specific parts. The government realized that it needed a QML-like process if it were to
leverage from the advancements in commercial technologies and, at the same time, have
a timely and effective qualification scheme for military parts. A reliability prediction
model was also developed for VHSIC devices in 1989 (Reference 9) in support of a MIL-
HDBK-217 update. An interesting observation was made during that study that deviated
from the premise on which most of the MIL-HDBK-217 models were based. The
traditional approach to developing models was to collect as much field failure rate data as
possible, statistically analyze it, and quantify model factors based on the results of the
statistical analysis. For integrated circuits, one of the factors that was quantified was
inevitably device complexity. This complexity was measured by the number of gates or
transistors and was the primary factor on which the models were based. The correlation
between failure rate and complexity was strong and could be quantified because the

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


14
Chapter 1: Introduction

failure rate of circuits was much higher than they are today and the defect rate was
directly proportional to the complexity. As technology has advanced, the gate or
transistor count became so high that it could no longer effectively be used as the measure
of complexity in a reliability model. Furthermore, transistor or gate count data was often
difficult or impossible to obtain. Therefore, the model developed for VHSIC
microcircuits needed another measure of complexity on which to base the model. The
best measures, and the ones most highly correlated to reliability are defect density and
silicon area. It can be shown that the failure rate (for small cumulative percent failure) is
directly proportional to the product of the area and defect density. However, another
factor that is highly correlated to defect density and area is the yield of the die, or the
percent of die that are functional upon manufacture. Ideally, a reliability model would
use either yield or defect density/area as the primary factor(s) on which to base the
model. The problem in using these factors in a model is that they are considered highly
proprietary parameters from a market competition viewpoint and, therefore, are rarely
released by the manufacturers. Therefore, the single most important driver of reliability
cannot be obtained by the user of the device, which is unfortunate because the accuracy
of the model suffers. The conflict between the usability of a model and its accuracy has
always been a difficult tradeoff to address for model developers.

Much of the literature in the 1990’s on the topic of reliability prediction has centered
around the debate as to whether the reliability discipline should focus on PoF-based or
empirically-based models (such as MIL-HDBK-217) for the quantification of reliability.

In the author’s opinion, many of the primary criticisms of MIL-HDBK-217 stem from the
fact that it was often used for purposes for which it was not intended. For example, it
was often used as a means by which the reliability of a product was demonstrated. Since
its use was contractually required, contractors would try to demonstrate compliance to the
specified reliability requirements by “adjusting” factors in the model to make it appear
that the reliability would meet requirements. Sometimes these adjustments had a
technical basis, and sometimes they did not. Les Gubbins, one of the government’s first
project managers for the handbook, once made the analogy that engaging in the use of
these adjustment factors is like pushing the needle on your car’s speedometer up, and
convincing yourself you’re going faster. This, of course, is not good engineering
practice, but rather was done for nontechnical reasons.

Another key development in the area of reliability predictions was related to the
implications of acquisition reform. In 1994, Military Specifications and Standards
Reform (MSSR) was initiated which decreed the adoption of performance-based
specifications as a means of acquiring and modifying weapon systems. It also overhauled

Reliability Information Analysis Center


15
Chapter 1: Introduction

the military standardization process which, in turn, led to a list of standardization


documents that required priority action because they were identified as barriers to
commercial processes, as well as major cost drivers in defense acquisitions. The list
included only one handbook, MIL-HDBK-217. Over the years, critics of MIL-HDBK-
217 have complained about its utility as an effective method for assessing reliability.
While the claim is made that it is inaccurate and costly, to date there is no viable
replacement in the public domain. As the DoD Lead Standardization Activity for
reliability and maintainability (R&M), Rome Laboratory (RL) was responsible for
implementing the R&M segment of MSSR. Within this context, RL initiated a project to
develop a new reliability assessment technique to supplement MIL-HDBK-217, and to
overcome some of its perceived problems.

Utilizing standardization reform funding, RL awarded a contract to the Reliability


Analysis Center and Performance Technology, Inc. The objective of the work was to
develop new and innovative reliability assessment methods that are flexible enough to
suit the needs of system reliability analysts regardless of their preferred (or required)
initial prediction methods. The intent was to use the final model to supplement or
possibly replace MIL-HDBK-217. The premise of traditional methods, such as MIL-
HDBK-217, is that the failure rate is primarily determined by components comprising the
system. This was a good premise in the 1960’s and 1970’s when components exhibited
higher failure rates and systems were less complex than they are today. Increased system
complexity and component quality have resulted in a shift of system failure causes away
from components to more “system level” factors including manufacturing, design, system
requirements, interface, and software problems. Historically, these factors have not been
explicitly addressed in prediction methods. The intent of this study was to develop a
structure for an electronic system reliability assessment methodology. The term “system”
was used because the methodology accounted for all predominant causes of system
failure. The new model adopted a broader definition of reliability. An integral part of the
methodology was the assessment of processes used in the design and manufacture of the
system, including factors contributing to the following failure causes: parts, design,
manufacturing, system management, induced, wearout, no defect found and software.
The results of this study became the basis for the current RIAC 217Plus methodology.

The 2000’s was a time in which there was progress on development of new standards,
some of which will be summarized in this book. Also, the DoD has initiated efforts to
resurrect MIL-HDBK-217 by updating it with models reflecting state-of-the-art
technologies.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


16
Chapter 1: Introduction

1.5. Acronyms
Acronyms and abbreviations that are used in his book are defined as follows:
AL Accelerated Life
ALM Accelerated Life Model
ALT Accelerated Life Testing
CA Constant acceleration
CDF Cumulative Distribution Function
CRR Center for Risk and Reliability
D Detectability
DoD Department of Defense
DPA Destructive Physical Analysis
DVT Design Verification Test
ED Electrical distributions
ELFR Early life failure rate
EPRD Electronic Parts Reliability Data
ESD Electrostatic discharge
EV External visual
EVT Engineering Verification Test
FMEA Failure Mode and Effect Analysis
FMECA Failure Mode and Effect Criticality Analysis
FRU Field Replaceable Unit
GFL Gross/fine leak
HALT Highly Accelerated Life Test (simultaneous temperature cycling and vibration)
HASS Highly Accelerated Stress Screening
HAST Highly Accelerated Stress Testing
HTB High temperature bake
HTOL High temperature operating life
HTRB High temp. reverse bias
IOL Intermittent operational life
IPL Inverse Power Law
IWV Internal water vapor
KPSI Pounds per square inch, in thousands
LI Lead integrity
MCMC Markov Chain Monte Carlo
MLE Maximum Likelihood Estimator
MS Mechanical shock
MTTF Mean Time to Failure
NPRD Non-Electronic Parts Reliability Data
O Occurrence
PD Physical dimensions
PDF Probability Density Function
PVT Process Verification Test
RBD Reliability Block Diagram
RPN Risk Priority Number
RSH Resistance to solder heat
S Severity
SD Solderability
TBD To Be Defined
TC Temperature cycling
TR Thermal resistance
TST Pre and post electrical test
TTF Time to Failure
VVF Vibration - variable freq.

Reliability Information Analysis Center


17
Chapter 1: Introduction

1.6. References
1. Coppola, A., Reliability Engineering of Electronic Equipment, A Historical
Perspective,” IEEE Transactions on Reliability. Vol. R-33. No. 1, April 1984.
2. Erles, D.R., “Reliability Application and Analysis Guide,” The Martin Company, July
1961.
3. Erles D.R. and M.F. Edins, “Failure Rates,” AVCO Corp. April, 1962.
4. Knight, C.R., “Four Decades of Reliability Progress,” 1991 Proceedings Annual
Reliability and Maintainability Symposium.
5. “Reliability Prediction Methodologies For Electronic Equipment,” AIR 5286, SAE G-
11 Committee, Electronic Reliability Prediction Committee, 31 Jan. 1998
6. “Reliable Application of Plastic Encapsulated Microcircuits,” Reliability Analysis
Center Publication PEM2.
7. Morris, S.F. and J.F. Reilly (Rome Laboratory), “MIL-HDBK-217 - A Favorite
Target.”
8. Denson, W. And P. Brusius, “VHSIC and VHSIC-Like Reliability Modeling,” RADC-
TR-89-177.
9. Reliability Analysis Center, “Benchmarking Commercial Reliability Practices”

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


18
Chapter 2: General Assessment Approach

2. General Assessment Approach


Prior to developing a reliability model for a product or system, the analyst should
consider the following questions:

• What is the goal of the model, and what decisions will be made based on it?
• What data is currently available on the product?
• Is field data available? If so, is it from the product or system operating in the
same manner and environment as the one under analysis?
• Is test data available? If so, what types of tests (i.e., accelerated life tests, non-
accelerated life tests, qualification tests, etc.)
• Is data, either field or test, available on a predecessor (i.e., earlier version) of the
product?
• Have models been developed for specific failure modes, mechanisms and/or
causes of the product?
o Life models?
o Stress-strength models?
o Models from first principals?
• Have critical failure causes of the product been identified?
• How much support can be expected from suppliers regarding identification and
quantification of the failure causes of their product?

A suggested approach to modeling the reliability of a product is shown in Figure 2.0-1.

Reliability Information Analysis Center


19
Chapter 2: General Assessment Approach

Define system Identify the purpose of


the model

Determine the appropriate level at which


to perform the assessment
(System, assembly, part, failure cause)

Assess data
available Determine appropriate approach
and execute
Assess feasibility of
performing reliability
tests
Combine data

Develop System Model

Figure 2.0-1: General Modeling Approach

Each of the elements of this approach is discussed below.

2.1. Define System


The first step in assessing the reliability of a product or system is to clearly define the
scope of the assessment. A model is then generated that describes the breakdown of the
product or system. This breakdown can be in accordance with a physical hardware
hierarchy of the system, or a functional breakdown. Either way, the goal is to define the
“items” for which a reliability estimate is required.

If handbook reliability prediction methodologies such as 217Plus or MIL-HDBK-217 are


used, the definition of the items to address in the prediction is generally accomplished
with a hardware-based hierarchical breakdown, since those prediction methodologies are
based on the physical components comprising the system. In other approaches, such as
life modeling from accelerated test data, the product or system breakdown can be based
on functionality or hardware, with the exception that the breakdown continues down to

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


20
Chapter 2: General Assessment Approach

the root failure mode cause or mechanism level. Tools for this “system model” include
FMEA and FTA.

Fault tree representation of a system breakdown in which the level at which reliability
estimates are made are the components, represented by circles (basic events) is illustrated
in Figure 2.1-1. This Figure represents a reliability prediction performed using MIL-
HDBK-217 or 217Plus.
System

Assembly 1 Assembly 2

Subassembly 1a Subassembly 1b Subassembly 2b Subassembly 2c

Comp. 1a1 Comp. 1a2 Comp. 1a3 Comp. 1b1 Comp. 1b2 Comp. 2b1 Comp. 2b2 Comp. 2b3 Comp. 2c1 Comp. 2c2

Figure 2.1-1: Fault Tree Representation of System Model

Fault tree representation of a system breakdown in which the level at which reliability
estimates are made are the failure mechanisms of the components, represented by circles
(basic events) is illustrated in Figure 2.1-2. This would be the representation of a
reliability prediction performed using a physics approach in which the intent is to
estimate the reliability of specific root-cause failure mechanisms.
System

Assembly 1 Assembly 2

Subassembly 1a Subassembly 1b Subassembly 2b Subassembly 2c

Comp. 1a1 Comp. 1a2 Comp. 1a3 Comp. 1b1 Comp. 1b2 Comp. 2b1 Comp. 2b2 Comp. 2b3 Comp. 2c1 Comp. 2c2

FM1 FM2 FM2 FM1 FM1 FM2 FM3 FM1 FM1 FM2 FM1 FM2 FM3 FM1 FM2 FM1 FM1 FM2 FM1 FM2

Figure 2.1-2: Fault Tree Representation to the Failure Cause Level

Reliability Information Analysis Center


21
Chapter 2: General Assessment Approach

Approaches such as this, in which the reliability of each failure mechanism is estimated,
are practical if:

1. The product or system under analysis has a manageable number of failure


mechanisms that can be estimated
2. The approach can be practically applied for all failure mechanisms over the entire
supply chain. In other words, each organization responsible for their component
or assembly has the ability to estimate the reliability of all failure mechanisms
within their component or assembly.

This same representation is relevant to performing FMEAs. In this case, the lowest level
events in the fault tree are the constituent failure modes of the component. If a failure
mechanism modeling approach is to be used, it needs to be applied to all failure
mechanisms in order for the assessment to quantify the reliability of the entire system.

2.2. Identify the Purpose of the Model


Perhaps the single most important factor contributing to a successful reliability
assessment is an unambiguous definition of the specific purpose to be accomplished in
the assessment. Only by knowing the purpose of an assessment can an appropriate
methodology be selected. If the purpose is not made clear, there is little chance that the
assessment will be successful. In the author’s opinion, this unclear definition of purpose
is the root cause of many of the controversies found in the reliability discipline over the
last twenty years as to selecting and using the appropriate approach.

All of the approaches described in this book have merit. All have their strengths and
weaknesses. A successful assessment will leverage the strengths of specific
methodologies toward the specific goals of the assessment. Toward this end, the intent of
this section (and the following sections) is to provide guidance on the applicable
approaches for specific assessment purposes.

A breakdown of the possible purposes for developing a reliability model is shown in


Figure 2.2-1.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


22
Chapter 2: General Assessment Approach

Purpose of model

Risk assessment
Reliability demo
Maintainability
Design aid
Anticipated Observed Determine if
Determine if
failure failure minimum reliability rqmt Allocate
robustness is is achieved maintenance
achieved PM personnel
Input to schedules
FMEA/FTA for ID Warranty cost
of failure cause predictions
priority Spares
Determine allocation
Determine impact of factors
Compare Model feasibility of on reliability Determine
competing Reliability meeting rel. rqmt screening rqmt
designs growth

Determine fault Determine testability


tolerance, redundancy requirements

Figure 2.2-1: Breakdown of Potential Reliability Modeling Purposes

Each of these purposes is described in Table 2.2-1.

Reliability Information Analysis Center


23
Chapter 2: General Assessment Approach

Table 2.2-1: Reliability Assessment Purposes


Purpose Description
Risk assessments are performed to quantify the reliability of critical- or safety-related failure
Anticipated failure modes before the product is fielded. This is often done to meet industry or customer
requirements.
Risk assessments are performed on fielded products that experience failures. Factors that
Risk usually need to be quantified are (a) determination of the root cause, (b) lifetime, (c) percent
Assessment failure at a given time, (d) the percent of the population at risk (i.e. whether the root cause is
Observed failure special cause or common cause), (e) whether the defect is lot- or batch-related, (f) whether
the defective portion can be contained, and (g) what the reliability will be as a function of
the level of corrective actions (for example, 1 – if nothing is done; 2 - if a complete recall is
done, and 3 - an approach in between).
Input to Techniques such as FMEA and FTA are used to assess and prioritize failure causes. Part of
FMEA/FTA for ID this prioritization includes the identification of the probability of occurrence, either
of failure cause qualitatively or quantitatively.
priority
For this purpose, reliability modeling is performed to quantify the relative reliabilities of
Compare several competing designs. This analysis is then used as one criterion from which the final
competing designs design is chosen. In this case, reliability is only one of the factors to be accounted for in this
comparison, and needs to be traded off against all of the other factors.
A natural part of the development process is to grow the reliability to a point that it meets its
reliability requirement. For this purpose, the reliability metric of choice is quantified as a
Model reliability
function of time. This provides Program Management with the information to assess the
growth
Design Aid reliability status of the project and to estimate the date at which the requirements will be
met.
Determine In many cases, reliability requirements are levied upon suppliers and contractors. For this
feasibility of purpose, the reliability assessment is performed to determine if there is a reasonable
meeting reliability probability of achieving the reliability requirements. If it is highly likely that requirements
rqmt. cannot be met, then management must make decisions regarding the future of the program.
Determine impact For this purpose, the effects of specific factors are assessed. For example, the effects of
of factors on temperature may be assessed to determine how much cooling is required.
reliability
(derating)
This purpose relates to quantifying reliability as a function of possible screening options, so
Determine
that it can be determined which screening options will result in the reliability requirements
screening rqmt.
being met.
Determine if This purpose is to provide quantitative data that proves, within acceptable confidence limits,
minimum that predefined robustness levels are achieved. These robustness levels usually correspond to
robustness is a “qualification” requirement, and may not be highly correlated to field reliability.
Reliability
achieved
Demo
Determine if This purpose is to provide quantitative data that proves, within acceptable confidence limits,
reliability rqmt. is that the reliability requirements are met.
achieved
Warranty cost For this purpose, the assessment is performed so that the costs associated with warranty
predictions repairs or replacements can be estimated.
Preventive The assessment is performed so that effective preventive maintenance schedules can be
Maintenance (PM) derived.
schedules
For repairable systems, since the replacement of failed items requires the availability of
Maintainability
spare items, the question of how many spares to keep on hand inevitably arises. The
Spares allocation
reliability characteristics of the item is one piece of information required. Others are repair
rates, a reliability block diagram, etc.
Allocate For repairable systems, organizations need to determine the personnel required to keep up
maintenance with maintenance demands. One input to this is the frequency of various types of failures
personnel
1-only for the specific failure causes modeled
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
24
Chapter 2: General Assessment Approach

Specific reliability modeling purposes are generally suited to specific program phases, as
summarized in Table 2.2-2.

Table 2.2-2: Program Phase vs. Reliability Assessment Purpose


Stage
Purpose Develop- Early Product- Deploy-
Concept
ment Production ion ment
Risk Anticipated failure x
Assessment Observed failure x x x x
Input to FMEA/FTA
for ID of failure cause x x
priority
Compare competing
designs x
Model reliability
growth x
Design Aid Determine feasibility
of meeting reliability x x
rqmt.
Determine impact of
factors on reliability x
(derating)
Determine screening
rqmt. x
Determine if minimum
Reliability robustness is achieved x x
Demo Determine if reliability
req. is achieved x x
Warranty cost
predictions x x x x

Maintainability PM schedules x x x x
Spares allocation x x x x
Allocate maintenance
personnel x x x x

2.3. Determine the Appropriate Level at Which to Perform the


Modeling
The first thing to determine is the hierarchical level at which the assessment will be
performed. A generic hierarchy is shown below:
Reliability Information Analysis Center
25
Chapter 2: General Assessment Approach

System
Subsystem
Assembly
Component
Failure Modes (Root)
Failure Causes/Mechanisms (Root)

2.3.1. Level vs. Data Needed


Traditional handbook approaches for reliability predictions will generally be applied at
the component level. In this case, a failure rate is estimated for each component, based
on the factors accounted for in the specific model used. In some cases, this predicted
failure rate will be apportioned amongst the component’s failure modes in a FMEA (if
the MIL-STD-1629 method is used, in which the criticality is determined by the modal
failure rate, i.e., the component failure rate multiplied by the failure mode percentage of
occurrence). This approach can be used based on readily accessible data, such as that
found in the handbooks. This approach also allows for the estimation of a failure rate
associated with each “failure severity”. This is accomplished by adding the failure rates
for the failure modes that result in a specific “severity” call of failure.

If the level to be analyzed is failure causes, then additional detailed data and information
is required. Therefore, the practicality of obtaining the required data must be a
consideration when choosing an appropriate approach. The degree of difficulty of
obtaining required data generally increases as you go lower in the hierarchy. This
concept is illustrated in Figure 2.3-1.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


26
Chapter 2: General Assessment Approach

Data required for reliability System level


assessment
System

Subsystem

Parts lists
Environmental conditions
Part stresses Assembly

Component

Failure mode distributions Failure modes

Yield
Defect density
Internal part stresses & Failure causes/mechanisms
distributions
Figure 2.3-1: Typical Data Requirements vs. Level of Hierarchy

As shown in Figure 2.3-1, the data required for the assessment of specific failure causes
can be factors like yield, defect density, internal part stresses and distributions. Because
these are factors often difficult to obtain by outside organizations, the best approach is
generally to have the manufacturer assess the reliability of the causes in the event that the
selected approach requires this sort of data.

The appropriate approaches for a reliability assessment will, therefore, generally depend
on the location of a company’s product in the hierarchy of the product or system.

Reliability Information Analysis Center


27
Chapter 2: General Assessment Approach

2.3.2. Using an FMEA as the basis for a reliability model


A FMEA can be an effective tool in identifying specific root failure causes that need to
be quantified in a reliability model. A generic FMEA approach is shown in Figure 2.3-2.

System Hierarchy
How functions can fail

Functions Failure effects


Functions Failure modes

Identify failure causes

Occurrence
Detectability

Risk Priority
Number (RPN)

Improve design

Figure 2.3-2: The Basic FMEA Approach

The hierarchical relationship between cause, mode and effect is shown in Figure 2.3-3.
For example, a failure mode can have any number of potential effects, and also can have
any number of potential causes.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


28
Chapter 2: General Assessment Approach

Failure Failure Failure


Effect Effect Effect
#1 #2 #3

Failure
Mode

Cause Cause Cause Cause


#1 #2 #3 #4

Cause Cause Cause Cause Cause Cause Cause Cause Cause


#1a #1b #1c #2a #2b #3a #3b #4a #4b

Cause Cause Cause Cause


#2a1 #2a2 #3b1 #3b2

Figure 2.3-3: Hierarchical Relationship Between Cause, Mode and Effect

If the reliability assessment is to be performed at the failure cause level, then all possible
causes need to be identified. One of the FMEA objectives is to identify all conceivable
failure causes. One way to accomplish this is to identify all combinations of initial
conditions, stresses and mechanisms, as illustrated in Figure 2.3-4 and Table 2.3-1.

Initial conditions Stresses Mechanism

Defect Free Defects


Operational Environmental
Mechanical
Chemical
Extrinsic Electrical
Intrinsic

Figure 2.3-4: Approach to Identifying Causes

Reliability Information Analysis Center


29
Chapter 2: General Assessment Approach

Table 2.3-1: Examples of Initial Conditions, Stresses and Mechanisms


Defect Free
Voids
Material property variation
Geometry variation
Intrinsic Contamination
Ionic contamination
Initial Conditions Crystal defects
Defects Stress concentrations
Organic contamination
Nonconductive particles
Extrinsic Conductive particles
Contamination
Ionic contamination
Thermal
Operational Electrical
Chemical
Optical
Chemical exposure
Salt fog
Mechanical shock
Stresses UV exposure
Drop
Environmental Vibration
Temperature-high &low
Temperature cycling
Humidity
Pressure – low &high
Radiation – EMI, cosmic
Sand and dust
Electromigration
Dielectric breakdown
Dendritic growth
Electrical Tin whiskers
Electro-thermo-migration
Second breakdown
Metal fatigue
Stress corrosion cracking
Melting
Creep
Warping
Brinelling
Fracture
Fretting fatigue
Pitting corrosion
Spalling
Crazing
Abrasive wear
Adhesive wear
Mechanism Mechanical Surface fatigue
Erosive wear
Cavitation pitting
Stress corrosion cracking
Elastic deformation
Material migration
Cracking
Plastic deformation
Elastic deformation
Brittle fracture
Expansion
Contraction
Emod change
Outgas
Corrosion
Chemical attack
Chemical Fretting corrosion
Oxidation
Crystallization

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


30
Chapter 2: General Assessment Approach

One of the keys to a successful FMEA is to understand the relationship between cause,
mode and effect. In general, there is a natural tiering effect that occurs in an FMEA as a
function of the product or system level, as illustrated in Table 2.3-2. For example, at the
most basic level, the part manufacturing process, the cause of failure may be a process
step that is out of control. The ultimate effect of that cause becomes the failure mode at
the part level, the failure effect of the part becomes the failure mode at the next level of
assembly, and so forth. It is very important that the cause, mode and effect are not
confounded in the analysis.

Table 2.3-2: Relationship Between Cause, Mode and Effect.

Part
System Assembly Part Manufacturing
Process

Effect

Mode Effect

Cause Mode Effect


Cause Mode Effect
Cause Mode
Cause

More detail regarding an FMEA approach is provided in Chapter 8.

Figures 2.3-5 through 2.3-8 illustrate, with fault trees, how the relationship between
cause, mode and effect scale up or down the product or system hierarchy, depending on
the hierarchical level at which the analysis is to take place.

In this example, failure “cause” is considered to be at the lowest level at which a


modeling effort will occur. If the cause corresponds to a fundamental mechanism of
failure (i.e., the mechanism represents the fundamental physical failure of the item), then
the term “cause” is considered synonymous with the term “mechanism”.

Reliability Information Analysis Center


31
Chapter 2: General Assessment Approach

TOP

OR AND

AND AND AND OR

OR VT AND AND OR OR OR OR

Event

Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event

Figure 2.3-5: Fault Tree of Product or System

TOP

OR AND

AND AND Effect OR

OR VT AND Mode OR OR OR OR

Event

Event Event Event Event Event Event Cause Event Event Event Event Event Event Event Event

Figure 2.3-6: Fault Tree of Product or System with Cause as the Lowest Level

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


32
Chapter 2: General Assessment Approach

TOP

OR Effect

AND AND Mode OR

OR VT AND Cause OR OR OR OR

Event

Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event

Figure 2.3-7: Fault Tree of Product or System with Cause Above the Lowest Level

Effect

OR Mode

AND AND Cause OR

OR VT AND AND OR OR OR OR

Event

Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event

Figure 2.3-8: Fault Tree of Product or System with Cause Two Levels Above the
Lowest Level

Therefore, if the FTA view of the product or system is to be consistent with the reliability
assessment, then the lowest level in the tree must be the level at which reliability
estimates are made.

The section above describes the hierarchical level at which a reliability model will be
developed, whether it be a failure cause, failure mode, a component or an assembly.
Once this physical level is determined, there are several model forms possible to
Reliability Information Analysis Center
33
Chapter 2: General Assessment Approach

construct a model to describe its reliability. This form will, of course, depend on the
specific approach and data used to develop the reliability model. Some of these forms are
described below. More detail on each of these is provided in subsequent sections.

2.3.3. Model Form vs. Level


The form of the model to be developed will depend on the level and the approach. For
example, if empirical data is used directly without a model developed from it, assuming
constant failure rate, the best estimate of the failure rate is simply:

Failures
λ=
operating time

If a life model is developed from life tests performed at various stress levels, the result
will be a time-to-failure (TTF) distribution (described by the Weibull, lognormal or other
statistical distributions) that is a function of stress levels. If a Weibull distribution is
used, the general model will be:
β
⎛t ⎞
−⎜ ⎟
R(t ) = e ⎝α ⎠

If models are to be derived from the analysis of field data, there are several possible
model forms. Traditional methods of reliability prediction model development have
included the statistical analysis of empirical failure rate data. When using multiple linear
regression techniques with highly variable data (which is often the case with empirical
field failure rate data), a requirement of the model form is that it be multiplicative (i.e. the
predicted failure rate is the product of a base failure rate and several factors that account
for the stresses and component variables that influence reliability). An example of a
multiplicative model is as follows:

λ p = λbπ eπ qπ s
where:

λp = Predicted failure rate


λb = Base failure rate
πe = Environmrntal factor
πq = Quality factor
πs = Stress factor
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
34
Chapter 2: General Assessment Approach

However, a primary disadvantage of the multiplicative model form is that the predicted
failure rate value can become unrealistically large or small under extreme value
conditions (i.e., when all factors are at their lowest or highest values). This is an inherent
limitation of multiplicative models, primarily due to the fact that individual failure
mechanisms, or classes of failure mechanisms, are not explicitly accounted for.

Another possible approach to model reliability is to segment the failure rate for each
group of failure causes that are accelerated by stresses incurred during specific portions
of a mission. Each of these failure rate terms are then accelerated by the appropriate
stress or component characteristic. This is the model form used in the RIAC 217Plus
methodology. This model form is as follows;

λ p = λ oπ o + λ eπ e + λ cπ c + λ i + λ sj π sj
where:

λp = predicted failure rate


λo = failure rate from operational stresses
πo = Product of failure rate multipliers for Operational Stresses
λe = failure rate from environmental stresses
πe= Product of failure rate multipliers for Environmental Stresses
λc = failure rate from power or temperature cycling stresses
πc = Product of failure rate multipliers for Cycling stresses
λI = failure rate from induced stresses, including electrical overstress and ESD
λsj = failure rate from solder joints
πsj = Product of failure rate multipliers for solder joint stresses

The concept of this approach is that the occurrence of each group of failure causes is
mutually exclusive, and their failure rates can be modeled separately and summed. By
modeling the failure rate in this manner, factors that account for the application and
component-specific variables that affect reliability (π factors) can be applied to the
appropriate additive failure rate term. Additional advantages to this approach are that
they:

o Address Operating-, Non-Operating- and Cycling-related Failure Rates in an


additive model. These individual failure rates are weighted in accordance with
the operational profile (duty cycle and cycling rate). The Pi factors modify only

Reliability Information Analysis Center


35
Chapter 2: General Assessment Approach

the applicable failure rate term, thereby eliminating many of the extreme value
problems that plague multiplicative models.
o Are based on observed failure mode distributions, so that observed component
root failure causes are empirically modeled
o Can be tailored with test data (if available) by applying it in a Bayesian fashion to
the appropriate failure rate term. As examples, temperature cycling data can be
combined with the failure rate from power or temperature cycling stresses (λc), or
high temperature operating life can be combined with the failure rate from
operational stresses term (λo).

2.4. Assess Data Available


A predominant factor that will dictate the options that an analyst has in modeling the
reliability of a product is the availability of data. The analyst should consider the
following questions when assessing the availability of test data:

• Is field data available on the specific product or system?


• Is data on a similar product or system available? If so, is it field data or test data?
• If data is available, is it:
o Relevant?
o Of sufficient quantity?
o Of sufficient quality?
• If physics-based models are to be employed, is the required detailed data and
information available, such as:
o Defect rates
o Material properties (e.g., functional characteristics)
o Defect (flaw) distributions
o Material variation quantification (e.g., purity, yields, dimensions)
o Etc.

Perhaps the most important element of a reliability program is the reliability testing of the
product. Reliability test data is, in turn, a critical element for assessing reliability. In this
context, a reliability test consists of two primary elements: measurement and exposure.
The measurement is the means of assessing the performance of the product or system
relative to its requirements. It usually consists of quantifying parameters that are
specifiable attributes. It may include both continuous variables (i.e. gain, power output,
etc.) or attribute data (i.e. a binomial representation of whether a product possesses an
attribute or not). Exposure is the application of a stress or stresses. These stresses may
consist of operational stresses or environmental stresses. Operational stresses are defined
as those stresses to which the product will be exposed by the act of operating the product.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
36
Chapter 2: General Assessment Approach

For example, a transistor is designed to have a voltage applied, and pass a given amount
of current. As such, these are operational stresses. It will also be exposed to externally
applied environmental stresses such as temperature, temperature cycling, vibration, etc.

Reliability tests can be performed either by sequentially performing repeated cycles of a


measurement, exposure, measurement, etc., or by continuously measuring performance
parameters in-situ during exposure. It is usually desirable to perform in-situ
measurement so that times to failure can be accurately determined. In practical cases,
however, it is not always feasible due to the complexities of setting up such measurement
capabilities. If repeated cycles of a measurement, exposure, and measurement are used,
the measurement intervals should be frequent enough so that sufficient resolution in the
times-to-failure data is available.

Practical considerations for assessing the feasibility of testing products are:

• Are samples available? If so, are they available in sufficient quantity?


• Are measurement systems available for continuous, in-situ, measurements during
exposure? If not, repeated cycles of a measurement and exposure may be
required.
• Are laboratory facilities available to perform the exposure?
• Are the measurement and exposure facilities available to support a multi-cell test
at various stress levels (i.e., application of various combination of stresses)?

Additional considerations for testing products and systems are provided in Chapter 5.

Reliability Information Analysis Center


37
Chapter 2: General Assessment Approach

2.5. Determine and Execute Appropriate Approach


This section discussed the various options that an analyst has to predict, assess, and
estimate the reliability of a product. Figure 2.5-1 illustrates the breakdown of various
approaches.

Figure 2.5-1: Breakdown of Reliability Assessment Options

Table 2.5-1 describes the approach, it strengths and its weaknesses. This information is
presented in the context of the intent of this book, which is to present options for
quantifying the reliability of a product as it is used by customers in actual use conditions.
Effective techniques also include using a combination of the approaches in this section.
The manner in which these approaches can be combined will be addressed in Section 2.6.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


38
Chapter 2: General Assessment Approach

Table 2.5-1: Summary of Reliability Assessment Options


Approach Description Strengths Weaknesses
Can quickly identify failure causes that are Only accelerates specific failure causes
accelerated by thermal cycling and vibration accelerated by the test
1
Accounts for the interaction of the two Large extrapolations to use conditions is
stresses required
Highly Exposure to severe
Accelerated levels of thermal
Can be used as a screening basis Can excite non-relevant failure modes (i.e.,
Life Test cycling and vibration those that are not representative under field
(HALT) Reflects the actual reliability environmental conditions)

Test data can be collected and applied before Cannot quantify special cause failure modes
the system is fielded
Correlation to field use conditions is difficult
Can demonstrate a degree of robustness to
the specific qualification tests
2 Can excite non-relevant failure modes (i.e.,
Exposure to industry
those that are not representative under field
standard “trade and Reflects the actual reliability
Qual environmental conditions)
commerce” tests
Test data can be collected and applied before
Cannot quantify special cause failure modes
the system is fielded
Can accurately model lifetime due to
common cause mechanisms
Can be expensive to execute
3 Can quantify acceleration factors
Life tests under a Difficult to quantify special cause failure
DOE multicell Can estimate reliability at use conditions
variety of stress levels modes due to large sample sizes sometimes
required
Reflects the actual reliability

Test data can be collected and applied before


the system is fielded
Can demonstrate required reliability in a
4
Demonstration of statistically significant way
reliability via life tests
Reliability Reflects the actual reliability Correlation to field use conditions is difficult
at accelerated
demo
conditions Test data can be collected and applied before
the system is fielded
Can demonstrate required reliability in a
5
Demonstration of statistically significant way
reliability via life tests
Reliability Reflects the actual reliability Large sample sizes usually required
at non-accelerated
demo
conditions Test data can be collected and applied before
the system is fielded

6 The most representative data Usually, the data is not available in time for
Use of field experience
use in product or system development
data on the product or
Field data – Can quantify failure causes that exhibit low
system under analysis percent failures Collecting field data is prone to errors
same product

Reliability Information Analysis Center


39
Chapter 2: General Assessment Approach

Table 2.5-1: Summary of Assessment Options (continued)


Approach Description Strengths Weaknesses
Difficult to keep updated
Can be reasonably sensitive to various
stresses
Actual failures are impacted by factors not
considered by the model
Represents field use
7 Models become outdated by new technology
Can be a good indicator of field reliability
Models developed
performance
Models from field experience Misapplication of models by the analyst
data on similar Based on easily obtainable data
products No uncertainty estimates available
Easy to use
Difficult to collect good quality field data
Can quantify failure causes that exhibit low
Difficult to distinguish correlated
percent failures
variables(i.e. quality and environment)
Represents field use
8 Extrapolations to specific use conditions
The direct use of field required
Easy to use
Raw data experience data on
(EPRD, similar products Not feasible to collect data representing all
Can quantify failure causes that exhibit low
NPRD) conceivable situations
percent failures
Difficult to account for material defects

May require information that’s difficult to


Good approach for fundamental material
obtain
9 behavior
Calculation of failure
Difficult to use for estimating field reliability
probabilities based on Can model fatigue behavior
Stress/Strength
the strength
modeling Can be complex and costly to apply
distribution and the Models specific failure mechanisms
stress distribution Difficult to use for modeling defect-driven
Valuable for predicting end-of-life for known
failure mechanisms
failure mechanisms
Not practical to use for the assessment of an
entire system
Can only be applied in rare cases

In practice, difficult to derive fundamental


equations
Scientifically robust
Empirical data is usually required to validate
the model, or to estimate model constants
Calculation of failure Good approach for fundamental material
10
probabilities based on a behavior
Difficult to account for material defects
fundamental
First principals Can model fatigue behavior
understanding of the May require information that’s difficult to
physics of the failure obtain
Models specific failure mechanisms
cause
Can be complex and costly to apply
Valuable for predicting end-of life for known
failure mechanisms t
Difficult to use for modeling defect-driven
failure mechanisms

Not practical to use for the assessment of an


entire system

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


40
Chapter 2: General Assessment Approach

Selecting a methodology

The various approaches summarized here are suited to various program phases,
corresponding to prediction, assessment and estimation. This is shown in Table 2.5-2
(note that the shaded area indicates where the approach can be applied). For example,
MIL-HDBK-217 should only be used for prediction, meaning that its usefulness is
limited for assessment and estimation. Conversely, 217Plus was designed to provide a
framework for all three reliability modeling phases.

Table 2.5-2: Relevancy of Approach to Prediction, Assessment and Estimation

Assessment

Estimation
Prediction
Approach

HALT
Qualification
Accelerated
Test DOE multicell
Reliability demo
Non-
Reliability demo
Accelerated
Same
Empirical
product
Models
Field 217Plus
data Similar MIL-HDBK-217
product Bellcore
Raw data (EPRD,
NPRD)

Physics Stress/Strength modeling

First principals

The appropriate approach(es) to modeling reliability will depend on several factors,


including:

• The severity of product failure. In this context, severity can mean that there are
significant financial ramifications of failure, that there are safety-related risks, or
that the system is not maintainable. For all of the reasons that high reliability may

Reliability Information Analysis Center


41
Chapter 2: General Assessment Approach

be required in the first place, are the same reasons that the reliability model must
be acceptably accurate. Since reliability is a stochastic process, reliance on any
one of the methodologies discussed in this book is susceptible to uncertainties.
Sometimes these uncertainties can be very large. This is true for any of the
methods. If, however, several methodologies can be employed, and their results
are consistent with each other, then this adds much more credibility to the
modeled reliability of the product. This is especially true if a physics approach is
coupled with an empirical approach.
• The amount and level of detailed information available to the analyst. Often, this
will dictate the available choices for the analysis.
• Complexity of the product. If the product or system is very complex, has many
levels of indenture, and there is a complex supply chain involving many suppliers,
then the available suitable choices for analysis at the top of the supply chain will
be limited. For example, as discussed previously, it is very difficult to obtain the
data required to utilize one of the physics approaches by organizations higher in
the supply chain. If, however, the entire supply chain utilizes the PoF approach
for the product or system, it can be a viable approach.

Table 2.5-3 provides general guidance on the identification of appropriate approaches


based on the purpose of the assessment.

If empirical data is to be used as a basis for one or more of the approaches, there are
various factors that will influence the uncertainty in assessments made with this empirical
data. These include the following data attributes:

Relevancy – how close is the product or system architecture and complexity on


which data is to be used to the item under analysis

Quantity – this pertains to the statistical uncertainty of reliability estimates based


on the quantity of data. For example, if the TTF distribution is exponential, this
uncertainty is usually modeled with the Chi-squared distribution.

Quality – this pertains to the accuracy inherent in the data itself. For test data, the
accuracy is generally much better than with field data, since test data is usually
much better controlled with known sample sizes, failure times, etc. Field data, on
the other hand, is usually fraught with many problems and sources of uncertainty.
This will be discussed in Section 5.2.1.2.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


42
Chapter 2: General Assessment Approach

Table 2.5-3: Identification of Appropriate Approaches Based on the Purpose


Approach

Empirical Physics

Test Field data

Similar product
Accelerated

Accelerated
Purpose

Stress/Strength modeling
Non-
Reliability demo

Reliability demo

First principals
DOE multicell

Same product
Qualification

Raw data
Models
HALT

Risk Anticipated failure x x x x


assessment Observed failure x x x x
Input to FMEA/FTA for ID of failure
x x x x x x x x x x
cause priority
Compare competing designs x x x x x x x
Model reliability growth x x x x
Determine fault
tolerance, x x x x x x 1 1
Design aid Determine
redundancy
feasibility of
Determine
meeting rel req.
testability x x x
requirements
Determine impact of factors on
1 x 1 1
reliability (e.g. derating)
Determine screening req. x x x x
Determine if
minimum
x x
Reliability robustness is
demo achieved
Determine if rel.
x x x x
req. is achieved
Warranty cost
x x x x
predictions
PM schedules x x x x x
Maintainability Spares allocation x x x x x x
Allocate
Maintenance x x x x x x
personnel
1 – for those failure causes addressed by the approach

Reliability Information Analysis Center


43
Chapter 2: General Assessment Approach

Relevancy is a function of the type of data that is available, and the product or system on
which that data is available. To further address the relevancy issue for assessments made
with empirical data, consider the information in Table 2.5-4, which summarizes the
various attributes of empirical data. This notion is valid, regardless of the level of
assembly, ranging from root failure causes to the system level.

Table 2.5-4: Ranking the Attributes of Empirical Data


Type of data
Field Test
Same Different Same Different
environment environment stress stress
Same Best
mfg/process
Product or Same
Different
System on
mfg/process
which data
Same
is
mfg/process
available Similar
Different
Worst
mfg/process

There has been much information published in the literature comparing and contrasting
empirical and physics-based models. However, they are not mutually exclusive
methodologies. For example, empirical models generally utilize PoF principals in their
derivation, and PoF models utilize empirical data in their derivation and parameter
estimation.

The majority of component field failures are a result of special causes. These causes may
be an anomaly in the manufacturing process, an application anomaly, or a host of other
assignable causes. They are rarely the result of a common cause failure mechanism,
which can generally be modeled by life modeling techniques.

Guidelines and examples are provided in the following sections for each of the
approaches.

2.5.1. Empirical

2.5.1.1. Test
Testing product ors system reliability is performed for many reasons, including:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


44
Chapter 2: General Assessment Approach

• Quantifying reliability (infant mortality, wear-out)


• Demonstrating reliability
• Growing reliability
• Lot acceptance
• Developing screens
• Performing screens
• Determining the limits of the technology
• Determining stress bounds for subsequent tests
• Determining predominant accelerating stresses
• Identifying “weak points” in the design
• Identifying failure causes
• Demonstrate compliance to industry standard qualification tests

An important consideration for all of the tests described above is the definition of
“failure”, i.e., the "failure criteria" that will be used to determine if a product passes or
fails. Industry guidelines, specifications, or an understanding of end-use application
tolerances are often used to set pass/fail criteria.

A common form of empirical testing is the performance of qualification tests.


“Qualification” is usually defined as demonstrating that a product will meet performance
requirements in its intended application, as used by customers, over the expected lifetime
of the product. There are two primary elements to performance qualification: Validation
and Verification (Reference 1), as follows:

Validation – Confirmation by examination and provision of objective evidence


that the particular requirements for specific intended use are fulfilled.

Verification – Confirmation by examination and provision of objective evidence


that specified requirements have been fulfilled.

Therefore, for a product or system to be considered fit for use for a specific application, it
must conform to the requirements of its specification over its intended life (verification)
and the specification must adequately capture the requirements of the end user
(validation). The various elements of qualification are illustrated in Figure 2.5-2.

Reliability Information Analysis Center


45
Chapter 2: General Assessment Approach

Qualification

Validation Verification

Specification Reliability
Compliance Testing

EVT Root
cause
analysis
DV and
corrective
action
PVT

Figure 2.5-2: Qualification Concepts and Terminology

Verification ensures that the product or system meets the specified requirements both
initially (specification compliance) and over its intended lifetime (reliability testing).
Specification compliance ensures that, at the beginning of its lifetime, the item meets
specified performance requirements and that the distribution of performance parameters
over the population of items is within acceptable limits. Reliability testing ensures that
the product is robust and that it meets the specified performance requirements over its
intended lifetime. Reliability testing consists of several test phases, each of which has its
own purposes and approaches.

The testing sequence can be grouped into three categories: Engineering Verification Tests
(EVT), Design Verification Tests (DVT), and Production Verification Tests (PVT).
These are further explained below, along with their relationship to the establishment of a
life model, the prediction of product reliability and the various elements of each test
approach. This is provided specifically to highlight how reliability testing can be used in
a reliability program.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


46
Chapter 2: General Assessment Approach

Engineering Verification Tests (EVT) are intended to identify and assess “high risk”
critical items so that corrective action can be taken, if necessary. The intent is to uncover
weaknesses or to identify product capability, not to pass a set of predefined tests, as is the
case with traditional qualification testing. The purpose and approach of these tests are
described in Table 2.5-5. These tests are also used to identify the maximum stress
capability of a product, which is a prerequisite for developing a complete test plan (in
DVT) to assess lifetime. Step stress tests are often used for this purpose, and the results
can support establishment of an upper bound on subsequent test stresses.

One of the primary purposes of DVT testing is to provide the data required to develop life
models. Often, there are multiple accelerating stresses, in which case life tests must be
conducted for various stress combinations. Design of Experiments (DOE) is used to
develop an effective and cost-efficient test plan. DOE concepts, as they pertain to
reliability testing, are covered in Chapter 4.

PVT tests demonstrate that the robustness of production units is equivalent to that of the
EVT/DVT samples. Whereas EVT and DVT demonstrate the intrinsic robustness, PVT
demonstrates the “as-built” robustness.

Table 2.5-5: EVT, DVT and PVT Purpose and Approach


Relative
Test
Sample
Program Purpose Approach
Sizes
Element
Required
Determine limits of the technology Test to failure
Determine stress bounds for Step stress (to determine limits)
subsequent tests
EVT Determine predominant accelerating Test a broad range of stressors to Low
stresses determine the stresses that accelerate
Identify “weak points” in the design predominant failure causes
Identify failure causes
Quantify elements of the bathtub Development of a life model that
curve (infant mortality, wear-out) so estimates time to failure as a function of
that effective screens can be pertinent accelerants
developed
DVT High
Provide data to assess product Use DOE to design statistically valid life
lifetime under various combinations tests, and perform long term life tests
of stresses using stresses that will be experienced in
the intended application
Verify the robustness of production Test relatively small samples of parts Low
PVT units are as good as EVT/DVT using a broad range of stressors. These
samples are traditional “qualification” tests

Reliability Information Analysis Center


47
Chapter 2: General Assessment Approach

Some reliability practitioners choose to separate qualification tests from reliability tests.
In this case, reliability tests are those that have a purpose similar to the DVT tests. The
reason for separation is that the reliability tests are more of an engineering test that are
not dictated by industry standards. As such, the results may or may not be shared with
customers. Likewise, qualification tests are required and thus shared with customers to
demonstrate compliance.

Root cause analysis and corrective action


A critical part of any reliability program is the ability to learn from failures and improve
the product or system. Failure analysis is performed to ensure that the root cause is
identified and understood, and corrective actions are implemented and verified. This is
done throughout product development, including EVT or DVT and PVT.

With a product that is comprised of a number of subassemblies, there is a time offset


between EVT, DVT or PVT tests performed on components of the product or system and
those tests performed on the end item. This is illustrated in Figure 2.5-3.

Time

Component
(EVT) (DVT) (PVT) Ongoing Reliability
Test (ORT)

Early Screening Prequalification Full qualification Mass Production

(EVT) (DVT) (PVT) Ongoing Reliability


Test (ORT)

Early Screening Prequalification Full qualification Mass Production

Assembly

Figure 2.5-3: EVT, DVT and PVT Relationships

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


48
Chapter 2: General Assessment Approach

2.5.1.1.1. Non-Accelerated
Nonaccelerated reliability tests are those in which samples are tested in a manner that
recreates the use conditions the product will experience in its intended use environment
as used by customers. These tests may be performed for several reasons:

1. To uncover any unexpected failure causes


2. To demonstrate that the product meets its reliability requirement

Generally, if the purpose is #1, a more effective way of acheiving this is with accelerated
testing, discussed in the next section of this book. If the purpose is #2, then concepts of
reliability demostration can be used, as discussed in the next section.
2.5.1.1.2. Reliability Demonstration
The fundamental concept of reliability demonstration is the following:

1 − CL = R

This is essentially a hypothesis test in which the hypothesis is that the true product
reliability is “R” or greater. For example, consider a case in which the reliability
requirement is 0.95 at 5000 hours, and the desired confidence level is 0.80 (80%). In this
case, the implied failure rate is 0.0000103 failures per hour.

If the hypothesis is true and the test is run such that there less than a 20% probability of
experiencing the observed number of failures (or fewer), then the analyst can be 80%
certain that the reliability requirements have been met.

Table 2.5-6 summarizes the probability as a function of the number of failures and
cumulative operating time. The values in the cells are the Poisson probability that there
will be “F” or fewer failures, under the hypothesis that the true failure rate is 0.0000103
(failures per hour). In this example, if the test can be run until 200,000 hours are
accumulated, with no failures, then the test is passed and the hypothesis is verified. This
is the first opportunity to pass the test, as this is the shortest time at which the Poisson
probability falls below 0.20 (i.e., 0.13). In this example, 0.20 is the risk of concluding
that the failure rate is less than 0.0000103 when it is not.

The test is run until the number of failures and time combinations falls either above or
below the shaded red area. If it falls above the red area, then the null hypothesis is
confirmed (that the failure rate is greater than the required). If it falls below the red area,
the hypothesis is confirmed. If the combination of hours and failures remains in the red
area, the hypothesis cannot be confirmed or denied, and further testing is required.
Reliability Information Analysis Center
49
Chapter 2: General Assessment Approach

The probability values are generally calculated from the binomial or Poisson
distributions, depending on whether the probability is time-based (Poisson) or attribute-
based (binomial). Poisson is used in the case of constant failure rates.

Table 2.5-6: Reliability Demonstration Example


Cumulative operating time (in thousands of hours)
100

150

200

250

300

350

400

450

500

550

600

650

700

750

800
50

10 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.97 0.95 0.92 0.89 0.85 0.79
9 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.96 0.94 0.90 0.86 0.81 0.75 0.69
8 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.95 0.92 0.88 0.83 0.77 0.71 0.64 0.56
Number of Failures

7 1.00 1.00 1.00 1.00 1.00 0.99 0.97 0.94 0.90 0.85 0.79 0.72 0.65 0.57 0.50 0.42
6 1.00 1.00 1.00 0.99 0.98 0.96 0.93 0.88 0.82 0.74 0.66 0.58 0.50 0.42 0.35 0.29
5 1.00 1.00 0.99 0.98 0.95 0.91 0.85 0.77 0.68 0.59 0.50 0.42 0.35 0.28 0.22 0.17
4 1.00 1.00 0.98 0.94 0.88 0.80 0.71 0.61 0.51 0.42 0.34 0.26 0.21 0.16 0.12 0.09
3 1.00 0.98 0.93 0.85 0.74 0.63 0.52 0.41 0.32 0.25 0.19 0.14 0.10 0.07 0.05 0.04
2 0.98 0.91 0.80 0.66 0.53 0.41 0.30 0.22 0.16 0.11 0.08 0.06 0.04 0.03 0.02 0.01
1 0.91 0.73 0.54 0.39 0.27 0.19 0.13 0.08 0.06 0.04 0.02 0.02 0.01 0.01 0.00 0.00
0 0.60 0.36 0.21 0.13 0.08 0.05 0.03 0.02 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00

The Microsoft EXCEL® functions for these are:

=1-BINOMDIST(x,y,z,TRUE), where BINOMDIST(number_s,trials,probability_s,cumulative)


=POISSON(x,y,TRUE), where POISSON(x,mean,cumulative)

2.5.1.1.3. Accelerated Testing


Accelerated testing is an enormous part of a reliability program. It is used for many
purposes, including:

• Identification of failure causes


• Qualification
• Life characterization
• Reliability demonstration

One of the critical aspects of accelerated testing is the degree to which acceleration takes
place. Consider the situation depicted in Figure 2.5-4. The reliability requirement, in
terms of lifetime in this example, will be specified at a specific stress condition. If tests
are performed at the accelerated conditions of Test 1, there will be some extrapolation to
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
50
Chapter 2: General Assessment Approach

lifetimes at use conditions (if the purpose is to quantify life). If tests are performed at the
accelerated conditions of Test 2, there will be additional extrapolation to lifetimes at use
conditions. Life modeling is the means of performing this extrapolation, and will be
covered in Section 2.5.1.1.2.3 and Chapter 5.

Figure 2.5-4: Acceleration Levels

The larger the extrapolation distance, the larger the uncertainty in the reliability estimate
at use conditions. This is illustrated in Figure 2.5-5.

Reliability Information Analysis Center


51
Chapter 2: General Assessment Approach

Figure 2.5-5: Uncertainty in Extrapolation

The relevancy of failure causes must be considered when using accelerated test data to
model product or system reliability in field deployed conditions. For example, if failures
occur in an accelerated test, the questions to be addressed are:

1. Can the failure cause occur under field conditions? Or has it been induced by the
test?
2. If the failure cause is relevant, can its reliability characteristics be scaled to field
use conditions with an acceleration model?

For example, consider several scenarios illustrated in Figure 2.5-6. Case 1 illustrates the
situation in which the failure cause observed in accelerated testing is relevant, and its
probability of occurrence can be extrapolated to use conditions with an acceleration
model. Case 2 illustrates the situation in which the failure cause observed in accelerated
testing is not relevant, and its probability of occurrence cannot be extrapolated to use
conditions with an acceleration model. Case 2 is representative of a situation in which
there is a “threshold” stress, above which the failure cause has been induced by the test.
The higher the acceleration, the higher the risk is that Case 2 will occur. For this reason,
for the purposes of quantifying reliability under field use conditions, highly accelerated
tests (like HALT) must be used with caution.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


52
Chapter 2: General Assessment Approach

Figure 2.5-6: Acceleration levels

Alternatives are also available that will cover more of the “life-stress” space, as shown in
Figure 2.5-7. This approach is desirable because there is minimal extrapolation to field
use conditions, and validity of the acceleration models over a broader stress range can be
ascertained.

Figure 2.5-7: Acceleration Alternatives


Reliability Information Analysis Center
53
Chapter 2: General Assessment Approach

Another factor to consider in accelerated testing, when used to quantify reliability at use
conditions, is the relative probability of occurrence of various failure causes as a function
of stress level. Each failure cause will have unique acceleration characteristics as a
function of stress, depicted as the slope of the life-stress line. They will also have unique
probabilities of occurrence, as depicted as the vertical position of the life-stress line.
These factors together indicate that the relative probabilities of the causes require a model
for each. This is illustrated in Figure 2.5-8. In this life-stress plot, the slope represents
the dependency of life as a function of stress, and the position of the line represents the
absolute life. As can be seen, the relative probabilities of the causes will depend on the
stress level.

Figure 2.5-8: Relative Lifetime vs. Stress

2.5.1.1.4. Highly Accelerated Life Test (HALT)


Highly Accelerated Life Test (HALT) is a popular technique in reliability testing. It is
useful to achieve very large acceleration factors. HALT is a test methodology that
simultaneously subjects an item to highly accelerated levels of thermal cycling and
vibration. It can be a useful tool in identifying mechanical design weaknesses. It is a
particularly valuable technique in the identification of the weakest area(s) of a new
design in the shortest possible time. Therefore, is if often used as a tool to grow the
product reliability through a test analyze and fix sequence.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


54
Chapter 2: General Assessment Approach

Such tests can:

• Provide a means of sampling inspection for incoming component lots


• Be used for burn-in screening tests. This is called Highly Accelerated Stress
Screening (HASS). For HASS, care must be taken to ensure that high levels of
the accelerating stress will not damage or remove excessive life from units that
are to be put into service.
• Be used for pilot tests to get information needed for planning a more extensive
accelerated life test (ALT) at lower levels of the accelerating variable
• Be used to assess the relevance of specific failure modes
• Be used to obtain shorter test times to allow design engineering to remain focused
on the product or system (resulting in a highly intensive and uninterrupted
engineering effort)
• Serve as a cooperative workshop that involves both suppliers and the customer(s)
• Support collaboration between design and test engineers to address design
weaknesses

HALT requires a different mindset than "conventional" accelerated testing. One is not
trying to predict or demonstrate life, but rather to induce failures of the weakest links in
the design, strengthen those links, and thereby greatly extend the life of the design. Root
cause failure analyses are conducted and repairs and redesign are carried out, as feasible
and cost-effective. Output results from HALT may include a Pareto chart showing the
weak links in the design, and design guidance that can be used to create a more robust
design.

Testing a new design and comparing it against a proven previous generation design using
the same accelerated test provides an efficient benchmarking test. Based on HALT
results, a determination of "optimum" design characteristics can be made using statistical
design of experiments (DOE).

A generic HALT process starts with a temperature survey:

1. Start at room temperature


2. Step down temperature to -100°C in 20° increments, with each dwell time long
enough to stabilize the product's internal temperature (the thermal rate of change
between each temperature transition step should be ~100°C/minute)
3. Step up temperature from -100°C to +40°C at 100°C/min
4. Step up temperature from +40°C in 20°C increments to 100°C or the maximum
temperature for the materials involved, with each dwell time long enough to
Reliability Information Analysis Center
55
Chapter 2: General Assessment Approach

stabilize the product's internal temperature (the thermal rate of change between
each temperature transition step should be ~100°C/minute)

Next, a vibration survey is performed:

1. Begin vibration testing at room temperature


2. Start six-axis random vibration at 5 Grms from 2Hz to 12kHz
3. Step up the vibration level in 5 Grms increments, to a maximum of about 50 Grms
4. Dwell for 10 minutes at each level

The vibration stress is provided by mechanically impacting the table with “hammers”.
As such, the frequency spectrum is not truly random, but rather is “pseudorandom”. The
purpose of the vibration survey is to detect weakness in the design as a function of the
stresses created by the increased vibration levels.

A combined environment HALT may also be performed:

1. Superimpose simultaneous temperature cycling from -100°C to +100°C at


~100°C/min of circulating air temperature. Dwell at each temperature only long
enough to “semi-stabilize” the internal temperature of the part
2. During temperature dwells, subject the test unit to vibration at 5 Grms
3. During subsequent thermal cycles, step the vibration level up in 5 Grms increments

In this example, the vibration is applied during temperature dwells, but if failure causes
are possible that are accelerated by vibration stresses during temperature transitions, the
stress profile can be modified to apply vibration continuously throughout the temperature
cycle.

This is a typical stress profile, and will be varied (and should be tailored) based on the
limits of the product or system being tested. The purpose of the step-stress temperature
test is to detect sensitivity of design functionality to temperature and temperature change
rates.

The purpose of the combined environment test should highlight weaknesses that result
from the interaction effects of simultaneous exposure to temperature and vibration.

Quantifying reliability is generally not the objective of HALT. The ability to improve the
inherent reliability/robustness of the product or systems design is. However, in some
cases it can be used as an indicator of field reliability performance. The fundamental

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


56
Chapter 2: General Assessment Approach

question to address is this: Does the HALT test excite failure causes that the item may
experience in the field? The answer to this question will depend entirely on the
characteristics of the item under test, and the stresses to which it will be exposed in field
use. For example, if the product or system critical failure causes are accelerated by
thermal cycling and random vibration, and the item will experience these stresses in the
field, then HALT test results may be indicative of field reliability. Likewise, if the
product or system critical failure causes are not accelerated by thermal cycling and
vibration, and/or the product will operating in a benign environment, then the HALT
results will provide very little information regarding field reliability.
2.5.1.1.5. Qualification Testing
Qualification testing is a term used to describe a series of tests that a product or system
must be exposed to, and pass, for it to be considered “qualified” by the industry or
standards body governing the qualification requirements. Several examples of
qualification requirements are provided in Tables 2.5-7 and 2.5-8, for an assembly, and
for a laser diode component, respectively.

Table 2.5-7: Example of a Qualification Plan for an Assembly


Group/Test SS/Failures
Group 1 Test Set
Impact (packaged w/ mates Cat A: 30" Drop (nominal, based on weight, see 3/0
removed for test) table 4-7) 10-orientations as specified
Impact (not packaged) 4" Drop (nominal, based on weight, see table 4-9) 3/0
5-orientations as specified
Temperature Cycling -40 °C to 85°C / 100Cycles 3/0
Vibration 10-55 Hz, 1.52mm (max=10G), 1min/cycle, 120 3/0
cycles, 3axis
Group 2 Test Set
Electro-Magnetic Interference Compliance with MIL-STD 1/0
Electro-Static Discharge Compliance with MIL-STD-883
Group 3 Test Set
Damp Heat 75C/90%RH: 500 hrs qual, 1000 hrs info only 3/0
Group 4 Test Set
Endurance Toperating Max, Pnominal 3/0
Full Qualification = 2000hrs
Information = 5000hrs

Reliability Information Analysis Center


57
Chapter 2: General Assessment Approach

Table 2.5-8: Qualification Example for a Laser Diode


Test Description GR-468, Hermetic Laser Module (active)
High Temperature Aging at 70°C
Ambient Condition Q=2000 hrs
I =5000 hrs
Low Temperature Aging at Min. storage temp.
Ambient Condition Q=2000 hrs
Damp Heat Aging 85°C/85% RH
Q=1000 hrs
Thermal Cycling -40 to 70°C
Q=100 cycles
I =500 cycles
Thermal Shock ΔT=100°C
20 cycles
Vibration 20G, 20-2000 Hz, 4 min/cy, 4 cy/axis
Shock 500G,
0.5 ms,
5 times/axis
Electrostatic Discharge MIL-STD-883, Method 3015

There are many qualification standards in existence, governed by standards bodies within
specific industries. Some noteworthy standards organizations are IEC (International
Electrochemical Commission), the U.S. Military (via MIL-specs), ISO, and Telcordia
(for telecommunication components and equipment.

There are several factors which will impact the usefulness of qualification data as an
indicator or field reliability. These are:

• The degree to which the stress is accelerated, and the acceleration factor between
the test and field environments
• The degree to which the stress accelerates critical failure causes that the product
or system will experience in the field
• The sample sizes used, which impacts the statistical significance of the data

The first two bullets are treated in detail elsewhere in this book. The last bullet is
discussed next.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


58
Chapter 2: General Assessment Approach

A common way in which sample size requirements are identified in standards is with a
Lot Tolerance Percent Defective (LTPD) methodology. This concept is identical to the
reliability demonstration idea presented previously. In this case, two parameters are
specified:

1. The percent of allowable defects


2. The confidence level

From before:

1 − CL = R

In this case, the value of “R” is the reliability of the entire sample size. So, if the test
plan is established to allow no failures (this will require the minimum sample size), the
equation becomes:

1 − CL = R n

where “n” is the sample size. For example, if the allowable percentage of defects is 20%,
and the desired confidence level is 0.90 (i.e. 90%), then n = 11 is the minimum sample
size required, as shown:

11
1 − 0.9 = 0.8

So, if the test is performed on 11 samples with no failures, then there is a 90% confidence
that the true reliability is greater than 0.8 (i.e., the probability of failure is less than 0.2).
Other plans are also available that allow a certain number of failures. These require
larger sample sizes, and are determined with binomial statistics.

Since the LTPD is generally less than the required reliability, qualification data is usually
not sufficient, in and of itself, to demonstrate reliability requirements. It can, however,
be valuable data when used in combination with other data sources.

As an example, consider a case in which a reliability requirement is that a product or


system must have less than 3% cumulative failures after 1000 hours of operation. This is
shown as the star in Figure 2.5-9. Now, let’s say that the item is represented by a
multimode Weibull distribution (notice the three distinct portions of the curve
representing the bathtub curve), characterized by the probability line called “Case 1”in
Figure 2.5-9. If 11 parts were tested, and zero failures occurred after 300 hours of
Reliability Information Analysis Center
59
Chapter 2: General Assessment Approach

operation, the only statistical statement that can be made is that there is a 90% confidence
that the true unreliability is less than 0.2 at 300 hours, shown as the solid star and arrow.
Here, the data is not sufficient to determine if the actual distribution is “Case 1,” or that
the reliability requirement is met. However, testing 11 samples may be sufficient to
determine if we have a wearout mode occurring at a time less than 300 hours, as
illustrated in “Case 2.”
Probability - Weibull
99.000

90.000

50.000

Case 2

10.000
Case 1
Unreliability, F(t)

5.000

1.000

0.500

0.100
0.100 1.000 10.000 100.000 1000.000 10000.000
Time, (t)

Figure 2.5-9: Reliability Requirement vs. Small Population Reliability Inference

If the goal of the test is to demonstrate the infant mortality percent fail value from the
first of the distribution modes is for example, less than 1%, it can be seen that testing 11
samples will not come close to demonstrating this requirement.

This example is shown to illustrate the fact that the demonstration of reliability due to
wearout related failure causes can be done with relatively small populations, whereas low
percent fail values typical of infant mortality cannot.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


60
Chapter 2: General Assessment Approach

In any case, however, the goal of a reliability program is to ensure that the actual
probability line is to the right of the reliability requirement point.
2.5.1.1.6. DOE-Based Multicell
The methodology of a DOE (Design of Experiment)-based multicell involves subjecting
a sample of products to a combination of factors, or accelerants. These factors can be
stresses or categorical variables. The intent of these tests is to generate the data that is
required to develop a life model that is capable of predicting reliability under a variety of
use conditions. Life modeling is usually performed for specific failure causes. A goal of
a reliability program is to identify those causes that warrant the work required to develop
a life model. Characteristics of these “critical failure causes” often include:

• Failures experienced in EVT tests


• New, unproven technology
• New, unproven manufacturing processes
• Items exposed to stringent/severe environmental conditions
• Items exposed to stringent/severe operating stresses
• Items designed or manufactured with non-robust practices
• Items with known life limitations
• Items from suppliers with a history of delivery, cost, performance or reliability
problems
• Old technology with availability problems (obsolescence and/or diminishing
manufacturing sources

After the identification of critical failure causes of a product or system that require life
modeling, action must be taken to ensure that those items are sufficiently robust to meet
product/system reliability and durability requirements. Life modeling is used for this
purpose, and involves the characterization and quantification of specific failure causes,
making it a critical element of a reliability program.

A generic life modeling methodology is shown in Figure 2.5-1.

Reliability Information Analysis Center


61
Chapter 2: General Assessment Approach

Tools Measurement:
• Environment
DOE Life • Stresses
Modeling • Duty Cycle
• Extreme Event
FMEA FTA Statistics FTA

Characterize
operating
stresses

Identify Reliability Develop Predict Model of


Factors Tests Life Reliability under System
Model Use Conditions Reliability

Actions

Figure 2.5-10: Life Modeling Methodology

Each of the elements in Figure 2.5-10 are further examined below. Additionally, the
topics of Design of Experiments (DOE) and life modeling are treated in more detail in
Chapters 4 and 5, due to their relatively complex nature and their importance to life
modeling. A detailed example of a life model developed is also provided in Chapter 7.

Identify Factors
Factors are the independent variables that can influence the product reliability, and the
response variable is the dependent variable. DOE is a common technique used to study
the relationships amongst many types of factors. In the context of this book, the response
variables specifically refer to the reliability metric of interest.

Critical failure causes and the factors that potentially affect their probability of
occurrence need to be identified. This can be done through testing, through analysis, or
both. EVT testing that is performed as part of the overall product/system reliability
program can be used for the identification of these factors, as previously described.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


62
Chapter 2: General Assessment Approach

FMEA is also a popular analytical technique for this and will be used in the upcoming
example.

Factors fall into one of several categories:

• Stresses
o Environmental
o Operational
• Product/System Attributes
o Design factors
o Manufacturing processes

Each of these factors can be a continuous or a categorical variable:

• A continuous variable is one that can assume any value within a given range
• Categorical variables are those that assume a discrete number of possibilities

Some factors can be modeled as either. For example, environmental stress can be
modeled with continuous variables of the specific environmental stresses (i.e.,
temperature, vibration, humidity, etc.), or it can be modeled as a categorical variable.
The latter case is the approach that has historically been used in MIL-HDBK-217, which
uses environmental categories like “Ground, Benign,” Airborne, Inhabited”, etc. The
217Plus methodology treats them as continuous variables, but default values are provided
for the categorical values of environment.

There are several ways in which these factors can be identified. One method that has
proven to be an efficient means of accomplishing this is to utilize the FMEA. This
involves modifying the FMEA to include several additional columns that correspond to
the above listed factors. At the analysts discretion, from one to four additional columns
can be included. This will depend on the type of product or system under analysis and
the level of rigor desired. In this approach, the FMEA team (or at least someone
knowledgeable with the item design and process attributes) identifies the specific stresses
or attributes that will affect the probability of occurrence of the specific failure cause that
was identified in the FMEA. Since each failure cause will generally have an associated
risk priority number (RPN), the cumulative RPN can be calculated for all failure causes
affected by the specific stress or product/system attribute.

For example, consider the case in which an FMEA was accomplished in this manner, and
the results in Figure 2.5-11 were obtained. Here, only the environmental stresses are
Reliability Information Analysis Center
63
Chapter 2: General Assessment Approach

shown, but the same methodology would apply to whichever additional factors are
included in the FMEA.

A more detailed discussions of the FMEA methodology is provided in Chapter 8.

Figure 2.5-11: Identification of Test Stresses Based on the FMEA

In this case, the sum of the RPN values for all failure causes accelerated by mechanical
shock is about 500. This cumulative RPN value is a relative number only, but can
provide valuable insight into the most important stresses to be addressed in the reliability
test plan.

In this example, the test stresses shown pertain to all of the failure causes addressed in the
FMEA. In performing life tests on specific failure causes, the information identified in
the FMEA should be used to identify the test stresses to be considered in the DOE plan.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


64
Chapter 2: General Assessment Approach

Reliability Tests
If critical item failure mechanisms are time dependent, then time-based life tests are
required. Life tests are conducted by subjecting test samples to a defined stress level and
measuring the times when failure occurs. The process is repeated for various
combinations of factor levels. Considerations for the reliability tests are described below.

Test Plan
If there are multiple accelerating stresses, then life tests must be conducted at various
combinations of stress magnitudes. A plan should be developed using an effective tool
such as Design of Experiments. The plan should consider all aspects of testing so that the
test program generates data in a cost effective way. It is easy to lapse into the mentality
of testing “one factor at a time”, in which tests are conducted to assess specific factors,
but this approach is generally not time- or cost-effective.

Factors to consider in establishing an appropriate DOE include (1) the sample size per
test cell, (2) stress levels, (3) the number of stress levels for each stress, (4) stress
interactions, (5) stress durations, (6) failure criteria, and (7) measurement methodology
(i.e., in-situ or periodic). The principals of DOE are treated in more detail in Chapter 4.

Maximum Test Stress


A prerequisite for developing a complete test plan to assess the lifetime of a product or
system attribute is knowledge of the maximum stress magnitude that can be tolerated by
the item prior to catastrophic failure. This knowledge supports establishment of an upper
bound on subsequent test stresses that may be a part of step-stress testing. These tests are
generally performed as part of the EVT tests.

In many cases, it is desirable to establish the upper bound of the test stress for each
specific stressor. An efficient way to determine this stress level, often called the
“destruct limit”, is to perform a step stress test. Here, a sample of units is exposed to a
stress level well below the suspected destruct limit. Then, the stress is increased until the
product is overstressed. This step-stress test can include a linearly ramped stress, or a
stepped-stress in which the samples are exposed to a constant stress for a given dwell
time, after which the stress is increased, dwelled, and so on until failure. An example of
the identification of these maximum stresses was mentioned previously in the HALT
discussion.

The destruct limit can be used as the upper limit of all subsequent life tests. Usually, the
actual life tests will be performed at a maximum stress that is a certain percentage level

Reliability Information Analysis Center


65
Chapter 2: General Assessment Approach

below the destruct limit. This percentage is dictated primarily by the sensitivity of the
TTFs to the stress. For example, consider the two cases illustrated in Figure 2.5-12.
Case 1 is a situation in which the lifetime, and subsequent reliability, is moderately
sensitive to the stress level. Case 2 is a situation in which the lifetime has an extreme
sensitivity to the stress level.

Figure 2.5-12: Using the Destruct Limit to Define the Life Test Max Stress

For example, if a power law acceleration model is used, the life – stress relationship is:

A
Life =
Sn

where “A” is a life constant and “S” is the stress.

A typical value of “n” for Case 1 would be 1 to 3, whereas a typical value of “n” for Case
2 would be greater than 20.

In case 1, the maximum stress for the life tests may be 10-20% below the destruct limit.
For Case 2, however, the maximum stress should be only a few percentage points below
the destruct limit. Otherwise, the risk is taken that the product or system will not fail
within a reasonable time period, which is required for reliability model development.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


66
Chapter 2: General Assessment Approach

Stress Profile
The two main types of stress profiles are steady-state and time varying. Steady state tests
are those in which a sample set is exposed to constant stress levels, and the response
(performance parameter(s)) is measured. Several examples are shown in Figure 2.5-13.

Figure 2.5-13: Possible Stress Profiles

Any of the profiles in Figure 2.5-13 can be used to develop life models. If the time-
varying stress profiles are used, a cumulative damage model is usually appropriate. In
this case, the stress function is integrated to obtain the cumulative damage. This will be
explained in more detail later.

Some of the advantages and disadvantages of the two generic approaches are listed in
Table 2.5-9.

Reliability Information Analysis Center


67
Chapter 2: General Assessment Approach

Table 2.5-9: Stress Profile Option Advantages and Disadvantages


Approach Advantage Disadvantage
Results can be easily interpreted Longer test times required

Steady State Stress Facilitates the de-convolution of Requires knowledge of


time and stress effects more destruct limits
easily
Short test times possible Can be difficult to model
parameters
A good approach when the time
Stepped (or Linear to failure characteristics as a Software required for
Ramped) Stress function of stress are unknown modeling

Does not require knowledge of


destruct limits

Optimum Measurement Intervals


When testing is performed on products or systems whose performance cannot be
monitored in-situ, the test needs to be run such that performance measurements are done
at periodic intervals. These intervals need to be frequent enough to bracket the TTFs
tightly enough such that life model parameters can be estimated accurately enough.

The objective of the measurement intervals is to obtain as much resolution as possible in


the regions of time that exhibit high failure rates. The measurement intervals should be
an order of magnitude shorter than the failure times.

There are several approaches to determining the appropriate measurement intervals:

1. Use constant intervals. While this approach may not be optimal, it can be
appropriate in cases where the failure characteristics are completely unknown
2. If the rate of occurrence of failure (ROCOF) is expected to decrease over time,
the measurement intervals can start out very frequent, and decrease in frequency
as the failure rate decreases. This is shown in Figure 2.5-14.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


68
Chapter 2: General Assessment Approach

Failure
Rate

Measurement Points

Figure 2.5-14: Measurement Points for an Infant Mortality Failure Cause

If the ROCOF is expected to increase over time, the measurement intervals can start out
very infrequent, and increase in frequency as the failure rate decreases. This is shown in
Figure 2.5-15.

Failure
Rate

Measurement Points

Figure 2.5-15: Measurement Points for a Wearout Failure Cause

This case is generally much more difficult to implement because the failure
characteristics need to be known before the tests. Therefore, one of the first two
approaches is usually desirable.
Reliability Information Analysis Center
69
Chapter 2: General Assessment Approach

Sample Size Requirements


The determination of adequate sample sizes will depend on several factors, the most
important being whether the failure cause is special cause or common cause. If it is
special cause, the sample size needed will depend entirely on the percent of the
population affected by the failure cause. For example, if the failure cause manifests itself
in 0.1% of the population, then at least 1000 items would be required in order to expect a
single failure. Since multiple failures are required for true quantification, an order of
magnitude more items, or about 10,000, would be required. The specific number can be
calculated by using the principals of reliability demonstration, as explained elsewhere in
this book.

If the failure cause is a common cause mechanism, meaning that the entire population is
at risk, then many fewer items would be required. In this case, test data on enough
samples is required such that differences in reliability as a function of the factors (i.e.,
stresses, indicator variables) can be determined in a statistically significant manner. This
will be a function of how much inherent variability there is in the population, and how
sensitive the reliability is as a function of the factors under analysis. Essentially, if these
variabilities are known, then statistical techniques, like the Fisher F-test, could be used.
However, in practice, these variabilities are rarely known a priori. Therefore, sample
sizes as large as possible are preferred. In practice, the sample sizes are usually dictated
by programmatic constraints, in which case it is the reliability practitioner’s responsibility
to lobby program managers for the required samples.

Test Time
The question as to how long tests should be run before stopping them inevitably needs to
be addressed. This is especially true in cases where the stress levels are low and the
resulting lifetimes are long. While it is usually difficult to determine an appropriate test
duration before the test is run, a general rule of thumb is that tests should be run for
durations sufficient to cause at least 50% of the items to fail. This facilitates
quantification of the median life. Keep in mind that tests are used to characterize the
statistical distribution at a specific stress level, and therefore enough failures need to be
experienced to quantify the distribution.

Consider the illustration in Figure 2.5-16. In this case, tests were performed at two stress
levels, and the resulting TTF distributions were obtainable for each level. The
acceleration in this case can be quantified, along with confidence bounds around the
acceleration model parameters.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


70
Chapter 2: General Assessment Approach

Figure 2.5-16: Acceleration When the Distributions for at Least Two Stresses are
Available

Now, consider the case in which the lower stress samples are not tested until enough
failures have occurred. This is shown in Figure 2.5-17. In this case, the distribution
cannot be quantified. All that is possible is the estimation of the lower bound of life, via
techniques like Weibayes analysis (shown as the star).

Figure 2.5-17: Acceleration When the Distributions for Low Stresses are Not
Available

Reliability Information Analysis Center


71
Chapter 2: General Assessment Approach

This 50% objective can sometimes be offset if enough data is available in at least two
other, more stressful conditions, to compensate for the lack of data in the low stress
condition.

Develop Life Model


After the life data is generated from implementing the DOE plan, a reliability model can
be constructed. Factors that must be quantified include:

• Time-to-failure (TTF) distribution


• Acceleration factors for the primary stress variables
• Characterization of the impact of specific design attributes on reliability

A generic sequence of events for model development is shown in Figure 2.5-18.

Collect data
• TTFs
• Acceleration variables
• Stress(es)
• Indicator

Select TTF distribution

Select acceleration model(s)

Estimate model parameters

Analyze goodness of fit and


parameter significance

Figure 2.5-18: Life Model Sequence

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


72
Chapter 2: General Assessment Approach

The TTF distribution can typically be modeled using the Weibull, exponential or
lognormal distributions. For sample "subpopulations" that exhibit different reliability
behavior than the main population, TTF distributions may manifest themselves as
bimodal. It is important that bimodal distributions be characterized. If one of the two
"modes" in the distribution appears to be the result of early failures from workmanship,
materials or process defects, then this information should be used to develop an
appropriate reliability screen. This topic is discussed in detail later in this book.

Characterize Operating Stresses


In order to estimate the field reliability of the product, in addition to the life model
(which will predict the life characteristics as a function of the chosen factors),
information regarding the stresses to which the product or system will be exposed in the
field is also necessary.

There are a variety of sources that can be used to estimate the stresses to which an item
will be exposed. First, customers will usually specify nominal and worst case
environmental requirements in the product or system specification. However, the data in
specifications are often very generic and lack sufficient detail for reliability analysis.

Another source of information is from direct measurement, either by directly measuring


stresses in the item use environment, or by equipping the item with sensors and data
logging features.

Field maintenance personnel can also often provide qualitative information pertaining to
stresses, especially when those stresses have resulted in failures.

There is a wealth of information available in both commercial and military handbooks


and standards. Many industries also have their own source material from the products or
systems used in their industry.

A summary of sources include:

• Customer specifications
• Customer usage information
• Measurement of conditions:
• Stresses
• Duty cycle
• Extreme event statistics

Reliability Information Analysis Center


73
Chapter 2: General Assessment Approach

• Using a sample of fielded products fitted with sensors and data-recording


electronics
• Discussions with field maintenance personnel
• Handbooks and standards
• MIL-STD-210, “Climatic Information to Determine Design and Test
Requirements for Military Systems and Equipment”

Predict Reliability Under Use Conditions


Once life models have been developed for all pertinent failure causes, the specific
combinations of design attributes and stresses that result in reliability requirements being
met can be identified. These attributes/stresses define the item "safe operating region,"
which should then be added to the system/product design rules so that reliability
requirements for future designs can be met without having to repeat the reliability
modeling process for that item.

Model of System Reliability


Once life models have been developed for all pertinent failure causes, they need to be
combined such that a reliability estimate of the entire product can be made. Section 2.7
describes this process and the appropriate tools in more detail.

Degradation Modeling
In many cases, the reliability response variable will not be a TTF, but rather it will be the
behavior of a critical parameter as a function of time. In these cases, there are several
choices:

1. Develop a model that predicts the parameter as a function of all factors that need
to be quantified.
2. Derive a simple model (linear, logarithmic, exponential or power law) model that
describes the parameter as a function of time, and then use this model to estimate
a time to failure (i.e. the time the parameter is predicted to degrade to some
predefined failure threshold.

In many cases, Option 2 is a good choice. Option 1 is a good choice in the following
cases:

1. When the failure mechanism can reach an asymptotic value of degradation. This
condition is difficult to model using the conventional life modeling techniques
2. If the goal of the analysis is to feed other analytical techniques, like worst case
analysis (WCA).
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
74
Chapter 2: General Assessment Approach

A general approach for degradation modeling is shown in Figure 2.5-19.

Data of a performance
parameter vs. time
Regression
Prediction of
delta
Non linear Model of performance
model vs. time
parameter Prediction of
estimation percent fail

Prediction of life
Life modeling Life model distribution

Figure 2.5-19 Degradation Modeling Approach

This approach starts with data pertaining to the value of a critical parameter as a function
of an independent parameter. This independent parameter is usually time, but can be
other parameters, such as cycles. An example of such data is shown in Figure 2.5-20, in
which five samples were put on test and the critical parameter was measured in situ.

Reliability Information Analysis Center


75
Chapter 2: General Assessment Approach

Figure 2.5-20: Degradation Data Example

Next, models of performance vs. time are modeled. This can be accomplished by using
some standard model forms like, linear, exponential, logarithmic, polynomial, or a more
sophisticated non linear model form. The standard model forms can be quantified by
applying a linear transform to the data and applying regression techniques. The models
are easily performed in MS EXCEL with the trend line functions. Non linear model
forms can be quantified using numerical methods. The “Solver” utility in MS EXCEL is,
again, an example of this solution type.

Once these degradation models are available, predictions can be made regarding the
degradation value or the percent of the population failing in accordance to a predefined
failure criterion (i.e. percent degradation). Or, another option is to convert the
degradation data to failure times, as shown in Figure 2.5-21. The estimated TTFs are
then used to generate a life model using the techniques covered elsewhere in this book.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


76
Chapter 2: General Assessment Approach

Figure 2.5-21: Degradation Data Conversion to Times-to-Failure

Note that the resulting TTF distribution can sometimes be counterintuitive. For example,
when dealing with what is believed to be a wearout-related phenomenon, the conversion
of degradation into TTFs can reveal a TTF distribution that is not usually considered to
be a wearout characteristic (for example a Weibull distribution with a shape parameter
that is less than one).
2.5.1.1.7. Reliability Demonstration
An accelerated reliability demonstration is conceptually the same as a non-accelerated
test, but the level of acceleration needs to be quantified. For this, life modeling
approaches are used.

2.5.1.2. Field Data


Reliability data obtained from the field experience of products or systems is an invaluable
source of data. When using empirical field reliability data from a similar item as the
basis of the reliability estimate, there are two fundamental approaches, as illustrated
below.

Reliability Information Analysis Center


77
Chapter 2: General Assessment Approach

The first approach is to utilize the field data directly, and the second is to utilize the data,
via an interim model developed from the data. This is shown in Figure 2.5-22.

Empirical Reliability
Field Data Estimate

Model

Figure 2.5-22: Reliability Estimates from Field Data

This data has been the primary source of data used to develop most of the empirical
prediction methodologies such as MIL-HDBK-217, 217Plus, etc. Due to the author’s
experience with these prediction methodologies, they will be used as examples in Chapter
7 to illustrate the concepts discussed in this section.
2.5.1.2.1. Same Product
Field data on the exact item under analysis is the best information on which to estimate
the reliability of the product or system. Unfortunately, it is usually available too late to
do any good. Reliability predictions and estimates are required long before product or
system deployment. This type of data is a lagging indicator of reliability, whereas the
other techniques discussed in this book are leading indicators. In other words, we need
leading indicators to estimate the reliability that will ultimately be observed with the field
data. This data, however, which should always be collected on products, is valuable in
the reliability assessment of future products.
2.5.1.2.2. Similar product
When using data on a similar product or system to assess a new product or system, the
degree of similarity needs to be accounted for to estimate the new item reliability based
on the empirical data available on the similar product. There are several ways in which
similarity can be assessed. The first approach is to utilize a reliability prediction
technique. This technique can be any of those covered in this document. The
technique’s ability to assess similarity is dependent on the ability of the specific
methodology to:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


78
Chapter 2: General Assessment Approach

1. Address the factors that drive the reliability for the two products under analysis
2. Be reasonably sensitive to these drivers.

Regarding #2, for example, if a system is being developed that represents an evolutionary
change to the system for which a reliability estimate is available, estimating the reliability
of the new system based on the data from the old system requires that the prediction
methodology be sensitive to the design differences between the old and new systems. If
these differences consist of the addition of new components, an increase in the operating
temperature, and the addition of software, then the methodology used to assess the
“delta” in reliability between the new and old system must be capable of assessing these
elements, and the reliability prediction approach must be reasonably sensitive to these
factors. The methodology of 217Plus was designed to accommodate this type of
situation, and is further detailed in Section 2.6.

Additionally, it is not necessary that a single methodology be used to assess this “delta”.
Different techniques can be used to assess each of the elements of the design, and the
cumulative effect can be pooled together to form a complete system model. The
techniques used to assess each of the design elements will generally fall into the
categories described in this document.

Another more qualitative technique is to simply list the general attributes of the design, as
shown in Table 2.5-10. The relative expected reliability of each of these elements for the
new and old designs are then listed. This is a qualitative method, but can be useful in
some cases.

Reliability Information Analysis Center


79
Chapter 2: General Assessment Approach

Table 2.5-10: Similarity Analysis


Reliability Ratio for the
Design and Process elements of the
Old and New Designs
Size
Weight
General design
Number of components of type “A”
Design Elements
Number of components of type “B”
Number of components of type “C”
Number of optical components
Thermal dissipation
Number of connections
Manufacturing site
Equipment
Screening
Process elements
Component attachment
Screening tests
QC tests

This approach needs to be developed for each product or system, since the reliability
attributes will be unique to that particular item type.

Another approach that can be used to assess similarity is to utilize the FMEA, if
available. This is illustrated in Figure 2.5-23. Here, the FMEA is performed on both the
new and the predecessor system. The failure causes identified represent a cumulative
listing of all failure causes, whether they are applicable to either or both items. Then, the
Occurrence rating is determined for each failure cause for both items. If a specific failure
cause is not applicable to one of the items, then it gets a rating of zero. The sum of the
Occurrence ratings are then calculated for each of the products or systems. The ratio of
this sum is an indicator of the relative reliability levels of the two items, and is a good
measure of the degree to which the items are similar.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


80
Chapter 2: General Assessment Approach

Applicable to:

Recommended Actions
Old System New System
Failure effects
Failure mode

Detectability
Occurrence
Component

function

Severity

Causes
In these two columns, it is
This section represents the FMEA results for both the new and old identified whether the failure
systems, and lists a cumulative set of failure modes, causes, etc. cause is applicable to the old,
new or both systems.
Sum of “O” Sum of “O”
values of values of
causes causes
applicable to applicable to
the old system the new
system

Figure 2.5-23: FMEA as a Toll for Assessing Similarity

2.5.1.2.3. Raw Empirical Field data - Similar Product or System


Raw field reliability data has been a very popular source of data on which to base
reliability estimates. This “similar data” can be based on a specific company’s own field
experience on previous products or systems, or it can be a pooled set of data based on a
variety of companies and organizations. As an example of the latter, one of the RIAC’s
most popular documents has been the “Nonelectronic Parts Reliability Data, (NPRD)”
publication. NPRD is a compilation of observed field reliability data on a wide variety of
components. A summary of NPRD is provided in Section 7.4, to provide the reader with
a guide to the interpretation of this type of data.

For the most part, methodologies such as EPRD (Electronic Parts Reliability Data),
NPRD, MIL-HDBK-217, and 217Plus rely on field data from similar products or systems
in order to make reliability estimates. The manner in which they do this differs, but they
all share the same fundamental type of data as their basis.

Reliability Information Analysis Center


81
Chapter 2: General Assessment Approach

2.5.1.2.4. Models
The use of models derived from empirical data to estimate the reliability of a product or
system is just one option for estimating reliability. Empirical models can be developed
and used by the analyst, or he/she can use empirical models developed by others. Models
developed by others include the industry standards or methodologies that many reliability
analysts are familiar with.

This section of the book deals with such models that are derived from the analysis of
empirical field data. Modeling is the means by which mathematical equations are
developed for the purpose of estimating the reliability of a specific item used and applied
in a specific manner. There are many ways in which models can be derived, and there is
no single “correct” way to develop these models. There are many such models in
existence. These models are generally easy to use, in that they are of a closed form and
simply require the analyst to identify the appropriate values of the input variables. The
developers of each of these models had their own perspective in terms of the user
community to be served, the variables that were to be modeled, the data that was
available, etc. It is not the intent of this book to review the specifics of these models, or
to compare them in detail. It is the intent, however, to discuss the rationale and options
for development of the models, and to provide some examples.

The analyst must first decide what variables are to be modeled. Factors that should be
considered as indicators of reliability include:

• Environmental stresses
• Operational stresses
• Reliability growth
• Time dependency
o Infant mortality
o Wearout
• Engineering practices
• Technology
o Feature sizes
o Materials
• Defect rates
• Yields

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


82
Chapter 2: General Assessment Approach

Which ones are actually included depend on whether the data is available to support the
quantification of a factor, if a valid theoretical basis exists for its inclusion, and whether
the factor can be empirically shown to be an indicator of reliability.

There are always many more potential factors influencing reliability that can realistically
be included in a model. The analyst must choose which ones are considered to be the
predominant reliability drivers, and include them in the model. The next step of model
development is to theorize a model form. This is generally accomplished by attempting
to establish a model consistent with the fundamental physics of reliability. Examples of
the development of several empirically-based models are provided in Chapter 7.

To compare various empirical methodologies, Table 2.5-11 contains the predicted failure
rate of various empirical methodologies for a digital circuit board. The failure rates in
this table were calculated for each combination of environment, temperature and stress.
As can be seen from the data, there can be significant differences between the predicted
failure rate values, depending on the method used. Differences are expected because
each methodology is based on unique assumptions and data. The RIAC data in the last
row of the table is based on observed component failure rates in a ground benign
application.

Table 2.5-11: Digital Circuit Board Failure Rates (in Failures per Million Part Hours)
Environment Ground Benign Ground Fixed
Temperature 10 Deg. C 70 Deg. C 10 Deg. C 70 Deg. C
Stress 10% 50% 10% 50% 10% 50% 10% 50%
ALCATEL 6.59 10.18 13.30 19.89 22.08 29.79 32.51 47.27
Bellcore Issue 4 5.72 7.09 31.64 35.43 8.56 10.63 47.46 53.14
Bellcore Issue 5 8.47 9.25 134.45 137.85 16.94 18.49 268.90 275.70
British Telecom HDR4 6.72 6.72 6.72 6.72 9.84 9.84 9.84 9.84
British Telecom HDR5 2.59 2.59 2.59 2.59 2.59 2.59 2.59 2.59
MIL-HDBK-217 E Notice 1 10.92 20.20 94.37 111.36 36.38 56.04 128.98 165.91
MIL-HDBK-217 F Notice 1 9.32 18.38 20.15 35.40 28.31 48.78 45.44 79.46
MIL-HDBK-217 F Notice 2 6.41 9.83 18.31 26.76 24.74 40.15 73.63 119.21
217Plus Version 2.0 0.28 4.89 0.51 6.04
RIAC data 3.3

For electronic systems, generic handbook models such as MIL-HDBK-217 or Telcordia


SR-332 can be separated into two basic approaches, Parts Count and Parts Stress. When
the models for these handbooks were developed, researchers performed statistical
analyses on collected test and field data to determine major influencing factors for the

Reliability Information Analysis Center


83
Chapter 2: General Assessment Approach

class of components being considered. For example, for most all electronic components,
the predicted failure rate is found to be a function of operating temperature and applied
electrical stress. In general, the lower the operating temperature and applied electrical
stress, the lower the predicted failure rate will be. Therefore, the parts stress method
includes model factors for these specific stresses. However, if specific stress values
cannot be determined, it is still possible to perform a prediction using the more general
parts count methodology. For the parts count method, model stress levels have been set
to typical default levels to allow a failure rate estimate simply by knowing the generic
type of component (such as chip resistor) and its intended use environment (such as
ground mobile). It should be noted that these reliability prediction handbook approaches
are, by necessity, generic in nature. Actual test or field data from other similar items is
always more desirable, given sufficient similarity, as was discussed previously.

MIL-HDBK-217
MIL-HDBK-217, “Reliability Prediction of Electronic Equipment”, has historically been
the most widely used of all of the empirically-based reliability prediction methodologies.
The basic premise of the handbook is the use of historical piece part test and field failure
rate data as the basis for predicting future system reliability. The handbook includes
failure rate models for most electronic part types. The latest released version of MIL-
HDBK-217 is “F, Notice 2”, dated 28 February 19952. The handbook was almost a
casualty of the DoD Acquisition Reform initiative, but it survived primarily because of its
widespread use, the dependency on it throughout the military-industrial complex, and the
lack of a suitable replacement.

Figure 2.5-24 presents a brief example of the MIL-HDBK-217 parts count method, where
the product or system failure rate is the sum of the failure rates of the generic electrical
and electromechanical components of which it is comprised3. Each piece-part failure rate
is derived by assigning “typical” defaults to the generic component category stress
models. The only factors considered in these parts count component models are (1) the
generic base failure rate for that part type (represented by λg) that is based on an assumed
application environment and default temperature, (2) a generic quality factor (πq) that is
used to modify this part type base failure rate, and (3) the quantity of that part type used
in the equipment. In the example shown here, the λg for a bipolar microcircuit comprised
of between 1 and 100 gates in a 16-pin dual-in-line package used in a ground, fixed (i.e.,
GF) environment and operating at an assumed junction temperature of 60 degrees C is

2
As of the publication date of this book, a Draft version of MIL-HDBK-217G is in development and is expected to be released some
time in 2010.
3
The current version of MIL-HDBK-217 does not predict the reliability of mechanical components or non-hardware reliability
elements, such as software, human reliability, and processes. Field failures of mechanical components and non-hardware items should
not be scored against MIL-HDBK-217 or any other electronics-based empirical methodologies.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
84
Chapter 2: General Assessment Approach

0.012 failures per million hours. The quality factor for a parts count prediction is also
determined from a table (not shown in this example).

The parts count prediction approach is intended for use early in the design phase of the
equipment life cycle, prior to the start of detailed design, when there is little known about
the specific characteristics of the parts being used, or how they will be applied (such as
individual operating and environmental stresses).

Figure 2.5-24: MIL-HDBK-217 Part Count Example

The parts stress approach of MIL-HDBK-217 is applied later in the design development
phase of the equipment life cycle, when more details of the design are becoming
available and specific part applications are being identified. The use of this approach
requires detailed knowledge of the applied stresses and physical characteristics of the
device, including ambient and/or operating junction temperatures, electrical stress levels
(such as voltage, power, or current) vs. rated parameters, device complexity (such as gate
counts or transistor counts for semiconductors), etc. An example of a MIL-HDBK-217
parts stress model is shown in Figure 2.5-25 for gate/logic arrays and microprocessors.

Reliability Information Analysis Center


85
Chapter 2: General Assessment Approach

Figure 2.5-25: MIL-HDBK-217 Part Stress Example

The form of the model separately addresses failure rate contributions from the
microcircuit die and the microcircuit package. The quantitative values for C1 (shown
here) and C2 (not shown) are device- and package technology-dependent. The values for
the temperature and environmental factors, πT and πE respectively, are also technology
dependent and are seen to independently impact the die and package failure rate
contributions. The quality factor, πQ, represents the amount of pre-condition screening or
testing that the part might get, and the learning factor (πL) reflects the level of maturity
associated with the manufacture of the device. Component maturity has been shown to
be a predominant reliability driver of components, since the maturity is inversely
proportional to the defect density, which in turn is proportional to the failure rate. .

Telcordia SR-332 (Bellcore)


The Telcordia SR-332, shown in Figure 2.5-26, was formerly known under the name of
Bellcore. The models are similar in concept and purpose to MIL-HDBK-217, including
the fact that they are empirically based.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


86
Chapter 2: General Assessment Approach

Figure 2.5-26: Telcordia SR-332 (Bellcore)

The primary discriminator between the two approaches is that the Telcordia models have
been tailored specifically for the telecommunications industry, meaning that the breadth
of available environmental factors in the Telcordia models is much narrower than in
MIL-HDBK-217. There are three basic reliability prediction methods in the Telcordia
methodology:

• Method I represents a parts count reliability prediction approach, and is applicable


to new technology where no field data exists. Since it does reflect new
technology, the model includes a “first year multiplier” factor to account for
infant mortality failure rates.
• Method II incorporates the characteristics of Method I, but expands its scope to
include the impact of lab test data.
• Method III is based on field tracking of failure rates.

It uses the Bayesian methodology to combine data from various sources, in a manner
similar to the RAC PRISM/RIAC 217Plus methodology.

Reliability Information Analysis Center


87
Chapter 2: General Assessment Approach

PRISM/217Plus
The original RAC PRISM®4 system reliability assessment tool was developed and
released by the RAC in January 2000 as a potential replacement for MIL-HDBK-217.
With the subsequent transition to RIAC in June 2005, the RIAC 217Plus methodology
replaced the RAC PRISM tool and added additional component models. As a result, the
217Plus methodology currently addresses all the major component types found in MIL-
HDBK-217. Figure 2.5-27 symbolizes the replacement of the RAC PRISM tool by
217Plus.

Figure 2.5-27: RAC PRISM Replaced by RIAC 217Plus

With no DoD sponsorship and funding, the models contained within MIL-HDBK-217F,
Notice 2, and the data upon which they were based, were becoming increasingly
outdated, thereby subject to increasing criticism. The part failure rate models
incorporated within PRISM, and ultimately within 217Plus, are based on a much larger
and more recent dataset, reflecting the improvements made in semiconductor device and
packaging technologies, resulting in more “accurate” part failure rate predictions. The
PRISM/217Plus software tool incorporates many of the ideas contained in the “New
System Reliability Assessment Study” performed by the RIAC for what was then Rome
Laboratory (Reference 7). These ideas include the ability to update an analytical
reliability prediction using in-house test data or field experience through Bayesian
techniques, and the ability to factor in system-level reliability impacts resulting from the
robustness (or lack, thereof) of the system development process. This methodology is
discussed in more detail in Chapter 7.

4
PRISM is a registered trademark of Alion Science and Technology.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
88
Chapter 2: General Assessment Approach

CNET/RDF 2000
The CNET/RDF 2000 Reliability Prediction Standard, shown in Figure 2.5-28, covers
most of the same component categories as MIL-HDBK-217.

Figure 2.5-28: CNET/RDF 2000

An example of an integrated circuit model is shown in Figure 2.5-29.

This model has many similarities to the PRISM/217Plus models, in that it addresses the
year of manufacture, dormancy failure rates, thermal cycling characteristics and electrical
overstress failure rates. The form of the integrated circuits model is shown here, and
bears a resemblance to the format of the MIL-HDBK-217F, Notice 2 microcircuit failure
rate model, in that it partitions the predicted device failure rate into die- and package-
related contributions. As can be seen from Figure 2.5-29, there is quite a bit of
information that the analyst must have access to in order to use the model, but this is
typical for virtually all parts stress reliability prediction models.

Reliability Information Analysis Center


89
Chapter 2: General Assessment Approach

Figure 2.5-29: CNET/RDF 2000 Model Example

FIDES
The FIDES methodology, illustrated in Figure 2.5-30, was created by a French
consortium of reliability experts from various companies. It has similarities to the
CNET, RAC PRISM and RIAC 217Plus methods.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


90
Chapter 2: General Assessment Approach

Figure 2.5-30: FIDES

Other Methodologies
Other methodologies include:

UTE C 80-810, RELIABILITY DATA HANDBOOK: RDF 2000 – A universal


model for reliability prediction of electronic components, PCBs and equipment.

IEC 62380 TR Ed.1 (2003), Reliability Data Handbook - A universal model for
reliability prediction of electronics components, PCBs and equipment.

IEC 1709, Electronic components – Reliability – Reference conditions for failure


rates and constraints influence models for conversion.

The VITA51 committee is also an organization that is addressing reliability prediction.


Its approach has been to adapt and modify existing methodologies, like MIL-HDBK-217,
by tailoring various factors of the MIL-HDBK-217F, Notice 2 models so that there is a
closer correlation between predicted values and field experience.

Reliability prediction models such as those summarized above can easily become
outdated as technology advances. As previously mentioned, maintaining the currency
and accuracy of a reliability prediction model can be a prohibitively costly and labor-

Reliability Information Analysis Center


91
Chapter 2: General Assessment Approach

intensive effort. Failure to invest in this activity, however, will doom a reliability
prediction methodology to eventual irrelevancy and obsolescence.

2.5.1.2.5. Collecting Field Data


Since field data is critical to the reliability assessment process, it is explored in this
section. The nuances of collecting and interpreting it are discussed. Some of the issues
encountered in collecting field data are discussed in the NPRD discussion included in
Chapter 7. The intent of this section is to present guidelines on how to approach field
data collection.

Good data collection is the key to an effective process for utilizing data obtained from a
reliability tracking system. This information includes:

• Failure statistics (i.e., TTF, MTBF)


• Application information (i.e. stress, environment, etc.)
• Failure modes
• Failure causes

The intent of this section is to outline a reliability data collection and analysis system that
can provide the data required. Although a reliability tracking system outlined herein has
similarities to a FRACAS program, there are distinct differences. While a FRACAS
program is intended identify the causes of failures so that corrective action can take
place, the program outlined herein is intended to be more comprehensive in that it assists
its user in more than the implementation of corrective actions, as it also provides the data
required to quantify reliability, in accordance with the methodologies outlined in this
book. This concept is illustrated in Figure 2.5-31.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


92
Chapter 2: General Assessment Approach

Reliability

TTF Failure MTBF Root Cause


Analysis Verification Analysis Identification

Vendor Implement
Warranty RCM
Selection Design
Claims Implementation
Improvements

Figure 2.5-31: Uses of Program Data Elements

A data system consists of several basic elements: a database, software analysis tools, and
an interface to the data system users. The database is the core of the system that captures
the raw maintenance data that is necessary to perform the required data analysis. A
typical structure of a database is provided in Figure 2.5-32.

System Information Parts Breakdown

Maintenance Data

Root Failure
Cause/Analysis Data

Figure 2.5-32: Program Database Structure

The blocks in the above figure correspond to records in a relational database structure.
The data elements associated with each record are defined below. The System
Reliability Information Analysis Center
93
Chapter 2: General Assessment Approach

Information record consists of population statistics and needs to be updated whenever the
product or system status changes. Such a change occurs when new or modified items are
fielded.

The parts breakdown data element consists of a hierarchical description of the system.
This description is necessary to avoid confusion as to which FRUs (Field Replaceable
Units) belong to which assemblies and the number of FRUs in the assembly, as well as in
the entire system.

The maintenance data element consists of a record of the maintenance action taken to
maintain or repair the system. It also consists of a description of the anomaly, the failure
mode, and the failure mechanism of the failed unit as determined by the maintenance
technician. One record corresponds to a single maintenance action, and there can be any
number of them for each FRU in the system (i.e., a FRU in the system can be replaced
any number of times over the life cycle of the system).

The root failure cause/analysis data element consists of information on the results of the
detailed failure analysis that may be performed on the failed unit. It is a separate record
because not all maintenance actions will result in the failure analysis of a removed unit.

There are two primary interfaces required of the system. The first is the maintenance
technician interface. This interface is the means by which maintenance data is entered
into the database. Ideally, this interface would consist of computers located within the
maintenance facility for direct data entry. The second interface is the one utilized by
individuals that need the results of the data analysis. The flow of the interface to the
system from the perspective of the system user is given in Figure 2.5-33.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


94
Chapter 2: General Assessment Approach

System maintenance is
required and maintenance
commences

System user
Maintenance technician enters part
identifies the part breakdown and
requiring maintenance maintains system
and enters the part data usage status
into the database
Central
Database
User runs
Technician performs appropriate analysis
required maintenance and obtains
necessary reliability
metric(s)
Maintenance technician
enters maintenance data
into the database

Figure 2.5-33: Database Information Flow

Important elements of the data system that should be considered for inclusion are
summarized below:

• System information
• Number of systems fielded
• Dates of fielding for each system j
• Location of operation (optional)
• System Numbers (unique identifier for each system)

Critical elements of a data collection system are discussed below.

Parts Breakdown
A description of every level of assembly must be available, down to the lowest level of
repair. For the purposes of this example, this assembly will be called a FRU (Field
Reliability Information Analysis Center
95
Chapter 2: General Assessment Approach

Replaceable Unit). This product or system description is critical to the unique


identification of parts so that the data that is reported at various levels is not confounded.
It is also critical if maintenance actions are not consistently performed at the same level.
At the lowest level of indenture, the following FRU information is required.

• Part number
• Serial number
• Part identification code (unique descriptor of part in hierarchical breakdown of
system; sometimes referred to as a Reference Designator)
• Number of parts in the product or system
• Applicable Life Unit (i.e. hours, miles, cycles, operations, etc.)
• Identification as to if there is an individual elapsed time meter (or miles, cycles,
operations) on the specific part or whether system life units must be used
• Manufacturer name

Maintenance Information
A critical element to an effective reliability data collection and analysis system is the
accurate quantification of the failure cause. Not all perceived failures are real failures
and, therefore, it is important to identify whether part removals are indeed true failures.
Figure 2.5-34 illustrates the hierarchy of maintenance actions.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


96
Chapter 2: General Assessment Approach

Maintenance Action

Scheduled Unscheduled

Perform Real Failure False Alarm


Remove/Replace Routine
Maintenance

Correct Incorrect Correct Incorrect


Diagnosis Diagnosis Diagnosis Diagnosis

Unnecessary
Necessary Faulty Unit Cannot
Repair
Repair Gets Put Back Duplicate
into Field

Failure Analysis Failure


Performed to Analysis
Identify Root Not
Cause Performed

Figure 2.5-34: Hierarchy of Maintenance Actions

The following is a list of required data elements in the capture of maintenance


information:

• Job number (unique identifier)

Reliability Information Analysis Center


97
Chapter 2: General Assessment Approach

• Calendar date and time that system is taken out of operation


• Calendar date of maintenance action
• System serial or configuration control number
• Number of total life units (i.e. hours, miles, cycles, operations) on the FRU at the
start of the maintenance action (if life unit meter is on FRU)
• Number of total life units (i.e. hours, miles, cycles, operations) on the product or
system at the start of the maintenance action (if life unit meter is not on part)
• Number of total life units (FRU or product/system, depending on which of the
above two items are applicable) on the part at the start of the maintenance action.
This is a calculated field generated by the database software.
• Initial description of the anomaly
• Initiating event (only one is chosen):
o Failure of system to perform (unscheduled maintenance)
o Condition monitoring-based event
o Scheduled maintenance
• When discovered
• Action taken (only one is chosen):
o Remove/replace
o Maintain
o Remove, re-test OK, and replace
• FRU on which action is taken (description and serial/configuration control
number)
• Maintenance technician (name)
• Man-hours required for maintenance action
• Calendar date and time that the system is put back into service
• Cause of failure identified by the maintenance technician
• Failure mode description
• Failure mechanism description. There could be a standardized listing of the
possible failure mechanisms from which the technician could scan and identify
the appropriate mechanism.

Failure Analysis Information


The failure analysis record is used when there is a detailed failure analysis performed on
a removed FRU. The data contained in this record generically consists of the following:

• Summary of the analysis performed


• Results of the analysis
• Failure cause (should be the root failure cause, not a failure symptom cause)
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
98
Chapter 2: General Assessment Approach

Analysis
From the data collected and captured in the database, several fundamental reliability
parameters, including those listed below, can be calculated.

• Operating hours (or life unit) of each FRU


• Cumulative operating hours of the population
• Cumulative system calendar hours of the population
• Cumulative FRU calendar hours of the population
• Individual calendar times for each product or system
• For scheduled removals:
o Number of scheduled removals
o Total number of man-hours associated with scheduled removals
o Individual operating times for scheduled removals
o Individual calendar times for scheduled removals
o Number of man-hours for each scheduled removal
• For unscheduled removals:
o Number of unscheduled removals
o Total number of man-hours associated with unscheduled removals
o Individual operating times for unscheduled removals
o Individual calendar times for unscheduled removals
o Number of man-hours for each unscheduled removal
• Number of total removals
• Total number of man-hours
• Individual number of man hours
• Individual operating times of all removals
• Individual calendar times of all removals
• Number of removals for each failure cause
• Individual operating times of removals for each failure cause
• Individual calendar times of removals for each failure cause
• Total time that each individual product or system is unavailable

For many of these parameters, it is necessary to calculate the number of life units to
which each part has been exposed. This is done by calculating the number of life units on
the part since the last time that the part was replaced. This calculation procedure is
illustrated in Figure 2.5-35.

Reliability Information Analysis Center


99
Chapter 2: General Assessment Approach

Is there a life unit meter


on the part?

Yes No

Use part life unit Use system life unit


meter meter

Record life unit Has the part been


meter reading (i.e., previously removed? (i.e.,
Part is there a maintenance
hours/miles/cycles) record for that part in the
database?)

Yes No

Subtract the system Record the


life unit from that of system life unit
the last maintenance
record from the
current life unit

Figure 2.5-35: Calculation of Part Life Unit

Outputs
A list of typical output parameters are listed below:

• Mean Operating Hours Between Scheduled Removals


• Mean Calendar Hours Between Scheduled Removals
• Mean Operating Hours Between Unscheduled Removals
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
100
Chapter 2: General Assessment Approach

• Mean Calendar Hours Between Unscheduled Removals


• Mean Man Hours per Maintenance Action (MMH/MA)
• Distribution of maintenance man hours per maintenance action
• Weibull parameters of individual operating times for unscheduled maintenance
actions
• Weibull parameters for failures of a specific cause
• Pareto ranking of part failure rates (or of any of the above listed parameters)
• Failure cause distribution
• Pareto ranking of failure causes
• Mean system availability for each system
• Distribution of system availability

Drenick’s Theorem
An important aspect of interpreting field reliability data is distinguishing between
calendar time and operating time. Consider a situation in which five items are fielded at
the same time, as illustrated in Figure 2.5-36. They will each have a failure time (or other
appropriate life unit) that is described by the TTF distribution as a function of operating
time.

1
2
3
4
5

Operating
Time

Failure Times

Figure 2.5-36: Failure Times Based on Operating Time

Reliability Information Analysis Center


101
Chapter 2: General Assessment Approach

Now, consider the same five items that were placed in the field at different calendar
times, as illustrated in Figure 2.5-37. They will have the same failure times relative to
their operating time, but the apparent failure times relative to calendar time will be quite
different.

1
2
3
4
5

Calendar
Time

Failure Times

Figure 2.5-37: Failure Times Based on Calendar Time

Furthermore, if the product or system is repairable (in which case the failed items are
replaced upon failure with a new item), an interesting effect occurs in which the apparent
failure rate will reach an asymptotic value that appears to represent a constant failure rate.
This occurs as the “time zero” values become randomized as items fail and are replaced
with new items.

To illustrate the relationship between the beta value (Weibull shape) and the
instantaneous failure rate as a function of calendar time when parts are replaced upon
failure, a simulation was performed. In this simulation example, the failure rate of 1100
items as a function of calendar time was calculated.

Figures 2.5-38 through 2.5-42 illustrate the results. These figures correspond to Weibull-
distributed TTFs with shape parameters of 20, 5, 2, 1 and 0.5, respectively. The time axis
is calendar time, normalized to a time unit of one characteristic life.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


102
Chapter 2: General Assessment Approach

Figure 2.5-38: Failure Rate Simulation with Weibull Beta = 20

Figure 2.5-39: Failure Rate Simulation with Weibull Beta = 5.0

Reliability Information Analysis Center


103
Chapter 2: General Assessment Approach

Figure 2.5-40: Failure Rate Simulation with Weibull Beta = 2.0

Figure 2.5-41: Failure Rate Simulation with Weibull Beta = 1.0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


104
Chapter 2: General Assessment Approach

Figure 2.5-42: Failure Rate Simulation with Weibull Beta = 0.5

Consider the case where the Weibull beta = 20 (Figure 2.5-38). When the populations
start operating at the same time at t = 0, the failures occur at a rate described by the
Weibull distribution with a beta value of 20. The peak of the failure rate occurs at
approximately the characteristic life value of time. As units fail and are replaced, the
“time zeros” start to become randomized. As enough time passes, the “times zeros” will
eventually become completely randomized. At this point, the asymptotic value of failure
rate is reached, which is the reciprocal of the characteristic life (in this case, 100). Figure
2.5-39, depicting the simulation results for a beta value of 5.0, indicates a similar effect.
The asymptotic failure rate, however, is reached sooner. This happens because the
variance in failure time is greater for a beta of 5.0 relative to a beta of 20, which, in turn,
means that the population “time zeros” become randomized sooner. The plot illustrating
a beta of 2.0 (Figure 2.5-40) is similar, with a corresponding asymptotic value reached
sooner. The plot corresponding to a beta of 1.0 (Figure 2.5-41) indicates that the random
failure rate occurs at t=0 which intuitively make since it has, by definition, a randomly
occurring failure rate.

However, when the beta is less than 1.0 (Figure 2.5-42), the asymptotic failure rate value
is zero. This occurs because, when enough time has passed, the failed items have been
replaced with items that have a higher probability of living longer. The lower the beta
value, the shorter the time period required to achieve a zero failure rate.
Reliability Information Analysis Center
105
Chapter 2: General Assessment Approach

Because this is an important factor in interpreting field reliability data, a methodology


was derived for the NPRD data to estimate the characteristic life based on field data with
varying “time zero” values. This methodology is discussed in Chapter 7, Section 4.

2.5.2. Physics
The generic approaches covered here in using a physics approach are stress strength
interference models and models from first principals. Each is described below.

2.5.2.1. Stress/Strength Modeling


Stress/strength interference theory is a technique used to quantify the probability that the
strength of an item is less than the stress to which it is subjected. For example, if the
distribution of the strength of an item can be quantified, and the distribution of the stress
it is under can be quantified, the area of intersection of the two stresses represents the
probability that the strength is less than the stress.

This technique is general in nature and applies equally to any situation that the two
distributions can be quantified, as long as the X-axis represents the same variable for both
distributions. The variable can be electrical, such as voltage or current, or it can be
mechanical strength, for example, in units of KPSI.

The goal of any design for robustness effort is to minimize the variance of both
distributions, and maximize the separation of the distribution means. In this manner, the
probability of distribution intersection, or failure, is minimized.

An example of this approach is illustrated in Figure 2.5-43.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


106
Chapter 2: General Assessment Approach

Design Dimension Material Properties

Dimensions Extrinsic
Stresses Modulus CTE

FEA Stress Strength Fatigue


Data Data

Strength

Probability of
Failure vs Time

Figure 2.5-43: Stress Strength Methodology

In this example, a mechanical item has certain physical properties, for example its
modulus and its coefficient of thermal expansion (CTE). These material properties are
used in addition to the design variable (i.e. dimensions, extrinsic stresses) to estimate the
stresses to which the item is exposed. This stress can be modeled in several ways. One is
the use of handbooks that contain closed-form equations that estimate the stress to which
a material is exposed as a function of dimensions, force, deflections, etc. This is usually
only viable for simple structures. For more complex mechanical structures, finite
element models and analysis (FEA) may be required to simulate stresses.

For the strength portion of the model, two factors need to be considered:

• The inherent strength distribution of the material


• The strength properties as a function of time

Reliability Information Analysis Center


107
Chapter 2: General Assessment Approach

An example of strength as a function of time is the fatigue properties of the material. The
fatigue properties pertain to the strength degradation over time.

At time = 0, the probability of failure is the intersection of the stress and the strength
distributions, as illustrated in Figure 2.5-44.

Figure 2.5-44: Stress/Strength Interference

The calculation for Normally-Distributed Stress and Strength Distributions is:


ux − u y
Z=
σ x2 + σ y2
where:

Z= Standard Normal variant (i.e., the number of standard deviations from the
normal standardized distribution). The value for “Z” can be obtained
from:
1. Tables of the Standard Normal distribution
2. MS EXCEL formula = Normdist(Z)
μx = the mean of the strength
μy = the mean of the stress
σx = the standard deviation of the strength
σy = the standard deviation of the stress

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


108
Chapter 2: General Assessment Approach

In many real situations, distributions other than the Normal are used, requiring alternate
methods of calculating the interference probability. Readily available software tools can
be used for this purpose (Reference 3).

As stated previously, in addition to the probability of failure at t=0, it is also critically


important to understand how this interference between stress and strength behaves as a
function of time. Items will sometimes age (due to mechanisms such as fatigue), which
essentially means that the strength distribution changes such that its mean is lowered.
Assuming that the stress to which the item is exposed remains constant, the result is that
there is more interference, and the failure probability increases with time. To properly
account for this aging phenomenon, the characteristics of this strength distribution and
the interference must be quantified as a function of time. This concept is illustrated in
Figure 2.5-45.

Figure 2.5-45: Stress/Strength Interference vs. Time

Reliability Information Analysis Center


109
Chapter 2: General Assessment Approach

An example of a model that has been successfully used for brittle materials is the
following:

⎛ m
m

⎜ ⎛V ⎞⎛ σ ⎞ ⎛t ⎞n ⎟
P = 1 − exp⎜ − ⎜⎜ ⎟⎟⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎟
⎜ ⎝ V0 ⎠⎝ S 0 ⎠ ⎝ t0 ⎠ ⎟
⎝ ⎠
where:

P= probability of failure
m = Weibull slope of the initial strength
S0 = characteristic strength
n= fatigue constant
V and V0 are volume parameters to account for the effects of size (i.e., they
account for the effect that the more volume or surface area that there is, the more
likely it is to have a strength limiting flaw)
σ= stress

Now, if a screen is applied to the material to eliminate defects having strength values
below the applied screen stress threshold (Sth), the probability of failure becomes:
⎛ ⎛ 1 m
⎞ ⎞
⎜ ⎜ ⎛ t0 ⎞ n ⎟ ⎟
⎜ σ − S ⎜ ⎟ m

⎛V ⎞⎜ th
⎝ t ⎠ ⎟ ⎛ t ⎞n ⎟
P = 1 − exp⎜ − ⎜⎜ ⎟⎟⎜ ⎟ ⎜ ⎟
⎜ ⎝ V0 ⎠⎜ S0 ⎟ ⎜⎝ t 0 ⎟⎠ ⎟
⎜ ⎜ ⎟ ⎟
⎜ ⎜ ⎟ ⎟
⎝ ⎝ ⎠ ⎠

This is only one example of a stress strength model. Many others can be found in the
literature.

Models such as these can be invaluable in understating the sensitivity of reliability as a


function of the factors accounted for in the model. However, as is the case with any
physics-based model, it is important to validate the model based on empirical evidence.
This is critical because there is ample opportunity to introduce large errors in the
analysis, based on extreme sensitivity to assumptions, sample variability, etc.
Additionally, while the approach may be grounded in physics, the model parameters
usually need empirical data for their quantification.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


110
Chapter 2: General Assessment Approach

2.5.2.2. First Principals


The premise of First Principals is that the fundamental physics that govern a failure
mechanism can be characterized, and that the reliability of the mechanism can be
accurately predicted from these equations. This is best illustrated with an example from
References 4 and 5. In this example, the reliability of a Fused Biconic Splitter was
modeled. This is a passive optical component used to split optical signals in fiber optic
telecommunication systems. The observed failure mode was a degradation of the
coupling ratio over time.

The original test plan included Accelerated Aging Tests on Fused Splitters for 3
conditions, as shown in Table 2.5-12.

Table 2.5-12: Test Conditions


Test Conditions Temperature Relative Absolute
(°C) Humidity Humidity
(RH) (AH)
85°C / 85% RH X X
85°C / 16% RH X X
45°C / 85% RH X X

The “X” values in the cells indicate which test conditions have a constant value for the
stress indicated in each column. The values were chosen to assess whether relative
humidity or absolute humidity was the predominant mechanism of the failure mode. In
this case, two of the three conditions have equivalent relative humidity and two of three
have equivalent absolute humidity.

The results of the accelerated tests did not agree with a previously hypothesized failure
mechanism that proposed epoxy creep as the coupling ratio drift mechanism. Therefore,
in an effort to obtain a model that was consistent with empirical evidence, the
fundamental physics were investigated. This process is described below:

From optical component physics, it can be shown that the coupling between two fibers is:
3πλ 1
c=
32n2 a (1 + 1 / V )2
2

where:
V ≡ ak (n22 − n32 )1 / 2

Reliability Information Analysis Center


111
Chapter 2: General Assessment Approach

Additionally, the diffusion of water vapor into silica can be represented as:

C (r , t ) = C 0⎜⎜1 − ∑ BnJ 0( jnr / b) exp{− jn 2 [ DH 2O(T )t / b 2 ]}⎟⎟


⎛ ∞ ⎞
⎜ ⎟

⎜ ⎟
⎝ n =1 ⎠

where:
Bn ≡ 2 /[ jnJ 1( jn)]
The hypothesis of the physical mechanism is that water diffuses into the outer surface of
the fused region very slowly and slightly decreases the index of refraction of this outer
surface. This increases the coupling coefficient, thereby increasing the coupling ratio.
As time goes by, more and more water diffuses in, and the coupling ratio increases until
the device goes out of spec. The amount of water in the silica is simply the number of
water molecules hitting the surface of the silica per unit time (directly proportional to the
absolute humidity) and the diffusion rate at that temperature. Therefore, if the time to
failure at a specific condition is known, the time to failure at a new condition is the
known TTF multiplied by a ratio of the absolute humidity level times the ratio of the
diffusion rates.

The data obtained in the tests were used to estimate the diffusion rate and the temperature
dependence of this diffusion rate, as shown in Table 2.5-13.

Table 2.5-13: Data to Estimate Diffusion Rate


ABS HUM DIFFUSION
SERVICE TEMP RH MTBF
grams H2O CONSTANT RATIO
CONDITION °C. % (Years)
per m3 cm2/sec
High Temp/High 85 85 297.1 6.63x10-18 1 0.579
Humidity Chamber
Med Temp/High 45 85 55.4 2.60x10-19 137 79
Humidity Chamber
High Temp/Medium 85 16 56.0 6.63x10-18 5 3
Humidity Chamber
Underground 25 85 19.6 3.73x10-20 2691 1559

Footway Box 15 93 11.9 1.73x10-20 9552 5535

The predictions from the model were then obtained. The predicted and observed
lifetimes are shown in Table 2.5-14.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


112
Chapter 2: General Assessment Approach

Table 2.5-14: Predicted Lifetimes vs. Observed


ENVIRONMENTAL TEMP RH MTBF in MTBF in %
CONDITION °C. % Hours Hours Difference
(Predicted) (Measured)
High Temp/High
85 85 5072 5072 0
Humidity Chamber
High Temp/Medium
85 16 26,909 27,500 2
Humidity Chamber

As can be seen above, the model is extremely accurate in predicting the failure
mechanism behavior.

Models developed from first principals like the one shown in this example can be very
accurate and, thus, beneficial to a reliability program. However, several pieces of
information were required in order to make this approach a viable alternative:

• Detailed component information, including:


o Index of refraction of the core and cladding on the fiber used in the
component
o Fiber dimensions, and model constants in the above equations
• The ability to generate a closed form equation that describes:
o Water diffusion rates into silica
o Optical coupling ratio as a function of component design parameters

While it would be desirable to model the reliability of every conceivable failure


mechanism in this manner, practical constraints of most reliability practitioners make this
difficult to apply to complex systems. The primary reason for this is that information like
that summarized above is not practical to obtain in many cases. Additionally, with
complex systems, there can be thousands of possible failure causes which would need to
be modeled in order to obtain a system reliability estimate.

The primary difference between this approach and the DOE-based life modeling
approach previously described is the manner in which the model form is determined. In
the DOE approach, the model forms are assumed and are based on standard forms like
the power law or the Arrhenius law. In the physics approach, the model forms are
determined from first principals of physics. In both cases, however, certain model
parameters are generally estimated from empirical data.

Reliability Information Analysis Center


113
Chapter 2: General Assessment Approach

2.6. Combine Data


Once the data for each item has been analyzed and reliability estimates have been made
using any of the methods descried previously, the information needs to be combined to
form the best estimate of product or system reliability. The methodology of 217Plus was
developed for this specific purpose and can be used as a framework from which to
perform this combination.

Figure 2.6-1 summarizes the 217Plus methodology for estimating the failure rate of a
product or system. In this example, only constant failure rates are addressed. If specific
items are described by non constant failure rates, the mathematics become more difficult,
but the basic approach remains the same.

Figure 2.6-1: 217Plus Approach to Failure Rate Estimation

The specific approach that can be used depends on several factors, including:

• Whether information exists on a predecessor product or system


• The amount of empirical reliability data available on that product or system

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


114
Chapter 2: General Assessment Approach

• Whether the analyst chooses to evaluate and assess the processes used in the
development of the product or system

The types of data that may be available can be any of the types summarized previously in
this section of the book.

If the product or system under analysis is an evolution of a predecessor item, the field
experience of the predecessor product can be leveraged and modified to account for the
differences between the new product and the predecessor product. A predecessor is
defined as a product or system that is based on similar technology and uses
design/manufacturing processes similar to the new item under development for which a
reliability prediction is desired. In this case, the new product or system is an evolution of
its predecessor. In this analysis, a prediction is performed on both the predecessor item
and the new item under development. These two predictions form the basis of a ratio that
is used to modify the observed failure rate of the predecessor, and account for the degree
of similarity between the new and predecessor products pr systems. The result of the
predecessor analysis is expressed as λ1, as presented in Figure 2.6-1.

If enough empirical data (field, test, or both) is available on the new product or system
under development, it can be combined with the reliability prediction on the new item to
form the best failure rate estimate possible. A Bayesian approach is used for this
combination, which merges the reliability prediction with the available data. As the
quantity of empirical data increases, the failure rate using the Bayesian combination will
be increasingly dominated by the empirical data. The result of the Bayesian combination
is defined as λ2, as presented in Figure 2.6-1.

The minimum amount of analysis required to obtain a predicted failure rate for a product
or system is the summation of the component estimated failure rates. The component
failure rates are determined from the component models, along with other data that may
be available to the analyst. The result of this component-based prediction is λIA,new. This
value can be further modified by incorporating the optional data, resulting in λpredicted,new,
as shown in Figure 2.6-1. All methods of analysis require that a prediction be performed
on the new product or system under development in accordance with the component
prediction methodology. Predictions based solely on the component analysis should
be used only when there is no field or test reliability history for the new item and no
suitable predecessor item with a field reliability history. In this case, the reliability
model is purely predictive in nature. After a product or system has been fielded, and
there has been a significant amount of operating time, the best data on which to base a
failure rate estimate is field observed data, or a combination of prediction and observed
Reliability Information Analysis Center
115
Chapter 2: General Assessment Approach

failure data. In this case, the reliability model yields an estimate of reliability, because
the reliability is estimated from empirical data.

Each element of the 217Plus methodology is further described in the following sections.

λIA,predecessor
λIA,predecessor is the initial reliability assessment of the predecessor product or system. It is
the sum of the predicted component failure rates, and uses any of the methods described
in this book.

λobserved, predecessor
λobserved, predecessor is the observed failure rate of the predecessor product or system. It is
the point estimate of the failure rate, which is equal to the number of observed failures
divided by the cumulative number of operating hours5.

Optional data
Optional data is used to enhance the predicted failure rate by adding more detailed data
pertaining to environmental stresses, operating profile factors, and process grades (the
concept of process grades is explained in detail in Chapter 7). The 217Plus models
contains default values for the environmental stresses and operational profile, but in the
event that actual values of these parameters are known, either through analysis or
measurements, they should be used. The application of the process grades is also
optional, in that the user has the option of evaluating specific processes used in the
design, development, manufacturing and sustainment of a product or system. If process
grades are not used in a 217Plus analysis, default values are provided for each process
(failure cause), so that the user can evaluate any or all of the processes.

λpredicted, predecessor
λpredicted, predecessor is the predicted failure rate of the predecessor product or system after
combining the initial assessment with any optional data, if appropriate.

λIA,new
λIA,new is the initial reliability assessment of the new product or system. This is the sum
of the predicted component failure rates, and uses the 217Plus component failure rate
models or other methods (such as data from NPRD or other data sources). A reliability

5
Note that “operating hours” can be replaced by any other life unit, such as calendar hours, miles, cycles, etc. The 217Plus
methodology predicts failure rates in terms of calendar hours. The important point is that all life units used in the assessment must be
consistent.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
116
Chapter 2: General Assessment Approach

prediction performed in accordance with this method is the minimum level of analysis
that will result in a predicted reliability value. Applying any optional data can further
enhance this value.

λpredicted, new
λpredicted, new is the predicted failure rate of the new system after combining the initial
reliability assessment with any optional data, if used. If optional data is not used, then
λpredicted, new is equal to λIA,new

λ1
λ1 is the failure rate estimate of the new system after the predicted failure rate of the new
system is combined with the information on the predecessor product (predicted and
observed data). The equation that translates the failure rate from the old product or
system to the new one is:

λobserved , predecessor
λ1 = λ predicted , new ×
λ predicted , predecessor

The values for λpredicted,new and λpredicted,predecessor are obtained using the component
reliability prediction procedures. The ratio of λobserved,predecessor /λpredicted,predecessor inherently
accounts for the differences in the predicted and observed failure rates of the predecessor
system, i.e., it inherently accounts for the differences in the products or systems analyzed
in the component reliability prediction methodology.

This methodology can be used when the new product or system is an evolutionary
extension of predecessor designs. If similar processes are used to design and
manufacture a new item, and the same reliability prediction processes and data are used,
then there is every reason to believe that the observed/predicted ratio of the new system
will be similar to that observed on the predecessor system. This methodology implicitly
assumes that there is enough operating time and failures on which to base a value of
λobserved,predecessor. For this purpose, the observance of failures is critical to derive a point
estimate of the failure rate (i.e., failures divided by hours). A single-sided confidence
level estimate of the failure rate should not be used.

ai
ai is the number of failures for the ith set of data on the new product or system.

Reliability Information Analysis Center


117
Chapter 2: General Assessment Approach

bi
bi is the cumulative number of operating hours for the ith set of data on the new product or
system.

AFi
AFi is the acceleration factor (AF) between the conditions of the test or field data on the
new product or system and the conditions under which the predicted failure rate is
desired. If the data is from a field application in the same environment for which the
prediction is being performed, then the AF value will be 1.0. If the data is from
accelerated test data or from field data in a different environment, then the AF value
needs to be determined. If the applied stresses are higher than the anticipated field use
environment of the new system, AF will have a value greater than 1.0. The AF can be
determined by performing a reliability prediction at both the test and use conditions. The
AF can only be determined in this manner, however, if the reliability prediction model is
capable of discerning the effects of the accelerating stress(es) of the test. As an example,
consider a life test in which the product was exposed to a temperature higher than what it
would be exposed to in field-deployed conditions. In this case, the AF can be calculated
as follows:

λT 1
AF =
λT 2
where:
λT1 = the predicted failure rate at the test conditions obtained by performing a
reliability prediction of the system at temperature 1
λT2 = the predicted failure rate at the use conditions obtained by performing a
prediction at temperature 2

b i’
bi’ is the effective cumulative number of hours of the test or field data used. If the tests
were performed at accelerated conditions, the equivalent number of hours needs to be
converted to the conditions of interest, as follows:

bi ' = bi × AFi

ao
ao is the effective number of failures associated with the predicted failure rate. If this
value is unknown, then use a default value of 0.5. In the event that predicted and
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
118
Chapter 2: General Assessment Approach

observed data is available on enough predecessor products or systems, this value can be
tailored. See the next section for the appropriate tailoring methodology.

λ2
λ2 is the best estimate of the new system failure rate after using all available data and
information. As much empirical data as possible should be used in the reliability
assessment. This is done by mathematically combining λ1 with empirical data. Bayesian
techniques are used for this purpose. The technique accounts for the quantity of data by
weighting large amounts of data more heavily than small quantities. λ1 forms the “prior”
distribution, comprised of a0 and ao/λ1. If empirical data (i.e., test or field data) is
available on the system under analysis, it is combined with λ1 using the following
equation:

n
a 0 + ∑ ai
λ2 = i =1
n
a0
+ ∑ bi '
λ1 i =1

where λ2 is the best estimate of the failure rate, and ao is the “equivalent” number of
failures of the prior distribution corresponding to the reliability prediction. For these
calculations, 0.5 should be used unless a tailored value can be derived. An example of
this tailoring is provided in the next section.

ao/λ1 is the equivalent number of hours associated with λ1.

a1 through an are the number of failures experienced in each source of empirical data.

There may be “n” different sources of data available (for example, each of the n sources
corresponds to individual tests or field data from the total population of products or
systems).

b1’ through bn’ are the equivalent number of cumulative operating hours experienced for
each individual data source. These values must be converted to equivalent hours by
accounting for any accelerating effects between the use conditions.

Reliability Information Analysis Center


119
Chapter 2: General Assessment Approach

Tailoring the Bayesian Constant, ao


This section discusses tailoring of the ao value used in the Bayesian equations. The value
of ao is proportional to the degree of weighting given to the predicted value (λ1). The
value of the constant, a0, is chosen such that the uncertainty in the failure rate estimate, as
calculated with the chi square distribution, equates to the observed uncertainty. The
default value of 0.5 to be used in the equation is based on the observed/predicted ratio
derived from a wide variety of systems, applications, industries, etc. As such, there are
many “noise factors” contributing to the variability in this ratio. However, if the user of
the 217Plus model has enough data on which to derive a tailored value of a0, it should be
derived and used. While the default value of 0.5 represents the large degrees of
uncertainty inherent when a diverse data set is used, a specific 217Plus user will
generally be analyzing products with a much more narrow focus, in terms of product
type, environment, operating profile, etc. As such, with enough data, the value of a0 can
be increased. As an example of calculating a value for a specific application, consider
the an example for a product used in a telecommunications system.

To estimate the value of ao that should be used, a distribution of the following metric is
calculated for all products for which both predicted and observed data is available:

λobserved, predecessor
λ predicted, predecessor

The lognormal distribution will generally fit this metric well, but others (for example,
Weibull) can also be used. The cumulative value of this distribution is then plotted.
Next, failure rate multipliers (as calculated by a chi square distribution) are calculated
and plotted. This chi-square distribution should be calculated and plotted for various
numbers of failures, to ensure that the distribution of observed/predicted failure rate
ratios falls between the chi-square values. In most cases, one, two and three failures
should be sufficient. Next, the plots are compared to determine which chi-square
distribution most closely matches the observed uncertainty values. The number of
failures associated with that distribution then becomes the value of a0. Figure 2.6-2
illustrates an example for which this analysis was performed.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


120
Chapter 2: General Assessment Approach

Figure 2.6-2: Comparison of Observed Uncertainty with the Uncertainty Calculated With
the Chi-square Distribution

As can be seen from Figure 2.6-2, the observed uncertainty does not precisely match the
Chi-square calculated uncertainty for any of the one, two or three failures used in this
analysis. This is likely due to the fact that the population of products on which this
analysis is based is not homogeneous, as assumed by the chi-square calculation.
However, the confidence levels of interest are generally in the range 60 to 90 percent. In
this range, the chi-square calculated uncertainty with 2 failures most closely
approximates the observed uncertainty. Therefore, in this example, an a0 value of 2 was
used. This value is also consistent with the Telcordia GR-332 reliability prediction
methodology (Reference 6).

The uncertainties represented by the distribution of observed/predicted failure rates are


typical of what can be expected when historical data on predecessor products or systems
are collected and analyzed to improve the reliability prediction process. Using this
example, one can be 80% certain that the actual failure rate for a product or system will
be less than 2.2 times the predicted value.

2.6.1. Bayesian Inference


Figure 2.6-3 depicts the outline of the Bayesian inference approach. The available
information about the model parameter vector, θ, in the form of prior distribution, f0(θ),
Reliability Information Analysis Center
121
Chapter 2: General Assessment Approach

is transformed to a new state of knowledge, represented by posterior distribution, f(θ).


The likelihood function represents data in the Bayesian framework, and determines how
much the data may influence the prior knowledge.

Model
for
Failure
Data Likelihood
L( Failure Data | θ )
Failure
Data

Prior Bayesian Posterior


f0 (θ) Inference f (θ) = L( θ | Failure Data)
Figure 2.6-3. Bayesian Inference Outline

The mathematical description of the Bayesian transformation is defined by the equation


below. The normalization factor appearing in the denominator is inevitable when dealing
with conditional probability calculations.

f 0 (θ ) × L(DATA θ )
f (θ ) = f (θ DATA) =
∫θ f (θ ) × L(DATA θ ) dθ
0

where,
θ = the vector of model parameters, (θ1, θ2, …, θn)
f(θ) = the posterior joint distribution of parameters
f0(θ) = the prior joint distribution of parameters
L(data|θ) = the likelihood of data given the model parameters

In practice, the features of this distribution include the updated marginal and conditional
distribution of each parameter given the provided information. The marginal distribution
of a single parameter is defined by the next equation. The marginal distribution is
estimated by integrating the posterior joint distribution, f(θ), over the range of other
parameters, as shown. The other important outcome of the posterior joint distribution is
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
122
Chapter 2: General Assessment Approach

the conditional distribution of each parameter, when other elements of vector θ are given.
The conditional distribution is constructed by substituting the known parameters in the
joint distribution, f(θ). Here again, the function needs to be scaled by a normalization
factor, as demonstrated in the equation below, in order to be consistent with the basic
characteristics of the distribution functions.

f j (θ j ) = ∫ f (θ1 ,θ 2 ,...,θ j ,...,θ n )dθ − j


θ− j

f (θ j , θˆ− j )
(
g j θ j | θˆ− j =)
∫θ f (θ , θˆ )dθ
j −j j
j

where:

θ-j = (θ1, … , θj-1, θj+1, … , θn)


)
θi = the given value for θi

The integrals necessary for Bayesian computation usually require analytic or numerical
approximations. While the computations for non-constant failure rate distributions can
get quite involved, they are relatively straightforward for the exponential distribution.
The method explained in the previous section details this situation.

2.7. Develop System Model


There are several options that the analyst has for merging the reliability models of all of
the failure causes, components, etc. In decreasing order of rigor, they are:

1. Perform a Monte Carlo analysis, where the TTF distribution of each element is
preserved, and the operating time and number of failures is modeled
2. If all of the failure causes are independent, the options are:
a. Calculate the reliability of each cause at a specific time of interest, and
then calculate the reliability as:

n
R(t ) = ∏ Ri (t )
i =1

where there are n items, and Ri is the reliability of each item


b. Convert the reliability estimate of each element to a constant failure rate,
and calculate the reliability as:
Reliability Information Analysis Center
123
Chapter 2: General Assessment Approach

n
λ = ∑ λi
i =1

if the following conditions are satisfied:

1. The analysis is performed only to the component level, without modeling the
specific failure causes
2. A constant failure rate distribution is used
3. All components are required for the product or system to meet its requirements
(i.e., failure probability values are independent)

Then, the product reliability is simply the product of the reliabilities of the individual
components, or likewise the failure rate is the sum of the failure rates of the individual
constituent components. This has been the traditional approach when using the
“handbook” types of methodologies.

If all of the above listed conditions are not present, then more sophisticated techniques
are required. For example, consider the situation in which Condition 1 and 2 are not
satisfied, but Condition 3 is. In this example, let’s say that there are seven failure causes
for which the life modeling has resulted in an estimate of the TTF distribution under field
use conditions. These distributions can be any arbitrary shape, dependent entirely on the
characteristics of the specific failure causes. This situation is depicted in Figure 2.7-1,
where the reliability block diagram is shown as a series configuration. Each failure cause
is represented by Events 1 through 7, each of which has its own probability density
function.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


124
Chapter 2: General Assessment Approach

Event 1 Event 2 Event 3 Event 4 Event 5 Event 6 Event 7

Probability Density Function


Probability Density Function
Probability Density Function 0.001
Probability Density Function Probability Density Function Probability Density Function Probability Density Function 9.000E-4
9.000E-4
0.001 2.000E-4 0.003
0.002

8.000E-4
7.200E-4

1.600E-4 7.200E-4
8.000E-4 0.002
0.002

6.000E-4
5.400E-4

1.200E-4 5.400E-4 0.002


0.001 6.000E-4

f(t)
f(t)
f(t)
f(t)

f(t)
f(t)
f(t)

4.000E-4
3.600E-4

8.000E-5 3.600E-4 0.001


8.000E-4 4.000E-4

2.000E-4
1.800E-4

4.000E-5 6.000E-4
1.800E-4
4.000E-4 2.000E-4

0.000
0.000 1000.000 2000.000 3000.000 4000.000 5000.000
0.000
0.000 600.000 1200.000 1800.000 2400.000 3000.000 Time, (t)
0.000
0.000 0.000 1000.000 2000.000 3000.000 4000.000 5000.000
0.000 0.000 1000.000 2000.000 3000.000 4000.000 5000.000 0.000 Time, (t)
0.000 0.000 1000.000 2000.000 3000.000 4000.000 5000.000 0.000 600.000 1200.000 1800.000 2400.000 3000.000
0.000 1000.000 2000.000 3000.000 4000.000 5000.000 Time, (t)
Time, (t)
Time, (t) Time, (t)
Time, (t)

Time to first failure


defines system Time to
Failure (TTF)

Probability Density Function


0.002

0.002

0.001
f(t)

8.000E-4

4.000E-4

0.000
0.000 1000.000 2000.000 3000.000 4000.000 5000.000
Time, (t)

Figure 2.7-1: Combining Seven Failure Cause Distributions

For repairable systems, in which repairs are made as failures occur, the system reliability
would be simulated over a given time period, such as the mission duration or the
warranty period. In this case, failure times are simulated from time = 0 to the specified
time period. In this simulation, multiple systems are simulated, for which the failure
times of the constituent components are also simulated. As failures occur, new
replacement components are installed which have a new component time zero (the
system operating time will not be zero, but will be the cumulative operating time). This
continues until the duration is exceeded for each of the simulated systems. The resulting
failure times for the system can then be analyzed, and the distribution parameters defined.
The resultant distribution will generally not be a mono-modal distribution; rather, it will
be a distribution of an arbitrary shape that is usually represented by a multi-modal
distribution.

It is noteworthy that the above model is valid for any situation in which all items are
critical, i.e., the failure of any one item results in product or system failure. For example,
the Fault Tree for this situation may look like Figure 2.7-2. In this case, all gates are OR
gates, which means that all failures of items represented by Events 1 through 7 constitute
critical failures. This is shown to illustrate the fact that the analysis does not necessarily
Reliability Information Analysis Center
125
Chapter 2: General Assessment Approach

need to be performed at the same level of hierarchy. The most important thing is that all
of the critical failure causes are accounted for.

TOP

OR Event 3 OR

OR OR
Event 1 Event 2

Event 4 Event 5 Event 6 Event 7

Figure 2.7-2: Possible Fault Tree Representation of a Series Reliability Block


Diagram

For non repairable systems, in which the first failure causes system failure and all items
represented by each failure cause are required for the product or system to function, this
becomes a competing risk situation in which the first failure cause to occur will define
the item’s TTF distribution. The Type 1 extreme value distribution, also known as the
Gumbel distribution, is sometimes used to model this situation when components have
the same reliability distribution. This competing risk situation, modeled with times to
first failure (TTFF), will not yield the same results as taking either the product of the
reliability values or the sum of the failure rates (in the case of constant failure rates)
because, in the latter cases, there is a probability that multiple failures will occur in the
time period analyzed, which is not the case for the competing risk situation.

For all but the simplest of situations, closed-form solutions cannot be obtained. These
require solutions with numerical simulation, like Monte Carlo analysis, which is
described in the next section.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


126
Chapter 2: General Assessment Approach

2.7.1. Monte Carlo Analysis


Monte Carlo analysis is a powerful analytical technique that allows for the estimation of
parameters or factors in cases where closed-form statistical derivations are not possible.
This occurs in many reliability engineering analyses, making it an invaluable tool.

Monte Carlo analysis can be used for several purposes:

1. To determine the time to first failure, as in the previous example


2. To determine the probability of failure from a stress/strength interference model

For #2, there are handbooks available which provide estimates of interference probability
based on the individual stress and strength distributions. Or, a statistical simulation can
be performed to estimate the degree of interference via numerical techniques. This is
generally a more efficient and effective way of performing the simulation, given software
tools that are readily available.

The basic principal behind Monte Carlo analysis, as applied to stress/strength interference
analysis is shown here:

1. First, the stress and strength distributions are determined


2. A randomly selected value from each distribution is obtained
3. The randomly selected values are compared, and if the selection from the strength
distribution is less than the selection from the stress distribution, a failure is
considered to have occurred. If it is not, then success is considered to have
occurred.
4. This process is repeated many times, and the number of trials and the number of
failures are counted. The number of trials needs to be large enough to result in a
good estimate of the failure probability. The failure probability is equal to the
total number of failures divided by the total number of trials.

N strength < stress


F =
N

where:

F= the failure probability


N= the total number of trials

More detail regarding the process is described on the next page.


Reliability Information Analysis Center
127
Chapter 2: General Assessment Approach

The first step is to randomly select a value from each of the stress and strength
distributions. As an example, consider a normally distributed strength with a mean of 10
and standard deviation of 3, the pdf of which is shown in Figure 2.7-3.

0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 2.7-3: pdf of Normal Distribution with Mean of 10 and Standard Deviation of
3.

Next, the cumulative function of this distribution is calculated, as shown in Figure 2.7-4.

0.8

0.6

0.4

0.2

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 2.7-4: Cumulative Normal Distribution with Mean of 10 and Standard


Deviation of 3

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


128
Chapter 2: General Assessment Approach

Next, the randomly selected value from the distribution is obtained by:

• Selecting a random number between 0 and 1. This number is displayed on the y-


axis.
• Then, the value on the x-axis corresponding to this y-value is determined a shown
in Figure 2.7-5.

Figure 2.7-5: Value Selection From a Distribution

Distributions typically used in stress/strength analysis include the Normal distribution


and the Weibull distribution. The Normal cumulative distribution does not have a closed-
form solution, and requires the solution of an integral for its computation. However,
software programs have simplified this calculation. For example, the MS EXCEL
function for this calculation is.

NORMINV(rand(),mean,standard deviation)

where:

Rand() returns a random number between 0 and 1


The mean and standard deviation are the values from the sampled distribution

Reliability Information Analysis Center


129
Chapter 2: General Assessment Approach

The Weibull distribution is simpler to use than the Normal distribution since an integral
of the pdf is not required to derive the CDF. The closed-form pdf of the Weibull
distribution is:
β
β −1 ⎛ t ⎞
β⎛t⎞ −⎜ ⎟
⎝α ⎠
f (t ) = ⎜ ⎟ e
α ⎝α ⎠

The reliability function (1 - cumulative function (CDF)) is:


β
⎛t ⎞
−⎜ ⎟
R(t ) = e ⎝α ⎠

The Weibull distribution is one of the most widely used distributions in reliability
engineering due to its versatility. It also has the advantage of having a closed-form
solution for its cumulative function.

To select a random value from this distribution, a random number between 0 and 1 is
selected, this value is substituted for R(t) and the corresponding TTF is determined from
the equation. In this example, time(t) is shown as the independent variable, but the
specific parameter could be any parameter whose distribution is used in a Monte Carlo
analysis. The inverse cumulative function is shown in Figure 2.7-6, along with the
selection of the random value.

Figure 2.7-6: Value Selection From a Weibull Distribution

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


130
Chapter 2: General Assessment Approach

Now, let’s consider another application of Monte Carlo simulation. In this example, a
simple relationship between items for a repairable system is shown in Figure 2.7-7. Here,
the items can be failure causes, components or assemblies, in accordance with the level to
which the analysis is performed.

Figure 2.7-7: Reliability Block Diagram of Redundant Example

For this example,

A and (B or C) need to be operational for the system to function


TTFi and TTRi are the times to failure (TTF) and times to repair (TTR), taken
from the governing distributions of each

The behavior for each item, along with the resultant system behavior, is shown in Figure
2.7-8.

Figure 2.7-8: System Monte Carlo Example

For example, item A operates until it fails at TTFA1. At that point in time, it takes TTRA1
to repair it. Items B and C fail and get repaired at rates determined by the simulated

Reliability Information Analysis Center


131
Chapter 2: General Assessment Approach

times for each, and governed by the specific distribution of each. The resultant system
availability (Asystem) is shown on the bottom.

A simulation was performed on this hypothetical system using a software tool, the results
of which are shown in Figure 2.7-9. In this case, the following metrics were calculated
from the Monte Carlo analysis:

• Ao: Availability (% of total time the system is available)


• MTBDE: Mean Time Between Downing Events
• MDT: Mean Down Time
• MTBM: Mean Time Between Maintenance
• MRT: Mean Repair Time
• % green time: Percent of time that all units are operational
• % yellow time: Percent of time that at least one unit is not operational, but
the system still operates
• % red time: Percent of time that at least one critical item is not
operational
• Number of failures: The number of simulated failures per run

Figure 2.7-9: Monte Carlo Simulation of Example System

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


132
Chapter 2: General Assessment Approach

Simulations of product reliability, as described above, are generally the best way to
combine life estimates of constituent parts in a system. If a system is comprised of
redundant elements, closed-form equations are available that calculate the effective
failure rate of the redundant elements. However, care must be taken when using these
equations. For example, the manner in which they are generally derived is to calculate
the failure characteristics as time approaches infinity. Only in this manner are closed-
form solutions possible. The results are “effective” failure rate estimates that often
underestimate the benefits of redundancy. This is especially true when mission times are
relatively short. As a result, calculating reliability based on the failure probability
examples described above is generally a more sound approach. Additionally, the
availability of software tools has made it much easier to perform these calculations.

2.8. References
1. “Production Part Approval Process (PPAP)”, Third Edition, Daimler-Chrysler,
Ford , General Motors, 1999)
2. Modarres, M., “Accelerated Testing”, ENRI 641, Univ. of Maryland, May 2005
3. Weibull++, Reliasoft Corp.
4. Colm V. Cryan, James R. Curley, Frederick J. Gillham, David R. Maack, Bruce
Porter, and David W. Stowe, “Long Term Splitting Ratio Drifts in Singlemode
Fused Fiber Optic Splitters”, NFOEC 95
5. David R. Maack, David W. Stowe and Frederick J. Gillham, “Confirmation of a
Water Diffusion Model For Splitter Coupling Ratio Drift Using Long Term
Reliability Data”, NFOEC 96
6. Telcordia GR-332, “Reliability Prediction Methodology”
7. Denson, W.K. and S. Keene, “A New System Reliability Assessment
Methodology – Final Report”, Available from the Reliability Information
Analysis Center, 1998

Reliability Information Analysis Center


133
Chapter 2: General Assessment Approach

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


134
Chapter 3: Fundamental Concepts

3. Fundamental Concepts
The intent of this book is not to cover the basics of probability or reliability theory. The
understanding of some of these fundamental concepts, however, is critical to the
interpretation of reliability estimates. The definition of reliability is a probability, the
value of which is estimated by the techniques covered in this book. Therefore, the basics
of reliability terminology, and the basis for various theoretical concepts are covered in
this section.

3.1. Reliability Theory Concepts


There are two basic types of variables: Discrete and Continuous. A discrete variable is
one that is limited to integer values (i.e., 0, 1, 2, 3,…). The probability distribution
describing this type of variable is called a discrete distribution. For example, the
distribution of the number of defects remaining in software programs after 6 months of
development would be a discrete distribution, since a partial defect cannot exist. Figure
3.1-1 illustrates a discrete probability distribution.

p(x5)

p(x4) p(x6)
Probability - p(xi)

p(x3) p(x7)

p(x2) p(x8)

p(x1) p(x9)

x1 x2 x3 x4 x5 x6 x7 x8 x9

Number of Remaining Defects (x)

Figure 3.1-1: Discrete Probability Distribution

The probability that a random variable “x” takes on a specific value “xi” is expressed as:

P{x = xi } = p(xi )

Reliability Information Analysis Center


135
Chapter 3: Fundamental Concepts

A continuous variable is one that is measured on a continuous scale, and its probability
distribution is defined as a continuous distribution. For example, the distribution of the
TTF would be a continuous distribution, since an infinite number of positive time values
can be represented in the distribution. Figure 3.1-2 illustrates a continuous distribution.

Figure 3.1-2: Continuous Probability Distribution

The probability that a random variable “x” lies between the interval from “a” to “b” is
expressed as:
b
P{a ≤ x ≤ b} = ∫ f ( x)dx
a

A probability distribution is characterized by a probability density function (pdf), f(t).


The pdf is essentially a histogram of the random variable, often the TTF. For a discrete
random variable, the pdf at a given value of the random variable is the probability that the
realization of the random variable will take on that value. For a continuous random
variable, the area under the pdf for a given interval is the probability that a realization of
the random variable will fall within that interval (Figure 3.1-2). The probability density
functions are non-negative for all values and the sum of the probabilities over all values
for discrete random variables, or the total area under the pdf for continuous random
variables, always equals 1.0.

The cumulative distribution function F(t) is defined as the probability in a random trial
that the random variable is not greater than t:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
136
Chapter 3: Fundamental Concepts

t
F (t ) = ∫ f (t )dt
−∞

If the random variable is discrete, the integral is replaced by a summation.

The Cumulative Distribution Function (CDF) is the probability that the value of a
corresponding random variable will not be exceeded. Cumulative distribution functions
are non-negative and non-decreasing. Given a random variable that cannot be negative,
the value of the CDF at the origin is zero. The upper limit of a CDF is always 1.0, as
illustrated in Figure 3.1-3. The CDF is the integral of the pdf, and is illustrated in Figure
3.1-3 for discrete and continuous distributions, respectively.

Figure 3.1-3: The Cumulative Distribution Function (CDF)

The reliability function, R(t), is the probability of a device surviving (not failing) prior to
time “t”, and is given by:

Reliability Information Analysis Center


137
Chapter 3: Fundamental Concepts


R(t ) = 1 − F (t ) = ∫ f (t )dt
t
Note that for the reliability, the integral of the pdf is from “t” to infinity for the
probability of success, as opposed to minus infinity to “t” as in the case of the failure
probability. The sum of the probability of success and the probability of failure needs to
be 1.0, consistent with the definition of a pdf.

By differentiating the above equation:

− dR(t )
= f (t )
dt

The probability of failure in a given time interval between t1 and t2 can be expressed by
the reliability function:

∞ ∞

∫ f (t )dt − ∫ f (t )dt = R (t1 ) − R (t 2 )


t1 t2

The rate at which failures occur in the interval t1 to t2, the failure rate “λ(t)”, is defined as
the ratio of the probability that a failure occurs within the interval, given that it has not
occurred prior to t1 (the start of the interval), divided by total the interval length. Thus:

R (t1 ) − R (t 2 ) R (t ) − R (t + Δt )
λ (t ) = =
(t 2 − t1 )R (t1 ) (Δt )R (t )

where t = t1 and t2 = t + Δt. The hazard rate, h(t), or instantaneous failure rate, is defined
as the limit of the failure rate as the interval length approaches zero, or:

⎡ R(t ) − R(t + Δt ) ⎤ 1 ⎡ − dR(t ) ⎤


h(t ) = lim(Δt → 0)⎢ ⎥ = ⎢ ⎥
⎣ (Δt )R(t ) ⎦ R(t ) ⎣ dt ⎦

Since it was already shown that:

− dR(t )
= f (t )
dt

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


138
Chapter 3: Fundamental Concepts

Then,
f (t )
h (t ) =
R (t )

In an attempt at providing an interpretation of the hazard rate function, consider the


following:

• The hazard rate, h(t), is the rate at which failures occur, providing the item has not
failed before the time h(t) is evaluated
• f(t) is the normalized percentage of the population failing in a given time interval
(Δt), such that the population size times value of f(t) is equal to the number of
failures in the interval of time.
• The denominator, R(t), is the probability of survival at t, which is equivalent to
the percentage of the population surviving at time t.

Multiplying R(t) by the population size yields the total number of units surviving until
“t”. This is the binomial probability, or expected value of the number of survivors at “t”.
Since this population will have accrued an operating time of “RN*Δt”, the denominator is
equivalent to the cumulative operating time on the population in the time interval.
Therefore,

f (t ) f (t )× N # failures in Δt
h(t ) =
Failures
= = = = Failure rate
R(t ) R(t )× N # units surviving Δt item hours

Integrating both sides of the h(t) function results in:

1 ⎡ − dR(t ) ⎤
h(t ) =
R(t ) ⎢⎣ dt ⎥⎦

Resulting in:

⎡ t ⎤
⎢ − ∫ h (t )dt ⎥
R(t ) = e ⎣ 0 ⎦

This is the general expression for the reliability function. If h(t) can be considered a
constant failure rate (λ), which is often the case, the equation becomes:

Reliability Information Analysis Center


139
Chapter 3: Fundamental Concepts

R(t ) = e − λt

The mean time to failure (MTTF) is the expected value of the time to failure, and is:


MTTF = ∫ R(t )dt
0
If the reliability function can be easily integrated, this is a convenient way to calculate the
mean time to failure (MTTF). If not, then numerical techniques can be used.

If all parts in a population are operated until failure, the mean life is:

∑t i
θ= i =1

n
where:

ti = the time to failure of the ith item in the population


n= total number of items in the population

The mean time between failure (MTBF) is:

T (t )
MTBF =
r

where:

T(t) = total operating time


r= number of failures

Failure rate and MTBF are applicable only to the situation in which the failure rate is
constant, i.e., the exponential TTF distribution. Per the definitions above, it can be seen
that the failure rate and MTBF are reciprocals of each other:

1
λ=
MTBF

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


140
Chapter 3: Fundamental Concepts

The failure rate is the number of failures divided by the cumulative operating time of the
entire population (failure/part hours), whereas the MTBF is the cumulative operating time
of the entire population divided by the number of failures (part hours per failure).

Table 3.1-1 provides an overview of the basic notation and mathematical representations
that are common among the various types of probability distributions.

Table 3.1-1: Probability Distribution Notation & Mathematical Representations


Notation Definition Mathematical Representation
X Random Variable
x Realization of a Random Variable
Pr( X ∈ S ) Probability That the Random Variable
“ X ” is in the Set “ S ”
f (x) Probability Density Function (PDF) ⎧ ∑ f ( x), Discrete Distribution
⎪⎪ x∈S
Pr( X ∈ S ) = ⎨
⎪ ∫ f ( x ) dx , Continuous Distribution
⎪⎩ S
F (x) Cumulative Distribution Function ⎧ x
(CDF) ⎪ ∑ f ( w ), Discrete Distribution
⎪⎪ w =0
F (x) = ⎨x

⎪ ∫ f ( w ) dw , Cumulative Distribution
⎩⎪ 0
h(x) Hazard Rate f ( x) f ( x) 1 dF( x )
h( x) = = =
1 − F(x) R( x) R ( x ) dx
R (x ) Reliability x
∞ − ∫ h ( t ) dt
R ( x ) = 1 − F ( x ) = ∫ f ( t ) dt = e 0

x
E[u( X )] Expected Value ⎧ ∞
⎪ ∑ u(w) f(w), Discrete Distribution
⎪⎪ w =0
E[ u ( X )] = ⎨ ∞

⎪ ∫ u ( w ) f ( w ) dw , Continuous Distribution
⎪⎩ 0
μ Mean μ = E (X )
σ Standard Deviation
σ = E[( X − μ ) 2 ]
Note: These definitions are based on the assumption that all realizations of a random variable
must be non-negative.

Reliability Information Analysis Center


141
Chapter 3: Fundamental Concepts

3.2. Probability concepts


This section discusses some of the basic probability concepts that are important in
reliability modeling.

3.2.1. Covariance
Covariance is a measure of the extent to which one variable is related to another, and is
expressed as:

n
(x − x )(y − y )
Cov( X , Y ) = ∑
i =1
i i
n −1

3.2.2. Correlation Coefficient


The correlation coefficient is a defined as the standardized Covariance:
Cov( X , Y )
r=
σ Xσ Y

Examples of various correlation coefficients are shown in Figure 3.2-1.

Figure 3.2-1: Examples of Correlation Coefficients

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


142
Chapter 3: Fundamental Concepts

3.2.3. Permutations and Combinations


A permutation is defined the number of ways of ordering “n” items taken “x” at a time,
and is mathematically expressed as:

n!
Pr =
n

(n − x )!
A combination is defined as the number of distinct combinations of “n” items taken “x”
at a time, when ordering is not relevant, and is mathematically expressed as:

n!
Pr =
n

x!(n − x )!

As an example of permutations and combinations, define n=4 and x=2. The number of
combinations is:

n! 4!
Pr = = =6
n

x!(n − x )! 2!(4 − 2)!

Consider these combinations, as illustrated in Table 3.2-1. Here, there are 4 items (n=4),
each of which can have two possible values (blank or “x”).

Table 3.2-1: Combinations Example

n
1 2 3 4
x x
x x
x x
x x
x x
x x

Reliability Information Analysis Center


143
Chapter 3: Fundamental Concepts

The corresponding number of permutations is:


n! 4!
Pr = = = 12
n

(n − x )! (4 − 2)!
Each set of 2 can be reversed, thus the number of permutations is double the number of
combinations for n=4.

3.2.4. Mutual Exclusivity


Items are mutually exclusive when the occurrence of one event precludes the other. In
other words, if one event occurs, the other cannot. This is the only case in which
probabilities can be added. Mutual exclusivity is defined as:
P(a or b ) = P(a ) + P(b )
where:

P(a or b) = probability of either event a or event b occurring


P(a) = probability of event a occurring
P(b) = probability of event b occurring

Mutually exclusive sets are those with no common members, shown in the Venn diagram
in Figure 3.2-2.

Figure 3.2-2: Venn Diagram of Mutually Exclusive Events

The expression “A∩B” signifies the “Empty” or “Null” set.

3.2.5. Independent Events


An independent event is one in which the probability of one event has no effect on the
other, and is expressed as follows:

P(a and b ) = P(a )P(b )


where:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


144
Chapter 3: Fundamental Concepts

P(a and b) = probability of both event a and event b occurring


P(a) = probability of event a occurring
P(b) = probability of event b occurring

The probability of either event a or b occurring is:


P(a or b ) = P(a ) + P(b ) − P(a )P(b )

This is illustrated in Figure 3.2-3.

Figure 3.2-3: Independent Events

3.2.6. Non-independent (Dependent) Events


Non-independent (or dependent) events indicate that the probability of one event is
dependent on the other, as shown:

P(a and b ) = P(a )P(b a )

or
P(a and b ) = P(b )P(a b )
where:

P(a and b) = probability of both event a and event b occurring


P(a) = probability of event a occurring
P(b) = probability of event b occurring
P(b|a) = probability of event b occurring, given that event a has occurred
P(a|b) = probability of event a occurring, given that event b has occurred
Reliability Information Analysis Center
145
Chapter 3: Fundamental Concepts

3.2.7. Non-independent (Dependent) Events: Bayes Theorem


For non-independent (dependent) events, one event may have several different outcomes,
each affecting the other event differently. This situation is mathematically described as:

P(b a1 )∗ P(a1 )
P (a1 b ) =
∑ P(b a )∗ P(a )i i

where:

P(b|a1) = probability of event b occurring, given that event a1 has


occurred
Σ P(b|ai)*P(ai) = the total probability of event b occurring

The event set a is mutually exclusive. Therefore their probabilities can be added.

3.2.8. System Models


For independent failure causes, the reliability of a system is the product of the reliability
values for the constituent failure causes, as shown:
R = R1 R2 R3 .........Rn

If the failure rate is constant, the probability of survival for a specific cause is:

R = e − λt

The system reliability is:

e − λtotal t = e − λ1t e − λ2t e − λ3t .........e − λnt

Taking the natural log of both sides yields:

λtotal = λ1 + λ2 + λ3 + ..........λn

The above equations are relevant to a series configuration of items, each with a constant
failure rate. The fault tree representation of this configuration is shown in Figure 3.2-4.
Here, the system reliability is represented by a logical OR gate, since the failure of A or
B or C will cause system failure.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


146
Chapter 3: Fundamental Concepts

OR

A B C

Figure 3.2-4: Fault Tree OR Gate

The corresponding reliability block diagram representation for this scenario is shown in
Figure 3.2-5.

A B C

Figure 3.2-5: Reliability Block Diagram for an OR Gate

All possible outcomes for this example are shown in Table 3.2-2.

Table 3.2-2: Combinations of an OR Configuration


A B C Output of OR Gate

Fail Fail Fail Fail


Fail Fail Pass Fail
Fail Pass Fail Fail
Fail Pass Pass Fail
Pass Fail Fail Fail
Pass Fail Pass Fail
Pass Pass Fail Fail
Pass Pass Pass Pass

Reliability Information Analysis Center


147
Chapter 3: Fundamental Concepts

Note that each of these eight possible outcomes in the table are mutually exclusive, in
that there is only one possible way in which each of the eight can occur.

As an example, if events A, B and C have the following reliability values:

RA = 0.95
RB = 0.92
RC = 0.99

The reliability of the series configuration (i.e., the probability of exactly zero failures) of
the three items is:
R = RA RB RC

R = .95 × .92 × .99 = .87

Now, suppose that several items must fail in order for the system to fail. This scenario is
represented by an AND gate in a fault tree representation, as is shown in Figure 3.2-6.

AND

A B C

Figure 3.2-6: Fault Tree AND Gate

The corresponding Reliability Block Diagram (RBD) representation is shown in Figure


3.2-7. Note the parallel nature of this configuration.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


148
Chapter 3: Fundamental Concepts

Starting Ending
Block Block

Figure 3.2-7: Reliability Block Diagram for an AND Gate

All possible outcomes for this example are shown in Table 3.2-3.

Table 3.2-3: Combinations of an AND Configuration


A B C Output of AND Gate

Fail Fail Fail Fail


Fail Fail Pass Pass
Fail Pass Fail Pass
Fail Pass Pass Pass
Pass Fail Fail Pass
Pass Fail Pass Pass
Pass Pass Fail Pass
Pass Pass Pass Pass

The reliability of this parallel configuration of three items is:

RA = 0.95
RB = 0.92
RC = 0.99

R = 1 − (1 − RA )(1 − RB )(1 − RC )

Reliability Information Analysis Center


149
Chapter 3: Fundamental Concepts

R = 1 − (1 − 0.95)(1 − 0.92)(1 − 0.99) = 0.99996

As an example of a slightly more complex situation, consider the fault tree representation
of a system in Figure 3.2-8.

TOP

AND Event 3 OR

AND OR
Event 1 Event 2

Event 4 Event 5 Event 6 Event 7

Figure 3.2-8: Fault Tree of an AND/OR Combination

The RBD is shown in Figure 3.2-9.

Event 1 Event 4
Extra
Starting Event 3 Event 6 Event 7
Block

Event 5
Event 2

Figure 3.2-9: RBD of AND/OR combination


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
150
Chapter 3: Fundamental Concepts

Combining the series and parallel events yields the following reliability expression for
this configuration

R = (1 − (1 − R1 )(1 − R2 ) )R3 (1 − (1 − R4 )(1 − R5 ) )R6 R7

3.2.9. K-out-of-N Configurations


A system consisting of “n” components or subsystems, of which only “k” need to be
functioning for system success, is called a “k-out-of-n” configuration. For such a system,
the integer value of “k” is always less than the integer value of “n”.

Define the following as:

R= reliability of one unit for a specified time period


Q= unreliability of one unit for a specified time period
R+Q= 1

As an example, let us assume that there are three units operating in parallel, two of which
are required for the system to perform adequately. If R=0.9 and Q=0.1, then the
probabilities associated with each possible combination of outcomes is summarized in
Table 3.2-4.

Table 3.2-4: Example of “k-out-of-n” Probability Calculations

Prob Prob Prob


Out- of pass of pass of pass Total System
A B C Probability
come or fail or fail or fail Probability
of A of B of C

1 Fail Fail Fail QAQBQC 0.1 0.1 0.1 0.1*0.1*0.1 0.001


2 Fail Fail Pass QAQBRC 0.1 0.1 0.9 0.1*0.1*0.9 0.009
3 Fail Pass Fail QARBQC 0.1 0.9 0.1 0.1*0.9*0.1 0.009
4 Fail Pass Pass QARBRC 0.1 0.9 0.9 0.1*0.9*0.9 0.081
5 Pass Fail Fail RAQBQC 0.9 0.1 0.1 0.9*0.1*0.1 0.009
6 Pass Fail Pass RAQBRC 0.9 0.1 0.9 0.9*0.1*0.9 0.081
7 Pass Pass Fail RARBQC 0.9 0.9 0.1 0.9*0.9*0.1 0.081
8 Pass Pass Pass RARBRC 0.9 0.9 0.9 0.9*0.9*0.9 0.729

Reliability Information Analysis Center


151
Chapter 3: Fundamental Concepts

In this example, the probability of each combination of possible outcomes (in this case,
eight) is calculated. Note that the sum of the probabilities for all possible outcomes is
1.0, since each of the eight possibilities is mutually exclusive and their probabilities can,
therefore, be added. This approach of calculating the probability of every possible
outcome is always valid, regardless of whether the reliability values of each of the
elements are the same or not. For example, if two of the three units are required for the
system to perform adequately, the system will “pass” if there are either no failures or if
there is one failure, as shown below. This is summarized in Table 3.2-5.

Table 3.2-5: Example of “2-out-of-3” Required for Success

Out- Total System Pass or


A B C Probability
come Probability Fail
1 Fail Fail Fail QAQBQC 0.001 Fail
2 Fail Fail Pass QAQBRC 0.009 Fail
3 Fail Pass Fail QARBQC 0.009 Fail
4 Fail Pass Pass QARBRC 0.081 Pass
5 Pass Fail Fail RAQBQC 0.009 Fail
6 Pass Fail Pass RAQBRC 0.081 Pass
7 Pass Pass Fail RARBQC 0.081 Pass
8 Pass Pass Pass RARBRC 0.729 Pass

It can be seen that the system will pass with outcomes 4, 6, 7 and 8. Outcomes 4, 6 and 7
correspond to exactly one failure (i.e., there are three ways in which one failure can
occur), and outcome 8 corresponds to exactly zero failures (there is only one way in
which this can occur).

If the probability of failure of all of the units is the same and they are independent, then
the binomial or Poisson distributions can be used:

• If the metric used in the reliability analysis is the probability of failure, use the
binomial distribution
• If the metric is a failure rate, use the Poisson distribution

Since this example pertains to items with defined probabilities, the binomial distribution
applies. As defined previously:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


152
Chapter 3: Fundamental Concepts

r
⎛n⎞ r
n!
F (x; r ) = ∑ ⎜ ⎟ p x q n− x = ∑ p x q n− x
x =0 ⎝ x ⎠ x =0 (n − x!)x!

where:

n= total number of items (3)


x= good items (2 or 3)
r= failed items (0 or 1)

The probability of exactly no failures (i.e., the first term in the above summation) is:

n! 3!
F (3,0 ) =
x n− x 3 3−3
p q = .9 q = 1 * 0.729 = 0.729
(n − x )! x! (3 − 3)!3!
The probability of exactly one failure (i.e. the second term in the above summation) is:

n! 3!
F (2,1) =
x n− x 2 3− 2
p q = .9 .1 = 3 * 0.81 * 0.1 = 0.243
(n − x )! x! (3 − 2 )!2!
Therefore, the cumulative binomial expression for 0 or 1 failures (r = 0 or 1) is:

r
n!
F ( x; r ) = ∑ (n − x)! x!p q
x =0
x n− x
= 0.729 + 0.243 = 0.972

Because the first term in the binomial probability expression is the number of
combinations of a specific number of failures (or survivals) occurring, the number of
combinations (as calculated by the first term) essentially adds the probabilities associated
with the mutually exclusive events.

3.3. Distributions
Reliability distributions are at the heart of a reliability model. They represent the
fundamental relationship between the reliability metric of interest (probability of failure,
failure rate, etc.) and the independent variable (TTF, cycles to failure, etc.). This
independent variable is called the “life unit”. Table 3.3-1 summarizes probability
distributions often used in reliability modeling, along with a description of their primary
uses.

Reliability Information Analysis Center


153
Chapter 3: Fundamental Concepts

Table 3.3-1: Probability Distributions Applicable to Reliability Engineering


Probability Type Primary Uses
Distribution
Binomial Discrete Used to find the probability of “x” events occurring in a total of “n”
trials, e.g., the number of failures in a sequence of a specified number
of equal-length time intervals

Poisson Discrete Used to model the probability of a specified number of events


occurring in a specified time interval

Exponential Continuous Used to describe the distribution of the time to failure when the
failure rate is constant

Gamma Continuous Used to determine the distribution of the time by which a specified
number of failures will occur when the failure rate is constant

Normal Continuous Used to describe the statistical mean of a sample taken from any
population with a finite mean and variance. Often used to model
parameter distributions. Rarely used for time to failure distributions.

Standard Normal Continuous The Standard Normal distribution (Z) is derived from the Normal for
ease of analysis and interpretation (mean = 0; standard deviation =
1).

Lognormal Continuous Used to model many wear out failure causes

Weibull Continuous Used to describe the distribution of failures representing constant


(i.e., exponential), increasing, or decreasing failure rates, depending
on the value of the slope parameter (β). Increasing popularity due to
its versatility. Applicable only when no repair is performed
following failure.

Student t Continuous Used to test for statistical significance of the difference between the
means of two samples

F Distribution Continuous Used to test for statistical significance of differences between the
variances of two samples

Chi-Square Continuous A special case of the Gamma distribution, used to estimate


confidence intervals around reliability test data, and to test to see
whether measured data reflects a constant failure rate.

The following section discusses several of the distributions used in reliability assessment.
While the intent of this book is not to cover the statistical aspects of distributions, some
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
154
Chapter 3: Fundamental Concepts

fundamental concepts are critical to the understanding of the basis for certain techniques
pertaining to reliability assessment, namely confidence level calculations and
demonstrating reliability levels. In particular, the binomial and Poisson distribution are
critical for these purposes.

The binomial distribution is used when there are only two outcomes, such as success or
failure, and the probability remains the same for all trials. The probability density
function (pdf) of the binomial distribution is:

⎛n⎞
f (x) = ⎜ ⎟ p x q (n− x )
⎝ x⎠

where:
⎛n⎞ n!
⎜ ⎟=
⎝ x ⎠ (n − x!)x!
and q = 1 – p.

The function “f(x)” is the probability of obtaining exactly “x” good items and “(n-x)” bad
items in a sample of “n” items, where “p” is the probability of obtaining a good item
(success) and “q” (or 1-p) is the probability of obtaining a bad item (failure).

The CDF, i.e., the probability of obtaining “r” or fewer successes in “n” trials, is given
by:

r
⎛n⎞ r
n!
F (x; r ) = ∑ ⎜ ⎟ p x q n− x = ∑ p x q n− x
x =0 ⎝ x ⎠ x =0 (n − x!)x!

The Poisson distribution is an extension of the binomial distribution when “n” is infinite.
In fact, it is used to approximate the binomial distribution when n ≥ 20 and p ≤ 0.05.

If events are Poisson-distributed, they occur at a constant average rate and the number of
events occurring in any given time interval is independent of the number of events
occurring in any other time interval. Since the TTF distribution for this situation is the
exponential (i.e., constant failure rate), the Poisson distribution will predict the number of
failures for specific values of time and failure rates. The number of failures in a given
time would be given by:

Reliability Information Analysis Center


155
Chapter 3: Fundamental Concepts

a x e −a
f (x ) =
x!

where “x” is the actual number of failures and “a” is the expected number of failures.
Since the expected number of failures (i.e., the expected value) for the exponential
distribution is “λt”, the Poisson expression becomes:

f (x ) =
(λt )x e −λt
x!

where:

λ= failure rate
t= length of time being considered
x= number of failures

The reliability function, R(t), or the probability of zero failures in time “t” is given by:

R(t ) =
(λt )0 e − λt
= e −λt
0!

This is the reliability for the exponential distribution.

There are many cases where the probability of experiencing a given number of failures (r)
or fewer is required. Examples are reliability demonstration, test planning, etc. For these
cases, the CDF is used:

R(x) = ∑
r
(λt )x e −λt
x =0 x!

A summary of the distributions most commonly used in reliability engineering are


presented in Figures 3.3-1 and 3.3-2, for discrete and continuous distributions,
respectively.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


156
Chapter 3: Fundamental Concepts

Figure 3.3-1: SHAPES OF FAILURE DENSITY AND RELIABILITY FUNCTIONS OF


COMMONLY USED DISCRETE DISTRIBUTIONS (from MIL-HDBK-338B)
Reliability Information Analysis Center
157
Chapter 3: Fundamental Concepts

Figure 3.3-2: SHAPES OF FAILURE DENSITY, RELIABILITY AND HAZARD RATE


FUNCTIONS FOR COMMONLY USED CONTINUOUS DISTRIBUTIONS (from MIL-
HDBK-338B)

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


158
Chapter 3: Fundamental Concepts

Continuous distributions are used when analyzing time to failure data, since times to
failure is a continuous variable. The most common distributions used in reliability
modeling to describe times to failure characteristics are the exponential, Weibull and
lognormal distributions. These are described in more detail n the following sections.

3.3.1. Exponential
The exponential distribution is most commonly applied in reliability to describe the times
to failure for repairable items. For non-repairable items, the Weibull distribution is
popular due to its flexibility. In general, the exponential distribution has numerous
applications in statistics, especially in reliability and queuing theory.

The exponential distribution describes products whose failure rates are the same
(constant) at each point in time (i.e., the “flat” portion of the reliability bathtub curve,
where failures occur randomly, by “chance”). This is also called a Poisson process. This
means that if an item has survived for "t" hours, the chance of it failing during the next
hour is the same as if it had just been placed in service. It is sometimes referred to as the
distribution with no memory. It is an appropriate distribution for complex systems that
are comprised of different electronic and electromechanical component types, the
individual failure rates of which may not follow an exponential distribution.

Since the exponential distribution is relatively easy to fit to data, it can be misapplied to
data sets that would be better described using a more complex distribution.

Table 3.3-2 lists the parameters for the exponential distribution: the probability density
function (pdf), the cumulative distribution function (CDF), the mean, the variance, and
the standard deviation. Another useful parameter of continuous distributions is the 100-
pth percentile of a population, i.e., the age by which a portion of the population has failed.
The 50% point is the median life. The mean of the exponential distribution is equal to the
63rd percentile. Thus, if an item with a 1000 hour MTBF had to operate continuously for
1000 hours, there would only be a 0.37 probability of success.

As an example, consider a software system with a failure rate (λ) of 0.0025 failures per
processor hour. Its corresponding mean time between failure (MTBF) is calculated as:

1 1
MTBF = θ = = = 400 processor hours
λ 0 .0025

Reliability Information Analysis Center


159
Chapter 3: Fundamental Concepts

Table 3.3-2: Exponential Distribution Parameters


Parameters Mathematical Expression Mathematical Expression
(based on failure rate) (based on MTBF)
Probability Density Function
f (t) = λ e −λ t ,
t
t>0 1 −θ
f (t ) = e , t>0
θ
Cumulative Distribution Function F( t ) = 1 − e − λ t , t > 0 t

F (t ) = 1− e θ , t>0
Failure rate λ 1
θ
Mean 1 μ=θ
μ=
λ
Variance 1 σ2 = θ2
σ2 =
2
λ
Standard Deviation 1 σ=θ
σ=
λ
100 pth Percentile 1 y P = −θ ln(1 − P)
yP =− ln(1 − P )
λ
Reliability Function
R (t ) = e−λ t
t

R (t) = e θ

The reliability function (i.e., the probability, or population fraction that survives beyond
age “t”) at 100 and 1000 processor hours is:

R ( t ) = e − ( 0.0025 )(100 ) = 0.7788 = 77.88%


R ( t ) = e −( 0.0025 )(1000 ) = 0.0821 = 8.21%

Which can be seen to be R(t) = 1 – F(t).

3.3.2. Weibull
The Weibull distribution is important in reliability modeling since it represents a general
distribution which can model a wide range of life characteristics. It can accommodate
increasing, decreasing and constant failure rates. Weibull analysis assumes that there has
been no repair of failed items and is often used to model single failure causes. The basic
features of the Weibull are:

• The shape parameter, β, which describes the shape of the pdf

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


160
Chapter 3: Fundamental Concepts

• The scale (or characteristic life) parameter, α, is the value at which 63rd percentile
of the distribution occurs
• The location parameter, γ (or gamma), is only used in the three parameter version
of the Weibull distribution, and is the value that represents the failure free period
for the item. If an item does not have a period where the probably of failure is
zero, then γ= 0 and the Weibull distribution becomes a two parameter distribution.
This third parameter is used when there are threshold effects.
• Determination of β, η, and γ can easily be estimated using Weibull probability
paper or by using available Weibull software programs
• A multi-mode version of the Weibull distribution can be used to determine the
points on the bathtub curve where the failure rate is changing from decreasing, to
constant, to increasing

There are two general versions of the Weibull distribution, the first being the two-
parameter Weibull and the second being the three-parameter Weibull. The two-
parameter Weibull uses a shape parameter that reflects the tendency of the failure rate
(increasing, decreasing, or constant) and a scale parameter that reflects the characteristic
life of items being measured ( ≅ 63.2% of the population will have failed). The three-
parameter Weibull adds a location parameter used to represent the minimum life of the
population (e.g., a failure mode that does not immediately cause system failure at time
zero, such as a software algorithm whose degrading calculation accuracy does not cause
system failure until four calls to the algorithm have been made). Note that in most cases,
the location parameter is set to zero (failures assumed to start at time zero) and the
Weibull distribution reverts to the two-dimensional case. The three parameter Weibull
distribution is also commonly used to characterize strength distributions (i.e., when using
a stress/strength model), where the γ-value represents a screen value, or proof test, in
which case this value of stress is applied to the item as a screen. It is also used to model
failure causes that are not initiated until a time equal to the gamma value has passed.

As with the gamma distribution, the definition of Weibull parameters is inconsistent


throughout the literature. Table 3.3-3 illustrates how some sources define these
parameters.

Reliability Information Analysis Center


161
Chapter 3: Fundamental Concepts

Table 3.3-3: Confusing Terminology of the Weibull Distribution


Reference Weibull Random Shape Scale Location
Form Variable Parameter Parameter Parameter
Montgomery, D.C., “Introduction to
Statistical Quality Control – 2nd
3-P X β δ γ
Edition”, John Wiley & Sons, 1991

Musa, J.D.; Iannino, A.; and Okumoto,


K.; “Software Reliability:
Measurement, Prediction, Application”, 2-P T α β
McGraw-Hill, May 1987

Nelson, W., “Applied Life Data


Analysis”, John Wiley & Sons, 1982 2-P Y β α

MIL-HDBK-338, Section 5.3.6


3-P T β η γ
This book 2-P X β α

For much life data, the Weibull distribution is more suitable than the exponential, normal
and extreme value distributions, so it should be the distribution of first resort. The
characteristics of various shape parameters are summarized below:

• For shape parameter < 1.0, the Weibull pdf takes the form of the gamma
distribution (see Section 3.7.1.4) with a decreasing failure rate (i.e., infant
mortality)
• For shape parameter = 1.0, the failure rate is constant so that the Weibull pdf
takes the form of the simple exponential distribution with failure rate parameter
“λ” (the flat part of the reliability bathtub)
• For shape parameter = 2.0, the Weibull pdf takes the form of the lognormal or
Rayleigh distribution, with a failure rate that is linearly increasing with time (i.e.,
wearout). This is often used to model software reliability.
• For 3 < shape parameter < 4, the Weibull pdf approximately takes the form of the
Normal distribution
• For shape parameter > 10, the Weibull distribution is close to the shape of the
smallest extreme value distribution

The basic parameters of the 2-parameter Weibull distribution are presented in Table 3.3-
4. To have the mathematical expressions reflect a 3-parameter Weibull, replace all
values of “x” with “(x-x0)”, where x0 represents the γ value as described above.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
162
Chapter 3: Fundamental Concepts

Table 3.3-4: Weibull Distribution Parameters


Parameter Mathematical Expression
Probability Density Function ⎡ ⎛ x ⎞β ⎤
⎢ ⎥
β −1 ⎢ − ⎜⎜ α ⎟⎟ ⎥
β ⎛ x ⎞ ⎣ ⎝ ⎠ ⎦
f ( x ) = ⎜⎜ ⎟⎟ e , x>0
α⎝α⎠

Cumulative Distribution Function ⎡ ⎛ x ⎞β ⎤


⎢ −⎜⎜ ⎟⎟ ⎥
⎢ ⎝α⎠ ⎥
F(x) = 1− e ⎣ ⎦

Shape parameter β
Scale parameter α
Failure Rate β⎛x⎞
β−1
λ ( x) = ⎜ ⎟
α ⎜⎝ α ⎟⎠

Mean ⎛ 1⎞
μ = α Γ ⎜⎜ 1 + ⎟⎟
⎝ β ⎠

Variance ⎡ ⎛
2⎞ ⎛ 1⎞
2⎤
σ 2 = α 2 ⎢ Γ ⎜⎜ 1 + ⎟⎟ − Γ ⎜⎜ 1 + ⎟⎟ ⎥
⎢ ⎝ β⎠ ⎝ β⎠ ⎥
⎣ ⎦

Standard deviation ⎡ ⎛ 2⎤
0.5
2⎞ ⎛ 1⎞
σ = α ⎢ Γ ⎜⎜ 1 + ⎟⎟ − Γ ⎜⎜ 1 + ⎟⎟ ⎥
⎢ ⎝ β⎠ ⎝ β⎠ ⎥
⎣ ⎦

100 Pth Percentile y P = α[− ln(1 − P )]1 β

Reliability ⎡ ⎛ x ⎞β ⎤
⎢ −⎜⎜ ⎟⎟ ⎥
⎢ ⎝α⎠ ⎥
R (x) = e ⎣ ⎦

Figure 3.3-3 provides a graphical example of the Weibull distribution pdf with a
characteristic life of 1000 hours for a variety of shape parameters (β). Figures 3.3-4 and
3.3-5 illustrate the hazard rate and probability plot, respectively, for the same values of
the shape parameter.

Reliability Information Analysis Center


163
Chapter 3: Fundamental Concepts

Figure 3.3-3: Example pdf Plots for the Weibull Distribution

Figure 3.3-4: Example Hazard Rate Plots for the Weibull Distribution

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


164
Chapter 3: Fundamental Concepts

Figure 3.3-5: Example Probability Plots for Weibull Distribution

For an example, consider that very early in the system integration phase of a large
software development effort, there have been numerous failures due to software that have
caused the system to crash (the predominant system failure cause). Plotting the failure
times of this specific failure mode (other failure modes are ignored for now) on Weibull
probability paper resulted in a shape parameter value of 0.77 and a scale parameter value
of approximately 32 hours. Based on these parameters, the calculated reliability and
failure rate of the software at 10 system hours is expected to be:

0 .77 −1
0 . 77 ⎛ 10 ⎞
λ (10 ) = ⎜⎜ ⎟⎟ = 0 . 0314 failures per hour
32 ⎝ 32 ⎠
⎡ ⎛ 10 ⎞
0.77 ⎤
⎢ −⎜⎜ ⎟⎟ ⎥
⎢ ⎝ 32 ⎠ ⎥
R (10 ) = e ⎣ ⎦ = 0.6647

Reliability Information Analysis Center


165
Chapter 3: Fundamental Concepts

3.3.3. Lognormal
The lognormal distribution is the distribution of a random variable whose natural
logarithm is distributed normally; in other words, it is the normal distribution with “ln t”
as the independent variable. The probability density function is

1 ⎛ ln (t )− μ ⎞
2

1 − ⎜ ⎟
f (t ) = e 2⎝ σ ⎠

σt 2π

The mean is:


⎛σ2 ⎞
μ + ⎜⎜ ⎟

⎝ 2 ⎠
e
And the standard deviation is:

(e )
1
2 μ + 2σ 2 2 μ +σ 2 2
−e

where μ and σ are the mean and standard deviation (SD), respectively, of ln (t).

The lognormal distribution is used in the reliability analysis of semiconductors and the
fatigue life of certain types of mechanical components. This distribution is also
commonly used in maintainability analysis.

The CDF for the lognormal distribution is:

t
1 ⎡ 1 ⎛ ln (t ) − μ ⎞ 2 ⎤
F (t ) = ∫ exp ⎢− ⎜ ⎟ ⎥ dt
0 tσ 2π ⎢⎣ 2 ⎝ σ ⎠ ⎥⎦

This can be related to the Standard Normal variant Z by:

⎛ ⎛ ln(t ) − μ ⎞ ⎞
F (t ) = P⎜⎜ Z ≤ ⎜ ⎟ ⎟⎟
⎝ ⎝ σ ⎠⎠

The reliability function is 1-F(t) or:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


166
Chapter 3: Fundamental Concepts

⎛ ⎛ ln(t ) − μ ⎞ ⎞
R(t ) = P⎜⎜ Z > ⎜ ⎟ ⎟⎟
⎝ ⎝ σ ⎠⎠

The hazard function, h(t), is given as follows

⎛ ln(t ) − μ ⎞
φ⎜ ⎟
f (t ) ⎝ σ ⎠
h(t ) = =
R(t ) tσR(t )

where φ is the standard normal probability function, and μ and σ are the mean and
standard deviation of the natural logarithm of the random variable, t.

Figures 3.3-6 through 3.3-8 illustrate the lognormal distribution for a mean value of 1000
and standard deviations of 0.1, 1 and 3. Shown are the pdf, the hazard rate, and the
cumulative unreliability function, F(t), respectively.

Figure 3.3-6: Example pdf Plots for the Lognormal Distribution

Reliability Information Analysis Center


167
Chapter 3: Fundamental Concepts

Figure 3.3-7: Example Hazard Rate Plots for the Lognormal Distribution

Figure 3.3-8: Example Probability Plots for the Lognormal Distribution


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
168
Chapter 3: Fundamental Concepts

3.4. References
1. Lyu, M.R. (Editor), “Handbook of Software Reliability Engineering”, McGraw-
Hill, April 1996, ISBN 0070394008
2. Musa, J.D., “Software Reliability Engineering: More Reliable Software, Faster
Development and Testing”, McGraw-Hill, July 1998, ISBN 0079132715
3. Nelson, W., “Applied Life Data Analysis”, John Wiley & Sons, 1982, ISBN
0471094587
4. Musa, J.D.; Iannino, A.; and Okumoto, K.; “Software Reliability: Measurement,
Prediction, Application”, McGraw-Hill, May 1987, ISBN 007044093X
5. Montgomery, D.C., “Introduction to Statistical Quality Control – 2nd Edition”,
John Wiley & Sons, 1991, ISBN 047151988X
6. Shooman, M., "Probabilistic Reliability, An Engineering Approach," McGraw-
Hill, 1968.
7. Abernethy, Dr. R.B., "The New Weibull Handbook," Gulf Publishing Co., 1994.

Reliability Information Analysis Center


169
Chapter 3: Fundamental Concepts

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


170
Chapter 4: DOE-Based Approaches to Reliability Modeling

4. DOE-Based Approaches to Reliability Modeling


The use of Design of Experiment (DOE) principals is critical to reliability modeling,
particularly as it pertains to designing reliability tests from which life models will be
derived. As such, it is treated as a separate topic in this book.

The tenets of DOE is that one or more of a products’ or systems’ responses is observed as
a function of pertinent factors that may affect that response, as illustrated in Figure 4.0-1.

Figure 4.0-1: The DOE Concept

At the heart of this technique is the product/system or process under analysis. This is the
feature for which we want to quantify the behavior. The independent variables are called
the factors. These represent the inputs to the product/system or process and are the things
that can potentially change how the product behaves. The output of the DOE activity is
the response, and is a measure of how good the product/system or process behaves.

The levels for each factor are varied, tests are performed, and the resulting response is
measured. The resultant data is analyzed to quantify the item or process response as a
function of the factor levels. The generic steps in applying DOE to generate life models
are:

1. Determine the product/system or process feature to be assessed

Reliability Information Analysis Center


171
Chapter 4: DOE-Based Approaches to Reliability Modeling

2. Determine the factors


3. Determine the factor levels
4. Design the tests
5. Perform tests and measurements
6. Analyze the data
7. Develop the life model

Each of these steps is described below.

4.1. Determine the Feature to be Assessed


The product/system or process feature to be assessed can be any characteristic of the
entity that is important to the end user or the producer. It can be related to the
performance of the entity, or it can be related to its reliability or durability. In the context
of this book, the primary features of interest are reliability and durability. The basic
premise of the DOE approach dictates that the feature to be assessed must be
quantifiable.

4.2. Determine Factors


A factor is any variable that can potentially influence the feature being analyzed. It can
be a design attribute, manufacturing attribute, process attribute, environmental stress,
operational stress, or any other influencing factor. The output of this determination is a
list of factors that will be varied in the DOE tests to be performed. A variety of tools can
be used to assist in determining the factors that are to be included in the experiments.
Some of these tools are:

• Quality Function Deployment (QFD)


• Brainstorming sessions
• Ishikawa diagram
• Design FMEA
• Process FMEA

The FMEA is treated in more detail in Chapter 8.

4.3. Determine the Factor Levels


After the factors are identified, the next step is to determine the levels of each factor that
will be used in the subsequent tests. The simplest and most common approach is the use
of two levels, one at the high end of the operating space (defined below) and the other at
the low end. However, there are risks associated with using only two levels. The main

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


172
Chapter 4: DOE-Based Approaches to Reliability Modeling

drawback is that it cannot detect non-linearity in the relationship between the factor and
the response. For example, consider the relationship in Figure 4.3-1.

Figure 4.3-1: Possible Response-Factor Level Relationship

In this example, the levels “a” and “d” represent the operating space of the product. The
conclusions will be very different, depending on the levels chosen within this operating
space. For example, if levels “a” and “b” are chosen, the conclusion will be that there is
a strong positive relationship; if levels “b” and “d” are chosen, the conclusion will be that
the factor has no effect on the response; and if “a” and “d” are chosen, which is a typical
approach, the conclusion will be that there is a moderate relationship. These results are
summarized in Table 4.3-1.

Table 4.3-1: Possible Conclusions for a Non-Linear Response-Factor Relationship

Levels Conclusion
a-b High positive relationship
c-d High negative relationship
b-d No relationship
a-d Moderate positive relationship

The number of levels for each factor should be chosen, in part, based on knowledge of
the physics of the manner in which the factor affects the response. Otherwise, there can
be large uncertainty in using the resulting model to interpolate or extrapolate the response
behavior as a function of the factor. For example, if the response under analysis is
Reliability Information Analysis Center
173
Chapter 4: DOE-Based Approaches to Reliability Modeling

corrosion, and the relationship between the factor, temperature, and the corrosion rate is
expected to be governed by the Arrhenius relationship over the entire operating space,
then a two-level temperature test may be appropriate. If, however, it is hypothesized that
there is a temperature threshold within the operating space, then more than two levels
may be required.

4.4. Design the Tests


The next step is to design the experiment itself. The tests must be designed to determine
the specific factor level combinations to be tested, and the order in which they will be
tested. There are many things that will influence the design of the experiment, including
sample availability, the cost of running the tests, the time allotted for the tests, and test
equipment availability.

As an example of a simple experimental design, consider Figure 4.4-1.

Figure 4.4-1: DOE Terminology

In this example, there are three factors to be assessed, A, B and C, represented by the
three right-hand columns. Each factor has two levels, a “+” indicating the high level and
a “–“ indicating the low level. This experiment has four runs, each one representing a
treatment. A treatment refers to the combination of levels used in the tests.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


174
Chapter 4: DOE-Based Approaches to Reliability Modeling

Repetition and replication are techniques used to increase the number of runs. The
advantage of increasing the number of runs is that obtaining multiple responses with
exactly the same factor levels is valuable in quantifying the amount of variability and
error in the measurements obtained. Repetition is the practice of repeating the same run
sequentially. Replication is the practice of repeating a set of runs sequentially. Both
practices will result in multiple responses for a given set of factor levels, but the
advantage of replication over repetition is that it is better able to quantify measurement
error in the event when there is a gradually changing parameter in the test or
measurement system.

The full-factorial approach will be used as an example for illustrating the concepts of data
analysis, followed by a discussion of other approaches.

A full-factorial design, an example of which is shown in Table 4.4-1, is the most


comprehensive experimental design. It includes runs which represent all possible
combinations of factor levels. The primary drawback to the full-factorial approach is that
it requires many runs. In some cases, this may be practical, but in many cases, the cost
and time required to carry out the experiments are prohibitive.

Table 4.4-1: Full-Factorial Example

Run A B C R (response)
1 + + + R1
2 + + - R2
3 + - + R3
4 + - - R4
5 - + + R5
6 - + - R6
7 - - + R7
8 - - - R8

The number of required runs is calculated as yx, where “y” is number of levels per factor
(2, for this example), and “x” is number of factors (3). In Table 4.4-1, then, the number
of runs is 23=8.

Reliability Information Analysis Center


175
Chapter 4: DOE-Based Approaches to Reliability Modeling

There are many alternatives to the full factorial approach. “One-Factor-at-a-Time”


experiments, illustrated in Figure 4.4-2, refer to experiments in which each run varies the
level of one factor. In this manner, the effects of each factor can be assessed by
comparing the response between the two successive runs in which the factor was varied.
This is generally a brute force way to perform experiments, and is usually very
inefficient.

Figure 4.4-2: One-Factor-at-a-Time Experiments

Fractional Factorial Orthogonal Array Experiments can be used when it is impractical to


perform a full factorial experiment. Characteristics of orthogonal experiments are as
follows:

• They use a fraction of the number of full-factorial combinations


• The treatments are chosen to provide enough information to analyze the effects of
a factor using analysis of means
• “Orthogonal” means that the combination of factors are balanced such that the
weight of all factors are equal
• “Orthogonal” also means that the effects of the factors can be assessed
independently of the others

A full-factorial array can be scaled such that the resultant array has the characteristics of
orthogonality. These are referred to as fractional factorial arrays, since only a fraction of
the full-factorial runs are required, yet are still orthogonal. The naming convention for
these arrays is determined from:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


176
Chapter 4: DOE-Based Approaches to Reliability Modeling

La ( y x )

where:

a= the number of experimental runs


y= the number of levels
x= the number of factors

In the previous examples, “y” and “x” were the number of factors and the number of
runs, respectively. In the standard DOE nomenclature, however, “La” refers to the
number of runs. For example, a seven-factor, two-level experiment for which there will
be eight runs is shown in Figure 4.4-3.

Figure 4.4-3: Standard DOE Nomenclature

Another critical element that must be considered when defining reliability tests is the
potential interactions between factors. Everything discussed thus far in this section has
assumed that the effects of each of the factors are independent of each other. In practice,
there are often interactions between factors that must be accounted for. Graphical
representations of potential interactions are shown in Figure 4.4-4. Referring to the
Figure, if the responses for the two levels of the “B-factor” plotted against the two levels
of the “A-factor” are parallel, then this is an indication that there is no interaction
between the two factors. This is shown on the top left. In other words, the relative
magnitudes of the B-response are independent of the level of “A”. If however, when the
plots of the same factors result in the plot on the top right, then this is an indication that
there is a strong interaction between factors A and B. In this example, the levels of “A”
change the entire relationship between the B-levels and the response. The plot on the
bottom indicates that there is a mild interaction between the two factors.

Reliability Information Analysis Center


177
Chapter 4: DOE-Based Approaches to Reliability Modeling

Figure 4.4-4: Potential Interactions

If the potential interactions are not accounted for in the reliability test plan, the risk is that
the effects of the factors cannot be deconvolved (separated) from the interactions between
the factors. There are many DOE test plans and tools that assist in identifying the
capability of various plans to identify main effects and interactions.

A detailed treatment of DOE principals is beyond the scope of this book, as this has been
done extensively in the literature, but it is important to understand the impact of some of
the principals as they pertain to reliability testing.

Resolution is a term that describes the degree to which the main effects of factors are
aliased, or confounded, with the interactions amongst factors. In general, the resolution
number of a design is one more than the smallest order interaction with which some main
effects are aliased. For example, if some main effects are confounded with some 2-level
interactions, the resolution number of the DOE is 3. Since full-factorial designs test the
response of every possible combination of factors, there is no confounding and, therefore,
they have infinite resolution. As stated previously, since the implementation of a full-
factorial test is often not practical, weaker tests are often necessary. The key is to select
the aliasing structure of the test such that the actual critical interactions can be
deconvolved from the main effects.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


178
Chapter 4: DOE-Based Approaches to Reliability Modeling

To illustrate this, consider an example of a corrosion failure mechanism that is


accelerated by temperature, humidity and the level of ionic contamination. A full
factorial, 2-level per factor, plan would be as shown in Table 4.4-2. The “-1” and “1”
designation represent the low and high levels of the factors, respectively. For this full-
factorial, 2-level plan, eight runs are sufficient to test all possible combinations.

Table 4.4-2: Full and Half Factorial Example for Corrosion


Main effects Interactions
Temperature Humidity Ionic
(T) (H) contamination (I) T*H T*I H*I
1 -1 1 -1 1 -1
-1 1 -1 -1 1 -1
1 -1 -1 -1 -1 1
Full- -1 -1 1 1 -1 -1
Factorial -1 1 1 -1 -1 1
1 1 1 1 1 1
-1 -1 -1 1 1 1
1 1 -1 1 -1 -1
Half- 1 1 1 1 1 1
Factorial 1 -1 -1 -1 -1 1
(Resolution -1 -1 1 1 -1 -1
= 3) -1 1 -1 -1 1 -1

Another possible plan would be a half factorial, also shown in Table 4.4-2. Notice that,
for the half-factorial design, the temperature-humidity (T*H) interaction (i.e., the product
of the two) is the same as for ionic contamination (I). Also, the T*I interaction is the
same as H, and the H*I interaction is the same as T. Therefore, this Resolution 3 plan is
incapable of deconvolving the main effects of T, H or I with the interactions of the other
two.

From physics, we know that both humidity and ionic contamination are required for
corrosion. Therefore, the fact that H*I is the same as T (i.e., they are confounded) is
unacceptable, since we would not be able to determine if the lifetime is governed by
temperature, or the combination of humidity and ionic contamination. Therefore, we
need a better DOE test plan. The full-factorial plan would be the best, if it could be
executed, since none of this confounding exists. For the full-factorial plan, notice that
none of the interaction terms are the same as the main effects.

Reliability Information Analysis Center


179
Chapter 4: DOE-Based Approaches to Reliability Modeling

If we were to actually model this failure cause based on the tests defined in these plans,
the general form of the reliability model may be based on the two parameter Weibull
distribution, which is:

β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R=e
where:

R= the reliability, or probability of survival, at time “t”


α= the characteristic life (i.e., time to 63% failure)
β= the Weibull shape parameter

The characteristic life is then developed as a function of the applicable variables. The
model in this case is:
α1
α = e e T H α I α HI α
α0 2 4 3

where:

α 0 through α 4 = parameter coefficients estimated in the life modeling process


T= the temperature in degrees K (degrees C+273)
H= the relative humidity
I= the ionic contamination
HI = the product of humidity and ionic contamination

All model parameters, α 0 through α 4, could be adequately quantified with the full-
factorial design, but not with the half-factorial.

There are many other potential test plans that would be adequate, providing that the
required model variables can be quantified and are not confounded with one another
(Reference 1).

4.5. Perform Tests and Measurements


The next step in the process is to perform the tests. The test for each run is performed,
and the response is measured. All variables that are not factors being addressed in the

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


180
Chapter 4: DOE-Based Approaches to Reliability Modeling

experiment must be kept as constant as possible. Make sure that all results are fully
documented. This also must include any anomalies or potential sources of error that may
have occurred. The order of the runs must be kept intact, per the experimental plan. If
repetition is used, the same run or treatment is repeated sequentially. If replication is
used, then the set of runs to be repeated have been identified in the experimental design.

For in-situ measurements, careful time stamping of the data is required. Life models to
be developed from the collected data often represent parameter degradation data and not
actual TTF data. As a result, a model of degradation rate as a function of time may be
used as the response to predict failure times. All test samples should be carefully stored,
as root-cause failure analysis may be required at some future time.

4.6. Analyze the Data


The data that is generated from the tests is then analyzed to identity the impact that each
factor has on the response, and the interactions between each factor. The simplest way to
analyze the data and the effects of each factor is to perform an analysis of means. This
can be done only if the experimental design is orthogonal. In this case, the average value
of the response is calculated for each level of each factor. From the previous example, if
the effects of A are to be determined, then the average of the responses when A is “+”
and when A is “-” are calculated. Likewise, the mean of each level of each factor is
calculated in the same manner, as shown below:

The means can be pictorially represented, as shown in Figure 4.6-1. This is a convenient
way to illustrate the sensitivity of the response to each factor. Data analysis techniques
more sophisticated than the analysis of means shown here are also often used, and there
are many good software tools available to aid in this analysis. However, if a balanced,
orthogonal design is used, analysis of means can be very straightforward and effective.

Reliability Information Analysis Center


181
Chapter 4: DOE-Based Approaches to Reliability Modeling

Figure 4.6-1: Analysis of Means

In the event that it is known that the response does not behave linearly with the factor
level, the response can sometimes be linearized by making the appropriate data
transformation. For example, if the response under analysis is corrosion that is governed
by the Arrhenius relationship over the entire operating space, then the response, life in
this case, would be exponential with temperature. However, if the transformation shown
in Figure 4.6-2 is applied, the response will be linear. This is especially useful when a
goal of the analysis is to determine the activation energy.

Figure 4.6-2: Linearization of the Arrhenius Relationship

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


182
Chapter 4: DOE-Based Approaches to Reliability Modeling

After the data has been analyzed, the optimal combination of factor levels can be
determined. The goal of this approach is to determine the factor levels that will result in
minimal variability of the product response and maximum probability of the product
meeting its requirements. This is the payoff in this approach, since it results in a more
robust design.

In this example, if the desirable response is high, then a high value of A and B with a low
value of C provides the best response, as shown in Figure 4.6-3.

Figure 4.6-3: Optimal Factor Settings

4.7. Develop the Life Model


The reliability data is then analyzed and the life model is developed. This process is
discussed in detail in Chapter 5

4.8. References
1. William Y. Fowlkes and Clyde M. Creveling, “Engineering Methods For Robust
Product Design: Using Taguchi Methods In Technology And Product
Development,”

Reliability Information Analysis Center


183
Chapter 4: DOE-Based Approaches to Reliability Modeling

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


184
Chapter 5: Life Data Modeling

5. Life Data Modeling


This section addresses the topic of life modeling, after the life data has been generated.
Life data modeling is treated as separate topic in this book since its principals pertain to
many of the types of data previously discussed. The purpose of modeling the reliability
of critical components or failure causes was previously described, and includes a variety
of objectives. Life modeling is simply a means of constructing a mathematical model
that predicts, assesses, or estimates the reliability of a product or system. A methodology
was previously presented for developing life models from tests performed at multiple
combinations of stress (DOE Multicell). From this data, a reliability model can be
constructed.

If all samples are tested to failure, or have been tested in exactly the same manner, then
traditional statistical analysis techniques (like regression, F-tests, T-tests, AVOVA, etc.)
will generally suffice for reliability modeling purposes. However, most real world cases
include censored data, unbalanced datasets, uncertain failure times, etc. It is these cases
where life modeling techniques are most effectively used.

Life modeling requires simultaneous characterization of:

1. TTF distributions
2. Acceleration factors (which provide a relative value of the reliability parameter as
a function of the stress level)

Each of these two major elements is discussed in the following sections and presents
more detailed information regarding development of the models after the life data has
been obtained.

5.1. Selecting a Distribution


While there is no specific distribution type that should be used in specific situations, there
are some rules of thumb that are helpful when selecting an appropriate distribution.

If the failure mechanism of interest is a manifestation of a positive feedback situation,


then the lognormal distribution is often applicable. These positive feedback situations are
recursive cases in which a flaw starts, the presence of the flaw results in an increased
stress level, the flaw propagates resulting in further increased stress, and so on, until
catastrophic failure occurs.

Reliability Information Analysis Center


185
Chapter 5: Life Data Modeling

If the failure process is governed by a distribution of defects present in the product or


system at time zero, then the Weibull distribution is usually appropriate.

In cases where the failure mechanism is random in nature, the exponential is applicable.

5.2. Parameter Estimation Overview


Life modeling, using statistical concepts, involves drawing inferences from observations
of random variables, such as observed failure times. Typical inferences consist of point
and interval estimates of distribution parameters and decisions in statistical hypothesis
testing.

Parameter estimation provides a means for the effective use of data to aid in life
modeling and the estimation of constants appearing in those models. The constants that
appear in distribution functions (e.g., “p” in the binomial distribution; “λ” in the Poisson
distribution; “μ” and “σ” in the Normal distribution; “λ” or “θ” in the exponential
distribution; and “α” and “β” in the Weibull distribution) are called parameters. The true
value of the parameters from a given distribution may not be known or measurable, so it
becomes more practical to obtain approximate or estimated values of these parameters
from a sample of data. In the larger context, parameter estimation is typically applied to
one of the following scenarios.

Point estimation is frequently used in reliability analysis to quantify parameters like the
failure rate in the exponential distribution.

Formally, a statistic, Y, is a function of random variables that does not depend on any
unknown parameter:

Y = u( X 1 , K , X n )

Let “θ” denote the parameter to be estimated. Consider functions w(Y) of the statistic,
which might serve as point estimates of the parameter. Since w(Y) is a random variable,
it has a probability distribution. Statisticians have defined certain properties for assessing
the quality of estimators. These properties are defined in terms of this probability
distribution.

A loss function, L[θ,w(Y)], assigns a number to the deviation between a parameter and an
estimator. A typical loss function is the square of the difference, and is the value used in
least squares regression:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


186
Chapter 5: Life Data Modeling

L[θ , w(Y )] = [θ − w(Y )] 2

The risk function is the expected value of the loss function:

R(θ , w) = E{L[θ , w(Y )]}

An unbiased estimator that minimizes the risk function for the above loss function is
referred to as a minimum variance unbiased estimator. An estimator that minimizes this
risk function uniformly in θ is called a minimum mean squared estimator. Table 5.2-1
summarizes the terms most commonly used in parameter estimation.

Table 5.2-1: Terminology Used In Parameter Estimation


Term Definition
Confidence Level The theoretical percentage (or probability) of an interval estimate containing
the parameter, and in which the endpoints of the interval are constructed from
sample data
Consistent Estimator The estimate converges to the true value of the parameter as the sample size
increases to infinity
Estimator A function of a statistic used to estimate a parameter in a probability model
Interval Estimator Estimates of the endpoints of an interval around a parameter
Likelihood The probability weight for given values of parameters at observed data points
Loss Function A function that provides a measure of the distance between a parameter value
and its estimator
Maximum Likelihood An estimate that maximizes the probability that given parameter values will
Estimate occur at observed data points
Minimum Mean Squared An estimator that uniformly minimizes the expected value of the square of the
Estimate difference between a parameter and an estimator
Minimum Variance Of all unbiased estimators, none has a smaller variance. Sometimes called a
Unbiased Estimator “best” estimator
Risk Function The mathematical expectation of the loss function
Sample Size The number of random variables from which a statistic is calculated
Unbiased Estimator An estimator with a mathematical expectation equal to the parameter being
estimated

Table 5.2-2 includes a brief discussion of common parameter estimation techniques.

Reliability Information Analysis Center


187
Chapter 5: Life Data Modeling

Table 5.2-2: Techniques for Parameter Estimation


Technique Discussion • Process
Maximum In all practical cases, MLE’s converge • Express the joint probability density function of the
Likelihood stochastically to the population value. random variables of interest as a function of the
Estimation If a MLE exists uniquely and a unknown parameters (i.e., the likelihood function)
(MLE) sufficient statistic for the parameter • Where appropriate, take the natural logarithm of the
exists, the MLE is a function of the likelihood function
sufficient statistic. Sometimes the MLE • Differentiate the likelihood (or log likelihood)
is impossible to find in closed form, and function with respect to each parameter
numerical methods must be used • Set all derivatives equal to zero and solve for the
(typical of time-domain software parameters as functions of realizations of the
reliability models). MLE’s are the best random variables
estimators for large sample sizes. • Check second-order conditions

Least Squares Least square estimators may be better • Express the sum of the squared distance between
when small or medium sample sizes are actual and predicted values as a function of
involved, since they may have smaller parameter estimates
bias, or approach normality faster. • Determine the parameter estimators that minimize
Least squares estimation minimizes the the sum of this squared distance (typically using
variance around the estimated differential calculus)
parameter. The technique is familiar to
those comfortable with linear regression
modeling.

Method of This technique works by equating • Determine the distribution whose parameters are to
Moments statistical sample moments calculated be estimated (suppose there are “n” parameters to
from a data set to actual population be estimated)
moments. Population moments are • Find the first “n” moments of the distribution, either
determined by the parameters to be around zero, or around the mean for moments
estimated. As many moments are higher than the first
equated as there are parameters to be • Equate these moments to sample moments
estimated. In most cases of practical • Solve for the parameters as a function of the
interest, these can be found in closed realizations of the random variables in the sample.
form., but their theoretical justification
is not as rigorous as for other parameter
estimation methods.

Bayesian Provides an efficient method for • Assign a non-informative or subjective distribution


incorporating various subjective and to the parameters of the model (the “priors”). The
objective data sources into parameter priors express the uncertainties in the parameter
estimation. It is a much less practical values.
method than MLE, as the analysis is • Combine actual data with the “priors” to obtain new
much more complex and the parameter distributions (the “posteriors”). The
computation is much more complicated. posteriors provide estimates and Bayesian
The validity of the approach is confidence limits for the parameters, producing
dependent on validity of the model and more precise estimates.
prior distributions.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


188
Chapter 5: Life Data Modeling

5.2.1. Closed Form Parameter Approximations


Simple equations that approximate parameters have been developed and are summarized
in Table 5.2-3, which provides an overview of the parameter estimates for commonly
used distributions.

Table 5.2-3: Parameters Typically Estimated from Statistical Distributions


Distribution True Parameter Estimated Parameter
Sample Occurrence Rate: λˆ = n / t
Poisson Occurrence Rate, λ n = number of observed failures
t = period (time, length, volume) over which failures are
observed
Sample Proportion: p = x / n
ˆ
Binomial Proportion, p x = number of “successful” trials
n = number of statistically independent sample units
n
∑xi
θ̂ = x = i =1
Exponential Mean, θ Sample Mean: n
xi = individual times to failure for each of the observations of
sample size “n”
n = number of statistically independent sample observations
n
∑x i
i =1
x=
Sample Mean: n
Mean, x
xi = individual times to failure for each of the observations of
sample size “n”
n = number of statistically independent sample observations
Normal n
∑ (x i − x )2
i =1
s2 =
Sample Variance: n −1
Variance, s2
s = sample variance (standard deviation, s, equals (s2)0.5)
2

xi = individual measurements for each of the observations of


sample size “n”
n = number of statistically independent sample observations

Reliability Information Analysis Center


189
Chapter 5: Life Data Modeling

Table 5.2-3: Parameters Typically Estimated from Statistical Distributions (continued)


Distribution True Parameter Estimated Parameter
The estimate of the Weibull shape parameter is:
1.283
β̂ =
s
where,
0.5
⎛ n ⎞
⎜ ∑ (x i − x )
2

s = ⎜ i =1 ⎟
⎜ n −1 ⎟
Shape Parameter, β ⎜ ⎟
⎝ ⎠
n
∑xi
Weibull i =1
x=
n
s = sample standard deviation
xi = individual times to failure for each observation of sample size
“n”
n = number of statistically independent sample observations
The estimate of the Weibull scale parameter is:
αˆ = exp ( x + ( 0.5772 )( 0.7797 ) s )
Scale Parameter, α s= sample standard deviation
xi = individual measurements for each observation of sample size
“n”
n = number of statistically independent sample observations

The parameter estimates shown in Table 5.2-3 are rather simplistic and easy to use, and
often provide adequate estimates. There are more rigorous techniques available that do a
better, more accurate job of estimating parameters, but their complexity requires the use
of software tools.

The most popular techniques used in reliability modeling are least squares regression and
maximum likelihood. These are described in the next sections.

5.2.2. Least Squares Regression


Least squares regression is often used to estimate model parameters in cases when a
function can be linearized. The following steps are required for this approach:

1. Select the distribution type


2. Linearize the distribution
3. Determine the plotting positions of each data point
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
190
Chapter 5: Life Data Modeling

4. Determine the parameters using a least squares technique

For example, if a two parameter Weibull distribution is used (Step 1), the linear transform
is performed as follows (Step 2):

β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R=e
Taking the natural log (base e) of both sides, twice, yields:

ln(− ln(R)) = β ln(t ) − ln(α )

This is now a linear model with ln(t) being the independent variable, β being the slope,
and ln(α) being the intercept.

Step 3 calculates the plotting position (i.e., the estimated percent fail of the population at
the failure time of each) for each data point. A common way to accomplish this is by
using Bernard’s formula:

i − 0.3
F=
N + 0.4

where:

i= the cumulative number of failures


N= the total sample size

For example, if there are ten items, the value of F after the second failure is:

i − 0.3 2 − 0.3
F= = = 0.163
N + 0.4 10 + 0.4

The value of F is calculated for each failure. These pairs of x-y points are the values to
which a linear model will be fit.

The values of the slope and intercept are then:

Reliability Information Analysis Center


191
Chapter 5: Life Data Modeling

β=
∑ (x − x )(y − y )
∑ (x − x )
2

ln(α ) = y − β x

In this case, y = ln(α), and x = time (t)

5.2.3. Parameter Estimation Using MLE


This section addresses the use of Maximum Likelihood Estimation (MLE) techniques for
estimating TTF distribution parameters, such as the parameter “λ” of the exponential pdf,
or “μ” and “σ” of the Normal and lognormal pdf. The objective is to find a point
estimate, as well as a confidence interval, for the parameters of these distributions based
on the data available from test or field observation. Quantification of confidence
intervals is very important in the estimation process because there is almost always a
limited amount of data (e.g., on TTFs), and, thus, we cannot state our point estimation
with certainty. Therefore, the confidence interval is a statement about the range within
which the actual (“true”) value of the parameter resides. This interval is greatly
influenced by the amount of data available. Of course, other factors such as diversity and
accuracy of the data sources and adequacy of the selected model can also influence the
state of our uncertainty regarding the estimated parameters. When discussing goodness-
of-fit tests, we are trying to address the uncertainty due to the choice of the probability
model form by using the concept of levels of significance. However, uncertainty due to
diversity and accuracy of the data sources is a more difficult issue to deal with.

Times-to-failure data are seldom complete. A complete sample is one in which all items
observed have failed during a given observation period, and all the failure times are
known. When “n” items are placed on test or observed in the field, whether with
replacement or not, it is sometimes necessary (due to the long life of certain components)
to terminate the test and perform the reliability analysis based on the observed data up to
the time of termination.

There are two basic types of possible life observation termination. The first type is time
terminated (which results in Type I right censored data), and the second is failure
terminated (resulting in Type II right-censored data). In the time-terminated life
observation, “n” units are monitored and the observation is terminated after a
predetermined time has elapsed. The number of items that failed during the observation
time, and the corresponding TTF of each component, are recorded. In the failure-
terminated life observations, “n” units are monitored and the observation is terminated
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
192
Chapter 5: Life Data Modeling

when a predetermined number of component failures have occurred. The time to failure
of each failed item, including the time that the last failure occurred, are recorded.

The MLE method is one of the most widely used methods for estimating reliability model
parameters. I n the first part of this section, a brief historical review of the MLE method
is presented. The likelihood function concept for different types of failure data, as well
as the mathematical approach to solve likelihood equations, is presented next. The last
part of this section reviews the basic equations of the MLE approach for specific case
studies, including exponential, Weibull and lognormal distribution likelihood functions.

5.2.3.1. Brief Historical Remarks


The use of regression techniques has many shortcomings when it comes to reliability
modeling. In particular, it is weak when it comes to analyzing interval or censored data.
The Maximum Likelihood Estimation method was originally introduced by Fisher
(Reference 5).

Fisher used the conditional probability of occurrence for each failure event as a measure
for his mathematical curve fitting. He argued that, using a subjective assumption about
the TTF model, one can characterize the probability of each failure event, conditioned to
the model parameter. He then derived the posterior probability of failure events in a
Bayesian framework using a uniform distribution as a prior for the model parameters. He
later calculated the best estimate for model parameters by maximizing the posterior.
Note that, in a Bayesian framework, a uniform distribution cancels out from the equation
since it is a constant. The normalizing factor in the denominator is also a constant, which
has no impact when one is interested in the extremes of the function. Therefore, this
method was eventually called the maximum likelihood estimator, because it is basically
the likelihood function that is maximized in this process.

5.2.3.2. Likelihood Function


Fisher (Reference 3) based his maximum likelihood measure on an implied Bayesian
uniform prior for the parameters, and he names the method as leading to “the most
probable set of values” for the parameters (Reference 9). He suggested that the ratio of
the likelihood function and its maximum may be used to find confidence intervals for the
model parameters, and derived it for the case of Normal sampling curves.

Let “[f (t) × dt]” be the chance of a failure observation falling within the range “dt”.
Fisher introduces the method of maximum likelihood by claiming that the factor “dt” is
independent of the theoretical curve, and the probability is proportional to “f(t)”.

Reliability Information Analysis Center


193
Chapter 5: Life Data Modeling

Therefore, the likelihood of “N” independent TTF observations will be proportional to


the product of the probability distribution function at the TTFs, as shown below:
N
LF (t1 , t 2 ,..., t N / θ ) ∝ ∏ f (ti | θ )
i =1
M
LR (T1 , T2 ,..., TM / θ ) ∝ ∏ R (Ti | θ )
i =1
K
LL (T1 , T2 ,..., TK / θ ) ∝ ∏ F (Ti | θ )
i =1
L
LI [(Ta1 , Tb1 ),..., (TaL , TbL ) / θ ] ∝ ∏ [ F (Tb i ) − F (Tai )] | θ )
i =1
where:

θ= the vector of model parameters, (θ1, θ2, …, θn)


N= the number of complete failure observations
M= the number of right censored observations
K, L = the number of left and interval data observations
Ti = censored observations
ti = complete failure observations
f (t) = probability density function
R (t) = reliability function
Tai = the lower bound of time interval
Tbi = the upper bound of time interval
LF = the likelihood function for complete failure data
LR = the likelihood function for right censored data
LL = the likelihood function for left censored data
LI = the likelihood function for interval data

Using the notion of the conditional probability density function, f(t|θ), helps to integrate
many different types of failure data into the likelihood function. For example, the
likelihood of the right-censored observations will be the reliability function, because this
is the probability that the component remains reliable up to the censored time. Therefore,
the likelihood of “M” independent right-censored observations will be the product of the
reliability functions as illustrated in the second equation above. For left-censored times
(that is, the time before which a failure has occurred) the likelihood is also the definition
of probability of failure at that time. In the case of many left-censored times, the total
likelihood will be the multiplication of the likelihood values of individual components

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


194
Chapter 5: Life Data Modeling

using the independency assumption, as shown in the third equation on the previous page.
For interval times, the likelihood is the probability of having one failure in that interval
(which is basically the integral of the probability density function between the upper and
lower abounds of interval). This is simply the difference between the cumulative
distribution function when it is evaluated at the upper and lower bounds, respectively, as
shown in the final equation from the previous page. Assuming the independency of
failure or censored time events, these likelihood functions can be multiplied with the
likelihood of the complete failure data in order to build the likelihood function for the
entire population.

5.2.3.3. Maximum Likelihood Estimator (MLE)


The likelihood function is used differently in the Bayesian and MLE frameworks. In the
Bayesian method, the prior knowledge that is available for the model parameters is
updated using this function as the conditional likelihood of data. In the MLE approach,
the most probable set of values of the parameter vector, θ, s estimated by maximizing this
likelihood as a standalone function.

The practical way to find the modes of the likelihood function is derivation. A
multivariable function has its maximum value at a point in which the first-order partial
derivative of the function with respect to each variable becomes zero, as shown below:

⎧ ∂Λ
⎪ ∂θ = 0
⎪ 1
⎪ ∂Λ
⎪ =0
Λ = ln( L ) ⇒ ⎨ ∂θ 2 ⇒ θˆ = (θˆ1 ,θˆ 2,..., θˆn )
⎪...

⎪ ∂Λ
⎪ ∂θ = 0
⎩ n
where:

Λ= the log likelihood function


θˆ = the best estimate parameter vector

Note that the likelihood function, as explained before, is based on a multiplication format.
This makes the derivation process very complex. The likelihood is always positive, so
Reliability Information Analysis Center
195
Chapter 5: Life Data Modeling

one may take the natural logarithm of this function to convert these multiplication
operators to summation. This will significantly reduce the mathematical derivation
complexity, while still providing the same best estimates for the mode of the likelihood
function. Constructing the likelihood function, L, as explained before, one may set up the
equations that need to be solved for the modes of this function.

In the following sections, three examples of the likelihood function, representing the
exponential, Weibull and lognormal distributions, are presented for further clarification.

5.2.3.3.1. Exponential Distribution


If failures are expected to randomly occur at a constant rate in time, the TTF distribution
follows an exponential distribution. The exponential distribution assumes a constant
hazard rate for the item. This constant hazard rate is the only parameter of the
exponential distribution. The likelihood of complete failure and right-censored data, as
explained in previous sections, can be represented based on the probability density and
the cumulative distribution functions of the exponential distribution. The equation below
shows the log-likelihood function in case of “F” complete (i.e., failed) and “S” right-
censored (i.e., survived or suspended) observations.

( ) − ∑ N λT
F S
L = ∑ N i ln λe − λt i
j j
i =1 j =1

The only variable in this equation is λ. In the MLE method, the best estimate of λ is
evaluated by maximizing the likelihood (or log-likelihood) function. The next equation
shows the criteria to estimate the best estimate of λ. The uncertainty of the calculation
can be illustrated as confidence bounds over λ, which is calculated using the
corresponding local Fisher information matrix. This step will be explained in detail later.

∂L F
⎛1 ⎞ S
= ∑ N i ⎜ − ti ⎟ − ∑ N jT j = 0
∂λ i =1 ⎝ λ ⎠ j =1

5.2.3.3.2. Weibull Distribution


The Weibull distribution can be used for non-repairable hardware units exhibiting
increasing, decreasing, or constant hazard rate functions. Similar to the lognormal
distribution, it is a two-parameter distribution and its estimation, even in the case of
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
196
Chapter 5: Life Data Modeling

complete (uncensored) data, is not a trivial problem. It can be easily shown that, in the
situation where all “r” units out of “n” observed units fail, the log-likelihood estimates of
the Weibull distribution are represented by the equation below:

⎛ ⎞ S
β
β −1 ⎛ t i ⎞ β
⎜ β ⎛ ti ⎞ − ⎜⎜ ⎟⎟
⎟ ⎛ Tj ⎞
F
L = ∑ N i ln ⎜ ⎜ ⎟ e ⎝ α ⎠ ⎟⎟ − ∑ N j ⎜⎜ α ⎟⎟
i =1 ⎜α ⎝α ⎠ j =1 ⎝ ⎠
⎝ ⎠

The best estimate of the parameters “α” and “β” are made using the first derivative of the
log-likelihood function, as shown in the next equations below. The best estimates will be
the unique answer for the set of two equations and two unknowns, as shown:

β β
∂L 1 F F
⎛ ti ⎞ F ⎛ ti ⎞ ⎛ ti ⎞ S ⎛T ⎞ ⎛T ⎞
= ∑ Ni + ∑ Ni ln⎜ ⎟ − ∑ Ni ⎜ ⎟ ln⎜ ⎟ − ∑ N j ⎜⎜ j ⎟⎟ ln⎜⎜ j ⎟⎟ = 0
∂β β i =1 i =1 ⎝ α ⎠ i =1 ⎝ α ⎠ ⎝ α ⎠ j =1 ⎝ α ⎠ ⎝ α ⎠
β β
∂L − β F
β F
⎛ ti ⎞ S ⎛Tj ⎞
∂α
=
α
∑ Ni +
α
∑ N ⎜
α
⎟ + ∑ N ⎜
j⎜
⎟ =0

⎝α
i
i =1 i =1 ⎝ ⎠ j =1 ⎠

Note that, despite the complexity of the mathematical representations of the likelihood
and log-likelihood functions and their derivatives, the basic concept is fairly simple. In
advanced numerical approaches using computers, the entire mathematical derivation is
done through numerical simulations using predefined tool boxes and library functions.
5.2.3.3.3. Lognormal Distribution
In the case of estimating the parameters of the lognormal distribution, the only difference
is in the construction of the likelihood function for which the pdf and CDF of the
distribution are used for complete and suspended failure data, respectively. The equation
below shows the log-likelihood function for a combination of complete failure and
suspended (right-censored) data.

F
⎛ 1 ⎛ ln(ti ) − μ ⎞ ⎞ S ⎛ ⎛ ln(T j ) − μ ⎞ ⎞
L = ∑ N i ln⎜⎜ φ ⎜ ⎟ ⎟⎟ + ∑ N j ln⎜⎜1 − Φ⎜⎜ ⎟⎟ ⎟

i =1 ⎝ σti ⎝ σ ⎠ ⎠ j =1 ⎝ ⎝ σ ⎠⎠

Having the log-likelihood function of failure data, the MLE approach can be executed
using the first derivative approach, as explained in previous sections. The first derivative

Reliability Information Analysis Center


197
Chapter 5: Life Data Modeling

of the log-likelihood function with respect to the mean and standard deviation is
illustrated in the following two equations:

⎛ ln(T j ) − μ ⎞
φ ⎜⎜
⎟⎟
∂L 1 F 1 S ⎝ σ ⎠ =0
= 2 ∑ Ni (ln(ti ) − μ ) + ∑ N j
∂μ σ i =1 σ j =1 ⎛ ln(T j ) − μ ⎞
1 − Φ⎜⎜ ⎟⎟
⎝ σ ⎠

⎛ ln(T j ) − μ ⎞ ⎛ ln(T j ) − μ ⎞
⎜ ⎟φ ⎜ ⎟
⎛ (ln(t i ) − μ ) ⎜ ⎟ ⎜ ⎟
1 ⎞⎟ 1 S σ σ
2
∂L F
⎝ ⎠ ⎝ ⎠
= ∑ Ni ⎜ − − ∑Nj =0
∂σ i =1 ⎜⎝ σ3 ⎟
σ ⎠ σ j =1 ⎛ ln(T j ) − μ ⎞
1 − Φ⎜⎜ ⎟

⎝ σ ⎠
where:
1
1 − (x)2
φ(x) = e 2

x 1
1 − (t )2
Φ(x ) =
2π ∫
−∞
e 2 dt

The capital Φ in the above equation is basically the cumulative Normal distribution,
which is defined as the integral of the small φ (i.e., normal pdf). The derivative of the
CDF always becomes the pdf, since the derivative operator cancels out the integration.

5.2.4. Confidence Bounds and Uncertainty


Since point estimates are constructed from data that exhibits random variation, these
estimates will not be exactly equal to the unknown population parameters. Confidence
bounds provide a convention for making statements about the random variation in the
estimates of parameters.

5.2.4.1. Confidence Bounds with MLE


The variance and covariance of the parameters, calculated using MLE equations, can be
found using the local Fisher information matrix. Fisher assumed a Normal distribution
for the parameters when deriving these equations. Using the following local information
matrix, one can relate the likelihood function to the variance and covariance of the model
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
198
Chapter 5: Life Data Modeling

parameters. The next equation represents these uncertainties for a general case for which
there are “n” parameters in the likelihood function.

⎡ Var (θˆ1 ) Cov (θˆ1 , θˆ2 ) ... Cov (θˆ1 , θˆn ) ⎤


⎢ ˆ ˆ ⎥
⎢Cov (θ 2 , θ 1 ) Var (θˆ2 ) ... Cov (θˆ2 , θˆn )⎥
⎥ = [F ]
−1
⎢ . . . .
⎢ . . . . ⎥
⎢ ⎥
⎢⎣ ... ... Cov (θˆn , θˆn −1 ) Var (θˆn ) ⎥⎦
(17)

where:

Var = variance of the parameter of interest


Cov = covariance of the two parameters
Λ= the log likelihood function
F= the local Fisher information matrix as defined below:

⎡ ∂ 2Λ ∂2Λ ∂2Λ ⎤
⎢ − 2 − ... − ⎥
⎢ ∂θ2 1 ∂θ1∂θ 2 ∂θ1∂θ n ⎥
⎢ ∂ Λ ∂2Λ ∂2Λ ⎥
⎢− ∂θ ∂θ − 2
∂θ 2
... −
∂θ 2 ∂θ n ⎥
F=⎢ 2 1

⎢ . . . . ⎥
⎢ . . . . ⎥
⎢ ⎥
⎢ ∂2Λ ∂2Λ ⎥
... ... − − 2
⎢⎣ ∂θ n ∂θ n −1 ∂θ n ⎥⎦

Having the variance and the best estimate of each parameter, one may estimate the
uncertainty bounds for any given confidence bounds. Note that the important underlying
assumption here is independency, as well as the Normal distribution for all parameters.

5.2.4.2. Confidence Bounds Approximations


Tables 5.2-4 through 5.2-11 present a summary of equations for calculating the
confidence bounds around the parameters for various distributions.

Reliability Information Analysis Center


199
Chapter 5: Life Data Modeling

Table 5.2-4: Confidence Bounds for the Poisson Distribution


Parameter One-Sided Confidence Interval Two-Sided Confidence Interval
Given: The estimate for a the true occurrence rate, λ, is the sample occurrence rate:
λˆ = n / t
where,
n= number of observed failures
t= period (time, length, volume) over which failures are observed
Poisson Limits (approximate only):
Exact confidence levels cannot be conveniently obtained for discrete distributions
λ L = 0.5 χ 2 [1 − γ ; 2 n ] / t λ L = 0.5 χ 2 [(1 − γ ) 2 ; 2 n ] / t
λ U = 0 .5 χ 2 [ γ ; ( 2 n + 2 ) ] / t λ U = 0.5 χ 2 [(1 + γ ) 2 ; ( 2 n + 2 ) ] / t
True Occurrence
Rate, λ Normal Approximation
When “n” is large (say, >10)
λ ≅ λˆ − z ( λˆ / t ) 0.5
L γ λ L ≅ λˆ − z ( 1+ γ ) 2 ( λˆ / t ) 0.5
λ U ≅ λˆ + z γ ( λˆ / t ) 0.5 λ U ≅ λˆ + z ( 1+ γ ) 2 ( λˆ / t ) 0.5
Given: Given the observed rate of occurrence above, the prediction for the future rate of occurrence is:
yˆ = λˆ s = (n / t ) s
where,
n, t = as defined above
s= period (time, length, volume) over which future observation is predicted
Poisson Limits (approximate only)
Closest integer solutions for yL and yU from the following equations
( n + 1) ( n + 1)
F [γ ; ( 2 n + 2 ); 2 y U ] F [(1 + γ ) 2 ; ( 2 n + 2 ); 2 y U ]
yU yU
= =
s t s t

= F [γ ; ( 2 y L + 2 ); 2 n ] = F [(1 + γ ) 2 ; ( 2 y L + 2 ); 2 n ]
s t s t
( y L + 1) n ( y L + 1) n
Future Occurrence
Rate, y
Normal Approximation
When “n” and “y” are large (e.g., each is > 10)
(
y L ≅ yˆ − z γ λˆ s ( t + s ) t )
0.5
y L ≅ yˆ − z ( 1+ γ ) 2 (λˆ s ( t + s ) t ) 0.5

y U ≅ yˆ + z γ (λˆ s ( t + s ) t )
0.5
y U ≅ yˆ + z ( 1+ γ ) 2 (λˆ s ( t + s ) t ) 0.5

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


200
Chapter 5: Life Data Modeling

Table 5.2-5: Confidence Bounds for the Binomial Distribution


Parameter One-Sided Confidence Interval Two-Sided Confidence Interval
Given: The estimate of the true population proportion, p, is the sample proportion:
pˆ = x / n
where,
x= number of “successful” trials
n= number of statistically independent sample units
Binomial Limits (approximate only):
Exact confidence levels cannot be conveniently obtained for discrete distributions
1 1
pL = pL =
1 + ( n − x + 1)(1 x ) F [γ ; ( 2 n − 2 x + 2 ); 2 x ] 1 + ( n − x + 1)(1 x ) F [(1 + γ ) 2 ; ( 2 n − 2 x + 2 ); 2 x ]
1 1
pU = pU =
1 + ( n − x )(1 (( x + 1) F [γ ; ( 2 x + 2 ); 2 n − 2 x ] 1 + ( n − x )(1 (( x + 1) F [(1 + γ ) 2 ; ( 2 x + 2 ); 2 n − 2 x ]

Normal Approximation
True Proportion, When “x” and “n-x” are large (e.g., each is > 10)
p p L ≅ pˆ − z γ ( pˆ (1 − pˆ ) / n ) 0.5 p L ≅ pˆ − z ( 1+ γ ) 2 ( pˆ (1 − pˆ ) / n ) 0.5
p U ≅ pˆ + z γ ( pˆ (1 − pˆ ) / n ) 0.5 p U ≅ pˆ + z ( 1+ γ ) 2 ( pˆ (1 − pˆ ) / n ) 0.5
Poisson Approximation
When “n” is large and “x” is small (e.g., when “x” < n/10)
p L ≅ 0.5 χ 2 [(1 − γ ); 2 x ] n p L ≅ 0.5 χ 2 [(1 − γ ) 2 ; 2 x ] n
p U ≅ 0 .5 χ 2
[γ ; 2 x + 2 ] n p U ≅ 0.5 χ 2 [(1 + γ ) 2 ; 2 x + 2 ] n
Given: Given the observed probability above, the prediction for the number of “y” future category units
is:
yˆ = mpˆ = m ( x / n )
where,
x, n = as defined above
m= future sample size
Normal Approximation
When “x”, “n-x”, “y” and “m-y” are all large (say, > 10)
[
y L ≅ yˆ − z γ m pˆ (1 − pˆ )( m + n ) n ]
0.5
[ ]
y L ≅ yˆ − z (1+ γ ) 2 m pˆ (1 − pˆ )( m + n ) n 0 .5

y U ≅ yˆ + z γ [m pˆ (1 − pˆ )( m + n ) n ] 2 [m p
ˆ (1 − pˆ )( m + n ) n ]0.5
0.5
y U ≅ yˆ + z ( 1+ γ )
Prediction of
Future Poisson Approximation
Probability of When “n” is large and “x” is small (e.g., when “x” < n/10)
“Success”, y Closest integer solutions for yL and yU from the following equations
( x + 1) ( x + 1)
F [γ ; 2 x + 2; 2 y U ]
yU
F [(1 + γ ) 2 ; ( 2 x + 2 ); 2 y U ]
yU
= =
m n m n

= F [γ ; ( 2 y L + 2 ); 2 x ] = F [(1 + γ ) 2 ; ( 2 y L + 2 ); 2 x ]
m n m n
( y L + 1) x ( y L + 1) x

Reliability Information Analysis Center


201
Chapter 5: Life Data Modeling

Table 5.2-6: Confidence Bounds for the Exponential Distribution


Parameter One-Sided Confidence Interval Two-Sided Confidence Interval
Given: The estimate of the true population mean, θ, is the sample mean:
n
∑xi
i =1
θ̂ = x =
n
where,
xi = individual times to failure for each of the observations of sample size “n”
n= number of statistically independent sample observations
Exponential Limits (exact) for Failure Truncated Tests
2 nx 2 nx
θL = θL =
χ 2
[γ ; 2 n ] χ 2
[(1 + γ ) 2 ;2n ]
2 nx 2 nx
θU = θU =
χ 2 [(1 − γ ); 2 n ] χ 2 [(1 − γ ) 2 ; 2 n ]
Exponential Limits (exact) for Time Truncated Tests
2nx 2n x
θL = θL =
True value of the χ 2 [γ;2(n + 1)] χ [(1 + γ) 2 ;2(n + 1) ]
2

mean, θ 2nx 2n x
θU = 2 θU =
χ [(1 − γ);2(n + 1)] χ [(1 − γ) 2;2(n + 1) ]
2

Normal Approximation for Failure Truncated Tests


When “n” is large (say, > 15)
x x
θL ≅
exp z γ ( n )
θL ≅
(
exp z ( 1+ γ ) 2 n )
θ U ≅ x * exp z γ ( n ) (
θ U ≅ x * exp z ( 1+ γ ) 2 n )
Given: The estimate of the true population failure rate, λ, is the sample failure rate:
1 1
λˆ = = n
ˆθ
∑xi
i =1

n
where,
θhat= sample mean
xi = individual times to failure for each of the observations of sample size “n”
n= number of statistically independent sample observations
Exponential Limits (exact) for Failure Truncated Tests
1 χ 2 [(1 − γ ); 2 n ] 1 χ 2 [(1 − γ ) 2; 2 n ]
True value of the λL = = λL = =
θU 2 nx θU 2 nx
failure rate, λ
1 χ 2 [γ ; 2 n ] 1 χ 2 [(1 + γ ) 2; 2 n ]
λU = = λU = =
θL 2 nx θL 2 nx

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


202
Chapter 5: Life Data Modeling

Table 5.2-7: Confidence Bounds for the Exponential Distribution (continued)


Parameter One-Sided Confidence Interval Two-Sided Confidence Interval
Given: The usual estimate of the 100 pth percentile, yp, is calculated as:
y p = − x * ln(1 − p )
where,
p= probability at the 100 pth percentile
True value of y p ,L = − θ L * ln(1 − p ) =
−2 nx * ln(1 − p )
y p ,L = − θ L * ln(1 − p ) =
−2 n x * ln(1 − p )
the 100-pth χ 2 [γ ; 2 n ] χ 2 [(1 + γ ) 2 ; 2 n ]
percentile, yp y p , U = − θ U * ln(1 − p ) =
− 2 nx * ln(1 − p ) − 2 n x * ln(1 − p )
y p , U = − θ U * ln(1 − p ) =
χ 2 [(1 − γ ); 2 n ] χ 2 [(1 − γ ) 2 ; 2 n ]

Given: The usual estimate of the reliability, R(t), at any age, t, is:
R * (t) = e −(t x)

where,
R= reliability as a function of time, distance, etc.
t= period at which reliability is assessed (time, distance, etc.)
True value of ( {
R L ( t ) = e − ( t θ L ) = exp − t * χ 2 [γ ; 2 n ] 2 nx }) ( {
R L ( t ) = e − ( t θ L ) = exp − t * χ 2 [(1 + γ ) 2; 2 n ] 2 nx })
reliability at
end of period, R U (t ) = e −( t / θ U )
= exp (− t * {χ 2
[(1 − γ );2n ] 2 nx }) ( {
R U ( t ) = e −( t / θ U ) = exp − t * χ 2 [(1 − γ ) 2; 2 n ] 2 nx })
R(t)

Table 5.2-8: Confidence Bounds for the Normal Distribution


Parameter One-Sided Confidence Interval Two-Sided Confidence Interval
Given: The estimate of the true population mean, μ, is the sample mean:
n
∑x i
i =1
x=
n
where,
xi = individual times to failure for each of the observations of sample size “n”
n= number of statistically independent sample observations
Normal Limits (exact)
Also serve as approximate intervals for the mean of a distribution that is not normal
⎛ ⎞ ⎛ ⎞
True value of the μ L = x − t [γ ; n − 1] * ⎜⎜ s ⎟ μ L = x − t [(1 − γ ) 2; n − 1] * ⎜⎜ s ⎟

mean, μ ⎝ n ⎟⎠ ⎝ n ⎠
⎛ ⎞ ⎛ s ⎞
μ U = x + t [γ ; n − 1] * ⎜⎜ s ⎟ μ U = x + t [(1 − γ ) 2; n − 1] * ⎜⎜ ⎟
n ⎟⎠ ⎝ n ⎟⎠

Reliability Information Analysis Center


203
Chapter 5: Life Data Modeling

Table 5.2-9: Confidence Bounds for the Normal Distribution (continued)


Parameter One-Sided Confidence Interval Two-Sided Confidence Interval
Given: The estimate of the true population variance, σ2, is the sample variance:
n
∑ (x i − x )2
i =1
s2 =
n −1
where,
s2= sample variance (standard deviation, s, equals (s2)0.5)
xi = individual measurements for each of the observations of sample size “n”
n= number of statistically independent sample observations
Normal Limits (exact)
0.5 0.5
⎧⎪ n −1 ⎫⎪ ⎧⎪ n −1 ⎫⎪
σL = s* ⎨ ⎬ σL = s * ⎨ ⎬
True value of the ⎪⎩ χ [γ ; n − 1] ⎪⎭
2
⎪⎩ χ 2 [(1 + γ ) 2 ; n − 1] ⎪⎭
variance, σ2 0.5 0.5
⎧⎪ n −1 ⎫⎪ ⎧⎪ n −1 ⎫⎪
σU = s*⎨ ⎬ σU = s*⎨ ⎬
⎪⎩ χ [(1 − γ ); n − 1] ⎪⎭
2
⎪⎩ χ [(1 − γ ) 2 ; n − 1] ⎪⎭
2

Given: The estimate of the reliability at any age “t”, R(t), is:
R * (t ) = 1 − Φ( z )
where,
R= reliability as a function of time, distance, etc.
t= period at which reliability is assessed (time, distance, etc.)
Φ(z) = estimate of the fraction of a population failing by age “t”
True value of R L ( t ) = 1 − FU ( t ) = 1 − Φ ( z U ) R L ( t ) = 1 − FU ( t ) = 1 − Φ ( z U )
reliability at end where , where ,
of period, R(t) (x − x) (x − x)
z= z=
s s
0 .5 0 .5
zγ ⎛ 2 ⎞ z (1+ γ ) 2 ⎛ 2 ⎞
zU ≅ z + ⎜ 1 + z (n / 2) ⎟ zU ≅ z + ⎜1 + z (n / 2) ⎟
⎜ n −1 ⎟ ⎜ n −1 ⎟
n ⎝ ⎠ n ⎝ ⎠

R U ( t ) = 1 − FL ( t ) = 1 − Φ ( z L ) R U ( t ) = 1 − FL ( t ) = 1 − Φ ( z L )
where , where ,
(x − x) (x − x)
z= z=
s s
0. 5 0 .5
zγ ⎛ 2 ⎞ z (1+ γ ) 2 ⎛ 2 ⎞
zL ≅ z − ⎜1 + z (n / 2) ⎟ zL ≅ z − ⎜1 + z (n / 2) ⎟
⎜ n −1 ⎟ ⎜ n −1 ⎟
n ⎝ ⎠ n ⎝ ⎠

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


204
Chapter 5: Life Data Modeling

Table 5.3-10: Confidence Bounds for the Weibull Distribution


Parameter One-Sided Confidence Interval Two-Sided Confidence Interval
Given: The estimate of the Weibull shape parameter, β, is given as:
1.283
β̂ =
s
where,
0.5
⎛ n ⎞
⎜ ∑ (x i − x )
2

s=⎜ i =1 ⎟
⎜ n −1 ⎟
⎜ ⎟
⎝ ⎠
n
∑xi
i =1
x=
n
where,
s= sample standard deviation
xi = individual times to failure for each observation of sample size “n”
n= number of statistically independent sample observations
Weibull Limits (approximate)
Limits are crude unless “n” is quite large (say, “n” > 100)
1 1
True value of βL ≅ βL ≅
⎛ 1.049 z γ ⎞ ⎛ 1.049z (1+ γ ) 2 ⎞
the Weibull 0.7797s * exp⎜ ⎟ 0.7797s * exp⎜ ⎟
⎜ ⎟ ⎜ ⎟
shape ⎝ n ⎠ ⎝ n ⎠
parameter, β βU ≅
0.7797s
βU ≅
0.7797s
⎛ 1.049 z γ ⎞ ⎛ 1.049z (1+ γ ) 2 ⎞
exp⎜ ⎟ exp⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎝ n ⎠ ⎝ n ⎠
Given: The estimate of the Weibull scale parameter, α, is:
αˆ = exp ( x + ( 0.5772 )( 0.7797 ) s )
where,
s= sample standard deviation
xi = individual measurements for each observation of sample size “n”
n= number of statistically independent sample observations
Weibull Limits (approximate)
Limits are crude unless “n” is quite large (say, “n” > 100)
True value of ⎛ (1.081)( 0.7797 ) s ⎞⎟ ⎛ (1.081)( 0.7797 ) s ⎞⎟
the Weibull α L ≅ exp ⎜ ( x + 0.45s ) − z γ α L ≅ exp ⎜ ( x + 0.45 s ) − z ( 1+ γ ) 2
⎜ ⎟ ⎜ ⎟
scale ⎝ n ⎠ ⎝ n ⎠
parameter, α ⎛ (1.081)( 0.7797 ) s ⎞ ⎛
⎜ (1.081)( 0.7797 ) s ⎞⎟
α U ≅ exp ⎜ ( x + 0.45s ) + z α ⎟ α U ≅ exp ( x + 0.45 s ) + z ( 1+ γ )
⎜ 2 ⎟
⎜ ⎟
⎝ n ⎠ ⎝ n ⎠

Reliability Information Analysis Center


205
Chapter 5: Life Data Modeling

Table 5.2-11: Confidence Bounds for the Weibull Distribution (continued)


Parameter One-Sided Confidence Interval Two-Sided Confidence Interval
Given: The estimate of the reliability at any age “t”, R(t), is:
⎛ t ⎞
β
−⎜⎜ ⎟⎟
⎝α⎠
R * (t ) = e
where,
R= reliability as a function of time, distance, etc.
t= period at which reliability is assessed (time, distance, etc.)
α= Weibull scale parameter
β= Weibull shape parameter
True value Limits are crude unless “n” is quite large (say, “n” > 100)
of reliability One-sided approximate Weibull limits:
at end of ⎡ ⎛ ⎡
0.5 ⎞⎤
⎛ t − ( x + 0.45s) ⎞ ⎤ ⎟⎥
2
⎢ ⎜ ⎛ t − ( x + 0.45s) ⎞
period, R(t) ⎜ ⎛ t − ( x + 0.45s) ⎞ ⎢1.168 + (1.1)⎜ ⎟ − ( 0.1913)⎜ ⎟ ⎥
⎢ ⎝ 0.7797s ⎠ ⎝ 0.7797s ⎠ ⎥ ⎟⎟⎥
RL (t ) = exp⎢− exp⎜ ⎜ ⎟ + zγ ⎢
⎜ ⎝ 0.7797s ⎠ ⎢ n ⎥ ⎟⎥
⎢ ⎢ ⎥ ⎟⎥
⎢ ⎜
⎣ ⎝ ⎣ ⎦ ⎠⎥⎦
⎡ ⎛ ⎡
0.5 ⎞⎤
⎛ t − ( x + 0.45s) ⎞ ⎤ ⎟⎥
2
⎢ ⎜ ⎛ t − ( x + 0.45s) ⎞
⎜ ⎛ t − ( x + 0.45s) ⎞ ⎢1.168 + (1.1)⎜ ⎟ − (0.1913)⎜ ⎟⎥
⎢ ⎝ 0.7797s ⎠ ⎝ 0.7797s ⎠ ⎥ ⎟⎟⎥
RU (t ) = exp⎢− exp⎜ ⎜ ⎟ − zγ ⎢
⎜ ⎝ 0.7797s ⎠ ⎢ n ⎥ ⎟⎥
⎢ ⎢ ⎥ ⎟⎥
⎢ ⎜
⎣ ⎝ ⎣ ⎦ ⎠⎥⎦

Two-sided approximate Weibull limits:


⎡ ⎛ ⎡
0.5 ⎞ ⎤
⎛ t − ( x + 0.45s ) ⎞ ⎤ ⎟⎥
2
⎢ ⎜ ⎛ t − ( x + 0.45s ) ⎞
⎜ ⎛ t − ( x + 0.45s) ⎞ ⎢1 . 168 + (1 .1 )⎜ ⎟ − ( 0 .1913)⎜ ⎟ ⎥
⎢ ⎝ 0.7797 s ⎠ ⎝ 0.7797 s ⎠ ⎥ ⎟⎟⎥
RL (t ) = exp ⎢− exp⎜ ⎜ ⎟ + z(1−γ ) 2 ⎢
⎜ ⎝ 0 .7797 s ⎠ ⎢ n ⎥ ⎟⎥
⎢ ⎢ ⎥ ⎟⎥
⎢ ⎜
⎣ ⎝ ⎣ ⎦ ⎠⎥⎦
⎡ ⎛ ⎡
0.5 ⎞⎤
⎛ t − ( x + 0.45s) ⎞ ⎤ ⎟⎥
2
⎢ ⎜ ⎛ t − ( x + 0.45s) ⎞
⎜ ⎛ t − ( x + 0.45s) ⎞ ⎢ 1. 168 + (1 . 1) ⎜ ⎟ − ( 0. 1913)⎜ ⎟ ⎥
⎢ ⎝ 0.7797s ⎠ ⎝ 0.7797s ⎠ ⎥ ⎟⎟⎥
RU (t ) = exp⎢− exp⎜ ⎜ ⎟ − z(1−γ ) 2 ⎢
⎜ ⎝ 0.7797s ⎠ ⎢ n ⎥ ⎟⎥
⎢ ⎢ ⎥ ⎟⎥
⎢ ⎜
⎣ ⎝ ⎣ ⎦ ⎠⎥⎦

5.3. Acceleration Models


Acceleration models are needed to determine how the TTF distribution behaves as a
function of the accelerant. The accelerant can be a stress (such as temperature, voltage,
pressure, etc.), or it can be an “indicator” variable (such as a product feature or design
attribute). These are also sometimes called “categorical” variables. One of the most
common ways to quantify acceleration factors is to perform tests at various stress levels
(and, in the case of indicator variables, for various product features or design attributes).
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
206
Chapter 5: Life Data Modeling

Accelerated testing is often used for this purpose, in which case tests are performed at
stress levels higher than the item will experience in use, to speed up failure processes.
Acceleration models consist of two generic types:

Physical Acceleration Models: For well-understood failure mechanisms, one


may have a model based on physical/chemical theory that describes the failure-
causing process over the range of the data and provides extrapolation to use
conditions.

Empirical Acceleration Models: Empirical acceleration models are used when


there is little understanding of the chemical or physical processes leading to
failure, and a model can be empirically determined to describe the observed data.

5.3.1. Fundamental Acceleration Models


In practice, the acceleration models used are a combination of physical and empirical, in
that theory may be used to determine the appropriate form of the acceleration model, but
the specific model constants are almost always determined empirically.

There are four basic forms of accelerated life models. Combinations of these are also
possible:

The linear model is:


y = ax + b

The exponential model is:


y = be ax

The power law model is:


y = bx a

The Logarithmic model is:


y = a ln(x) + b

In all of these equations, “y” is the dependent variable, usually either lifetime (as
measured by characteristic life or mean life, depending on the TTF distribution used), or
failure rate. Since the failure rate is the reciprocal of the mean life (in the case of the

Reliability Information Analysis Center


207
Chapter 5: Life Data Modeling

exponential distribution), the constant “a” will generally be positive in one case and
negative in the other.

The most commonly used reliability models are the power law and exponential models.
Several points regarding acceleration models are:

• There is no “correct” acceleration factor to use for a specific application


• Several different acceleration factor model forms are often equally applicable
• The “best” acceleration factor model is often the one that best fits the empirical
data

5.3.1.1. Examples
Several commonly used acceleration models are summarized in this section.

Arrhenius
The Arrhenius relationship is a widely used model describing the effect that temperature
has on the rate of a simple chemical reaction:

⎡ Ea ⎤
⎢ KT ⎥
L ∝ Ae ⎣ ⎦

where:

L= the lifetime
A= a life constant
Ea = the activation energy in eV
T= the absolute temperature in degrees Kelvin

It can be seen that this is the exponential model, with the reciprocal of temperature used
as the stress. The Arrhenius model is the most widely used for evaluating the effect of
temperature on reliability. It is applicable to situations in which the failure mechanism is
a function of the steady state temperature, such as corrosion, diffusion, etc. Notable
observations about the Arrhenius acceleration model are that:

• It was derived centuries ago to model chemical reaction rates


• Over the last few decades, it has been applied to electronics reliability modeling,
since it often empirically fit the data reasonable well

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


208
Chapter 5: Life Data Modeling

• In the formative years of the electronics industry, many failure mechanisms were
related to corrosion and contamination, which are inherently chemical reaction
rates for which the Arrhenius factor applies reasonably well
• It has since been applied to many other failure mechanisms, with an assumed
applicability

Eyring
The Eyring model is:
⎡ B⎤
1 − ⎢ A− ⎥
L∝ e ⎣ T⎦
T
Coffin-Manson
A form of fatigue life strain models is the Coffin-Manson “life vs. plastic strain”, which
is often used for solder joint reliability modeling:

β
⎛ ΔT ⎞
AF = ⎜⎜ S ⎟⎟
⎝ ΔTU ⎠

where:

AF = acceleration factor
ΔTU = product temperature in service use, °K
ΔTS = product temperature in stress conditions, °K
β= constant for a specific failure mechanism

The number of cycles to failure is expressed as:

β
⎛ 1 ⎞
N f = A⎜ ⎟
⎜ Δe ⎟
⎝ p ⎠

where:

Nf = number of cycles to failure


A= a material constant
Δep = plastic strain range
β= a material constant
Reliability Information Analysis Center
209
Chapter 5: Life Data Modeling

Since ΔT ∝ Δep, a simplified acceleration factor for temperature cycling fatigue testing
is:

β
N ⎛ ΔT ⎞
AF = use = ⎜⎜ test ⎟⎟
N test ⎝ ΔTuse ⎠

The Coffin-Manson model is also sometimes used to model the acceleration due to
vibration stresses. Random vibration input and response curves are typically plotted on
log-log paper, with the power spectral density (PSD) expressed in squared acceleration
units per hertz (G2/Hz), plotted along the vertical axis, and the frequency (Hz) plotted
along the horizontal axis.

G2
P = lim
Δf → 0 Δ f

In the above equation, “G” is the root mean square (RMS) of the acceleration, expressed
in gravity units, and “Δf” is the bandwidth of the frequency range expressed in hertz.
Since “G” is the agent of failure that causes fatigue, the following inverse power model
applies:

1 1
L(G ) ∝ β
⇒ Life =
G KG β

The acceleration factor for vibration based on Grms for similar product responses is
represented by:

β
N ⎛G ⎞
AF = use = ⎜⎜ test ⎟⎟
N test ⎝ G use ⎠

5.3.2. Combined Models


Acceleration models with more than one accelerating variable might be suggested when it
is known that two or more potential accelerating variables contribute to degradation and
failure. Several examples follow.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


210
Chapter 5: Life Data Modeling

Temperature and Non-Thermal Stress


When temperature and a second non-thermal stress (e.g., voltage) are the accelerated
stresses of a test, then the Arrhenius and the Inverse Power Law models can be combined
to yield the Temperature-Non-Thermal (T-NT) model (Reference 10):

L(U ,V ) =
C
B

n V
U e

where:

U= non-thermal stress (i.e., voltage, vibration, etc.)


V= temperature (in °K)
B, C, n = parameters to be determined

The T-NT relationship can be linearized and plotted on a Life vs. Stress plot by taking the
natural logarithm of both sides:

ln[L(U ,V )] = ln(C ) − n ln(U ) +


B
V

Here, the log of the life is equal to a linear relationship, where the intercept is ln(C), the
slope of ln(U) is “n” and the slope of 1/V is “B”.

The acceleration factor for the T-NT relationship is given by:

B
C Vu
e n ⎛ 1 1 ⎞
LUse U un ⎛ U ⎞ B ⎜⎜ V −V ⎟⎟
AF = = B
= ⎜⎜ A ⎟⎟ e ⎝ u A ⎠
LAccelerated C VA ⎝ Uu ⎠
e
U An

where:

LUse = the life at use stress level


LAccelerated = the life at the accelerated stress level
Vu = the use temperature level
VA = the accelerated temperature level
Uu = the use non-thermal level
Reliability Information Analysis Center
211
Chapter 5: Life Data Modeling

UA = the accelerated non-thermal level

Temperature-Humidity Models
A variation of the Eyring relationship is the Temperature-Humidity (TH) relationship.
This combination model is expressed as:

⎛φ b ⎞
⎜ + ⎟
L(V , U ) = Ae ⎝V U ⎠

where, “φ” and “b” are parameters to be determined (the parameter “b” is also known as
the activation energy for humidity), “A” is a constant, “U” is the relative humidity
(decimal or percentage), and “V” is the temperature (in absolute units, °K). Note that the
relative humidity can be expressed in either a decimal format or as a percentage, as long
as it is consistent throughout the analysis. The relationship is linearized by taking the
natural logarithm of both sides of the equation:

φ
ln[L(V , U )] = ln( A) +
b
+
V U

The acceleration factor for the TH relationship is:

⎛ φ b ⎞
⎜ + ⎟ ⎛ 1 1 ⎞ ⎛ 1
⎜V U ⎟ 1 ⎞
⎝ u ⎠ φ ⎜⎜ − ⎟⎟ + b ⎜⎜ − ⎟
LUse Ae u

AF = = ⎛ φ b ⎞
=e ⎝ Vu V A ⎠ ⎝ U u U A ⎠
LAccelerated ⎜⎜ + ⎟⎟
⎝ VA U A ⎠
Ae

where:

LUse = the life at use stress level


LAccelerated = the life at the accelerated stress level
Vu = the use temperature level
VA = the accelerated temperature level
Uu = the use humidity level
UA = the accelerated humidity level

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


212
Chapter 5: Life Data Modeling

Peck Model
The Peck model (Reference 6) is:
⎡ Ea ⎤
−n ⎢ KT ⎥
L ∝ ( RH ) e ⎣ ⎦

where:

RH = Relative Humidity
T= temperature
n= constant
Ea = activation energy
K= Boltzman’s constant = 8.617 x 10-5 eV/°K

Note that this is a multiplicative model consisting of a power law for humidity and the
Arrhenius model for temperature.

The British Telecom Model


The British Telecom model, also used in the Telcordia standards (Reference 7) is:

⎡ Ea ⎤
⎢ KT ⎥ + n ( RH )
2

L∝e ⎣ ⎦

This model includes the effects of both temperature and relative humidity.

Harris Model
Wearout data published by the Harris Corporation shows a good fit to Peck’s model
(Reference 8) in representing aluminum corrosion. This model is:
a b
⎡ Ea ⎛ 1 1 ⎞ ⎤ ⎛ RH S ⎞ ⎛ VS ⎞
⎢ ⎜⎜ − ⎟⎟ ⎥ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟
⎣⎢ k ⎥ ⎝ RH U
AF = e ⎝ U S ⎠⎦ ⎠ ⎝ VU ⎠
T T

where:

AF = acceleration factor
Ea = activation energy
k= Boltzman’s constant = 8.617 x 10-5 eV/°K
TU = product temperature in service use, °K
Reliability Information Analysis Center
213
Chapter 5: Life Data Modeling

TS = product temperature in stress conditions, °K


RHU = relative humidity in service use
RHS = relative humidity in stress conditions
VU = voltage in service use
VS = voltage in stress conditions
a= 2.66 based on Peck
b= 1.4 (from Harris data)

Fatigue and S-N curves


With metals and alloys, the fatigue process starts with dislocations, or crystallographic
irregularities, that ultimately result in crack formation. It is a probabilistic phenomenon
with a significant variation in lifetime. The S-N curves quantify the relationship between
stress and the number of stress cycles to failure. It is essentially a life-stress relationship
for metals. The curves are generally obtained by testing samples of the metal or alloy,
and have been published in various handbooks.

In estimating fatigue life for materials, the model is used as the analytical representation
of the so-called “S-N” curves, where “S” is stress amplitude and “N” is life (in cycles to
failure), such that N = kS-b, where “b” and “k” are material parameters either estimated
from test data or published in handbooks.

Miner’s Rule
Miners rule states that the amount of damage sustained by a metal is proportional to the
number of cycles it experiences, as follows:

k
ni
∑N
i =1
=C
i

There are “k” stress levels (one for each contribution “n” cycles), “N” is the total number
of cycles at a constant stress reversal, and “C” is usually assumed to be 1.0. It essentially
estimates the percentage of life used by each stress reversal at each specific magnitude.

5.3.3. Cumulative Damage Model


Many situations arise in which there is cumulative damage inflicted on an item when
subjected to a stress. For those situations where a Weibull distribution is appropriate, the
reliability function is expressed as:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


214
Chapter 5: Life Data Modeling

β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R (t ) = e
where:

R(t) = reliability – the probability of survival at time t


β= Weibull shape parameter (in time space)
α= Characteristic life as a function of the stressor

If it is assumed that the acceleration can be described by a power law, then:


n
⎛a⎞
α =⎜ ⎟
⎝S⎠
where:

S= stressor
a= life constant
n= fatigue exponent in time space

Combining the two equations yields:


β
⎛ ⎞
⎜ ⎟
⎜ t ⎟
−⎜ n ⎟
⎜⎛a⎞ ⎟

⎜ S ⎟ ⎟
R (t ) = e ⎝⎝ ⎠ ⎠

The modeling process estimates β, “a” and “n”.

The premise of the cumulative damage model is that the amount of life used per cycle is
proportional to the stressor raised to the “n” power:

n
⎛S ⎞
te = t1 ⎜⎜ 1 ⎟⎟
⎝ S0 ⎠

Reliability Information Analysis Center


215
Chapter 5: Life Data Modeling

where:

te = equivalent time at stressor 1 relative to S0


S0 = normalization stressor

This cumulative damage model is particularly useful when the stresses are time varying,
since an equivalent amount of damage can be estimated per unit time, regardless of the
behavior of stress as a function of time. This model is also consistent with fatigue, which
is essentially a cumulative damage scenario.

5.4. MLE Equations


The previous sections summarized information relative to the selection of a specific
distribution and acceleration factors. Once these have been determined for a particular
situation, and tests have been performed at various levels of acceleration, the next step is
to quantify model parameters. Previously, in the discussion on distributions and
parameter estimation with MLE, only the distribution parameters were considered, not
the acceleration model parameters. A life model needs parameter estimates for both the
distribution parameters and the acceleration model parameters. In this section, the
likelihood equations for various combinations of distributions and acceleration models
are presented.

The form of a life model is the distribution equation, with the mean or characteristic life
(depending on the distribution) replaced with the acceleration model. For example, if a
Weibull distribution is used, the reliability function is:

β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R (t ) = e
where:

R(t) = reliability – the probability of survival at time t


β= Weibull shape parameter (in time space)
α= Characteristic life as a function of stressor:

And, if the acceleration model is the power law:


n
⎛a⎞
α =⎜ ⎟
⎝S⎠

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


216
Chapter 5: Life Data Modeling

where:

S= stressor
a= life constant
n= fatigue exponent in time space

Then, combining the two equations yields:


β
⎛ ⎞
⎜ ⎟
⎜ t ⎟
−⎜ n ⎟
⎜⎛ a⎞ ⎟
⎜⎜S ⎟ ⎟
R(t ) = e ⎝⎝ ⎠ ⎠

The modeling process estimates β, “a” and “n”. Once these parameters are estimated, the
life distribution for any stress level can be obtained.

5.4.1. Likelihood Functions


The likelihood functions for the six combinations of distribution (exponential, Weibull,
lognormal) and acceleration model (Arrhenius, Inverse Power Law) are provided below.

Exponential-Arrhenius Reaction Rate Model:

B B
N
B t − M
T −
L = ∑ N i (− − ln(C ) − i e Vi ) − ∑ N i Ri e Vi
i =1 Vi C i =1 C
⎡ ⎜ T e Vi ⎟ ⎤
⎛ −B ⎞
− ⎜ Li C ⎟
⎡ − ⎛⎜ Tbie−VBi ⎞⎟ ⎜ T e Vi ⎟ ⎤
⎛ −B ⎞
− ⎜ ai C ⎟
K ⎢ ⎜ ⎟⎥ L ⎢ ⎜ ⎜ C ⎟
⎟ ⎜ ⎟⎥
+ ∑ N i Ln ⎢1 − e ⎝ ⎠
⎥ + ∑ N i Ln ⎢e ⎝ ⎠
−e ⎝ ⎠

i =1 ⎢ ⎥ i =1 ⎢ ⎥
⎣⎢ ⎦⎥ ⎣⎢ ⎦⎥

Exponential-Inverse Power Law (IPL):

N M
L = ∑ N i (ln K + n ln(S i ) − KS t ) − ∑ N i KS inTRi n
i i
i =1 i =1
K L
+ ∑ N i Ln(1 − e − KS inTLi
) + ∑ N i Ln(e − KSi Tbi − e − KSi Tai )
n n

i =1 i =1

Reliability Information Analysis Center


217
Chapter 5: Life Data Modeling

Weibull-Arrhenius:

⎛ ⎞
β
⎛ ⎞
⎜ ⎛ ⎞
β −1 ⎜ t
−⎜ i B

⎟ ⎟ ⎛ ⎞
β
N
⎜ β ⎜ ti ⎟ ⎜⎜ Vi ⎟⎟ ⎟ M ⎜ TRi ⎟
L = ∑ N i ln⎜ B ⎜ B ⎟ e ⎝ Ce ⎠
⎟ ∑ i⎜ B
− N ⎟
i =1 ⎜ Ce Vi ⎜ Vi ⎟ ⎟ i =1 ⎜⎝ Ce Vi ⎟
⎜ ⎝ Ce ⎠ ⎟ ⎠
⎝ ⎠
⎛ ⎞ ⎛ ⎛⎜ T ⎞⎟ ⎞
β β β
⎛ ⎞ ⎛ ⎞
⎜ ⎜ T ⎟
− ⎜ LiB ⎟ ⎟ ⎜ −⎜ aiB ⎟ ⎜ T ⎟
− ⎜ biB ⎟ ⎟
K
⎜ ⎜⎜ ⎟
Vi ⎟ ⎟ L ⎜ ⎜⎜⎝ Ce Vi ⎟⎟⎠ ⎜⎜ ⎟⎟

+ ∑ N i ln⎜1 − e ⎝ Ce ⎠ ⎟ ∑ i ⎜
+ N ln e − e ⎝ Ce Vi ⎠

i =1 ⎜ ⎟ i =1 ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠

Weibull-IPL:

( ) e( )β ⎞⎟ − N (KS nT )β
N M
L = ∑ N i ln⎛⎜ βKS in KS in t i ∑ i i Ri
β −1 − KS nt
i i

i =1 ⎝ ⎠ i =1

+ ∑ N i ln⎛⎜1 − e −(KSi TLi ) ⎞⎟ + ∑ N i ln⎛⎜ e −(KSi Tai ) − e −(KSi Tbi ) ⎞⎟


K n β Ln n β β

i =1 ⎝ ⎠ i =1 ⎝ ⎠

Lognormal-Arrhenius:

⎛ ⎛ B ⎞⎞ ⎛ ⎛ B ⎞⎞
⎜ ⎜ ln(t i ) − ln(C ) − ⎟ ⎟ M ⎜ ⎜ ln(TRi ) − ln(C ) − ⎟ ⎟
⎜ 1 Vi ⎟ ⎟ ⎜ Vi ⎟ ⎟
L = ∑ N i ln⎜ φ ⎜ + ∑ N i ln⎜1 − Φ⎜
N

i =1 σ t ⎜ σ ⎟ ⎟ i =1
⎜ σ ⎟⎟
⎜⎜ i ⎜ ⎟ ⎟⎟ ⎜⎜ ⎜ ⎟ ⎟⎟
⎝ ⎝ ⎠⎠ ⎝ ⎝ ⎠⎠
⎛ ⎛ B ⎞⎞ ⎛ ⎛ B⎞ ⎛ B ⎞⎞
⎜ ⎜ ln(TLi ) − ln(C ) − ⎟ ⎟ L ⎜ ⎜ ln(Tbi ) − ln(C ) − ⎟ ⎜ ln(Tai ) − ln(C ) − ⎟⎟
⎜ Vi ⎟ ⎟ ⎜ ⎟⎟
+ ∑ N i ln⎜ Φ⎜ + ∑ N i ln⎜ Φ⎜
Vi ⎟
− Φ⎜
K
Vi
i =1
⎜ σ ⎟ ⎟ i =1
⎜ σ ⎟ ⎜ σ ⎟⎟
⎜⎜ ⎜ ⎟ ⎟⎟ ⎜⎜ ⎜ ⎟ ⎜ ⎟ ⎟⎟
⎝ ⎝ ⎠⎠ ⎝ ⎝ ⎠ ⎝ ⎠⎠

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


218
Chapter 5: Life Data Modeling

Lognormal-IPL:

N
⎛ 1 ⎛ ln(t i ) + ln( K ) + n ln(S i ) ⎞ ⎞
L = ∑ N i ln⎜⎜ φ⎜ ⎟ ⎟⎟
i =1 ⎝ σt i ⎝ σ ⎠⎠
M
⎛ ⎛ ln(TRi ) + ln( K ) + n ln(S i ) ⎞ ⎞ K ⎛ ⎛ ln(TLi ) + ln( K ) + n ln(S i ) ⎞ ⎞
+ ∑ N i ln⎜⎜1 − Φ⎜ ⎟ ⎟⎟ + ∑ N i ln⎜⎜ Φ⎜ ⎟ ⎟⎟
i =1 ⎝ ⎝ σ ⎠ ⎠ i =1 ⎝ ⎝ σ ⎠⎠
L
⎛ ⎛ ln(Tbi ) + ln( K ) + n ln(S i ) ⎞ ⎛ ln(Tai ) + ln( K ) + n ln(S i ) ⎞ ⎞
+ ∑ N i ln⎜⎜ Φ⎜ ⎟ − Φ⎜ ⎟ ⎟⎟
i =1 ⎝ ⎝ σ ⎠ ⎝ σ ⎠⎠

Solutions for the parameters’ stress-life log-likelihood functions can be obtained by


setting their first order partial derivatives equal to zero and applying iterative methods.
The second order partial differential equations for each of the six combinations are
presented below. The advantage to using the second order partials is the potential for
dual use in the Fisher information matrix and the required iterative methods for attaining
the necessary parameter solutions.

Exponential Arrhenius:

∂2L ∂2L ∂2L


, ,
∂B 2 ∂ C 2 ∂B ∂ C

Exponential IPL:

∂2L ∂2L ∂2L


, ,
∂K 2 ∂n 2 ∂K ∂n

Weibull Arrhenius:

∂2L ∂2L ∂2L ∂2L ∂2L ∂2L


, , , , ,
∂β 2 ∂B 2 ∂C 2 ∂β∂B ∂β∂C ∂B∂C

Weibull IPL:

∂2L ∂2L ∂2L ∂2L ∂2L ∂2L


, , , , ,
∂β 2 ∂K 2 ∂n 2 ∂β∂K ∂β∂n ∂K∂n

Reliability Information Analysis Center


219
Chapter 5: Life Data Modeling

Lognormal Arrhenius:

∂2L ∂2L ∂2L ∂2L ∂2L ∂2L


, , , , ,
∂ B 2 ∂C 2 ∂ σ 2 ∂ B ∂C ∂B ∂σ ∂ C ∂σ

Lognormal IPL:

∂2L ∂2L ∂2L ∂2L ∂2L ∂2L


, , , , ,
∂ K 2 ∂n 2 ∂ σ 2 ∂ K ∂ n ∂K ∂ σ ∂ n ∂ σ

The likelihood function will yield a value for all possible combinations of parameter
values. A useful tool in data analysis is a plot of the likelihood value. As an example,
Figure 5.4-1 illustrates a contour plot of the likelihood value for an exponential-IPL
model.

Figure 5.4-1: Likelihood Contour Example


100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
220
Chapter 5: Life Data Modeling

In this example, the plot lines represent values of equal likelihood as a function of the
two parameters of interest (i.e., the Weibull slope and the exponent in the power law
acceleration model). The center position represents the combination of beta and “n” at
which the maximum value of likelihood occurs. The height of the likelihood value
increases as the center of the contour lines is approached. The spread in the contour lines
of equal likelihood are proportional to the uncertainty in the parameter estimates, and in
fact are one way to estimate confidence bounds on the model parameters. Also, the
dispersion of the likelihood values on the “n” axis can be thought of as the spread of the
TTFs in the stress dimension, and the dispersion of the likelihood values on the “beta”
axis can be thought of as the spread of the TTFs in the time dimension.

5.5. References
1. Lyu, M.R. (Editor), “Handbook of Software Reliability Engineering”, McGraw-Hill,
April 1996, ISBN 0070394008
2. Musa, J.D.; Iannino, A.; and Okumoto, K.; “Software Reliability: Measurement,
Prediction, Application”, McGraw-Hill, May 1987, ISBN 007044093X
3. Musa, J.D., “Software Reliability Engineering: More Reliable Software, Faster
Development and Testing”, McGraw-Hill, July 1998, ISBN 0079132715
4. Nelson, W., “Applied Life Data Analysis”, John Wiley & Sons, 1982,
ISBN0471094587
5. Fisher, R. A., 1912, “On an Absolute Criterion for Fitting Frequency Curves”,
Messenger of Mathematics, Vol. 41, pp. 155-160. [Reprinted in Statistical
Science, Vol. 12, (1997) pp. 39-41.]
6. Peck, S., IRPS tutorial, 1990
7. Telcordia GR1221
8. Peck and Hallberg, “Quality and Reliability Engineering International”, 1991
9. Hald, A., 1999, “On the Maximum Likelihood in Relation to Inverse Probability
and Least Squares.” Statistical Science, Vol. 14, No. 2, pp. 214-222.
10. Accelerated Life Testing Analysis (ALTA), Reliasoft Corp.

Reliability Information Analysis Center


221
Chapter 5: Life Data Modeling

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


222
Chapter 6: Interpretation of Reliability Estimates

6. Interpretation of Reliability Estimates


This chapter presents topics related to the interpretation of various aspects of reliability
models. It is hoped that this information will provide the reader with information that
allows for a better intuitive understanding of reliability predictions, assessments and
estimations.

6.1. Bathtub Curve


The “bathtub curve” is a general reliability model of failure rate as a function of time
that, for hardware, has three distinct periods. It is often misunderstood and
misinterpreted. It should be thought of as a concept rather than an actual failure rate
function. A generic bathtub curve is shown in Figure 6.1-1.

Figure 6.1-1: Bathtub Curve

The three regions are:

Infant Mortality. In this first portion of the bathtub curve, the failure rate is relatively
high because a portion of the population may contain parts with defects. These parts
Reliability Information Analysis Center
223
Chapter 6: Interpretation of Reliability Estimates

generally fail earlier than those in the main population. The shape of the failure rate
curve is decreasing, with its rate of decrease dependent on the maturity of the design and
manufacturing processes, as well as the applied stresses.

Useful Life. The second portion of the bathtub curve is known as the “useful life” and is
characterized by a relatively constant failure rate caused by randomly occurring failures.
It should be noted that the failure rate is only related to the height of the curve, not to the
length of the curve, which is a representation of product or system life. If items are
exhibiting randomly occurring failures, then they fail according to the exponential
distribution, in accordance with a Poisson process. Since the exponential distribution
exhibits a constant hazard rate, we can simply add the failure rates for all items making
up an item to estimate the overall failure rate of that item during its useful life.

Wearout. The last part of the curve is the wearout portion. This is where items start to
deteriorate to such a degree that they are approaching, or have reached, the end of their
useful life. This is often relevant to mechanical parts, but can also apply to any failure
cause that exhibits wearout behavior.

It is important to understand the difference between the MTBF of an item and the useful
life of that same item. Items that experience wearout failure modes/mechanisms will
have some period of useful life before they fail as a result of wearout. This useful life is
not the same as the item MTBF. During useful life, an item may also experience
randomly occurring “freak” failures caused by weak components or faulty workmanship,
especially if the item is subjected to high stress conditions. The occurrence of these
random failures during an item’s useful life results in higher failure rates, or lower
MTBF, for that item.

Mechanical items are usually most prone to wearout and, therefore, we are usually most
concerned with the useful life, or MTTF, associated with these items. Electronic items
usually become obsolete long before any significant wearout takes place6. Therefore, the
infant mortality and constant failure rate portions of the bathtub curve are of the most
interest for these items.

The bathtub curve conceptually offers a good view of the three primary types of failure
categories. It is essentially a composite failure rate curve comprised of three generic
types of failure causes. In practice, however, the well defined curve of Figure 6.1-1 is
rare. The actual curve for a product or system will depend on many factors. A specific

6
It should be noted, however, that with the progressively decreasing feature sizes of current state-of-the-art microelectronic devices,
the issues associated with wearout and useful life are becoming of greater concern.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
224
Chapter 6: Interpretation of Reliability Estimates

failure cause will generally exhibit characteristics of only one segment of the bathtub
curve, but when the characteristics of all of the other failure causes for that product or
system are considered, and a composite model is generated, the curve will have a shape
that deviates from the classic bathtub curve, even though it will often contain elements of
each of the three portions. Usually, the composite curve will be dominated by the
characteristics of those failure causes that dominate the overall reliability of the item.

It is also important to note that defects do not always manifest themselves as infant
mortality failures. They can appear to be infant mortality, random or wearout, depending
on the specific characteristics of the failure mechanism and factors, such as defect
severity distributions.

6.2. Common Cause vs. Special Cause


The fact that a failure rate can be predicted for a given part under a specific set of
conditions does not imply that a failure rate is an inherent quality of the part. The
probability of failure is a complex interaction between the inherent defect density, defect
severity, and stresses incurred in operation. Failure rates predicted using empirical
models are, therefore, typical failure rates only and represent typical defect rates, design
characteristics and use conditions. The accuracy of these prediction models is dependent
on:

• The model developers’ ability to identify the variables (component- or use-


related) that most heavily influence reliability
• The level of detailed data to which the model user has access
• The quantity and quality of the data on which the models are based

The accuracy of a reliability model is a strong function of the manner in which defects
are accounted for. Therefore, there is a trade-off between the usability of the model and
the level of detailed data that it requires. This highlights the fact that the purpose of a
reliability prediction must be clearly understood before a methodology is chosen.

Practical considerations for choosing an approach will inevitably include the types and
level of detail of information available to the analyst. Given the practical time and cost
constraints that most reliability practitioners face, it is usually important that the chosen
reliability prediction methodology be based on data and information accessible to them.

Model developers have long known that many of the factors which had a major influence
on the reliability of the end product were not included in traditional methods like MIL-
HDBK-217, but under the “constraints” of handbook users, these factors could not be
Reliability Information Analysis Center
225
Chapter 6: Interpretation of Reliability Estimates

included in the models. For example, it was known that manufacturing processes had a
major impact on end item reliability, but those are the factors which corporations hold
most proprietary. As an example of this, a physics-of-failure-like model was developed
several years ago for small-scale CMOS technology. This model required many input
variables, such as metallization cross-sectional area, silicon area, oxide field strength,
oxide defect density, metallization defect density etc. While the model has the potential
to be much more accurate than the other MIL-HDBK-217 models, it is essentially
unusable by anyone other than the component manufacturers who have access to such
information. The model is useful, however, for these manufacturers to improve the
reliability of their component designs.

The two primary purposes for performing a quantitative reliability assessment of systems
are (1) to assess the capability of the parts and design to operate reliably in a given
application (robustness), and (2) to estimate the number of field failures or the probability
of mission success. The first does not require statistically-based data or models, but
rather sound part and materials selection/qualification and robust design techniques. It is
for this purpose that physics approaches have merit. The second, however, requires
empirical data and models derived from that data. This is due to the fact that field
component failures are predominantly caused by component and manufacturing defects
which can only be quantified through the statistical analysis of empirical data. This can
be seen by observing the TTF characteristics of components and systems, which are
almost always decreasing, indicating the predominance of defect-driven failure
mechanisms. The “handbook” models described in this book provide the data to quantify
average failure rates which are a function of those defects.

It has been shown that system reliability failure causes are not driven by deterministic
processes, but rather by stochastic processes that must be treated as such in a successful
model. There is a similarity between reliability prediction and chaotic processes. This
likeness stems from the fact that the reliability of a complex system is entirely dependent
upon initial conditions (e.g., manufacturing variation) and use variables (i.e., field
application). Both the initial conditions and the use application variables are often
unknowable to any degree of certainty. For example, the likelihood of a specific system
containing a defect is often unknown, depending on the defect type, because the
propensity for defects is a function of many variables and deterministically modeling
them all is virtually impossible. However, the reliability can be predicted within bounds
by using empirically based stochastic models.

A critical factor that must be considered when choosing a reliability assessment method
is whether the failure mechanism under analysis is a special cause or a common cause

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


226
Chapter 6: Interpretation of Reliability Estimates

mechanism. In other words, a special cause mechanism means that there is an assignable
cause to the failure and that only a subpopulation of the item is susceptible to this failure
mechanism. Common cause mechanisms are those affecting the entire population.

Table 6.1-1 summarizes the characteristics of various categories of failure causes, and
identifies whether they are typically common cause or special cause. The categories of
failure types encompass the ways a failure cause can manifest itself. These are also
categories that can be used in a FMEA.

Table 6.1-1: Categories of Failure Effects


Failure cause type Category of Failure Type
Design Process Screen Infant Random Wearout
Not Not Fallout/Out-of- Mortality Failure
Capable Capable the-Box Failure
Always (Common
Cause) x x x x
Sometimes (Special
Cause) x x x x x x

If it is erroneously assumed that special cause mechanisms will affect the entire
population, gross errors in the reliability estimates of the population will result. This
error results from the assumption of a mono-modal TTF distribution when, in fact, the
actual distribution is multimodal.

If the distribution is truly mono-modal, only the parameters applicable to a single mode
distribution need to be estimated. However, if there are really several sub-populations
within the entire population, the parameters of each of the distributions needs to be
estimated, along with the percentage of the entire population represented by each
distribution.

This is especially critical when dealing with defects. In this case, it is critical to
understand the percentage of the population that is at risk of failure. To illustrate this,
consider the probability plot in Figure 6.2-1. As can be seen in this plot, there is an
apparent “knee” in the plot at about 400 hours, an indication of several subpopulations.
If a mono-modal distribution is assumed (i.e., the straight line), errors in the cumulative
percent fail at a given time will occur. Likewise, if a multimodal distribution is assumed,
a much more accurate representation of the situation results (the line through the data
points).

Reliability Information Analysis Center


227
Chapter 6: Interpretation of Reliability Estimates
ReliaSoft Weibull++ 7 - www.ReliaSoft.com
Probability - Weibull
99.000
Probability-Weibull

Data 1
90.000 Weibull-Mixed
MLE SRM MED FM
F=98/S=139
Data Points
Susp Points
Probability Line
50.000

10.000
Unreliability, F(t)

5.000

1.000

0.500

Bill Denson
Corning
1/15/2008
0.100 5:24:32 PM
10.000 100.000 1000.000 10000.000 100000.000 1000000.000
Time, (t)
β[1]=1.3341, η[1]=307.1460, Ρ[1]=0.0646; β[2]=0.7505, η[2]=2.1367Ε+4, Ρ[2]=0.4240; β[3]=4.2735, η[3]=1.1624Ε+5, Ρ[3]=0.5114

Figure 6.2-1: Example of a Non-Mono-Modal Distribution

The quantification of subpopulations usually requires data on many more samples


relative to the mono-modal situations.

If accelerated tests are used to model life, the risk in assuming mono-modality must be
considered. For this reason, techniques like stress/strength and first principals are often
difficult to use to quantify multimodality.

Examples of multimodal distributions


The plots presented in Figures 6.2-2 through 6.2-6 illustrate the characteristics of several
different types of multimodal distributions. Before each plot, the information on each of
the two distributions comprising the multimodal distribution is presented in a table
(Tables 6.2-2 through 6.2-6, respectively). Included in this description are the beta value,
the eta value (characteristic life) and the portion of the population represented by the
distribution.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


228
Chapter 6: Interpretation of Reliability Estimates

Table 6.2-2: Bimodal Population Example 1


Population 1 2
Beta 0.60 0.59
Eta 61.1 918.7
Portion 0.42 0.58

Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. com


Probability - Weibull
9 9. 9 00

9 0. 0 00

5 0. 0 00
Unreliability, F(t)

1 0. 0 00

5. 0 00

1. 0 00

0. 5 00

0. 1 00
0 . 1 00 1. 00 0 10 . 00 0 1 00 . 00 0 1 000 . 0 00 10 00 0. 00 0
T ime, (t)
F olio 1\1-. 5: β[ 1 ] =0 .6 0 1 0 , η[ 1 ] =6 1 .0 8 6 0 , Ρ[1 ]= 0 .4 2 1 9 ; β[ 2 ]= 0 .5 9 3 6 , η[2 ]= 9 1 8 .6 9 3 5 , Ρ[ 2 ] = 0 .5 7 8 1

Figure 6.2-2: Multimodal Distribution Example 1

Reliability Information Analysis Center


229
Chapter 6: Interpretation of Reliability Estimates

Table 6.2-3: Bimodal Population Example 2


Population 1 2
Beta 0.86 1.4
Eta 341.4 863.25
Portion 0.63 0.37

Re lia Sof t W e ibull+ + 7 - w w w . Re lia Sof t. co m


Probability - Weibull
99 . 9 0 0

90 . 0 0 0

50 . 0 0 0
Unreliability, F(t)

10 . 0 0 0

5.000

1.000

0.500

0.100
0. 1 00 1 . 00 0 1 0. 00 0 10 0 . 0 00 1 0 00 . 00 0 1 0 00 0. 0 00
T ime, (t)
F o lio 1 \1 -1 : β[1 ]= 0 .8 6 3 3 , η[1 ]= 3 4 1 .4 6 5 6 , Ρ[1 ]= 0 .6 3 0 3 ; β [2 ]= 1 .4 0 6 2 , η[2 ]= 8 6 3 .2 7 6 7 , Ρ[2 ]=0 .3 6 9 7

Figure 6.2-3: Multimodal Distribution Example 2

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


230
Chapter 6: Interpretation of Reliability Estimates

Table 6.1-4: Bimodal Population Example 3


Population 1 2
Beta 1.81 1.23
Eta 98.44 679.4
Portion 0.19 0.81

Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m


Probability - Weibull
9 9. 9 00

9 0. 0 00

5 0. 0 00
Unreliability, F(t)

1 0. 0 00

5. 0 00

1. 0 00

0. 5 00

0. 1 00
0.100 1. 00 0 1 0. 00 0 1 0 0. 00 0 1 00 0. 0 00 10 0 00. 0 00
T ime, (t)
F o lio 1\5-1 : β [1 ]= 1 .8 1 4 4 , η[1 ]= 9 8 .4 4 2 8 , Ρ[1 ]= 0 .1 8 8 1 ; β[ 2 ]= 1 .2 3 8 5 , η[2 ]=6 7 9 .4 4 6 9 , Ρ[ 2 ] =0 .8 1 1 9

Figure 6.2-4: Multimodal Distribution Example 3

Reliability Information Analysis Center


231
Chapter 6: Interpretation of Reliability Estimates

Table 6.1-5: Bimodal Population Example 4


Population 1 2
Beta 1.18 4.69
Eta 206.2 497.6
Portion 0.19 0.81

Re lia Sof t W e ibull+ + 7 - w w w . Re lia Sof t. co m


Probability - Weibull
99 . 9 0 0

90 . 0 0 0

50 . 0 0 0
Unreliability, F(t)

10 . 0 0 0

5.000

1.000

0.500

0.100
0. 1 00 1 . 00 0 1 0. 00 0 10 0 . 0 00 1 0 00 . 00 0 1 0 00 0. 0 00
T ime, (t)
F o lio 1 \. 5-5: β [1 ]=1 .1 8 0 8 , η[1 ]=2 0 6 .1 9 6 8 , Ρ[1 ]= 0 .1 9 4 0 ; β [2 ] =4 .6 9 4 3 , η[2 ]= 4 9 7 .6 3 5 9 , Ρ[2 ]=0 .8 0 6 0

Figure 6.2-5: Multimodal Distribution Example 4

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


232
Chapter 6: Interpretation of Reliability Estimates

Table 6.1-6: Bimodal Population Example 5


Population 1 2
Beta 5.71 4.29
Eta 44.7 483.7
Portion 0.10 0.90

Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m


Probability - Weibull
99.900

90.000

50.000
Unreliability, F(t)

10.000

5 . 0 00

1 . 0 00

0 . 5 00

0 . 1 00
0. 10 0 1. 00 0 10 . 0 0 0 10 0 . 0 00 1 00 0 . 0 0 0 1 0 00 0. 0 00
T ime, (t)
F o lio 1\5 -5 : β [ 1 ]=5 .7 1 6 3 , η[1 ]=4 4 .7 8 9 1 , Ρ[1 ]= 0 .0 9 9 9 ; β[ 2 ]= 4 .2 9 3 2 , η[2 ]= 4 8 3 .7 0 4 2 , Ρ[2 ] =0 .9 0 0 1

Figure 6.2-6: Multimodal Distribution Example 5

A distribution was then obtained by pooling all of the individual distributions described
previously. This is shown in Figure 6.2-7. The effect of pooling the various distributions
from many failure causes has the effect of randomizing the apparent failure
characteristics of the resultant pooled population. This is one of the reasons that a

Reliability Information Analysis Center


233
Chapter 6: Interpretation of Reliability Estimates

constant failure rate distribution (i.e., exponential) is usually a reasonably good


representation of a complex system’s failure rate characteristics.
Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m
Probability - Weibull
99.900
Proba bility-W e ibull

a ll da ta
W e ibull-Mixe d
NLRR SRM MED F M
90.000 F = 4 98 /S= 0
Da ta Points
Pro ba bility Line

50.000
Unreliability, F(t)

10.000

5 . 0 00

1 . 0 00

0 . 5 00

Bill De nson
Co rning
3 /14 /2 0 10
0 . 1 00 9 :43 :2 8 PM
0. 01 0 0 . 1 00 1.000 1 0 . 0 00 1 00 . 00 0 10 0 0. 00 0 1 0 00 0. 0 00 10 0 000 . 0 00
T ime, (t)
β [1 ]=0 .7 2 0 8 , η[1 ]=4 3 2 .4 6 1 4 , Ρ[1 ]= 0 .6 6 9 4 ; β[ 2 ]= 4 .4 2 3 2 , η[2 ]= 4 9 1 .0 1 2 0 , Ρ[2 ]=0 .3 3 0 6

Figure 6.2-7: Multimodal Distribution Example of Pooled Data Set

To illustrate the reliability theory concepts discussed above, consider an example in


which the lifetimes of people are analyzed. The data on which this analysis was based is
from http://www.mortality.org (Reference 1) and considers the lifetimes of individuals
that died in 2006.

The raw data is contained in Figure 6.2-8, which presents the number of deaths occurring
at each age. This is the discrete version of the pdf.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


234
Chapter 6: Interpretation of Reliability Estimates

4000

3500

3000

2500
Number of Deaths

2000

1500

1000

500

0
0 20 40 60 80 100 120
Age

Figure 6.2-8: Age at Death Data

From this graphic, it can be seen that there are several distinct distributions present. First,
is the infant mortality period, which is represented by Mode 1. The second mode, Mode
2, represents deaths in the late teens and early twenties. Then, the third and fourth modes
represent deaths from old age.

Next, a multimode Weibull distribution was fit to the data, using Reliasoft’s Weibull++
software tool, which allows fitting failure data to multimode distributions. The results
are summarized in Table 6.1-7.

Table 6.1-7: Four Mode Weibull Distribution Parameters

Parameter Mode 1 Mode 2 Mode 3 Mode 4


Beta 0.184 4.25 4.74 9.61
Eta 0.1030 24.81 67.84 87.67
Portion 0.0090 0.012 0.194 0.784

The composite pdf is shown in Figure 6.2-9.

Reliability Information Analysis Center


235
Chapter 6: Interpretation of Reliability Estimates
Probability Density Function
0.040

0.032

0.024
f(t)

0.016

0.008

0.000
0.100 22.100 44.100 66.100 88.100 110.100
Time, (t)

Figure 6.2-9: pdf of multimode distribution of ages

The failure rate is shown in Figure 6.2-10.


Failure Rate vs Time Plot
0.040

0.032

0.024
Failure Rate, f(t)/R(t)

0.016

0.008

0.000
0.100 22.100 44.100 66.100 88.100 110.100
Time, (t)

Figure 6.2-10: Failure Rate of Age Data

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


236
Chapter 6: Interpretation of Reliability Estimates

The Weibull probability plot is shown in Figure 6.2-11. Note that, in this graph, the plot
is shown using Weibull scales, i.e. the log of time on the x-axis and double log of
unreliability on the y-axis. If this plot was close to a straight line, it would indicate that
the distribution could be described adequately with a mono-modal Weibull distribution.
Clearly, this is not the case.
Probability - Weibull
99.990

90.000

50.000
Unreliability, F(t)

10.000

5.000

1.000

0.500

0.100
0.100 1.000 10.000 110.000
Time, (t)

Figure 6.2-11: Probability Plot of Age Data

Figure 6.2-12 illustrates a single mode Weibull fit (straight line) to the data. As can be
seen, if the single mode fit is used to estimate probability of death at a specific age,
significant errors would result. For example, it would imply that about 20% of the
population would live to 110 years. And, it would imply that there is less than .001%
probability of death in the first year.

This example illustrates the fact that, if there is a “sub-population” of samples with
different reliability behavior than the main population, then the TTF distributions may
manifest themselves as bimodal or multimodal. It is important that these multimodal
distributions be characterized. If one of the two “modes” in the distribution appears as
early failures resulting from defects, this information is required to develop an
appropriate reliability screen.

Reliability Information Analysis Center


237
Chapter 6: Interpretation of Reliability Estimates
Probability - Weibull
99.990

90.000

50.000

10.000

5.000
Unreliability, F(t)

1.000

0.500

0.100

0.050

0.010

0.005

0.001
1.000 10.000 110.000
Time, (t)

Figure 6.2-12: Single Mode Weibull Fit to the Age Data

6.3. Confidence Bounds


The topic of confidence bounds has always been important in reliability engineering due
to the fact that estimating the uncertainty associated with a reliability estimate is
important when making decisions based on that estimate. In this case, the risk associated
with being wrong must be assessed. It is a topic that has received a tremendous amount
of attention by reliability practitioners and academicians alike.

6.3.1. Traditional Techniques for Confidence Bounds


The traditional manner in which confidence levels are calculated around failure rates is
the use of the chi-square distribution, as follows:

χ 2 (1 − CL ,2 r + 2 )
λ=
2t

where the numerator is a value taken from a chi-square table, and “t” is the number of
device hours. A question sometimes arises as to how the confidence bounds calculated in
this manner compare to those calculated with the use of the Poisson distribution.

From the binomial and Poisson distributions, Farachi (Reference 2) has shown that:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


238
Chapter 6: Interpretation of Reliability Estimates

r
n!
1 − CL = ∑ (1 − q )n−k q k
k =0 k!(n − k )!

Using the Poisson approximation of the binomial:

(1 − q )n − k q k ≈ (nq ) e − nq
k
n!
k !(n − k )! k!

Combining the above two equations yields:

1 − CL = ∑
r
(nq) −nq
k
−nq

e = e ⎢1 + nq + ⋅ ⋅ ⋅ ⋅ +
(nq)
r −1
+
(nq)r ⎤
k =0 k! ⎣ (r − 1)! (r )! ⎥⎦
Since:

nq = λ t

Then:

1 − CL = ∑
r
(λt )k −λt −λt

e = e ⎢1 + λt + ⋅ ⋅ ⋅ ⋅ +
(λt )r −1 (λt )r ⎤
+
k =0 k! ⎣ ( r − 1 )! (r )! ⎥⎦
The chi-square value is the exact solution to the above equation. The chi-square values
are for “λt”, not “λ” alone. Therefore, for a given confidence level and number of
failures, the chi-square tables provide the value for “λt”. Therefore, the chi-square values
are entirely consistent with the Binomial and Poisson distributions.

It is important to note that the confidence bounds based on the chi-square distribution
summarized above pertain to the uncertainty from statistical considerations alone. They
do not account for variations in failure rate due to other noise factors, such as:

• Uncertainty in the number of hours or failures


• Whether the failure causes are truly relevant
• Time dependencies of the failure rate

Reliability Information Analysis Center


239
Chapter 6: Interpretation of Reliability Estimates

Additional information on confidence bounds is included in the section on life modeling.

6.3.2. Uncertainty in Reliability Prediction Estimates


One of the limitations of reliability predictions that are based on handbook models is that
they can only provide “point estimates” of failure rates. These failure rates are based on
whatever data was available to make up the model, and the model development approach.
There are no statistical confidence limits or intervals that can be associated with
handbook model data. Traditional methods are not applicable because there are many
more factors contributing to the uncertainty than the statistical-only considerations of
traditional techniques.

For example, consider the following summary of the model development and use
approach, along with the potential sources of error, as shown in Figure 6.3-1. The
sources of error are highlighted in the gray boxes. From this, it can be seen that there are
many sources of noise. The model output results reflect the cumulative effects of the
uncertainties in all of the noise sources shown.

Although a theoretical basis for the calculation of the confidence bounds around
reliability predictions is extremely difficult to derive, it is possible to empirically observe
the degree of uncertainty. Reliability predictions performed using empirical models
developed from field data result in a failure rate estimate with relatively wide confidence
bounds. Table 6.3-1 presents the multipliers of the failure rate point estimate as a
function of confidence level. This data was obtained by analyzing data on systems for
which both predicted and observed data was available. For example, using traditional
approaches, one could be 90% certain that the true failure rate was less than 7.57 times
the predicted value.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


240
Chapter 6: Interpretation of Reliability Estimates

Item Information:
Raw Input Data User Manufacturing Date
Quality
Defect Rate
Environmental Stresses:
Item Information: Temperature
Manufacturer Humidity
Manufacturing Date Delta T
Quality Radiation
Defect rate Contaminants
Operational Profile:
Duty Cycle
Data:
Cycling Rate
Operating Hours
Operating Stress
Time to Failure
Electrical Stress
Number of Failures
Mechanical
Failure Relevancy
Extreme Events
Degradation vs
Catastrophic
Model
Unmodeled Noise Development
Factors
Model
Modeled Factors: Censored Data;
Environmental: Biased Estimators;
Temperature Assumptions Made
Humidity in Modeling
Delta T
Radiation
Model
Contaminants Output
Operational Profile:
Duty Cycle
Cycling Rate
Operating Stress
Electrical Stress
Mechanical
Extreme Events

Figure 6.3-1: Sources of Error in Empirical Models


Reliability Information Analysis Center
241
Chapter 6: Interpretation of Reliability Estimates

Table 6.3-1: Failure Rate Uncertainty Level Multipliers


Percentile Multiplier
0.1 0.13
0.2 0.26
0.3 0.44
0.4 0.67
0.5 1.00
0.6 1.49
0.7 2.29
0.8 3.78
0.9 7.57

An interesting effect occurs when combining the distributions that describe the
uncertainties of the individual components comprising a system. The uncertainties are
wider at the piece-part level than at the system level. If one were to take the distributions
of failure rate from the regression analysis used to derive the component model (i.e.,
standard error estimate), and statistically combine them with a Monte Carlo summation,
the resultant distribution describing the system prediction uncertainty will have a
variance much smaller than that of the individual components comprising the system.
The reason for this is the effect of the Central Limit Theorem which quantifies the
variance of summed distributions. For example, the variance around the component
failure rate estimate is higher than the variance suggested by the above table. However,
the variance in the above table is observed to be much larger than that theoretically
derived by summing the component failure rate distributions. This implies that there are
system-level effects that contribute to the uncertainty that are not accounted for in the
component-based estimate.

Bayesian techniques, such as those used in the 217Plus system reliability assessment
methodology, allow the refinement of analytical predictions over time to reflect the
experienced reliability of an item as it progresses through in-house testing, initial field
deployment and subsequent use by the customer. In-house testing can be comprised of
accelerated tests at the component or equipment level, reliability growth tests, and
reliability screens or accelerated screening techniques.

We will not discuss Bayesian methods in detail here. The primary benefit of using
Bayesian techniques can be implied from Figure 6.3-2, however. As more and more test
and experience data is factored into the initial analytical reliability prediction, the

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


242
Chapter 6: Interpretation of Reliability Estimates

statistical confidence levels represented by the outside (red) lines on the graph continue
to converge on the “True MTBF” of the subject item. Using Bayesian techniques, as
time approaches infinity the predicted inherent MTBF and the true MTBF of the device,
product or system population become one and the same. This, of course, assumes that
MTBF is the appropriate metric, but the same situation conceptually applies to other
metrics such as failure rate and reliability (R).

Prediction

Assessment

Estimation

Paper
Analysis
In-House
Field
Testing
Data
MTBF

Upper Confidence
“True MTBF”

Lower
Confidence Level

TIME

Figure 6.3-2: Confidence Level Through Prediction, Assessment and Estimation

6.4. Failure Rate vs pdf


The biggest distinction to be made when assessing reliability is whether the time period
of interest for the item under analysis is in the “meat” of the TTF distribution, or whether
it is in the extreme left tail of the distribution. For example, consider a system that has a
five year design life. If an item has a mean life of three years, clearly precautions would
Reliability Information Analysis Center
243
Chapter 6: Interpretation of Reliability Estimates

be required, such as preventive maintenance. The reliability of these types of items is


usually easier to predict because they can be tested to failure in relatively short times, and
small sample sizes will usually suffice.

On the other hand, consider a component that has a failure rate of 2 FITs7, typical for
many modern electronic components. In the five year design life, assuming continuous
operation, the reliability would be:

(− λ t )
R=e = 0 .999912

Or, a probability of failure of 0.000088.

Therefore, if there were 10,000 of these components operating in a system, the expected
number of failures in the five year period would be less than one.

Predicting the reliability behavior of a failure cause based on the extreme left tail of the
TTF distribution of the main population is dangerous, since the accuracy of the
distribution breaks down in its extreme tails. As an example, consider a state-of-the-art
integrated circuit. One failure mechanism is electromigration of the metal lines.
Manufacturers will typically perform life tests of the metal line structures to assess their
lifetime. These tests are done in a manner similar to the practices detailed in this book.
They are accelerated tests performed under a variety of temperature and current density
conditions. Failure times are collected and models are developed to predict lifetimes
under deployment conditions. A goal of a good manufacturer is to design the metal lines
such that the probability of failure is acceptably low when the part is used under specified
conditions. While, as stated, the models developed can be used to estimate the reliability
under deployment conditions, rarely will the prediction be reasonably close to the
observed failure data. The reasons for this are:

• The distribution is usually not mono-modal


• Manufacturing variability is difficult to account for in the model
• Extreme events, such as defects in the metal lines, will only manifest themselves
after very large sample sizes are tested or fielded

A multimode distribution can be used to model this situation, the first mode being
applicable to the defects, and the second being applicable to the main population.

7
Two FITs is defined as 2.0 failures per billion hours. This corresponds to 0.002 failures per million hours.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
244
Chapter 6: Interpretation of Reliability Estimates

However, in many cases, it is only the first mode that will impact the field reliability
within the useful life of the component.

Some researchers have attempted to use extreme value statistics for such cases, but they
also have limited usefulness because the data on low failure rate items, like electronic
components, is generally not consistent with these distributions. As a result, low failure
rate items are usually modeled with a constant failure rate (exponential distribution), or a
Weibull distribution. The Weibull is usually used in this case to model the effects of
infant mortality.

6.5. Practical Aspects of Reliability Assessments


There are very often serious constraints put on practicing reliability engineers. Due to
limitations of time, cost, test resources, availability of data, limitation of modeling
capabilities, and lack of understanding of failure physics, analysts often need to do the
best they can with what they have to work with. This is usually compounded in small-
and medium-sized companies, which often lack the resources needed to execute many of
the analysis techniques described in this book.

Companies engaged in highly competitive industries face extreme time pressures, which
is in stark contrast to the tenets of good reliability engineering practices. The goal of the
reliability engineer should be to select an optimal approach that achieves the desired
purpose of the analysis, while conforming to the practical constrains to which he or she is
subjected.

6.6. Weibayes
There are many cases in reliability modeling in which there are few or no failures. For
these, a Weibayes technique can be used. This approach is practical when there are few
or no failures and a reasonable shape parameter can be estimated. This approach
essentially fixes a plotting position using:

1. One failure assumed at the end of the test duration


2. A line drawn through the median rank point with an assumed beta

The result of this analysis is a lower single-sided bound of the life distribution. As an
example, consider the following case:

1. 50 samples are tested for 1000 hours, with no failures


2. Data from other testing indicates a beta of 3 is appropriate

Reliability Information Analysis Center


245
Chapter 6: Interpretation of Reliability Estimates

3. The median rank at 1000 hours is 1.39%. A line is drawn through this point with
a beta slope of 3.

This is shown in Figure 6.6-1.

Figure 6.6-1: Weibayes Example

6.7. Weibull Closure Property


In cases where it is desired to estimate the time-to first-failure (TTFF) of a product or
system comprised of multiple items, the Weibull closure property can be used. Here, the
characteristic life of the Weibull distribution of time to first failure is:
1

⎡ n 1 ⎤ β
α s = ⎢∑ β ⎥
⎢⎣ i =1 α i ⎥⎦

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


246
Chapter 6: Interpretation of Reliability Estimates

where αi and β represent the Weibull distribution parameters for individual items. This is
applicable when β is the same, but αi can be different for each item.

6.8. Estimating Event-Related Reliability


Many cases arise in estimating reliability where a failure cause of a device under analysis
is event-related. For example, if a hand-held device is susceptible to failure when it is
dropped, the failure rate (or hazard rate, if a time-varying failure rate distribution is used)
is a function of the:

• rate at which drops occur


• distribution of the drop height
• relationship between drop height and G-shock level
• probability of failure as a function of G-level

The failure rate is expressed as:

λ (t ) = λd (t )hd
G
Gth
h
where:

λ(t) = the failure rate of the device due to shock-related failure causes
λd(t) = the rate at which the drops occur
hd = the drop height distribution
G/h = the relationship between the G-level and the drop height
Gth = the failure threshold distribution

Since hd and Gth are random variables described by distributions, λ(t) can generally be
estimated with a Monte Carlo analysis, as described earlier in this book.

In this case, the conditional probability of failure if the device is dropped is:

G
hd Gth
h

This is essentially a stress/strength interference model.

Reliability Information Analysis Center


247
Chapter 6: Interpretation of Reliability Estimates

6.9. Combining Different Types of Assessments at Different


Levels
Practicing reliability engineers are usually faced with the challenge of making reliability
estimates of a product or system based on imperfect, noisy data and information. The
engineer must utilize the data that is available, and additional data that is feasible to
obtain, and combine this information to estimate the product or system reliability.

The 217Plus methodology summarized previously in this book presents one possible
approach for using the initial estimated reliability based on the predictions made from
empirical models, and combining it with empirical data on the same product or system.
This combination is done using Bayesian principals. This is a general approach that can
be extended to include the combination of estimates from different methods that are made
at different levels. For example, consider the case summarized in Table 6.9-1. It may be
possible to characterize specific failure causes with one of the physics-based techniques
summarized herein, but it also may be unlikely that all failure causes can be modeled in
this manner.

Table 6.9-1: Example of Combing Different Types of Models


Item Available Reliability Estimate
Assembly Life Test Data
Component A Life Test Data
Failure Cause 1 Physics Model
Failure Cause 2 Field Data on Similar Item
Failure Cause 3 Physics Model
Component B Field Data
Failure Cause 1 Life Test Data
Failure Cause 2 Field
Failure Cause 3 Physics Model

In this example, the objective is to estimate the reliability of the assembly, which is
comprised of two components. Component A has physics-based models available for
two of the three primary failure causes.

An estimate of the failure rate of component A is:

λA −preliminary = λ1 + λ2 + λ3

where λ1, λ2 and λ3 are the failure rates obtained from the model or data available on each
failure cause. Of course, these values should represent the failure rate under the use
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
248
Chapter 6: Interpretation of Reliability Estimates

conditions for which the assessment is to be made. In this example, λ is used, which
indicates a constant failure rate. However, if the failure rates are time-dependent, the
corresponding time-dependent failure rates or hazard rates can be used. Also, the
methodology to be illustrated in this example is similar to the data combination
methodology described in the 217Plus section, the main difference being that this
example deals with the situation in which there are different types of data at different
hierarchical levels of the product or system, whereas the 217Plus methodology deals with
different types of data within the same configuration item.

Now, since Component A has life test data available from tests performed on the
component, λA-preliminary is the failure rate estimate before accounting for the life test data
on the entire component. This life data will account for any failure causes not included in
the three failure causes considered, and it will also provide additional data on the three
failure causes considered. A better estimate of reliability can be obtained by combining
λA-preliminary with the life test data, using Bayesian techniques. This technique accounts
for the quantity of data by weighting large amounts of data more heavily than small
amounts. λA-preliminary forms the “prior” distribution, comprised of a0 and ao/λA-preliminary .
The empirical data (i.e., test data in this case) is combined with λA-preliminary using the
following equation:
n
a0 + ∑a
i =1
i
λA = n

∑b '
a0
+
λA−preliminary i =1
i

λA is the best estimate of the Component A failure rate, while ao is the “equivalent”
number of failures of the prior distribution corresponding to λA-preliminary. For these
calculations, 0.5 should be used unless a tailored value can be derived. An example of
this tailoring is provided in the Section 2.6 of this book. The equivalent number of hours
associated with λA-preliminary is represented by ao/λA-preliminary. The number of failures
experienced in each source of empirical data is a1 through an. There may be “n” different
sources of data available (for example, each of the “n” sources corresponds to individual
tests or field data from the population of products). The equivalent number of cumulative
operating hours experienced for each individual data source is b1’ through bn’. These
values must be converted to equivalent hours by accounting for any accelerating effects
between the use conditions.

The same methodology is applied to Component B, and λB is obtained.

Reliability Information Analysis Center


249
Chapter 6: Interpretation of Reliability Estimates

The same methodology is, in turn, applied at the parent level assembly, in which case, the
preliminary estimate is:

λAssembly= preliminary = λA + λB

and the parent assembly failure rate becomes:


n
a0 + ∑a i
λA = i =1
n

∑b '
a0
+
λAssembly- preliminary i =1
i

where ao is the “equivalent” number of failures of the prior distribution corresponding to


λAssembly-preliminary, and the values for ai and bi correspond to the Assembly life test data.

6.10. Estimating the Number of Failures


There are many cases in which the desired outcome of a reliability analysis is the
expected number of failures. This is appropriate, for example, when calculating spares
requirements or warranty returns. The techniques described in this book are useful for
estimating either failure rates or probability of failure.

If the outcome of the analysis is a failure rate, then the expected number of failures is:

N f = λt
where:

Nf = the number of expected failures


λ= the failure rate
t= the cumulative operating time

This can be seen by reviewing the units in this relationship:

Failures ⎛ operating time ⎞


N = λt = ⎜⎜ ×# parts ⎟⎟ = Failures
operating time ⎝ part ⎠
# parts ×
part

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


250
Chapter 6: Interpretation of Reliability Estimates

This equation is usually used for repairable systems.

If the output of the analysis is a life model that describes the distribution of TTFs for a
specific set of conditions, the number of failures is:

N f = N [F (t 2 ) − F (t1 )]

where:

Nf i= the number of expected failures


N = the total number of parts in the population
F(t1) = the cumulative probability function at time t1
F(t2) = the cumulative probability function at time t2
t1 and t2 are the times between which the failure probability is to be evaluated

In this case, since “F” is a (unitless) probability value, the total population is scaled by
the probability of failure in the time interval of interest. This is identical to the expected
value of the binomial distribution of the number of failures.

6.11. Calculation of Equivalent Failure Rates


In many cases, it is advantageous to calculate an “equivalent” failure rate from the results
of a reliability model that yield a non-constant failure rate as its output. For example, if a
reliability model estimates that a certain percent fail will occur at a given time (based on
the non-constant failure rate model), the equivalent constant failure rate can be calculated
as follows:

The reliability function for a constant failure rate is:

R = e − λt
The equivalent failure rate can be obtained by solving the above equation for the failure
rate:

− ln(R )
λ=
t

Reliability Information Analysis Center


251
Chapter 6: Interpretation of Reliability Estimates

The resulting failure rate value is equal to a failure rate that will result in the same
cumulative percent fail as predicted by the non-constant model at the specific time that
the reliability is calculated. If a different time is chosen, a different value will be
obtained.

This technique can be used when the reliability of some parts of a system is calculated
with non-constant failure rate models and others are calculated with a constant failure
rate. It can also be used when modeling “one-shot” devices, which will simply have a
probability of failure instead of a failure rate.

6.12. Failure Rate Units


The output of a reliability model can include a host of potential metrics, including:

• Mean life
• Median life
• MTBF
• Failure rate
• Time to X% fail
• B10 life
• Distribution parameters:
o Weibull characteristic life and shape parameter
o Lognormal mean and standard deviation

If a constant failure rate distribution is used, there are various units of failure rate
possible. Some of these are:

• Failures per hour


• Failures per million hours
• Failures per billion hours
• Percent failure per thousand hours

“Failures per hour” is the fundamental unit. All of these failure rate units can be
translated to each other with a constant multiplication factor. For example, “Failures per
million hours” times 1000 equals “Failures per billion hours” and “Percent failure per
thousand hours” is equivalent to “Failures per ten thousand hours”.

In the above cases, the “life unit” shown is in hours (i.e., time), but it does not necessarily
need to be. Other possible life units are cycles, miles, missions, operations, etc.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


252
Chapter 6: Interpretation of Reliability Estimates

Additionally, if the life unit in the above listed metrics is time (hours), it can refer to the
number of operating hours, calendar hours, flight hours, etc. Reliability prediction
methods like MIL-HDBK-217 use operating hours as the life unit, whereas 217Plus uses
calendar hours as the life unit. Calculation of the operating failure rate using MIL-
HDBK-217 makes the implicit assumption that the failure rate during non-operating
periods is zero, unless the non-operating failure rate is otherwise accounted for.

However, in all cases the life unit refers to the cumulative value of the population. For
example, if the failure rate unit of “Failures per million hours” is used, the million hours
refers to the cumulative time of the entire population, i.e. the sum of each component’s
number of hours.

6.13. Factors to be Considered When Developing Models


This section discusses a few of the factors that should be considered in the development
of a reliability model. It is by no means an exhaustive list, but it is included here to give
the reader ideas on the types of factors that should be considered.

6.13.1. Causes of Electronic System Failure


An assumption often made when using traditional reliability prediction methodologies is
that the failure rate of a product or system is primarily determined by the components
comprising the system. A significant number of failures also stem from non-component
causes such as defects in design and manufacturing. Historically, these factors have not
been explicitly addressed in prediction methods. The data in Figure 6.13-1 contains the
nominal percentage of failures attributable to each of eight identified predominant failure
causes based on failure mode data collected by the RIAC on electronic systems.

Reliability Information Analysis Center


253
Chapter 6: Interpretation of Reliability Estimates

Figure 6.13-1: Nominal Failure Cause Distribution of Electronic Systems

The definitions of failure causes are:


• Parts (22%): Failures resulting from a part (i.e., microcircuit, transistor, resistor,
connector, etc.) failing to perform its intended function. Examples
include part failures due to poor quality; manufacturer or lot
variability; or any process deficiency that causes a part to fail before
its expected wearout limit is reached.
• Design (9%): Failures resulting from an inadequate design. Examples include
tolerance stack-up, unanticipated logic conditions (e.g., sneak paths), a
non-robust design for given environmental stresses, etc.
• Manufacturing (15%): Failures resulting from anomalies in the manufacturing
process that are not related to the inherent reliability of a part, i.e.,
faulty solder joints, inadequate wire routing resulting in chafing, bent
connector pins, etc.
• System Management (4%): Failures traceable to faulty interpretation of system
requirements, imposition of “bad” requirements (missing, inadequate,
ambiguous or contradictory), or failure to provide the resources
(funding and/or personnel) required to design and build a reliable
product or system.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


254
Chapter 6: Interpretation of Reliability Estimates

• Wearout (9%): Failures resulting from wearout-related failure mechanisms due


to basic device physics. Examples of electronic components
exhibiting wearout-related failure mechanisms are electrolytic
capacitors, solder joints, microwave tubes (such as TWTs), and switch
and relay contacts.
• No defect (20%): Perceived failures that cannot be reproduced upon further
testing. These may or may not be an actual failure; however they are
removals and, therefore, are typically counted toward the logistic
failure rate (or MTBF). Examples include the inability of the
maintenance environment to recreate the operational environmental
stresses under which the original failure occurred, or “looser”
tolerances on the test equipment than on the platform or system from
which the defective unit was taken.
• Induced (12%): Failures resulting from an externally applied stress. Examples
are electrical overstress and maintenance-induced failures (i.e.,
dropping, bending pins, etc.).
• Software (9%): Failures of a system to perform its intended function due to the
manifestation of a software fault

While there are reliability assessment methods for specific causes listed above, (i.e.,
components, software, etc.) there are few methodologies that attempt to take a holistic
view of system reliability and integrate them into a single methodology. One example of
a methodology that attempts to do this is 217Plus, which is described in Chapter 7.

6.13.2. Selection of Factors


The process of reliability assessment can be viewed as an IPO model, which has input
parameters (I), the process or models used to assess the reliability as a function of those
input parameters (P), and an output (O). This is illustrated in Figure 6.13-2.

Reliability Information Analysis Center


255
Chapter 6: Interpretation of Reliability Estimates

Input Process Output

Initial
Conditions Reliability
Metrics
Stresses

Figure 6.13-2: IPO Model

Examples of the IPO variables, as applied to reliability modeling, are shown in Table
6.13-1.

Table 6.13-1: Factors to be Considered in a Reliability Model


Defect-Free
• Voids
• Material Property Variation
• Geometry Variation
Intrinsic • Contamination
• Ionic Contamination
Initial Conditions • Crystal Defects
Defects
• Stress Concentrations
• Organic Contamination
• Nonconductive Particles
Extrinsic • Conductive Particles
• Contamination
• Ionic Contamination
• Thermal
Input Operational • Electrical
• Chemical
• Optical
• Chemical Exposure
• Salt Fog
• Mechanical Shock
Stresses • UV Exposure
• Drop
Environmental • Vibration
• Temperature - High and Low
• Temperature Cycling
• Humidity
• Atmospheric Pressure – Low and High
• Radiation – EMI, Cosmic
• Sand and Dust

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


256
Chapter 6: Interpretation of Reliability Estimates

Table 6.13-1: Factors to be Considered in a Reliability Model (continued)


Process This is the reliability assessment process using the various techniques described in this book.
Output • Mean Life
• Median Life
• MTBF
• Failure Rate
• Time to X% Fail
• B10 Life
• Distribution Parameters
• Weibull Characteristic Life and Shape Parameter
• Lognormal Mean and Standard Deviation

Additional information will be discussed in Chapter 8.

Given the stochastic nature of reliability prediction for many failure causes, it is
impossible to develop a model that is adequately sensitive to all conceivable factors. At
least, this is true in all but the simplest of cases. That’s why model developers need to
select what are believed to be the most relevant factors, and then model accordingly.

This highlights the fact that reliability assessment falls into two distinct categories: the
modeling of intrinsic and extrinsic failure causes. Intrinsic failure causes are generally
those whose root cause is from a known failure mechanism that affects the entire
population of product. These can often be predicted within acceptable bounds by
understanding the stresses, the material properties, etc.

Extrinsic failure causes are those resulting from unpredictable causes, often a complex
sequence of events that ultimately results in the failure cause. Unfortunately, many real
world situations fall into this category. It’s unfortunate because these are the ones whose
likelihood is most difficult to predict. Generally, components that have very low failure
rates are governed by these mechanisms. It is often those unexpected, unpredictable
things that happen somewhere upstream in the process, or in the supplier’s process. This
is the premise behind the 217Plus system assessment methodology. While it is difficult
to predict the likelihood of these “extreme events”, or even identify the failure cause a
priori, it is possible to assess controllable factors that have a relationship to the likelihood
of experiencing the failure cause.

6.13.3. Reliability Growth of Components


Another issue facing reliability model developers is the manner in which reliability
growth is accounted for. A good model reflects state-of-the-art technology. However,
empirical models are usually developed from the analysis of field data, which takes time
Reliability Information Analysis Center
257
Chapter 6: Interpretation of Reliability Estimates

to collect. The faster the growth, the more difficult it is to derive an accurate (i.e.,
“current”) model.

As an example of this reliability growth effect, Table 6.13-2 contains, for each generic
component electronic type, the growth rate that has been observed from data collected by
the RIAC. These reliability growth factors are included in the 217Plus component
models. The growth rate model used for each component for this purpose is:

λ ∝ e − β (t −t
1 2 )

where:

λ= the estimated failure rate as a function of year of manufacture


β= the growth rate
t1 = the year of part manufacture for which a failure rate is estimated
t2 = the year of manufacture of parts on which the data was collected

Table 6.13-2: Failure Rate Data Summary


Component Type Growth
Rate (β)
Capacitor, Ceramic 0.0082
Capacitor, Electrolytic 0.229
Capacitor, Tantalum 0.229
Connectors 0.23
Diode, General Purpose 0.223
Diode, Schottky 0.297
Diode, Zener 0.150
IC, Digital, Nonhermetic 0.473
IC, Hermetic (All Types) 0.33
IC, Linear, Nonhermetic 0.293
IC, Memory/Microprocessor, Nonhermetic 0.479
Inductors 0.0
LED 0.34
Optoelectronic Devices 0.087
Relays 0.0
Resistors, All Types 0.00089
Switches 0.0
Thyristors 0.20
Transformers 0.0
Transistor, Bipolar 0.281
Transistor, FET, N-Channel 0.397
Transistor, Microwave 0.269

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


258
Chapter 6: Interpretation of Reliability Estimates

6.13.4. Relative vs. Absolute Humidity


There are many failure mechanisms that are accelerated by the combination of
temperature and humidity. When modeling failure causes that are a function of humidity,
a question arises as to whether the model should be a function of relative humidity or
absolute humidity. The appropriate metric to use will depend on whether the failure
cause is a function of the absolute amount of water at the surface of the item under
analysis. If this is the case, absolute humidity is probably the appropriate measure. The
relationship between absolute and relative humidity is illustrated in Figure 6.13-3.

Figure 6.13-3: Relationship Between Absolute and Relative Humidity

6.14. Addressing Data with No Failures


In many cases, reliability estimates are made with data containing few or no failures. The
analyst must be careful when using this data to estimate reliability. The true failure rate
of a component is only available after prolonged operation, but reliability estimates are
usually required before this data becomes available. In other words, the analyst needs a
leading indicator of reliability, not a lagging indicator. Therefore, before the component
has experienced enough operating time to estimate the true failure rate, there may be
some data available. This data is often a certain number of operating hours with no

Reliability Information Analysis Center


259
Chapter 6: Interpretation of Reliability Estimates

observed failures. A common way of utilizing this data is to estimate a single-sided


confidence level of the failure rate, based on the observed number of operating hours.

As an example, consider a situation in which a component’s true failure rate is 0.1


failures per million operating hours. Figure 6.14-1 illustrates the 60% and 90% upper
bound estimates as a function of the number of operating hours. For example, if there are
1 million observed operating hours, then the upper bound of the failure rate, at a 60%
confidence level, is 0.916 F/10e6. I n other words, there is 60% confidence that the true
failure rate is less than 0.916 F/10e6 hours. Only after there have been a total of 6 to 8
million operating hours is the 60% upper bound a reasonable estimate.

Figure 6.14-1: Estimated Upper Bound failure Rates vs Operating Time at 60 and
90% Confidence

Using a single-sided failure rate bound for reliability estimates can be dangerous, because
they can be very pessimistic. Exactly how pessimistic is determined by the number of
operating hours relative to the true failure rate. Moreover, if the upper bound is used on
multiple components in an assembly, then the pessimism in the assembly failure rate
estimate is compounded.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


260
Chapter 6: Interpretation of Reliability Estimates

The Bayesian techniques described previously are a way to address the issue of few or no
failures. This is, in fact, the premise of the 217Plus methodology. This approach, while
it requires a prior estimate, can alleviate the pessimistic nature of reliability estimates
made only from an observed number of hours with no failures.

Another related approach is to pool “like” data together for the purpose of estimating a
failure rate. For example, if a component has no failures, but there is also data available
on other components within the “family” of components, the data can be combined. An
example of this approach is described in the section on NPRD (Section 7.4). In that case,
the pooling occurs as a function of part type, quality and environment. The algorithm
used in that case was similar to a Bayesian approach, but was tailored to the specific
constraints of the data.

6.15. Reliability of Components Used Outside of Their Rating


A significant issue with the application of commercial microcircuits used in severe
environments is the temperature rating of the part. The associated temperature range over
which a manufacturer will guarantee performance is limited to that of a commercial part,
i.e. typically 0 to 70 degrees C. Military and aerospace applications often require
guaranteed performance over wider temperature ranges, i.e. -55 to 125 degrees C. While
this is not a reliability prediction issue per se, it does confound the definition of failure
criteria. For example, although a part may not perform beyond it rated temperature, it
usually does not catastrophically fail and, therefore, is not considered a reliability failure.
However, many practitioners do consider this a reliability issue and, as such, turn to
reliability models for the quantification of the microcircuit reliability in their specific
extended range application. There are no reliability models currently available that
can quantify the reliability of parts when used beyond their rating. All existing
models make the implicit assumption that parts are used within their rating. A separate,
but critical, requirement for the reliable application of components is the qualification of
parts and manufacturers to insure that specific parts will function reliably in the intended
application.

The application of a component beyond its rated value of stress can result in one or more
undesired effects. First, there can be reliability ramifications, which can manifest
themselves in a variety of ways: either as a sudden, catastrophic failure or as a latent
failure. The detectability of the first is much better, since it can be observed with product
or system testing. Latent failures are much more difficult to detect, and require more
testing and modeling using the techniques described in this book. The second type of
undesired effect is related to component performance. Performance characteristics can
either be permanently degraded or they may be subject to a “reversible” process in which
Reliability Information Analysis Center
261
Chapter 6: Interpretation of Reliability Estimates

the performance recovers after the overstress condition is taken away. In any event, these
possible undesired effects should be studied and understood before applying components
beyond their rated stress values.

6.16. References
1. http://www.mortality.org
2. Farachi, V., “Electronic Component Failure Rate Prediction Analysis,” RIAC
Journal, Nov., 2006.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


262
Chapter 7: Examples

7. Examples
This chapter presents several examples of reliability models that are intended to provide a
cross section of several different methodologies. The focus of the examples is to present
methodologies that the author has personally developed, and ,thus, can provide insight
into the logic and rationale for their development. Several examples were previously
presented in Chapter 2, but not in detail. This section presents more detail regarding
model factors, development methods, etc.

The following examples are provided:

1. MIL-HDBK-217 Model Development Methodology – The generic modeling


methodology for many of the models contained in MIL-HDBK-217 is presented
in this section. Not all of the models in the handbook have been developed using
this methodology, but the majority have been. This is presented so that the reader
can gain an understanding of the approach and methodology used, and to provide
insight into the decisions faced by the model developer.
2. 217Plus Reliability Models – 217Plus is the methodology developed by the
RIAC to fill the void left after MIL-HDBK-217 was no longer scheduled to be
updated. The approach taken in the development of this methodology was quite
different than the methodology for MIL-HDBK-217. It was intended to be a
holistic approach in which all primary causes of electronic system failure were
accounted for. Therefore, factors addressing non-component reliability were
considered. It was also intended to be holistic in terms of its ability to leverage
experience from predecessor systems, and utilize information from empirical
testing. The general approach for this methodology was previously presented in
Chapter 2 in the “Combining Data” section. The additional information presented
in this section presents the details on the remaining portions of the methodology.
Additionally, the development of models for several different components is
presented. First is the development of the original twelve electronic part types.
For these models, sufficient field reliability data was available. The second
component models presented are for photonic component types. For these, very
little field data was available, and, therefore, the original 217Plus approach
needed to be tailored.
3. Life Model Example – The intent of the life modeling example that will be
presented is to illustrate an application of the life modeling methodologies
previously discussed. This is a hypothetical example, but provides information
pertaining to the various elements of life modeling.

Reliability Information Analysis Center


263
Chapter 7: Examples

4. NPRD – This section, covering the RIAC “Nonelectronic Parts Reliability Data
(NPRD)” publication, is presented to illustrate the nuances of field reliability data,
the manner in which data is merged, and the manner in which it is used in
reliability modeling. Some of this information was previously presented in
Chapter 2 in the section on the use of field data, but more detail will be presented
here. This will hopefully provide the user with an appreciation for both the uses
and limitations of this type of data.

The examples presented in this section were selected to provide a cross-section of various
methodologies, including prediction, assessment and estimation. It is presented to
complement the information previously provided in Chapter 2.

7.1. MIL-HDBK-217 Model Development Methodology


MIL-HDBK-217 is probably the most widely used of the empirically-based reliability
prediction methodologies. The basic premise of the handbook is the use of historical
piece-part test and field failure rate data as the basis for predicting future product or
system reliability. The handbook includes failure rate models for most electronic part
types, and many electromechanical part types. The latest version of MIL-HDBK-217 is
“F, Notice 2”, dated 28 February 19958. The handbook was almost a casualty of Perry’s
DoD Acquisition Reform initiative, but it survived primarily on the wide use of, and
dependency on, the methodology throughout the military-industrial complex and the lack
of a suitable replacement.

The models that are currently contained in MIL-HDBK-217 have been developed by
various organizations, which use various techniques for their development. However,
Reference 1 will be used to illustrate a typical model development methodology. The
study documented in this report developed the models for discrete semiconductor
devices. Excerpts from this report are summarized within this section. The model
development methodology is shown in Figure 7.1-1. Each of the elements in this
methodology is further examined below.

8
As noted previously, as of the publication date of this book, a Draft of MIL-HDBK-217G is currently in the works, with an
anticipated release in 2010.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
264
Chapter 7: Examples

Figure 7.1-1: MIL-HDBK-217 Model Development Methodology

Reliability Information Analysis Center


265
Chapter 7: Examples

7.1.1. Identify Possible Variables


The first step in the modeling methodology is to identify possible model factors. In this
example, the possible factors were:

• Device Style
• Power Rating
• Package Type
• Semiconductor Material
• Structure (NPN, PNP)
• Electrical Stress
• Circuit Application
• Quality Level
• Duty Cycle
• Operating Frequency
• Junction Temperature
• Application Environment
• Complexity
• Power Cycling

7.1.2. Develop Theoretical Model


A series of theoretical failure rate prediction models is hypothesized to provide the
resultant models with a sound theoretical/engineering backing. Basically, theoretical
model development involves evaluation of the effects of the parameters identified in the
previous phase. In addition, the optimal model form (i.e., additive, multiplicative, or a
combination) is determined and the time dependency of the discrete semiconductor
failure rates is studied.

The development of the theoretical device failure rate prediction models is an integral
part of the overall model development process. Information collected through literature
searches and discrete semiconductor user and vendor surveys is reviewed and evaluated
to aid in the development of theoretical models for each discrete semiconductor device
type group. The theoretical models serve the following functions:

1. Assure that the prediction models conform to physical and chemical principles
2. Select variables when not possible to determine sing purely statistical techniques

In general terms, the theoretical models were of the following form.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


266
Chapter 7: Examples

n
λ = λbπ T π E π Q ∏π i
i =1

where:

λ= theoretical failure rate prediction


λb = base failure rate, dependent on device style
πT = temperature factor (based on the Arrhenius relationship)
πE = environment factor
πQ = quality factor based upon device screening level and hermeticity
Product of πi = the product of Pi factors based upon variables from the potential
list of input variables found to have a significant effect on the
discrete semiconductor failure rate.

7.1.3. Collect and QC Data


The collection of empirical reliability data is integral to the approach used in model
development. Four specific data collection tasks were defined.

The first task was a system/equipment identification process. A survey of numerous


military equipments was conducted to identify system/equipments meeting predetermined
criteria established to ensure plentiful and accurate data.

The second task was an extensive survey of discrete semiconductor manufacturers and
users.

The third task was in-person visits to organizations where data could not be accessed by
other means.

The final data collection task was the compilation of data referenced in the literature and
documented technical studies. Also, as part of this task, additional contact was made
between the authors and/or study sponsors to determine whether more data was available.

The results of the four specific data collection tasks are described in the following
sections.

Five minimum criteria were established to define an acceptable data source. Each
potential equipment selection was evaluated with these criteria before proceeding with
data summarization. These five criteria were:
Reliability Information Analysis Center
267
Chapter 7: Examples

1. Data available to the part level


2. Primary failures could be separated from total maintenance actions
3. Sufficient detail, including stress levels, could be identified for the components
4. Part hours could be precisely determined
5. Sufficient equipment hours existed to expect discrete semiconductor failures

In addition to these criteria, the following factors were considered:

1. Number of different discrete semiconductor part types


2. Existence of low-population and state-of-the-art parts
3. Application environment
4. Age of data

Data summarization consisted of the extraction and compilation of the desired data
elements from the source reports and/or supporting documentation, and coding the data
for computer entry. Data summarization consisted of the following five tasks for sources
of field data:

1. Identification of discrete semiconductor part types within the chosen equipment


2. Determination of part characterization information
3. Identification of relevant part failures
4. Determination of applicable electrical and environmental stress levels
5. Determination of equipment operating histories

The data collected for this effort is summarized on the next page, in Table 7.1-1.
Included are, for each part type, the number of observed failures and operating hours. In
addition to this data, other information was captured, such as quality level, environment,
etc.

7.1.4. Correlation Coefficient Analysis


Using the multiple linear regression technique makes the implicit assumption that the
variables under analysis are independent, and not correlated. In practice, however,
factors are often highly correlated, thus making it difficult to deconvolve the effects that
each factor has.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


268
Chapter 7: Examples

Table 7.1-1: Data Collected for Model Development


Part Class Failures Part Hours
(Millions)

Switching Diode 86 916.91


Rectifier Diode 471 7745.48
Voltage Regulator Diode 228 1154.84
Voltage Reference Diode 282 2951.22
Current Regulator Diode 2 13.54
Transient Suppressor Diode 7 6.58
PNP Transistor, <5W 2330 24706.61
NPN Transistor, <5W 246 1845.35
PNP Transistor, > 5W 52 75.10
NPN Transistor, > 5W 89 112.24
Dual Transistor 1 7.05
Darlington Transistor 57 76.58
JFET 878 5177.81
MOSFET 209 431.77
Unijunction Device 19 68.23
Thyristor 245 1013.18
Schottky Microwave Diode 18 129.39
Tunnel Diode 72 234.45
Varactor 30 173.2
PIN Diode 1857 13413.37
Microwave Power Transistor 2612 1138.70
LED 22 4827.08
Infrared Emitting Diode (IRED) 0 39.1
Alphanumeric Display (Segment) 144 636689.67
Alphanumeric Display (Display) 4 646.09
Photodetector 7 47.0
Opto-isolator 170 595.96

An example of this is the correlation between quality and environment. This correlation
exists because higher quality parts are often used in the more severe environments. As
such, the analyst’s options are to:

1. Keep the factors as derived, with the caveat that they may be in error
2. Treat the factors as a combined, “pooled” factor representing the correlated
variables
3. Use alternate approaches to quantifying the effects of either or all correlated
variables

Reliability Information Analysis Center


269
Chapter 7: Examples

7.1.5. Stepwise Multiple Regression Analysis


This step in the analysis consists of the following:

1. Each factor is linearized in accordance with the desired acceleration model


2. The regression is performed and coefficients are estimated

For example, consider the following model in which the factors to be included are the
base failure rate, a temperature factor and a stress factor:

λ = λbπ T π s
or:
− Ea

λ = λbe KT
Sn

Taking the log of both sides yields:


− Ea

ln λ = ln λb + ln e KT + ln S n

− Ea
ln λ = ln λb + + n ln S
KT

or:
− Ea
ln λb + + n ln S
λ=a KT

The transforms are shown in Table 7.1-2.

Table 7.1-2: Data Transforms


Variable Transform
Observed failure rate ln λ
Temperature -1/T
Stress ln S

When the regression is performed, the intercept is “ln λ b”, and the temperature factor and
stress coefficients are “–Ea/K” and “n”, respectively. In MS Excel, the LINEST function
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
270
Chapter 7: Examples

is used to determine the model coefficients. These are the values used in the original
equation:
− Ea

λ = λbe KT
Sn

If categorical variables are to be modeled, they can be modeled with regression analysis
by assigning a “1” or a “0” to the variable, and performing the regression as described
above. As an example, consider the case in which the product or system to be modeled
has temperature, stress, environment and quality as the four variables affecting the
reliability. This is shown in Table 7.1-3.

Table 7.1-3: Regression Data Including Categorical Variables


Variable
Independent Environment Quality
variable (e.g., λ) Temperature Stress
GB AI GM Commercial Industrial Military
ln(λ1) 1/T1 lnS1 0 1 0 1 0 0
ln(λ2) 1/T2 lnS2 1 0 0 0 0 1
ln(λ3) 1/T3 lnS3 0 0 1 1 0 0
ln(λ4) 1/T4 lnS4 0 1 0 0 1 0
ln(λ5) 1/T5 lnS5 1 0 0 1 0 0

The equation above, expanded with the inclusion of the categorical variables, becomes:

− Ea
ln λb + + n ln S + a1GB + A2 AI + A3GM + A4Comm.+ A5 Ind .+ A6 Mil
λ =e KT

where Ai are the coefficients of the categorical variables determined from the regression
analysis.

7.1.6. Goodness-of-Fit Analysis


There are several ways to analyze how good the model fits the data. The standard error
provides an indication of the significance of the specific factor under analysis. The
standard error is the standard deviation of the coefficient estimate. Therefore, if the
standard error is small relative to the coefficient estimate, this is an indication that the
factor is statistically significant. Likewise the opposite is also true.

Reliability Information Analysis Center


271
Chapter 7: Examples

Residual plots are also useful in assessing how good the model is as a predictor of
reliability. The smaller the residuals, the better the model is.

Another useful plot, similar to a residual plot, is obtained when plotting the log10 of the
observed-to-predicted ratio. If this metric is relatively tightly clustered and centered
around zero, this is an indication of a good model.

7.1.7. Extreme Case Analysis


One of the potential problems in using a multiplicative model form is that extreme value
problems can arise. For example, when all input factors are simultaneously at their high
or low values, the resultant predicted failure rate can be unrealistically high or low. This
situation can be addressed with the use of different model forms, such as in the case of
the RIAC 217Plus models, in which a combination additive and multiplicative model
form is used.

7.1.8. Model Validation


The last step in the process is to validate the model. This is accomplished by ensuring
that the resulting models fit the observed data to a reasonable degree. Additionally, the
models can be checked against observed data not used in the model development.
Valuable data for this purpose is data at levels above the component level. In many
cases, high quality data can be obtained on systems or assemblies, but not at the part
level. This occurs due to the level at which maintenance is performed and data is
captured. Therefore, while the data cannot be used for model development, it can be used
for model validation.

Another thing that must be accounted for in the model validation effort is the scaling of
base failure rates to account for data in which there were no observed failures. The
methodology presented in this section is based on the premise that there exists a point
estimate of the dependent variable, in this case the failure rate. In cases where there are
no failures, a point estimate is not possible, i.e., only a lower single-sided confidence
bound is possible. The use of this confidence bound value cannot be used to represent
the data since the resultant model will be pessimistic (i.e., the failure rate will be
artificially increased). Only using the data points for which there are failures is also not
appropriate because it also will artificially bias the model pessimistically. Potential
solutions to this situation include:

• Scaling the base failure rates to reflect the zero failure data. One possible
alternative to accomplish this is to scale the base failure rates with the boundary

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


272
Chapter 7: Examples

condition that the predicted number of failures in the entire dataset equals the
observed number.
• Use of maximum likelihood (MLE) parameter estimation techniques. These MLE
techniques are especially suited to censored data such as zero failures.

7.2. 217Plus Reliability Prediction Models


7.2.1. Background
In 1994, Military Specifications and Standards Reform (MSSR) decreed the adoption of
performance-based specifications as a means of acquiring and modifying weapons
systems. This led to the cancellation of many military specifications and standards. This,
coupled with the fact that the Air Force had re-directed the mission of Rome Laboratory
(now called the Air Force Research Laboratory (the preparing activity for MIL-HDBK-
217)) away from reliability, resulted in MIL-HDBK-217 becoming obsolete, with no
government plans to update it. The RIAC believed that there was a need for a reliability
assessment technique that could be used to estimate the reliability of systems in the field.
A viable assessment methodology needed:

1. Updated component reliability prediction models, since MIL-HDBK-217 was not


to be updated
2. A methodology for quantifying the effect that non-component variables have on
system reliability
3. To be useable by reliability engineers with data that is typically available during
the system development process

The RIAC is chartered with the collection, analysis and dissemination of reliability data
and information. To this end, it publishes quantitative reliability data such as failure rate
and failure mode/mechanism compendiums, as well as failure rate models. It is not
required to provide these services, but does so because there is a need for this data in the
reliability engineering community. It will continue to engage in such activities as long as
there appears to be this need by reliability practitioners. For this reason, the 217Plus
models and methodology were developed.

There are two primary elements to 217Plus, component reliability prediction models and
system-level models. A system failure rate estimate is first made by using the component
models to estimate the failure rate of each component. These failure rates are then
summed to estimate the system failure rate. This is the traditional methodology used in
many reliability predictions, and represents the reliability prediction, i.e., a reliability
estimate that is made before empirical data or detailed assessments are available. This

Reliability Information Analysis Center


273
Chapter 7: Examples

prediction is then modified in accordance with system level factors, which account for
non-component, or system level, effects. This is an example of a reliability
“assessment”, in which the process and design factors are assessed. Finally, the
prediction and assessment are combined with empirical data to form the reliability
“estimate” of the product, which is the best estimate of reliability based on all analysis
and data available to the analyst.

The goal of component reliability models is to estimate the “rate of occurrence of


failure”, or ROCOF, and accelerants of a component’s primary failure mechanisms
within an acceptable degree of accuracy. Toward this end, the models should be
adequately sensitive to operating scenarios and stresses, so that they allow the user the
ability to perform tradeoff analysis amongst these variables. For example, the basic
premise of the 217Plus models is that they have predicted failure rates for operating
periods, non-operating periods and cycling. As a result, the user can perform tradeoff
analysis amongst duty cycle, cycling rate, and other variables. As an example, a question
that frequently arises is whether a system will have a higher failure rate if it is
continuously powered on, or whether it is powered off during periods of non-use. The
models in 217Plus are structured to facilitate the tradeoff analysis required to answer this
question.

A flow diagram of the entire approach was presented in Chapter 2, which guides the user
in the application of the component models and the system level models. The basis for
the 217Plus methodology is the component reliability models, which estimate a system’s
reliability by summing the predicted failure rates of the constituent components in the
system. This estimate of the system reliability is further modified by the application of
“System-Level” factors, called Process Grade Factors (PGF). Development of the
component models is presented in Sections 7.2.3 through 7.2.5.

The primary intent of this section is to detail the development of the 217Plus
methodology. It is provided to familiarize the reader with the issues faced by model
developers in order to allow a better understanding of 217Plus and similar models. It
provides details related to certain aspects of model development.

7.2.2. System Reliability Prediction Model

7.2.2.1. 217Plus Background


The premise of traditional methods of reliability predictions, such as MIL-HDBK-217, is
that the failure rate of a product or system is primarily determined by the components
comprising it. Historically, a significant number of failures also stem from non-
component causes such as design deficiencies, manufacturing defects, inadequate
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
274
Chapter 7: Examples

requirements, induced failures, etc., that have not been explicitly addressed in prediction
methods.

The data in Figure 7.2-1, presented previously, contains the nominal percentage of
failures attributable to each of eight identified predominant failure causes based on data
collected by the RIAC. The data in this figure represents nominal percentages. The
actual percentages can vary significantly around these nominal values.

Softw are
9%
Parts
22%

No Defect
20%

Manufacturing
15%

Induced
12%
Design
9%
Wearout System
9% Management
4%

Figure 7.2-1: Failure Cause Distribution of Electronic Systems

The definitions of failure causes, as presented in an earlier chapter, are:


• Parts (22%): Failures resulting from a part (i.e., microcircuit, transistor, resistor,
connector, etc.) failing to perform its intended function. Examples
include part failures due to poor quality; manufacturer or lot
variability; or any process deficiency that causes a part to fail before
its expected wearout limit is reached.

Reliability Information Analysis Center


275
Chapter 7: Examples

• Design (9%): Failures resulting from an inadequate design. Examples include


tolerance stack-up, unanticipated logic conditions (e.g., sneak paths), a
non-robust design for given environmental stresses, etc.
• Manufacturing (15%): Failures resulting from anomalies in the manufacturing
process that are not related to the inherent reliability of a part, i.e.,
faulty solder joints, inadequate wire routing resulting in chafing, bent
connector pins, etc.
• System Management (4%): Failures traceable to faulty interpretation of system
requirements, imposition of “bad” requirements (missing, inadequate,
ambiguous or contradictory), or failure to provide the resources
(funding and/or personnel) required to design and build a reliable
product or system.
• Wearout (9%): Failures resulting from wearout-related failure mechanisms due
to basic device physics. Examples of electronic components
exhibiting wearout-related failure mechanisms are electrolytic
capacitors, solder joints, microwave tubes (such as TWTs), and switch
and relay contacts.
• No defect (20%): Perceived failures that cannot be reproduced upon further
testing. These may or may not be an actual failure; however they are
removals and, therefore, are typically counted toward the logistic
failure rate (or MTBF). Examples include the inability of the
maintenance environment to recreate the operational environmental
stresses under which the original failure occurred, or “looser”
tolerances on the test equipment than on the platform or system from
which the defective unit was taken.
• Induced (12%): Failures resulting from an externally applied stress. Examples
are electrical overstress and maintenance-induced failures (i.e.,
dropping, bending pins, etc.).
• Software (9%): Failures of a system to perform its intended function due to the
manifestation of a software fault

Another example that this author has experience with is shown in Figure 7.2-2, which
represents the distribution observed for Erbium Doped Fiber Amplifiers (EDFAs) used in
long haul telecommunications systems. The distribution is different than the above chart,
which is a pooled result from various system types and manufacturers. This example is

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


276
Chapter 7: Examples

provided to illustrate the notion that the system type and manufacturing practices will
dictate the specific distribution obtained.

8% No Fault 1% - Component -
Found Mechanical
7% - Component -
21% - Electrical
Manufacturing
Defect

63% - Component
- Pumps and
Other Optical
Components

Figure 7.2-2: Optical Amplifier Failure Cause Distribution

7.2.2.2. Methodology Overview


The 217Plus methodology is structured to allow the user the ability to estimate the
reliability of a product or system in the initial design stages when little is known about it.
For example, a reliability prediction early in the development phase of a system can be
made based on a generic parts list, using default values for operational profiles and
stresses. As additional information becomes available, the model allows the incremental
addition of empirical test and field data to supplement the initial prediction.

The purpose of 217Plus is to provide an engineering tool to assess the reliability of


electronic systems. It is not intended to be the "standard" prediction methodology, and it
can be misused if applied carelessly, just as any empirical or physics-based model can.
Also, it is a tool to allow the user the ability to estimate the failure rate of parts,
assemblies and systems. It does not consider the effect of redundancy or perform
FMEAs. The intent of 217Plus is to provide the data necessary as an input to these
analyses. The methodology allows for the modification of a base reliability estimate with
Process Grading Factors for the failure causes listed in Section 7.2.2.1.
Reliability Information Analysis Center
277
Chapter 7: Examples

These process grades correspond to the degree to which actions have been taken to
mitigate the occurrence of product or system failure due to these failure categories. Once
the base estimate is modified with the process grades, the reliability estimate is further
modified by empirical data taken throughout item development and testing. This
modification is accomplished using Bayesian techniques that apply the appropriate
weights for the different data elements.

Advantages of the 217Plus methodology are that it uses all available information to form
the best estimate of field reliability, it is tailorable, it has quantifiable confidence bounds,
and it has sensitivity to the predominant product or system reliability drivers. The
methodology represents a holistic approach to predicting, assessing and estimating
product or system reliability by accounting for all primary factors that influence the
inability of an item to perform its intended function. It factors in all available reliability
data as it becomes available on the program. It, thus, integrates test and analysis data,
which provides a better prediction foundation and a means for estimating variances from
different reliability measures.

7.2.2.3. System Reliability Model


The fundamental 217Plus failure rate model for a system is as follows:

λ P = λ IA (Π P + Π D + Π M + ΠS + Π I + Π N + Π W ) + λ SW

The sum of the Pi-factors in the parenthesis represents the cumulative multiplier that
accounts for all of the processes used in system development and sustainment. The sum
of these values is normalized to unity for processes that are considered to be the mean of
industry practices. The individual model factors are:

λP = Predicted failure rate of the product or system (in failures per


million calendar hours)
λIA = Initial assessment of the failure rate based on component failure
rate estimates
ΠP = Parts process multiplier
ΠD = Design process multiplier
ΠM = Manufacturing process multiplier
ΠS = System management process multiplier
ΠI = Induced process multiplier
ΠN = No-defect process multiplier

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


278
Chapter 7: Examples

ΠW = Wearout process multiplier


λSW = Software failure rate prediction

Additional factors included in the model account for the effects of infant mortality,
environment, and reliability growth. Since each of these factors does not influence all of
the factors in the above equation, they are applied selectively to the applicable factors.
For example, environmental stresses will generally accelerate part defects and
manufacturing defects to failure. These additional factors are normalized to unity under
average conditions, so that the value inside the parenthesis is one under nominal
conditions and for nominal processes.

λ P = λ IA (Π P Π IM Π E + Π D Π G + Π M Π IM Π E Π G + ΠS Π G + Π I + Π N + Π W ) + λ SW

where,

ΠIM = Infant mortality factor


ΠE = Environmental factor
ΠG = Reliability growth factor

The initial assessment of the failure rate, λIA, is the seed failure rate value, which is
obtained by using the 217Plus component reliability prediction models, along with other
available data. This failure rate is then modified by the Pi-factors that account for
specific processes used in the design and manufacture of the product or system, along
with the environment, reliability growth and infant mortality characteristics of the item.

The above failure rate expression represents the total failure rate of the system, which
includes "induced" and "no defect found" failure causes. If the inherent failure rate is
desired, then the "induced" and "no defect found" Pi-factors should be set to zero, since
they represent operational and non-inherent failure causes.

7.2.2.4. Initial Failure Rate Estimate


An initial estimate of a system failure rate is based on a combination of the component
failure rate models, the empirical field failure rate data contained in the RIAC databases,
or user-defined failure rates from other sources that are entered directly by the user. This
initial failure rate is then used as a seed value that represents a typical failure rate for the
product or system. It is then adjusted in accordance with the PGFs, infant mortality
characteristics, reliability growth characteristics, and environmental stresses. In addition,
software is modeled as a separate failure rate.

Reliability Information Analysis Center


279
Chapter 7: Examples

All variables in the model default to average values, not worst-case values. As a result,
the user has the option of applying any or all factors, depending on the level of
knowledge of the product or system and the amount of time or resources available for the
assessment. If a traditional reliability prediction is desired, the user can perform it using
the component models and the RIAC database failure rates contained in 217Plus9. As
additional data and information becomes available, the analysis can be expanded to
include these system-level factors.

7.2.2.5. Process Grading Factors


An objective of the 217Plus system model is to explicitly account for the factors
contributing to the variability in traditional reliability prediction approaches. This is
accomplished by grading the process for each of the failure cause categories. The
resulting grade for each cause corresponds to the level to which an organization has taken
the action necessary to mitigate the occurrence of failures of that cause. This grading is
accomplished by assessing the processes in a self-audit fashion. Any or all failure causes
can be assessed and graded. If the user chooses not to address a specific failure cause,
the model simply reverts to the default "average" value. If the user chooses to apply the
PGF methodology for any failure cause, there are a minimum number of questions that
should be assessed and graded. Beyond this minimum, the user can selectively assess
and grade additional criteria. If answers to the grading questions are not known, the
model simply ignores those criteria. Process grading is used to quantify the following
factors:

• ΠP (parts process multiplier)


• ΠD (design process multiplier)
• ΠM (manufacturing process multiplier)
• ΠS (system management process multiplier)
• ΠI (induced process multiplier)
• ΠN (no-defect process multiplier)
• ΠW (wearout process multiplier)

The sum of the Π factors within the parentheses in the failure rate model is equal to
unity for the average grade. Each factor will increase if "less than average" processes are
in used and decrease if “better than average” processes are in used.

9
The RIAC 217Plus software contains databases that hold the RIAC’s NPRD and EPRD failure rate data, converted to failures per
million calendar hours. The RIAC “Handbook of 217Plus Reliability Prediction Models” does not contain this supplementary data.
The RIAC NPRD and ERPD databooks are available for separate purchase from the RIAC, and are in units of failures per million
operating hours.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
280
Chapter 7: Examples

Features of this PGF methodology are that it:

• Explicitly recognizes and accounts for special (assignable) cause problems


• Models reliability from the user (or total system-level) perspective
• Promotes cross-organizational commitment to Reliability, Availability and
Maintainability (RAM)
• Quantitatively grades developers' efforts to affect improved reliability
• Maintains continuing organizational focus on RAM throughout the development
cycle

Reference 2 presents the results of the study in which the process grades were
determined.

7.2.2.6. Basis Data for the Model

7.2.2.7. Uncertainty in Traditional Approach Estimates


A goal of 217Plus is to model predominant system reliability drivers. The premise of
traditional methods such as MIL-HDBK-217 is that the failure rate is primarily
determined by the technology and application stress of the components comprising the
product or system. This was a good premise many years ago, when components
exhibited higher failure rates and systems were not as complex as they are today.
Increased item complexity and component quality have resulted in a shift of system
failure causes away from components to more system-level factors, including system
requirements, interface problems and software problems. A significant number of
failures also stem from non-component causes such as defects in design and
manufacturing. Historically, these factors have not been explicitly addressed in
prediction methods. The approach used to develop the 217Plus model was to (1) quantify
the uncertainty in predictions using "component-based" traditional approaches and (2)
explicitly model the factors contributing to that uncertainty.

Data was collected by the RIAC on systems for which both predicted and observed
MTBF data was available. This was done for the purpose of quantifying the uncertainty
in traditional component-based predictions. Table 7.2-1 presents the multipliers of a
failure rate point estimate as a function of confidence level that was derived from analysis
of this data. For example, using traditional approaches, one could be 90% certain that the
true failure rate was less than 7.575 times the predicted value.

Reliability Information Analysis Center


281
Chapter 7: Examples

Table 7.2-1: Uncertainty Level Multiplier


Percentile Multiplier
0.10 0.132
0.20 0.265
0.30 0.437
0.40 0.670
0.50 1.000
0.60 1.492
0.70 2.290
0.80 3.780
0.90 7.575

7.2.2.8. System Failure Causes


The premise of the 217Plus model developed in the RIAC study was that the failure rate
attributable to the predominant system-level failure causes could be quantified. In
addition to the intrinsic variability associated with the failure rate prediction, there is
additional variability associated with the variance in the distribution of failure causes.
This requires that there be baseline data that quantifies the failure rate of each cause. The
data in Table 7.2-2 was used for this purpose. This table contains, for each source of
data, the percentage of failures attributable to each of the eight identified predominant
failure causes. It should be noted here that the reported percentages of failure due to
some failure causes might be underestimated. For example, system management and
software may be under-reported because failures are usually not attributed to those
categories, even when they are the root cause of failure. This also means that the
percentages from the other causes may be overestimated. Although the authors recognize
that this is likely, the values in the model reflect the reported values. However, if a user
of the model has failure cause distribution information from which the model factors can
be tailored, this data should be used instead of the nominal values.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


282
Chapter 7: Examples

Table 7.2-2: Percentage of Failures Attributable to Each Failure Cause


Survey Part Mfg. System Wearout No Induced Software
Respondent Defect Defect Design Mgt. Defect
1 5 38 0 0 0 42 8 8
2 34 28 0 0 39 0 0 0
3 13 5 5 0 3 30 43 0
4 9 31 38 0 6 0 16 0
5 46 10 19 0 12 0 14 0
6 46 25 2 0 12 0 14 0
7 19 39 10 0 10 0 22 0
8 28 28 28 0 0 0 17 0
9 42 42 16 0 0 0 0 0
10 64 0 0 0 17 0 20 0
11 24 28 0 0 6 34 8 0
12 15 13 4 12 6 17 32 1
13 32 1 5 11 27 16 7 0
14 13 10 10 1 13 0 34 20
15 19 3 5 0 5 40 7 20
16 61 5 5 1 15 10 3 0
17 38 15 17 0 12 0 18 0
18 30 19 10 1 11 11 15 3

An analysis was then performed on the Table 7.2-2 data to quantify the distributions of
percentages for each failure cause. This was accomplished by performing a Weibull
analysis of each column. The resulting distributions are summarized in Table 7.2-3.

Table 7.2-3: Weibull Parameters for Failure Cause Percentages


Failure Cause Characteristic Weibull Shape
Percentage Parameter (beta)
Parts 33.9 1.62
Manufacturing 23.2 0.96
Design 13.9 1.29
System Management 7.1 0.64
Wearout 14.7 1.68
Induced 19.8 1.58
No Defect 31.9 1.92
Software 15.0 0.70

Reliability Information Analysis Center


283
Chapter 7: Examples

Table 7.2-4 summarizes the failure rate multiplier values for each of the eight failure
causes as a function of the grade for each of the eight. The generic formula for the
multiplier is given as:

Πi = −α × (ln Ri )1/ β

In this calculation, the characteristic percentages listed in Table 7.2-3 are scaled by a
factor of 1.11 to ensure that the sum of the multipliers is equal to one when each grade is
equal to 0.50. In this case, a grade of 0.50 represents an "average" process, and since the
model is normalized to an average process, the total multiplier of the initial assessment
failure rate is equal to one under these conditions.

Table 7.2-4: Multipliers as a Function of Process Grade


Manufacturing

Management
Cumulative Percentage

No Defect
Wearout

Induced
(Grade)
System
Design
Parts

0.01 0.725 0.948 0.378 0.643 0.304 0.433 0.588


0.02 0.655 0.800 0.333 0.498 0.276 0.391 0.540
0.03 0.612 0.714 0.306 0.420 0.258 0.365 0.511
0.04 0.581 0.653 0.286 0.367 0.245 0.346 0.488
0.05 0.556 0.606 0.271 0.328 0.235 0.330 0.470
0.06 0.535 0.567 0.258 0.298 0.227 0.317 0.455
0.07 0.516 0.535 0.247 0.273 0.219 0.306 0.442
0.08 0.500 0.507 0.237 0.251 0.212 0.296 0.430
0.09 0.486 0.482 0.229 0.233 0.207 0.288 0.420
0.10 0.472 0.461 0.221 0.218 0.201 0.279 0.410
0.11 0.460 0.441 0.214 0.204 0.196 0.272 0.401
0.12 0.449 0.423 0.207 0.191 0.191 0.265 0.393
0.13 0.438 0.406 0.201 0.180 0.187 0.259 0.385
0.14 0.428 0.391 0.195 0.170 0.183 0.253 0.378
0.15 0.419 0.376 0.190 0.161 0.179 0.247 0.371
0.16 0.410 0.363 0.185 0.152 0.176 0.242 0.364
0.17 0.402 0.351 0.180 0.145 0.172 0.237 0.358
0.18 0.394 0.339 0.176 0.137 0.169 0.232 0.352
0.19 0.386 0.328 0.171 0.131 0.166 0.227 0.346
0.20 0.379 0.317 0.167 0.124 0.162 0.223 0.340
0.21 0.372 0.307 0.163 0.119 0.160 0.219 0.335

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


284
Chapter 7: Examples

Manufacturing

Management
Cumulative Percentage

No Defect
Wearout

Induced
(Grade)

System
Design
Parts
0.22 0.365 0.298 0.160 0.113 0.157 0.214 0.330
0.23 0.358 0.288 0.156 0.108 0.154 0.210 0.325
0.24 0.352 0.280 0.152 0.103 0.151 0.206 0.320
0.25 0.345 0.271 0.149 0.098 0.149 0.203 0.315
0.26 0.339 0.263 0.146 0.094 0.146 0.199 0.310
0.27 0.333 0.256 0.143 0.090 0.144 0.196 0.306
0.28 0.328 0.248 0.140 0.086 0.141 0.192 0.301
0.29 0.322 0.241 0.137 0.083 0.139 0.189 0.297
0.30 0.317 0.234 0.134 0.079 0.137 0.185 0.293
0.31 0.311 0.228 0.131 0.076 0.134 0.182 0.288
0.32 0.306 0.221 0.128 0.072 0.132 0.179 0.284
0.33 0.301 0.215 0.125 0.069 0.130 0.176 0.280
0.34 0.296 0.209 0.123 0.067 0.128 0.173 0.276
0.35 0.291 0.203 0.120 0.064 0.126 0.170 0.272
0.36 0.286 0.198 0.118 0.061 0.124 0.167 0.269
0.37 0.281 0.192 0.115 0.059 0.122 0.164 0.265
0.38 0.277 0.187 0.113 0.056 0.120 0.161 0.261
0.39 0.272 0.181 0.110 0.054 0.118 0.159 0.257
0.40 0.267 0.176 0.108 0.052 0.116 0.156 0.254
0.41 0.263 0.171 0.106 0.049 0.114 0.153 0.250
0.42 0.259 0.167 0.104 0.047 0.112 0.151 0.247
0.43 0.254 0.162 0.101 0.045 0.111 0.148 0.243
0.44 0.250 0.157 0.099 0.043 0.109 0.146 0.240
0.45 0.246 0.153 0.097 0.042 0.107 0.143 0.236
0.46 0.241 0.148 0.095 0.040 0.105 0.140 0.233
0.47 0.237 0.144 0.093 0.038 0.104 0.138 0.229
0.48 0.233 0.140 0.091 0.036 0.102 0.136 0.226
0.49 0.229 0.136 0.089 0.035 0.100 0.133 0.223
0.50 0.225 0.132 0.087 0.033 0.098 0.131 0.219
0.51 0.221 0.128 0.085 0.032 0.097 0.128 0.216
0.52 0.217 0.124 0.083 0.030 0.095 0.126 0.213
0.53 0.213 0.120 0.081 0.029 0.093 0.124 0.210
0.54 0.209 0.117 0.080 0.028 0.092 0.121 0.206
0.55 0.205 0.113 0.078 0.026 0.090 0.119 0.203
0.56 0.202 0.109 0.076 0.025 0.088 0.117 0.200
0.57 0.198 0.106 0.074 0.024 0.087 0.114 0.197

Reliability Information Analysis Center


285
Chapter 7: Examples

Manufacturing

Management
Cumulative Percentage

No Defect
Wearout

Induced
(Grade)

System
Design
Parts
0.58 0.194 0.103 0.072 0.023 0.085 0.112 0.194
0.59 0.190 0.099 0.071 0.022 0.084 0.110 0.190
0.60 0.186 0.096 0.069 0.021 0.082 0.108 0.187
0.61 0.183 0.093 0.067 0.020 0.080 0.106 0.184
0.62 0.179 0.090 0.065 0.019 0.079 0.103 0.181
0.63 0.175 0.086 0.064 0.018 0.077 0.101 0.178
0.64 0.172 0.083 0.062 0.017 0.076 0.099 0.174
0.65 0.168 0.080 0.060 0.016 0.074 0.097 0.171
0.66 0.164 0.077 0.059 0.015 0.073 0.095 0.168
0.67 0.160 0.074 0.057 0.014 0.071 0.092 0.165
0.68 0.157 0.072 0.055 0.013 0.069 0.090 0.162
0.69 0.153 0.069 0.054 0.013 0.068 0.088 0.158
0.70 0.149 0.066 0.052 0.012 0.066 0.086 0.155
0.71 0.146 0.063 0.050 0.011 0.065 0.084 0.152
0.72 0.142 0.061 0.049 0.010 0.063 0.081 0.149
0.73 0.138 0.058 0.047 0.010 0.062 0.079 0.145
0.74 0.135 0.055 0.046 0.009 0.060 0.077 0.142
0.75 0.131 0.053 0.044 0.008 0.058 0.075 0.139
0.76 0.127 0.050 0.042 0.008 0.057 0.073 0.135
0.77 0.123 0.048 0.041 0.007 0.055 0.071 0.132
0.78 0.119 0.045 0.039 0.007 0.053 0.068 0.129
0.79 0.116 0.043 0.038 0.006 0.052 0.066 0.125
0.80 0.112 0.040 0.036 0.006 0.050 0.064 0.122
0.81 0.108 0.038 0.035 0.005 0.048 0.062 0.118
0.82 0.104 0.036 0.033 0.005 0.047 0.059 0.114
0.83 0.100 0.034 0.031 0.004 0.045 0.057 0.111
0.84 0.096 0.031 0.030 0.004 0.043 0.055 0.107
0.85 0.092 0.029 0.028 0.003 0.042 0.052 0.103
0.86 0.088 0.027 0.027 0.003 0.040 0.050 0.099
0.87 0.084 0.025 0.025 0.003 0.038 0.047 0.095
0.88 0.079 0.023 0.023 0.002 0.036 0.045 0.091
0.89 0.075 0.021 0.022 0.002 0.034 0.042 0.087
0.90 0.070 0.019 0.020 0.002 0.032 0.040 0.082
0.91 0.066 0.017 0.019 0.001 0.030 0.037 0.078
0.92 0.061 0.015 0.017 0.001 0.028 0.034 0.073
0.93 0.056 0.013 0.015 0.001 0.026 0.031 0.068

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


286
Chapter 7: Examples

Manufacturing

Management
Cumulative Percentage

No Defect
Wearout

Induced
(Grade)

System
Design
Parts
0.94 0.051 0.011 0.013 0.001 0.023 0.028 0.062
0.95 0.045 0.009 0.012 0.001 0.021 0.025 0.057
0.96 0.039 0.007 0.010 0.000 0.018 0.022 0.050
0.97 0.033 0.005 0.008 0.000 0.015 0.018 0.043
0.98 0.025 0.003 0.006 0.000 0.012 0.014 0.035
0.99 0.016 0.002 0.003 0.000 0.008 0.009 0.024

7.2.2.9. Environmental Factor


MIL-HDBK-344 (Reference 6) defines the stress screening strength (SS) to be “the
probability that a specific screen will precipitate a latent defect to failure and detect it by
test, given that a latent defect susceptible to the screen is present. It is the product of the
precipitation efficiency (PE) and detection efficiency (DE).” It is equivalent to the
percentage of defects that are removed from the prescreened population:
Dremoved
SS =
Din

where:
Dremoved = D in − Dremaining

The failure rate is, therefore:


D field (t )
λ=
t

where:

t = the period, in hours, over which the MTBF is to be measured


Dfield = the number of field failures due to latent defects occurring during the
interval “t”.

Since SS is the percentage of defects removed from the population, it follows that:
Reliability Information Analysis Center
287
Chapter 7: Examples

Dfield = Dremaining* SS field

The SSfield is the effective screening strength of the stresses that the product or system
will encounter in the field, and SSESS is the screening strength that the system is exposed
to during environmental stress screening (ESS). It also follows that Dfield is equal to the
cumulative (integral of) field failure rate:

D field = ∫ λ (t )

D field = ∫ λ postscreened (t )


D field = SS * λ prescreened (t )

λ postsceened = SS * λ prescreened

This indicates that, in addition to estimating the effect that ESS has on system reliability,
the screening strength calculated from field stresses (SSfield) can be effectively used as a
failure rate multiplier that accounts for the environmental stresses:

1 − e − kt
SS field (t ) =
t

where,

SSfield(t) = equivalent screening strength of the field environment


k = field precipitation rate

The total screening strength, SStotal , after accounting for both the temperature cycling and
vibration-related portions, is:

SStotal = PTC * SS(TC) + PRV * SS(RV)

where:

PTC = the percentage of failures resulting from temperature cycling stresses


PRV = the percentage of failures resulting from random vibration stresses
SS(TC) = the screening strength applicable to temperature cycling

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


288
Chapter 7: Examples

SS(RV) = the screening strength applicable to random vibration.

Algorithms for calculating screening strength are given in a subsequent section. If the
actual values of PTC and PRV are unknown, the default values that should be used are:

PTC = 0.80
PRV = 0.20

Since the component failure rates described above are relative to a ground benign
environment, the failure rate multiplier is the ratio of the SS value in the use environment
to the SS value in a ground benign environment:

PTC * SS (TCuse ) + PRV * SS ( RVuse )


ΠE =
PTC * SS (TCGb ) + PRV * SS ( RVG )
b

where:

PTC = percentage of failures resulting from temperature cycling stresses


PRV = percentage of failures resulting from random vibration stresses
SS = screening strength applicable to the application environmental values

As previously indicated, the SS value is the screening strength and has been derived from
MIL-HDBK-344. It is an estimate of the probability of both precipitating a defect to
failure and detecting it once it is precipitated by the test.

SS TC = 1 − e (− kTC t )
SSRV =1 − e(−k RV t )
k TC = 0.0017 ( ΔT + .6) .6 [ln (RATE + 2.718) ]
3

where:
ΔT = Tmax − Tmin (in degrees C)
RATE = degrees C/minute
t = # of cycles
k RV = 0.0046 G 1.71
Reliability Information Analysis Center
289
Chapter 7: Examples

The parameter “G” is the magnitude of vibration stress, in units of Grms. Whenever
possible, the actual values of delta T (ΔT) and vibration (Grms) should be used for the use
application environment when calculating SS values. If the actual values are not known,
then the default values of ΔT (summarized in the component model descriptions later)
can be used. A discussion of the values of “k” follows.

For RV screens it is necessary to include an axis sensitivity factor. The RV applied in the
axis perpendicular to the plane of the board will have the greatest effect. When selecting
and modeling RV stress, the precipitation efficiency is, thus, given by:

[1− exp (-kt)]* (Axis Sensitivity Factor)


where the “axis sensitivity factor” is the defect density in the sensitive axis divided by the
total defect density. Transmissibility and resonance effects must be considered, and the
frequency spectrum may need to be suitably notched to avert overstress or wearout
effects. Similarly, thermal mass and conductivities must be considered when determining
temperature cycle (TC) transition rates and required dwell times. The stress levels for all
of these equations pertain to the product or system being screened and not the test
chamber conditions.

It should also be noted that the expressions and tables for precipitation efficiency are only
approximate and, as in the estimation of initial defects, should be refined based upon
actual user data according to the techniques of Procedure D of MIL-HDBK-344.

Under the average temperature cycling and random vibrations conditions that represent
the data used in development of the models, the denominator is 0.205. This value is a
normalization constant such that the environment factor is equal to 1.0 when a product or
system is subjected to the average stress levels.

The values assumed for the rate and duration are 2 degrees C per minute and 10 hours,
respectively. Therefore, the environment factor is:

⎛ ⎛ (− 0.065 (ΔT + 0.6 )0.6 ) ⎞ ⎛ (− 0.046 G ) ⎞ ⎞


1.71
0.855 × ⎜ 0.8⎜1 − e ⎟ + 0.2⎜1 − e ⎟⎟
⎝ ⎝ ⎠ ⎝ ⎠⎠
∏E =
0.205

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


290
Chapter 7: Examples

7.2.2.10. Reliability Growth


The 217Plus model includes a factor for assessing the reliability growth characteristics of
a product or system10. The premise of this factor is that the processes that contribute to
system reliability growth in the field may or may not exist. The degree to which growth
exists is estimated by a grading factor that assesses the processes contributing to growth.
The growth factor calculation is given by the formula:

1.12(t + 2) −α
ΠG =
2 −α

The denominator in the above expression is necessary to ensure that the value of the
factor is 1.12 at the time of field deployment, regardless of the growth rate (α). Figure
7.2-3 illustrates the growth Pi-factor multiplier for various values of growth rates as a
function of time.
1.2

0.8 0
Pi (Growth)

0.2
0.5
0.6
0.7
1
0.4

0.2

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
0

Time (years)

Figure 7.2-3: ΠG vs. Time and Growth Rates

The value of “α” is estimated by determining the degree to which the potential for growth
exists. This estimation is accomplished in a manner similar to the process grading

10
The system reliability growth factor is different from, and in addition to, the reliability growth factors used in the 217Plus
component models to reflect component technology improvements from their respective baseline years.
Reliability Information Analysis Center
291
Chapter 7: Examples

methodology by assessing and grading the processes that can contribute to reliability
growth.

7.2.2.11. Infant Mortality


Infant mortality is accounted for in the model with a time-variant factor that is a function
of the level to which ESS has been applied. The infant mortality correction factor, ΠIM,
is calculated as:
t - 0.62
Π IM = (1 - SSESS )
1.77
where:

t = time in years
SSESS = the screening strength of the screen(s) applied, if any.

The value of SS can be determined by using the stress screening strength equations as
presented in Section 7.2.2.9.

The above expression represents the instantaneous failure rate. If the average failure rate
for a given time period is desired, this expression must be integrated and divided by the
time period.

7.2.2.12. Combining Predicted Failure Rate with Empirical Data


The user of this model is encouraged to collect as much empirical data as possible and
use it in the 217Plus reliability assessment. This was summarized in Section 2.6, and is
done by mathematically combining the initial assessment made (based on the initial
assessment and the process grades) with empirical data. This step combines the best
"pre-build" failure rate estimate obtained from the initial assessment (plus the influence
of the PGFs) with the metrics obtained from the empirical data. Bayesian techniques are
used for this purpose. This technique accounts for the quantity of data by weighting large
amounts of data more heavily than small amounts. The failure rate estimate obtained
above forms the "prior" distribution, comprised of a0 and b0.

7.2.3. Development of Component Reliability Models

7.2.3.1. Model Form


Traditional methods of reliability prediction model development, as discussed earlier in
the section on MIL-HDBK-217, have included the statistical analysis of empirical failure
rate data. Statistical methods have included ANOVA, multiple linear regression,
sensitivity analysis, etc. When using multiple linear regression techniques with highly
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
292
Chapter 7: Examples

variable data (which is often the case with empirical failure rate data), a requirement of
the model form is that it be multiplicative (i.e., the predicted failure rate is the product of
a base failure rate and several factors that account for the stresses and component
variables that influence reliability). An example of a multiplicative model is as follows:

λ p = λbπ eπ qπ s
where:

λp = predicted failure rate


λb = base failure rate
Πe = environmental factor
Πq = quality factor
Πs = stress factor

However, a primary disadvantage of the multiplicative model form is that the predicted
failure rate value can become unrealistically large or small under extreme value
conditions (i.e., when all factors are at their lowest or highest values). This is an inherent
limitation of multiplicative models, primarily due to the fact that individual failure
mechanisms, or classes of failure mechanisms, are not explicitly accounted for. A better
approach is an additive model which predicts a separate failure rate for each generic class
of failure mechanisms. Each of these failure rate terms are then accelerated by the
appropriate stress or component characteristic. This model form is as follows;

λ p = λ oπ o + λ eπ e + λ cπ c + λ i + λ sj π sj
where:

λp = predicted failure rate


λo = failure rate from operational stresses
πo = product of failure rate multipliers for operational stresses
λe = failure rate from environmental stresses
πe= product of failure rate multipliers for environmental stresses
λc = failure rate from power or temperature cycling stresses
πc = product of failure rate multipliers for cycling stresses
λi = failure rate from induced stresses, including electrical overstress and ESD
λsj = failure rate from solder joints
πsj = product of failure rate multipliers for solder joint stresses
Reliability Information Analysis Center
293
Chapter 7: Examples

By modeling the failure rate in this manner, factors that account for the application and
component specific variables that affect reliability (π factors) can be applied to the
appropriate additive failure rate term. Additional advantages to this approach are that
they:

• Address operating, non-operating and cycling-related failure rates in an additive


model which are weighted in accordance with the operational profile (duty cycle
and cycling rate). The Pi-factors modify only the applicable failure rate term,
thereby eliminating many of the extreme value problems that plague
multiplicative models
• Are based on observed failure mode distributions so that observed component
failure causes are empirically modeled
• Are based on quantitative stresses (and not on qualitative environmental
categories), but default to average stress conditions as a function of environment
• Are industry-independent and predict the average failure rates of best commercial
practices
• Can be tailored with test data, if available, by applying the test data to appropriate
additive term via the Bayesian method

7.2.3.2. Acceleration Factors


Acceleration factors (also called Pi-factors) are used in the 217Plus models to estimate
the effect on failure rate of various stress and component variables. Since the traditional
technique of multiple linear regression was not used in the derivation of the failure rate
models, the Pi-factors were derived by utilizing either industry accepted values, values
determined separately from data available to the RIAC, or values from previous modeling
efforts. For example, the models typically include both an operating and non-operating
temperature factor based on the Arrhenius relationship, which require an activation
energy for operating and non-operating conditions. To estimate these values for the
models, previous modeling studies (along with existing prediction methodologies) were
used. Similarly, some factors were based on test data. For example, the exponent used in
the delta T Pi-factor for the 217Plus integrated circuit model is based on fallout rate data
from temperature cycling tests that were performed at various levels of delta T.

7.2.3.3. Time Basis of Models


Traditional reliability prediction models have been based on the operating time of the
part, and the units were typically failures per million (or billion) operating hours
(F/106H). The RIAC 217Plus models (and the empirical data contained in the RIAC
databases included with the RIAC 217Plus software) predict the failure rate in units of
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
294
Chapter 7: Examples

failures per million calendar hours (F/106CH). This is necessary (and appropriate)
because it is the common basis for all failure rate contribution terms used in the model
(operating, non-operating, cycling, and induced). If an equivalent operating failure rate is
desired (in units of failures per million operating hours), the failure rate (in F/106CH) can
be divided by the duty cycle to yields a failure rate in F/106operating hours.

7.2.3.4. Failure Mode to Failure Cause Mapping


There are two primary types of data on which the RIAC 217Plus component models are
based: failure rate and failure mode. The model development process required that the
failure rate data be apportioned into four failure cause categories. Since the failure mode
data contained in the RIAC databases was typically not defined by these categories, it
was necessary to transform the RIAC failure mode data into a failure cause distribution.
This was accomplished by assessing the stresses that accelerate the specific class of
failure categories, and estimating the percentage of failures that could be attributed to
those stresses. The primary stresses that potentially accelerate operational failure modes
are operating temperature, vibration, current and voltage. The stresses that accelerate
environmental failure causes are non-operating (i.e., dormant) ambient temperature,
corrosive stresses (contaminants/heat/humidity), ageing stresses (time), and humidity. As
an example, Table 7.2-5 summarizes this process for a resistor. Each of the six failure
modes included in the analysis are listed across the top of the table, i.e. EOS,
contamination, etc., along with their associated observed relative percentage of
occurrence. This data was collected by the RIAC and was based primarily on the root
cause failure analysis results of parts that had failed in the field.

Table 7.2-5: Example of Failure Mode-to-Failure Cause Category Mapping


Failure Accelerating Failure Mode
Category Stresses/ EOS Contamination Cracked Chip Leakage TNI % Total
Causes out %
41.20% 23.50% 17.60% 7.10% 5.90% 4.70%
Operational Operating s 0.00 0.05
Stresses Temperature
Vibration p s 0.04
Current s 0.00
Voltage s 0.00
Environmental Ambient p s 0.08 0.31
Stresses (Dormant) Temp.
Corrosion p p s 0.09
Ageing s p s 0.05
Humidity p p s 0.09
Power Cycling Power Cycling p p s s 0.22 0.22
Induced/EOS Induced/EOS p s 0.42 0.42

Reliability Information Analysis Center


295
Chapter 7: Examples

7.2.3.5. Derivation of Base Failure Rates


Once the Pi-factors were defined for each component type that was modeled, and once
the failure rate was apportioned amongst the failure causes, the base failure rate could be
determined. This was accomplished by (1) gathering all failure rate data, (2) estimating
the model input variables (temperatures, stresses, etc.) for each source of data, (3)
calculating the associated Pi-factor for each failure rate, and (4) deriving a base failure
rate for each of the failure cause categories. For example, the failure rate associated with
operational stresses is equated to the product of the base failure rate and the operational
Pi-factors:

PFC * λobs = λbπ o


where:

PFC = percentage of failure rate attributable to operational failure causes


λobs = observed failure rate
λb = base failure rate to be derived
πo = product of model Pi-factors

Solving for λb, and adding a factor to account for data points which have had no observed
failures, yields:

PFC * λobs
λb = * PF
πo

The PF parameter is the percentage of total observed calendar hours associated with
components that have had observed failures. This factor is necessary to pro-rate the base
failure rate which was calculated from those data records containing failures. Once this
value of λb was calculated for each data record, the geometric mean was used as the best
estimate of the base failure rate.

7.2.3.6. Combining the Predicted Failure Rate with Empirical Data


The user of the 217Plus model is encouraged to collect as much empirical data as
possible and use it in the assessment. This is done by mathematically combining the
prediction made (based on the initial assessment and the process grades) with empirical
data, resulting in a reliability estimate. This step will combine the best “pre-build” failure
rate estimate obtained from the initial assessment (with process grading) with the metrics
obtained from the empirical data. Bayesian techniques are used for this purpose. This
technique accounts for the quantity of data by weighting large amounts of data more
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
296
Chapter 7: Examples

heavily than small quantities. The failure rate estimate obtained above forms the “prior”
distribution, comprised of a0 and b0.

If empirical data (i.e., test or field data) is available on the system under analysis, it can
be combined with the best pre-build failure rate estimate using the following equation:

ao + a1 + ....an
λ=
bo + b1 + ....bn

where:

λ = the best estimate of the predicted failure rate


ao = the equivalent number of failures of the prior distribution corresponding to
the reliability prediction (after process grading has been accounted for).
The default value is:

a0 = 0.5

bo = the equivalent number of hours associated with the reliability prediction


(after process grading) After a0 is calculated, the value of b0 can be
calculated by:

a0
b0 =
λp

a1 through an = the number of failures experienced in the empirical data. There


may be “n” different types of data available
b1 through bn = the equivalent number of cumulative operating hours (in
millions) experienced in the empirical data. These values must
be converted to equivalent hours by accounting for the
accelerating effects between the test and use conditions.

If test data is available that was taken at accelerated conditions, it needs to be converted
to the conditions of interest. A traditional reliability prediction can be performed at both
the test and use conditions, and the equivalent number of hours (bi) can be accelerated by
the failure rate ratio between the test and use temperatures, as follows:

Reliability Information Analysis Center


297
Chapter 7: Examples

λT 1
H Eq = * HT
λT 2
where:

H Eq = the equivalent number of test hours


λ T1 = the predicted failure rate at the test conditions, obtained by performing a
reliability prediction of the product or system at the test conditions
λT2 = the predicted failure rate at the use conditions, obtained by performing a
reliability prediction of the product or system at the use conditions
HT = the actual number of test hours

The benefits of including empirical data in the failure rate estimate are that it:

• Integrates all reliability data that is available at the point in time when the
estimate is performed (analogous to the statistical process called “meta-analysis”)
• Provides flexibility for the user to customize the reliability model with actual
historical experience data

7.2.3.7. Estimating Confidence Levels


The 217Plus methodology also estimates confidence levels around the failure rate.
Before empirical data is available on a system, the levels are assessed based on a
distribution that was derived by analyzing data on a variety of systems for which both
reliability predictions and field data were available. After test or field data becomes
available and failures are accrued, traditional Chi-square techniques can be used to
estimate the uncertainty in the reliability prediction.

7.2.3.8. Using the 217Plus Model in a Top-Down Analysis


If empirical data exists on a predecessor system, the equation that translates the failure
rate from the old system to the new system is as follows:

λ predicted , new
λ predicted = λ predecessor *
λ predicted , predecessor

The (predicted, new)/(predicted, predecessor) failure rate ratio accounts for the
differences in application environment, complexity, stresses, date, etc. The predicted
failure rates for the predecessor and the new system are determined using the complete
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
298
Chapter 7: Examples

detailed 217Plus methodology previously described. The observed predecessor failure


rate is used as the baseline against which the new system failure rate is estimated.

7.2.3.9. Capacitor Model Example


This section presents an example of the 217Plus component model for capacitors. The
failure rate equation for capacitors is:

λP = π Gπ C (λOBπ DCOπ TOπ S + λEBπ DCN π TE + λTCBπ CRπ DT ) + λSJBπ SJDT + λEOS

λP = predicted failure rate, failures per million calendar hours


πG = reliability growth failure rate multiplier:

π G = e (− β (Y −1993 ))
β= growth constant. A function of capacitor type (see Table 7.2-6)
πC = capacitance failure rate multiplier:

CE
⎛C⎞
π C = ⎜⎜ ⎟⎟
⎝ C1 ⎠
C= capacitance, in microfarads
C1 = constant. A function of capacitor type (see Table 7.2-6)
CE = constant. A function of capacitor type (see Table 7.2-6)
λOB = base failure rate, operating
πDCO = failure rate multiplier for duty cycle, operating:

DC
π DCO =
DC1op

πTO = Failure rate multiplier for temperature, operating:

⎛ − Eaop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617⎜ T + 273 − 298 ⎟ ⎟
π TO = e ⎝ ⎝ AO ⎠⎠

Reliability Information Analysis Center


299
Chapter 7: Examples

Eaop = activation energy, operating. A function of capacitor type (see Table 7.2-
6)
πS = failure rate multiplier for electrical stress:

n
⎛S ⎞
π S = ⎜⎜ A ⎟⎟
⎝ S1 ⎠
SA = stress ratio, the applied voltage stress divided by the rated voltage
S1 = constant. A function of capacitor type (see Table 7.2-6)
n= constant. A function of capacitor type (see Table 7.2-6)
λEB = base failure rate, environmental (see Table 7.2-6)
πDCN = failure rate multiplier, duty cycle – nonoperating:

1 − DC
π DCN =
DC 1nonop

πTE = failure rate multiplier, temperature-environment :

⎛ − Ea nonop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟
π TE = e ⎝ ⎝ AE ⎠⎠

Eanonop = activation energy, nonoperating. A function of capacitor type


(see Table 7.2-6)
λTCB = base failure rate, temperature cycling (see Table 7.2-6)
πCR = failure rate multiplier, cycling rate:

CR
π CR =
CR1

πDT = Failure rate multiplier, delta temperature:


2
⎛T −T ⎞
π DT = ⎜⎜ AO AE ⎟⎟
⎝ DT1 ⎠

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


300
Chapter 7: Examples

λSJB = base failure rate, solder joint (see Table 7.2-6)


πSJDT = failure rate multiplier, solder joint delta temperature:

2.26
⎛ T − TAE ⎞
π SJDT = ⎜ AO ⎟
⎝ 44 ⎠

λEOS = failure rate, electrical overstress (see Table 7.2-6)

Table 7.2-6: Capacitor Parameters

DC1nonop
λOB λEB λTCB λIND λSJB

TRdefault
β

Eanonop
Part Type CR1 DT1 n C1 S1 CE

DC1op
Eaop
Aluminum 0.000465 0.00022 0.000214 0.000768 .00095 0.229 0.17 0.5 0 0.83 0.4 1140.35 21 5 7.6 0.6 0.23

Ceramic 0.001292 0.000645 0.000096 0.00014 .00095 0.0082 0.17 0.3 0 0.83 0.3 1140.35 21 3 0.1 0.6 0.09

General 0.000634 0.000351 0.000083 0.000259 .00095 0.033 0.17 0.3 0 0.83 0.3 1140.35 21 7 0.1 0.6 0.09

Mica/Glass 0.000826 0.000997 0.000888 0.000764 .00095 0.0082 0.17 0.4 0 0.83 0.4 1140.35 21 10 0.1 0.6 0.09

Paper 0.000663 0.000075 0.000882 0.000042 .00095 0.0082 0.17 0.2 0 0.83 0.2 1140.35 21 5 0.1 0.6 0.09

Plastic 0.000994 0.001462 0.001657 0.002531 .00095 0.0082 0.17 0.2 0 0.83 0.2 1140.35 21 6 0.1 0.6 0.09

Tantalum 0.000175 0.000049 0.000032 0.000816 .00095 0.229 0.17 0.2 0 0.83 0.2 1140.35 21 17 7.6 0.6 0.23

Tantalum 0.000175 0.000049 0.000032 0.000816 .00095 0.229 0.17 0.2 0 0.83 0.2 1140.35 21 17 7.6 0.6 0.23

Variable, Air 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.3 1140.35 21 6 0.35 0.5 0.09

Variable, Ceramic 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.1 1140.35 21 3 0.35 0.5 0.09

Variable, FEP 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 6 0.35 0.5 0.09

Variable, General 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 6 0.35 0.5 0.09

Variable, Glass 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 3 0.35 0.5 0.09

Variable, Mica 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 10 0.35 0.5 0.09

Variable, Plastic 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 6 0.35 0.5 0.09

7.2.3.10. Default Values


The default values for the environmental and operating profile factors are summarized in
Tables 7.2-7and 7.2-8.

Reliability Information Analysis Center


301
Chapter 7: Examples

Table 7.2-7: Default Environmental Stress Values


Environment TAO TAE Humidity Vibration (GRMS)
Airborne 55 14 40 9
Airborne, Fixed Wing 55 14 40 9
Airborne, Fixed Wing, Inhabited 55 14 40 9
Airborne, Fixed Wing, Uninhabited 71 14 50 9
Airborne, Missile 55 14 40 10
Airborne, Missile, Flight 55 14 40 1.3
Airborne, Missile, Launch 55 14 40 16
Airborne, Rotary Wing 55 14 40 3.3
Airborne, Rotary Wing, Inhabited 55 14 40 3.3
Airborne, Rotary Wing, Uninhabited 71 14 50 3.3
Airborne, Space 55 14 40 0
Ground 35 17 40 0
Ground, Man Pack 55 14 40 1
Ground, Mobile 55 14 40 10
Ground, Mobile, Heavy Wheeled 55 14 40 10
Ground, Mobile, Heavy Wheeled, Chassis Mounted 55 14 40 10
Ground, Mobile, Heavy Wheeled, Engine Compartment 55 14 40 10
Ground, Mobile, Heavy Wheeled, Engine Mounted 55 14 40 10
Ground, Mobile, Heavy Wheeled, Instrument Panel Closed 55 14 40 10
Ground, Mobile, Heavy Wheeled, Instrument Panel Open 55 14 40 10
Ground, Mobile, Heavy Wheeled, Trunk 55 14 40 10
Ground, Mobile, Light Wheeled 55 14 40 4
Ground, Mobile, Light Wheeled, Chassis Mounted 34 14 40 4
Ground, Mobile, Light Wheeled, Engine Compartment 40 14 40 4
Ground, Mobile, Light Wheeled, Engine Mounted 58 14 40 4
Ground, Mobile, Light Wheeled, Instrument Panel Closed 31 14 40 4
Ground, Mobile, Light Wheeled, Instrument Panel Open 24 14 40 4
Ground, Mobile, Light Wheeled, Trunk 17 14 40 4
Ground, Mobile, Tracked 55 14 40 2
Ground, Stationary 35 19 40 0
Ground, Stationary, Indoors 30 23 40 0
Ground, Stationary, Outdoors 40 14 50 0
Naval 55 14 80 0.7
Naval, Shipboard 55 14 80 0.7
Naval, Shipboard, Sheltered 40 20 70 0.7
Naval, Shipboard, Unsheltered 60 14 90 0.7
Naval, Submarine 55 23 50 1

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


302
Chapter 7: Examples

Table 7.2-8: Default Operating Profile Values


Operating profile
Equipment type
DC CR (C/yr)
Automotive 5 1000
Commercial Aircraft 25 2982
Computer 80 1491
Consumer 30 368
Emergency Power 10 50
Industrial 80 184
Military Aircraft 25 1008
Military Ground 45 263
Naval 80 50
Telecommunications 80 368

7.2.4. Photonic Model Development Example

7.2.4.1. Introduction

7.2.4.1.1. Component Reliability Models Form


This section summarizes the manner in which photonic device models were derived
(Reference 3). It is included to demonstrate the development of models when little field
data is available.

The photonic component model form is:

λP = π Q (λOBπ DCOπ TOπ V + λEBπ DCN π TEπ RH + λTCBπ CRπ DT + λind )


where:

λp = predicted failure rate


πQ = multiplier for photonic device quality
λOB = base failure rate from operational stresses
πDCO = failure rate multiplier for duty cycle:

DC
π DCO =
DC1op

Reliability Information Analysis Center


303
Chapter 7: Examples

πTO = factor for operating temperature:


⎛ − Eaop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617 ⎜ T +T + 273 − 298 ⎟ ⎟
π TO = e ⎝ ⎝ AO R ⎠⎠

πV = vibration factor:
nvib
⎛ V +1⎞
π V = ⎜⎜ a ⎟⎟
⎝ Vc ⎠

λEB = base failure rate from environmental stresses


πDCN = failure rate multiplier for nonoperating duty cycle:

1 − DC
π DCN =
1 − DC1op

πTE = nonoperating temperature factor:

⎛ − Eanonop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟
π TE = e ⎝ ⎝ AE ⎠⎠

πRH = humidity factor:

n RH
⎛ RH a + 1 ⎞
π RH = ⎜⎜ ⎟⎟
⎝ RH c ⎠

λTCB = base failure rate from power or temperature cycling stresses

πcr = cycling rate factor:

CR
π CR =
CR1

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


304
Chapter 7: Examples

πDT = delta Temperature factor:

⎛ T + T − TAE ⎞
n PC

π DT = ⎜ AO R ⎟
⎝ 14 ⎠

λi = failure rate from induced stresses

The model parameters are defined as follows:

λP = predicted failure rate, failures per million calendar hours


πQ = failure rate multiplier for quality
λOB = base failure rate, operating
πDCO = failure rate multiplier for duty cycle, operating
DC = duty cycle (fraction of calendar time in operation)
DC1op = 0.25
πTO = failure rate multiplier, temperature – operating
Eaop = activation energy - operating
TAO = ambient operating temperature
TR = temperature rise above TAO
πV = failure rate multiplier, vibration level
VA = max vibration level applied (Grms)
VC = 1.0
nvib = vibration exponent
λEB = base failure rate, environment
πDCN = failure rate multiplier, duty cycle – nonoperating
πTE = failure rate multiplier, Temperature – environment
Eanonop = activation energy, nonoperating
TAE = ambient environmental temperature
πRH = failure rate multiplier, relative humidity
RHa = relative Humidity (%)
RHc = 50%
nRH = relative humidity exponent
λTCB = base failure rate, temperature cycling
πCR = failure rate multiplier, cycling rate
CR = cycling rate (cycles per year)
CR1 = 1000
Reliability Information Analysis Center
305
Chapter 7: Examples

πDT = failure rate multiplier, delta temperature


nPC = temperature cycling exponent
7.2.4.1.2. Model Development Methodology
The modeling methodology that was used in the photonics device modeling study is
summarized in Figure 7.2-4. This methodology is similar to the 217Plus model
development methodology, but was tailored for the specific needs of photonic
components. Each element of this methodology is explained in the following sections.

Collect reliability data and


populate spreadsheet
Collect failure mode
data

Estimate stresses to which


the parts were exposed
Map observed
failure modes into
the failure cause Estimate acceleration
categories model constants

Identify the base Calculate a normalization


percentage of failure stress accelerating stress
rate attributable to
each cause
Estimate acceleration
factors (Pi factors) for each
part from each data source

Calculate the based failure rates for each


cause such that observed = predicted
failure rates

Figure 7.2-4: Model Development Methodology Flowchart

7.2.4.2. Model development methodology and results


This section details the model development methodology and also presents the results of
each task in this methodology. Each task in Figure 7.2-4 is described in the following
sections.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
306
Chapter 7: Examples

7.2.4.2.1. Collect Failure Mode Data


There are two primary types of data upon which the component models are based, failure
rate and failure mode. The model development process required that the failure rate data
be apportioned into the following four defined failure cause categories:

• Failures from operational stresses


• Failures from environmental stresses
• Failures from power or temperature cycling stresses
• Failures from induced stresses

Since failure mode data is typically not classified according to these categories, it is
necessary to transform the failure mode distribution data into the failure cause
distribution. This failure mode distribution data was obtained from several sources:

• Data collected during the photonic device study


• Data obtained from the literature
• Analysis similar to a Failure Mode and Effects Analysis (FMEA), in which failure
causes are hypothesized.

An example of this is summarized in Table 7.2-9, in which the failure causes for a
connector are hypothesized (2nd column), and then an occurrence rating is given for each
cause. This rating is in the 3rd column, and is scored as a 1, 3 or 9. This weighting
scheme is often used in FMEA analysis. The result is a fractional value for each failure
cause that is proportional to the weighting. The sum of all of these values for each
component type equals 1.0.

The methodology used in the photonics device models to derive the fraction of
occurrence differs from the methodology presented previously for the 217Plus
components, in that failure mode distributions were not available during the photonics
model development effort. For the 217Plus models, the components were more mature
and therefore, there was considerable history of both failure mode and failure rate data to
draw upon.

Reliability Information Analysis Center


307
Chapter 7: Examples

Table 7.2-9: Failure Cause Summary for Connectors


Component Failure Cause Occurrence Fraction of
Type Occurrence
Spring failure 3 0.073
Wear of the connector resulting in misalignment 3 0.073
Wear of the end face 1 0.024
Contamination of facet (sand, dust, grease) 9 0.220
Contamination on outside that wicks in 1 0.024
Eccentric wear on the ferrule causes misalignment 1 0.024
Crimping too tight causes pinching 3 0.073
Crimping too loose causes it to fall apart 1 0.024
Connector O-ring failure 1 0.024
(SC and FC) Contraction of the outer jacket causes fiber pistoning 3 0.073
Fracture of the end face 1 0.024
Misalignment of cable end due to sleeve wear 1 0.024
Misalignment of cable end due to buckling from
tolerance stack up 3 0.073
Misalignment of cable end due to separation from
tolerance stack up 3 0.073
Insufficient cure of epoxy 3 0.073
Corrosion, pitting or facets 3 0.073
Embrittlment of organic materials due to UV exposure 1 0.024

7.2.4.2.2. Map Observed Failure Modes into the Failure Cause Categories
To transform the failure mode distribution data into the failure cause distribution, the
following process was used:

• Identify failure modes and their relative percentages (summarized above)


• Identify the accelerating factors applicable to each failure cause
• Identify the accelerating stresses applicable to each failure cause category (for
example, accelerating stresses from device operation applicable to many photonic
components will be optical power, temperature, etc.)
• Map the accelerating stress to the appropriate failure modes (identify them as
being a primary, secondary or no accelerant driver)

The last item is accomplished by assessing whether each stress is a primary accelerant of
the failure mode, a secondary accelerant, or is not an accelerant. A 3:1 weighting
between primary and secondary accelerant was then used in estimating the percentage of
failures that could be attributed to those stresses.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
308
Chapter 7: Examples

The primary stresses that potentially accelerate operational failure modes are operating
temperature, vibration, current/voltage and optical power. The stresses that accelerate
environmental failure causes are nonoperating ambient temperature, corrosive stresses
(contaminants/heat/humidity), and aging stresses (time). As an example, Table 7.2-10
summarizes this process for our connector example.

Table 7.2-10: Failure Mode to Failure Cause Category for Connectors (SC and FC)

Misalignment of cable end due to separation from tolerance stack up


Misalignment of cable end due to buckling from tolerance stack up

Embrittlment of organic materials due to UV exposure


Contraction of the outer jacket causes fiber pistoning
Eccentric wear on the ferrule causes misalignment
Wear of the connector resulting in misalignment

Misalignment of cable end due to sleeve wear


Contamination of facet (sand, dust, grease)

Crimping too loose causes it to fall apart


Contamination on outside that wicks in

Crimping too tight causes pinching

Corrosion, pitting or facets


Insufficient cure of epoxy
Fracture of the end face
Wear of the end face

O-ring failure
Spring failure

Total
21.95%
7.32%
7.32%
2.44%

2.44%
2.44%
7.32%
2.44%
2.44%
7.32%
2.44%
2.44%
7.32%
7.32%
7.32%
7.32%
2.44%

100 %

Failure Cause Accelerating Stress


Category or Cause
Operational Stresses Operating temperature 0.00 0.11
Vibration s p p p s p 0.10
Current/voltage 0.00
Optical power s 0.01
Environmental Ambient temperature s s s p 0.07 0.30
Corrosion p p 0.04
Ageing p p p p 0.09
Humidity p s s s s p 0.10
Power Cycling Power Cycling s s s s s p p s 0.23 0.23
Induced/handling Induced/handling p p p p p p 0.36 0.36
TOTAL 1.00 1.00

Reliability Information Analysis Center


309
Chapter 7: Examples

 
Each of the failure modes is listed across the top of the table, and each of the accelerating
stresses/causes is listed down the left side. Each combination is identified with a “blank”
(no acceleration from the factor), a "p" (primary) or an "s" (secondary). The associated
relative percentage of failures attributable to the accelerating stress/cause is listed down
the right columns.

The % column (second from the right) is calculated as follows:

⎛ ⎞
n ⎜ ⎟

% = ∑ FM % n
wi ⎟
⎜ ⎟
⎜ ∑ wi ⎟
FM1

⎝ AC1 ⎠
where:

FM% = the percentage associated with the ith failure mode


wi = the weight of the specific combination of failure mode and accelerating
stress or cause (0 for none, 1 for secondary, and 3 for primary)

For example, the % value for ambient temperature (as part of the environmental failure
cause category) is:

⎛1⎞ ⎛1⎞ ⎛1⎞ ⎛1⎞


7.32%⎜ ⎟ + 7.32%⎜ ⎟ + 7.32%⎜ ⎟ + 2.44%⎜ ⎟ = 0.07
⎝4⎠ ⎝ 4⎠ ⎝ 11 ⎠ ⎝1⎠

Therefore, an estimate of the percentage of failure causes accelerated by ambient


temperature is 7%.
7.2.4.2.3. Identify the Base Percentage of Failure Rate Attributable to Each Cause
The base percentages of failure rate are calculated by summing the accelerating
stress/cause percentages associated with each failure cause. For our connector example,
the four percentages associated with the “operating” accelerating stresses/causes is 11%,
or 0.11. These percentages are an estimate of the percent of failures that can be expected
for each cause under nominal stress conditions. In this case, nominal stresses are the
average stresses to which the models are normalized. Table 7.2-11 summarizes the
failure cause percentages (in fractional form).

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


310
Chapter 7: Examples

Table 7.2-11: Failure Cause Percentages for Connectors


Failure Rate Percentage
Term (Fraction)
Operational 0.11
Environmental 0.30
Power Cycling 0.23
Induced 0.36

7.2.4.2.4. Collect Reliability Data and Populate Spreadsheet


As previously summarized, the approach that was taken in photonics device model
development methodology relied on the collection of quantitative failure mode and
failure rate data. Literature searches were performed toward the goal of collecting the
quantitative data required for model development. Sources searched for applicable data
included:

• Optical Society of America (OSA)


• SPIE
• RIAC databases
• Total Electronic Migration System (TEMS) (a database of government-related
research from IACs and other sources)
• Government-Industry Data Exchange Program (GIDEP)
• Manufacturers data
• Data mined from the Web

The results of this data collection effort, for connectors, are summarized in Table 7.2-12.

Reliability Information Analysis Center


311
Chapter 7: Examples

Table 7.2-12: Data Collected for Connectors

Observed
Lambda
Failures
Delta T

Hours
RHa
Part

TAO
TAE

DC

CR
VA
TR
Data Type
Type

Field 30 23 5 12 0 0.8 368 40 33333333.33 1 30


Damp heat 85 85 0 0 0 0 0 85 20000 0
Damp heat 60 60 0 0 0 0 0 95 20160 12
Damp heat 85 85 0 0 0 0 0 85 1056 0
Damp heat 85 85 0 0 0 0 0 85 22000 0
Damp heat 85 85 0 0 0 0 0 85 22000 0
Connector

High temperature storage 85 85 0 0 0 0 0 2 20160 8


High temperature storage 85 85 0 0 0 0 0 1056 0
Low temperature storage -40 -40 0 0 0 0 0 1056 0
Thermal Cycling 85 -40 0 125 0 1 1752 5000 0
Thermal Cycling 70 -40 0 110 0 1 1752 12600 16
Thermal Cycling 85 -40 0 125 0 1 1752 50 0
Thermal Cycling 85 -40 0 125 0 1 1752 5500 0
Vibration 25 25 0 0 20 1 0 33 0

The first column is the part type; the second is the data type. Data types used in the
photonics device study included:

• Field data
• Test data
Thermal cycling
o Vibration
o Damp heat
o High temperature storage
o Low temperature storage
o Operating life test

The 3rd through tenth columns are the estimates of the actual stresses to which the part
was exposed in the field or during the test. These stresses are defined as follows:

TAO = ambient operating temperature


TAE = ambient environmental temperature

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


312
Chapter 7: Examples

TR = temperature rise above TAO


VA = maximum vibration level applied (Grms)
DC = duty cycle (fraction of calendar time in operation)
CR = cycling rate (cycles per year)
RHa = relative humidity (%)

7.2.4.2.5. Estimate Stresses to Which the Parts were Exposed


For each source of data that was collected, an estimate of the stresses and operating
profiles to which the component was exposed was required so that the failure rates could
be normalized to the actual stresses. These stresses were summarized in the previous
section.

For test data, these values were generally readily available. For data collected from
fielded systems, the actual stress values were not available. Therefore, they had to be
estimated. The default values of the environmental and operating profile factors were
summarized in Tables 7.2-7 and 7.2-8. Only field data from telecommunication
applications used in a ground, stationary, indoors environment was available to the
photonics device modeling study, so only the values pertaining to those conditions were
estimated in this manner.
7.2.4.2.6. Estimate Acceleration Model Constants for Each Part
Acceleration factors (or Pi-factors) were used in the component models to estimate the
effects of various stress and component variables on the failure rate. The two
predominant forms of acceleration factors are the Arrhenius and the power law models.

The Arrhenius model is generally used for modeling temperature effects and is:

Ea
AFT = e KT

where “AFT” is the temperature acceleration factor, “Ea” is the activation energy, “K” is
Boltzman’s constant, and “T” is the temperature (in degrees K).

The power law model is:


 
AF = S n

where “S” is the stress and “n” is a constant.

Reliability Information Analysis Center


313
Chapter 7: Examples

The specific forms of these acceleration factors that were used in the models are
summarized below.

πTO = factor for operating temperature:


⎛ − Eaop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617 ⎜ T +T + 273 − 298 ⎟ ⎟
π TO = e ⎝ ⎝ AO R ⎠⎠

πV = vibration factor:
nvib
⎛ V +1⎞
π V = ⎜⎜ a ⎟⎟
⎝ Vc ⎠

πTE = nonoperating temperature factor:

⎛ − Eanonop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟
π TE = e ⎝ ⎝ AE ⎠⎠

πRH = humidity factor:

n RH
⎛ RH a + 1 ⎞
π RH = ⎜⎜ ⎟⎟
⎝ RH c ⎠

πDT = delta temperature factor:

⎛ T + T − TAE ⎞
n PC

π DT = ⎜ AO R ⎟
⎝ 14 ⎠

The temperature factors based on the Arrhenius relationship were normalized to 25


degrees C. The acceleration factors for vibration and relative humidity that are based on
the power law were normalized to a specific value, i.e. the denominator, and include a
value of 1.0 in the numerator to ensure that the factor does not go to zero with a stress
level of zero.

Each model has a single factor that needs to be estimated, i.e., “Ea” for the Arrhenius and
“n” for the power law. These were estimated in one of the following ways:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


314
Chapter 7: Examples

1. Values generated from information that was available in the literature


2. Engineering judgment based on the known behavior of similar failure
mechanisms

For #2, the accelerations were categorized from “no acceleration” to “very high
acceleration” for each specific accelerating stress. Table 7.2-13 summarizes the values of
the applicable parameters as a function of the relationship.

Table 7.2-13: Categories of Acceleration Model Parameters


Dependency n (PC) Ea (op) Ea (nonop) n (RH) n (Vibration)
Very High 10 1 1 10 10
High 5 0.7 0.7 5 5
Medium 2 0.5 0.5 2 2
Low 1 0.1 0.1 1 1
None 0 0 0 0 0

Table 7.2-14 summarizes the specific parameter values used in the connector models.

Table 7.2-14: Acceleration Model Parameters


Component Type n (PC) Ea (op) Ea (nonop) n (RH) n (Vibration)
Connector 2 0.1 0.1 10 5

7.2.4.2.7. Calculate a Normalization Stress Accelerating Stress


The Pi factors needed to be normalized to a fixed set of conditions. This approach makes
it convenient to derive default Pi-factors. By normalizing the factors in this manner, the
Pi-factor is equal to 1.0 when the stress is equal to the default stress. Therefore, if an
analyst chooses to ignore the effects of a particular stress, the failure rate will be
representative of the default stress levels.

The default values for the applicable photonics device model Pi-factors are summarized
in Table 7.2-15.

Reliability Information Analysis Center


315
Chapter 7: Examples

Table 7.2-15: Default Model Parameters

Default RH
Default DC

Default CR

Default DT
Default Tr

Vibration
Model Category

Default
Connector 0
Passive Micro-Optic Component 10
Passive Fiber-Based Component 0
Isolator 5
VOA 20
Fiber 0
0.25 1000 1 50 20
Splice 0
Cable 0
Laser Diode Module 15
Photodiode 5
Transmitter 15
Receiver 15
Transceiver 15

7.2.4.2.8. Estimate the Acceleration Factors (Pi-factors) for Each Part from Each Data Source
The acceleration factors used in the models are Pi-factors, which are the acceleration
factors normalized to a given stress level. These factors were calculated for each part
from each data source. To derive these factors, two pieces of information were required:

1. The estimate of the stress for each data point (in this case, a data point is a single
observation of reliability (failures and hours) at a known set of stress conditions).
The manner in which these were quantified was previously explained.
2. The default stress level of the data for each stress parameter in the model

The Pi-factor was then the acceleration model normalized to the default stress level. An
example of this calculation is shown in Table 7.2-16. Every data point available from
field or test data had its associated Pi-factor values calculated. Note that some of the Pi-
factors were zero. This occurs because test data was not applicable to all failure causes.
This concept will be further explained in the next section.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


316
Chapter 7: Examples

Table 7.2-16: Summary of Pi-factor Calculations


Pi factors

Pi DCO

Pi DCN

Pi RH

Pi CR
Pi TO

Pi DT
Pi TE
Pi V
Cable Field 3.200 1.000 32.000 0.267 1.000 0.137 0.368 1.063
Cable Thermal Cycling 4.000 1.000 1.000 0.000 1.000 0.000 2.037 1.180
Cable Thermal Cycling 4.000 1.000 1.000 0.000 1.000 0.000 4.037 1.149
Cable Thermal Cycling 4.000 1.000 1.000 0.000 1.000 0.000 4.037 1.162
Cable Vibration 4.000 1.000 4084101 0.000 1.000 0.000 0.000 0.000
Cable Vibration 4.000 1.000 496874 0.000 1.000 0.000 0.000 0.000
Cable Vibration 4.000 1.000 785027 0.000 1.000 0.000 0.000 0.000
Cable Vibration 4.000 1.000 4084101 0.000 1.000 0.000 0.000 0.000
Connector Field 3.200 1.135 1.000 0.267 0.974 0.137 0.368 0.360
Connector Damp heat 0.000 1.921 1.000 1.333 1.921 227 0.000 0.000
Connector Damp heat 0.000 1.506 1.000 1.333 1.506 681 0.000 0.000
Connector Damp heat 0.000 1.921 1.000 1.333 1.921 227 0.000 0.000
Connector Damp heat 0.000 1.921 1.000 1.333 1.921 227 0.000 0.000
Connector Damp heat 0.000 1.921 1.000 1.333 1.921 227 0.000 0.000
Connector High temperature storage 0.000 1.921 1.000 1.333 1.921 0.000 0.000 0.000
Connector High temperature storage 0.000 1.921 1.000 1.333 1.921 0.000 0.000 0.000
Connector Low temperature storage 0.000 0.337 1.000 1.333 0.337 0.000 0.000 0.000

7.2.4.2.9. Calculate the Base Failure Rates for Each Cause Such That the Observed Failure
Rates = the Predicted Failure Rates
In the case of the 217Plus models, which were based solely on field data, the base failure
rates for the photonic device models were obtained, as follows, for each failure cause
category:
m
∑ (Fobs × %i ) field
1
λ Bi = m k
∑ H obs field × ∏ π
1 1

where:

λBi = the base failure rate for the ith failure rate term
Fobs = the number of observed field failures
Hobs = the number of observed field hours
Ππ= the product of the applicable Pi-factors to the applicable field environment
i= the number of failure causes
m= the number of field data sources
k= the number of correction factors
%i = the percentage of failure rate attributable to the specific failure causes
Reliability Information Analysis Center
317
Chapter 7: Examples

The product of the Pi-factors converts the actual hours to an equivalent “effective”
number of hours normalized to the default stress values.

However, in the case of the photonic models developed for the study, it was necessary to
utilize a significant amount of test data since there was not enough field data available.
This is due to the fact that there are few field data sources for photonic components.
Therefore, the modeling methodology needed to be tailored to accommodate the specific
data available on the parts addressed in the photonics device study. This was
accomplished by using a Bayesian technique in which the field data becomes the prior
distribution, and the summation of the failure and hours from all data sources forms the
basis of the posterior distribution. The failure rate parameter of the exponential
distribution was, therefore:

m j
∑ (Fobs × %i ) field + ∑ Fobstest
λ Bi = 1 1
m k j k
∑ H obs field × ∏ π + ∑ H obstest × ∏ π
1 1 1 1

where there were “j” test data sources.

Each specific type of test data that was collected for the study was applicable to only one
of the four specific failure causes, as summarized in Table 7.2-17. Field data, however,
encompassed all four failure causes.

Table 7.2-17: Applicability of Test Data


Failure Cause Category
Data Type
Operating Environmental Cycling Induced
Field X X X X
Operating Life Test X
High Temperature Storage X
Low Temperature Storage X
Damp Heat X
Vibration X
Thermal Cycling X

One of the advantages to the model structure was this ability to modify the base failure
rates of specific failure causes with test data applicable to only that failure cause.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


318
Chapter 7: Examples

The connector base failure rates resulting from this analysis are listed in Table 7.2-18.

Table 7.2-18: Base Failure Rates (Failures per Million Calendar Hours)
Base Failure Rate
Component (failures per million calendar hours)
Operating Environmental Cycling Induced
Connector 0.0002 0.3053 2.7952 0.0110

7.2.4.2.10. Adjust the Base Failure Rates


The last step in the process was to adjust the base failure rates to ensure that the predicted
number of failures was equal to the observed number. The manner in which this was
accomplished was to scale the base failure rates to ensure that the cumulative predicted
number of failures of the entire population of observed data points was equal to the
observed number of failures. This was accomplished by using the MS Excel “goal seek”
function, which finds the value of a “correction factor” that satisfies this boundary
condition. This approach is conceptually similar to a maximum likelihood method.
7.2.4.2.11. Treatment of Quality and Environmental Stresses
There were several options for modeling the effects of environmental stresses. Early in
the study, it was decided that the effects of quality and environment would be treated
such that the photonic component models would be “stand-alone”. This approach
differed from the form of the 217Plus methodology, in that quality and environment were
treated as “system” level effects. This concept was based on the premise that quality and
environmental effects were manifested more at the assembly or system level than they
were at the component level. The photonic component models include the effects of their
pertinent environmental stresses in the component models, instead of applying the
environment factor in the assembly or system model, as was the case with 217Plus. The
primary environmental stresses included in the photonic component models are
temperature, humidity and vibration.

The quality factor ( Q) is calculated in a manner similar to the 217Plus methodology, but
tailored to the unique concerns of photonic components. This factor is calculated as
follows:
1
π q = α i (− ln (R i )) β i

Where αi and βi are Weibull parameters representing the distribution of the percentage of
failures attributable to components (parts). The quality factor is scaled within this

Reliability Information Analysis Center


319
Chapter 7: Examples

distribution based on how good the parts control program is. The parameter “Ri” is the
rating of the parts control program and is calculated from:
ni


j =1
GijWij
Ri = ni

∑W
j =1
ij

where,

Ri = rating of the process for the ith failure cause, from 0.0 to 1.0
Gij = the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = the number of grading criteria associated with the ith failure cause

The 217Plus grading criteria, as applied to the photonics device models, are provided in
Table 7.2-19. These were tailored specifically for photonic components.

Table 7.2-19: Part Quality Process Grade Factor Questions for Photonic Device Models
Highest
Input User Actual
Parts Contribution to Reliability Rating Possible
Range Input Score
Score
Is there a documented part selection and part yes = 5
Y,N N 5 0.0
management process? no = 0

Are part evaluation and qualification processes yes = 3


Y,N N 3 0.0
established to add parts to the PPL? no = 0
Does a cross functional development team
yes = 3
(CFDT) review and approve new candidate parts Y,N N 3 0.0
no = 0
for addition to the PPL?
Is this a commercial off-the-shelf (COTS)
yes = 6
purchased assembly with a good history of Y,N N 6 0.0
no = 0
operational reliability?
Will new parts be added to the PPL to design this yes = 4
Y,N N 4 0.0
FRU? no = 0

Are procedures in place to detect part problems yes = 10


Y,N N 10 0.0
in both manufacturing and the field? no = 0
Are quality and reliability data tracked on parts
yes = 10
and fed back to suppliers so they know their Y,N N 10 0.0
no = 0
performance on this product?
Is there a design compliance checklist to ensure
that all parts are properly applied, operating at
yes = 10
sufficient margin with respect to environmental Y,N N 10 0.0
no = 0
and operational stresses, and take into account
lessons learned?

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


320
Chapter 7: Examples

Highest
Input User Actual
Parts Contribution to Reliability Rating Possible
Range Input Score
Score
Are teaming relationships established with all yes = 7
Y,N N 7 0.0
critical component suppliers? no = 0
Will all suppliers provide timely failure reporting
and corrective action support (FRACAS) for both
yes = 7
critical and custom parts? (Timely reporting Y,N N 7 0.0
no = 0
implies a 2 week turnaround with faster response
on priority demand.)
Have supplier identified the likely failure modes
yes = 10
on critical and custom parts, and does the design Y,N N 10 0.0
no = 0
take these failure modes into account?
Are operational failure rate and failure mode
yes = 7
data provided by the suppliers of critical and Y,N N 7 0.0
no = 0
custom parts being used?
Is there a device specification for all critical and yes = 5
Y,N N 5 0.0
custom parts? no = 0

Has the supplier reviewed the part application yes = 7


Y,N N 7 0.0
for all critical and custom parts? no = 0
Will critical suppliers provide timely notice of
yes = 7
impending part changes to allow the developer Y,N N 7 0.0
no = 0
to assess the impact?
Is a change history log maintained to provide
traceability of engineering change actions and yes = 7
Y,N N 7 0.0
their associated rationale for critical and custom no = 0
parts?
Will part identification (revision numbers) be
shown on the part to identify the particular part yes = 7
Y,N N 7 0.0
configuration, including the level of the part’s no = 0
firmware?
Is there a first article inspection and acceptance yes = 7
Y,N N 7 0.0
test planned? no = 0

Have key suppliers identified their part failure yes = 10


Y,N N 10 0.0
mechanisms? no = 0

Have the sources and the extent of part variation yes = 7


Y,N N 7 0.0
been identified? no = 0

Have mitigations been identified to handle the yes = 8


Y,N N 8 0.0
effects of part's variations? no = 0
Will a design of experiments part evaluation,
yes = 7
considering variations, as well as manufacturing Y,N N 7 0.0
no = 0
variations, be conducted?
Will developers' quality organization audit the yes = 6
Y,N N 6 0.0
supplier's processes and facility capabilities? no = 0

A. No OPA = 10
B. yes, MFD <2um = 0
Is an optical path adhesive (OPA) used in the
C. yes, MFD = 2 - 5 um = 4 A,B,C,D,E B 10 0.0
component
D. yes, MFD = 5 - 10 um = 6
E. yes, MFD > 10 um = 8

Reliability Information Analysis Center


321
Chapter 7: Examples

Highest
Input User Actual
Parts Contribution to Reliability Rating Possible
Range Input Score
Score
A. No Thin film = 0
B. yes, and surface is prepared by
Are there thin films (AR coatings, filter
sputtering = 2 A,B,C A 3 0.0
elements) in the light path?
C. yes, and surface is not
prepared by sputtering = 3
yes = 5
Does the component contain fused fiber? Y,N N 5 0.0
no = 0

yes = 5
Does the component contain fiber? Y,N N 5 0.0
no = 0
Was the package thermally designed to safely
yes = 3
dissipate heat by understanding and modeling Y,N N 3 0.0
no = 0
the thermal characteristics?
Has the manufacturer characterized the power yes = 5
Y,N N 5 0.0
handling capability of the component? no = 0
Have acceleration factors for power and
yes = 5
temperature been quantified and are they used Y,N N 5 0.0
no = 0
to determine the derating requirements?
Does the component contain absorbers at
yes = 4
wavelengths for which the component will be Y,N N 4 0.0
no = 0
exposed (i.e. garnet, shutter, etc.)
A. with a heat sink = 4
How is dissipated power intended to be dumped? B. dissipation not actively A,B B 4 0.0
managed = 0
Does the component rely on alignment of free yes = 3
Y,N N 3 0.0
space components attached with organics no = 0
A. stringent cleaning procedures
=3
Cleanliness precautions A,B,C C 3 0.0
B. some cleaning procedures = 2
C. no cleaning procedures = 0
For components that have a fiber/epoxy interface,
yes = 3
is the fiber tip inspected to ensure it is free of Y,N N 3 0.0
no = 0
defects and contamination?

7.2.4.3. Uncertainty Analysis


An analysis was performed to quantify the degree of uncertainty in the predicted failure
rates. This was accomplished by calculating the predicted failure rate and comparing it to
the observed failure rate. The metric that was used for this analysis was the log10 of the
value: predicted failure rate/observed failure rate. The value of this metric should cluster
around zero if the prediction models are approximating the observed data. Calculation of
the standard deviation of this metric also provides a quantification of the uncertainty
levels present in the predictions made with these models. Table 7.2-20 summarizes the
mean and standard deviation of this metric for all of the data and for only the field data.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


322
Chapter 7: Examples

Table 7.2-20: Summary of Uncertainty Metrics


All Data Field Data
Mean -0.68 0.20
Standard deviation 1.07 0.44

Figures 7.2-5 and 7.2-6 illustrate the distribution of this metric for all data and for just
field data. For this analysis, only data for which failures occurred were included, since
data with no observed failures only have a single-sided bound on the failure rate and,
therefore, cannot be compared to the predicted value. The result of not including zero
failure data is that the metric is biased. As can be seen in these figures, the distribution of
all failures is significantly wider than the distribution of just the field failure rates. This
is due to the fact that the non-field data, i.e. test data, is typically at extreme conditions.
Therefore, the uncertainty in these extreme cases is typically larger than for nominal
conditions.

Histogram

14

12

10
Frequency

8
6

0
-3 -2 -1 0 1 2 3
LOG 10 (PREDICTED/OBSERVED)

Figure 7.2-5: Distribution of Log10 Predicted/Observed Failure Rate Ratio for All
Data

Reliability Information Analysis Center


323
Chapter 7: Examples

Histogram

7
6
5
Frequency
4
3
2
1
0
-0.25 0.25 0.75 1.25 1.75 More
LOG 10 (PREDICTED/OBSERVED)

Figure 7.2-6: Distribution of Log10 Predicted/Observed Ratio for Field Data Only

The distributions of the predicted/observed failure rate ratio are illustrated in Figure 7.2-
7. With this metric, the value should be centered about one, since the log of this ratio has
not been taken.
 
Re lia So f t W e ibu ll+ + 7 - w w w . Re lia So ft. com
distribution of predicted/observed failure rate ratio
99 . 0 00
Prob a bility-Lo gn orma l
cumulative probability

50 . 0 00

10 . 0 00

5. 00 0

1. 00 0
0. 00 1 0 . 0 10 0. 10 0 1. 000 10. 00 0 1 00 . 0 00
predicted/observed ratio
F olio1\Da ta 1: μ= −1 .5 5 8 5 , σ=2 .4 9 2 5 , ρ= 0 .9 8 8 0
F olio1\Da ta 2: μ= 0 .4 5 5 6 , σ=0 .9 5 4 7 , ρ=0 .8 8 1 3

Figure 7.2-7: Distributions of the Predicted/Observed Failure Rate Ratio for All Data
and For Field Data Only
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
324
Chapter 7: Examples

7.2.4.4. Comments on Part Quality Levels


Part quality level has traditionally been used as one of the primary variables affecting the
predicted failure rate of a component. The quality level categories were usually those
defined by the applicable military specification.

One of the problems that developers had when developing MIL-HDBK-217 models was
de-convolving the effects of quality and environment. For example, multiple linear
regression analysis of field failure rate data was usually used to quantify model variables
as a function of independent variables such as quality and environment. A basic
assumption of such techniques is that the independent variables are statistically
independent of each other. However, in reality they are not, since the “higher” quality
components are generally used in the severe environments and the commercial quality
components are used in the more benign environments. This correlation makes it
difficult to discern the effects of each of the variables individually. Additionally, there
are several attributes pooled into the quality factor, including qualification, process
certification, screening and quality systems.

The approach used in the 217Plus model to quantify the effects of part quality is to treat it
as one of the failure causes for which a process grade is determined. In this manner,
issues related to qualification, process certification, screening and quality systems were
individually addressed.

7.2.4.5. Explanation of Failure Rate Units


The 217Plus models predict the failure rate in units of failures per million calendar hours.
This is necessary because the 217Plus methodology accounts for all failure rate
contribution terms (i.e., operating, nonoperating, cycling and induced), and the
appropriate manner in which they can be combined is to use a common time basis for the
failure rate, which is calendar hours.

If an equivalent operating failure rate is desired in units of failures per million operating
hours, the 217Plus reliability prediction should be performed with the actual duty cycle to
which the unit will be subjected, then divide the resulting failure rate (in f/106 calendar
hours) by the duty cycle to yield a failure rate in terms of f/106 operating hours. The
resulting “operating” failure rate will be artificially increased to account for the
nonoperating and cycling failures that would not otherwise be accounted for. The
incorrect way to predict a 217Plus failure rate in units of failures per million operating
hour is to set the duty cycle equal to 1.0. The resulting failure rate in this case would be
valid only if the actual duty cycle is 100%. If the actual duty cycle is not 100%, then the
failures during non-operating periods will not be accounted for.

Reliability Information Analysis Center


325
Chapter 7: Examples

7.2.5. System-Level Model

7.2.5.1. Model Presentation


As a reminder, the total 217Plus system model is:

λ P = λ IA (Π P Π IMΠ E + Π D Π G + Π M Π IMΠ E Π G + ΠSΠ G + Π I + Π N + Π W ) + λ SW

where:

λp = predicted failure rate of the system


λIA = initial assessment of the failure rate. This failure rate is based on new
component failure rate models derived by the RIAC presented in Section
2.2, whose derivations are discussed in the next section

Each of the following model factors represents a failure cause:

ΠP = parts process factor


ΠD = design process factor
ΠM = manufacturing process factor
ΠS = system management process factor
ΠI = induced process factor
ΠN = no-defect process factor
ΠW = wearout process factor

Each of these factors is calculated as follows:


1
π i = αi (− ln(Ri ))β i

where αi and βi are constants for each failure cause category, as given in Table 7.2-21.
The parameter Ri is calculated as:
ni


j =1
GijWij
Ri = ni

∑W
j =1
ij

where:

Ri = rating of the process for the ith failure cause, from 0.0 to 1.0.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
326
Chapter 7: Examples

Gij = the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = the number of grading criteria associated with the ith failure cause

Table 7.2-21: Parameters for the Process Grade Factors


Model Factor Default value for factor if
Name α β
Symbol (Πi) Ri is unknown
ΠD Design process factor 0.12 1.29 0.094
ΠM Manufacturing process factor 0.21 0.96 0.142
ΠP Parts Quality process factor 0.30 1.62 0.243
ΠS Systems Management process 0.06 0.64 0.036
factor
ΠN CND process factor 0.29 1.92 0.237
ΠI Induced process factor 0.18 1.58 0.141
ΠW Wearout process factor 0.13 1.68 0.106

ΠIM = infant mortality factor


t - 0.62
Π IM = (1 - SSESS )
1.77

where:

t= time in years. This is the instantaneous time at which the failure rate is
to be evaluated. If the average failure rate for a given time period is
desired, this expression must be integrated and divided by the time
period.
SSESS = the screening strength of the screen(s) applied, if any
ΠE = environmental factor

πE =
(( ) (
.855 × .8 1 − e(−.065(ΔT +.6 ) ) + .2 1 − e(−.046G )
.6 1.71
))
.205
where:

ΔT = the change in temperature between operating and non-operating periods


(TAO-TAE)

Reliability Information Analysis Center


327
Chapter 7: Examples

G = the magnitude of random vibration while the system is operating, in GRMS


ΠG = reliability growth factor, given by the formula:

1 .12 (t + 2 ) −α
ΠG =
2 −α

where:

α= the growth constant, which is equal to Ri for reliability growth processes


Ri = the rating of the growth process using the criteria in Table 7.2-30, and is
given as:
ni


j =1
GijWij
Ri = ni

∑W
j =1
ij

7.2.5.2. 217Plus Process Grading Criteria


This section contains a listing of all of the criteria that comprise the definition and
scoring for the individual 217Plus Process Grades. An index of the tables included
within this section is listed in Table 7.2-22.

Table 7.2-22. Index of Process Grade Type Questions


Table Number Process Grade Type
7.2-23 Design
7.2-24 Manufacturing
7.2-25 Part Quality
7.2-26 System Management
7.2-27 CND
7.2-28 Induced
7.2-29 Wearout
7.2-30 Growth

The rating for each process grade type, Ri,is given as:
ni


j =1
GijWij
Ri = ni

∑W
j =1
ij

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


328
Chapter 7: Examples

where:

Ri = rating of the process for the ith failure cause, from 0.0 to 1.0.
Gij = the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = number of grading criteria associated with the ith failure cause

These tables are organized as follows. Column 1 contains the criteria associated with the
specific Process Grade Type. Column 2 is the grading criteria (Gij). Most of the
questions are designated with a Y/N in this column. In these cases, a Yes (Y) answer
equals "1" and a "No" answer equals “0”. The question will receive the full weighted
score for a "Yes" answer and a zero for a "No" answer. In some cases, the grading
criteria is not binary, but rather can be one of three or four possible values. The grading
criteria for these are noted in this column. Column 3 identifies the scoring weight (Wij)
associated with the specific question.

In the event that a model user does not wish to answer all of the questions, he/she can
choose a subset of the most important questions by using only those with weight values
of seven or higher. Questions that are not scored should not be counted in the number of
grading criteria (ni) associated with the ith failure score.

Reliability Information Analysis Center


329
Chapter 7: Examples

7.2.5.3. Design Process Grade Factor Questions

Table 7.2-23: Design Process Grade Factor Questions


Question Gij Wij
<10 = 0
What is the % of lead design engineering people with cross training experience in manufacturing or field operations (thresholds at
10-20 = .5 5
10, 20%)?
>20 = 1
<25 = 0
What is the % of team members having relevant product experience (thresholds at 25, 50%)? 25-50 = .5 5
>50 = 1
<20 = 0
What is the % of team members having relevant process experience, i.e., they have previously developed a product under the
20-40 = .5 4
current development process (thresholds at 20, 40%)?
>40 = 1
<20 = 0
What is the % of development team that have 4-year technical degrees (thresholds at 20, 40%)? 20-40 = .5 3
>40 = 1
<10 = 0
What is the % of engineering team having advanced technical degrees (thresholds at 10, 20%)? 10-20 = .5 3
>20 = 1
<10 = 0
What is the % of engineering team members involved in professional activities in the past year; hold patents; authored/presented
10-20 = .5 2
papers; are registered professional engineers, or professional society offices at the National level (thresholds at 10, 20%)?
>20 = 1
<10 = 0
What is the % of engineering team members who have taken engineering courses in the past year (thresholds at 10, 20%)? 10-20 = .5 2
>20 = 1
Are resource people identified for program technology support across key technology and specialty areas such as optoelectronics, Yes = 1
7
servo control, Application Specific Integrated Circuits (ASIC) design, etc., to provide program guidance and support as needed? No = 0
Are resource people identified, for program tools support, to provide guidance and assistance with Computer Aided Design (CAD), Yes = 1
6
simulation, etc.? No = 0
3=1
How many (0,1,2,3) of the program objectives of cost, schedule and reliability did the manager successfully meet for the last 2 = .5
10
program that he/she was responsible? 1 = .25
0=0
Is this development program organized as "Cross Functional Development Teams" (CFDT) involving: design, manufacturing, test, Yes = 1
8
procurement, etc.? No = 0
Yes = 1
Does this Field Replaceable Unit (FRU) depend more on mature technology than state of the art technology? 3
No = 0
Is design of experiments (DOE) used to ensure robustness of the FRU in the product under all operational and environmental Yes = 1
5
variations? No = 0
Yes = 1
Are critical components identified along with plans to mitigate their risks? 5
No = 0
Yes = 1
Have designs been reviewed and plans made for part obsolescence during the product's life cycle? 6
No = 0
Are considerations made to accommodate part form factor evolution? This applies particularly to those parts deemed likely to Yes = 1
5
change during the production life of the fielded system. No = 0
Yes = 1
Are predominantly standard tools required for maintenance (limited-to-no use of special tools)? 2
No = 0
Yes = 1
Will the design application be modeled by variational analysis to ensure design centering? 5
No = 0
Yes = 1
Will timing analysis be performed on digital circuits? 5
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


330
Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Will a network modal analysis be performed on analog circuits? 5
No = 0
Yes = 1
Will electrical stress analysis be performed on electronic circuits? 6
No = 0
Yes = 1
Will mechanical stress analysis be performed on relevant components, materials and structures? 3
No = 0
Yes = 1
Will a prototype be developed in time to have user feedback impact the design? 10
No = 0
Yes = 1
Will customer feedback on the prototype be sought? 10
No = 0
Will design personnel participate in a Failure Modes and Effects Analysis (FMEA), Failure Modes Effects and Criticality Analysis Yes = 1
6
(FMECA), or Fault Tree Analysis (FTA) that is performed concurrently with the design effort? No = 0
Yes = 1
Will the design engineer also design the diagnostic code for this FRU? 4
No = 0
Yes = 1
Will a worst-case analysis be performed? 4
No = 0
Yes = 1
Will the product support tasks be ergonomically evaluated (human factors) from an Operations & Maintenance standpoint? 4
No = 0
Will the product be analyzed using a human factor task analysis to ensure the Operations and Maintenance Tasks are tailored to Yes = 1
4
human capabilities? No = 0
Will the chassis that this FRU is mounted in be thermally measured and analyzed and operating temperatures assured to be at a safe Yes = 1
6
margin below device limits? No = 0
Yes = 1
Is electrical/mechanical power by electronic logic or physical action (switches)? 4
No = 0
Yes = 1
Do control procedures ensure that the system and its software are put in a safe state during power shut down? 4
No = 0
<90 = 1
What is the maximum silicon junction temperature on this FRU in degrees C (thresholds at 90 and 125 degrees C)? If not
90-125 = .5 8
applicable select “0”.
>125 = 0
Will environmental analyses and profiling (thermal, dynamic) be performed on the product to ensure it is used within its design Yes = 1
5
strength capabilities? No = 0
Yes = 1
Will the product be analyzed/tested for electromagnetic compatibility (EMC) and radiated/conducted susceptibility and emissions? 6
No = 0
Will the product be EMC-certified, per the European CE (Conformity European) regulatory compliance criteria for equipment used Yes = 1
4
in Europe, or under a similarly rigorous standard such as DO-160 (commercial aircraft)? No = 0
th
Are the size of equipment orifices (cover openings) less than 1/10 of the wavelength of the signal frequencies that the equipment Yes = 1
4
will generate within its enclosure or be exposed to in its environment? No = 0
Do traces on a Printed Wiring Board (PWB) run over a ground plane or an impedance control layer (e.g., power planes) and never Yes = 1
4
over reference plane or power plane voids? No = 0
Do traces on alternate PWB layers run orthogonal to one another, when a reference plane or power plane is not interposed between Yes = 1
4
them? No = 0
Yes = 1
Are adjacent traces separated by at least twice their width, except for minor adjacencies that run less than a half inch? 4
No = 0
Is the power source filtered over the range of 1KHz to 100 MHz for military or 150KHz to 30 MHz for commercial power, and Yes = 1
4
utilize surge suppression devices where appropriate? No = 0
Are all interconnect cables emerging from a shielded cabinet grounded to the chassis for operating frequencies greater than 1 MHz Yes = 1
4
or capacitively decoupled to the chassis for frequencies less than 1 MHz? No = 0
Yes = 1
Are traces set back at least 2 widths from the edge of the reference or ground plane? 4
No = 0
Yes = 1
Is there a shared product development vision that includes Design for Manufacturability (DFM) goals? 8
No = 0

Reliability Information Analysis Center


331
Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Are part types standardized via a Preferred Parts List (PPL)? 5
No = 0
Is there continuing focus to keep the PPL up to date and to minimize the number of parts on the PPL, by increasing part Yes = 1
6
standardization, encouraging designers to use the PPL and requiring analysis to justify adding a new part to the PPL? No = 0
Yes = 1
Is this product to be built on an existing manufacturing platform that makes use of existing process capabilities? 6
No = 0
Yes = 1
Do plans for follow-on products and product retirement exist? 2
No = 0
Yes = 1
Are new, critical parts qualified for by test and analysis prior to their inclusion in the system? 6
No = 0
<75% = 1
How does the part count on this project compare with predecessor products or competitive products? (Thresholds at 75 and 100%). 75%-100% = .5 4
>100% = 0
Are there DFM guidelines provided that the program must adhere to? (e.g., a good DFM design is fabricated on a uni-axis Yes = 1
6
assembly orientation, preferably built from the bottom) No = 0
<60 = 1
What % of inter-connections are there compared to the predecessor version of this FRU (thresholds at 70 and 100%)? 60-100 = .5 4
>100 = 0
Yes = 1
Are PWB traces at least 5 mils in width? 5
No = 0
Yes = 1
Is the development process documented? 5
No = 0
Is the process documentation on-line with the recognition that the on-line version is the only standard? (All printed copies are for Yes = 1
6
reference only). No = 0
Yes = 1
Does each process activity have clear entry and exit criteria? 2
No = 0
Is the system configuration documented on-line, with changes since the last baseline highlighted to keep the entire team current Yes = 1
6
with the design? No = 0
Yes = 1
Are there functional block diagrams of the system, subsystems, etc., down to the FRU level? 3
No = 0
Are examples of good development products (e.g., specs, plans, documentation) provided to the engineering team, typifying the Yes = 1
6
desired work products for each stage of development? No = 0
Yes = 1
Are examples of past problems provided to the engineering team that typify those found at each stage of development? 4
No = 0
Yes = 1
Is there a closed-loop problem database to track development problems to closure? 5
No = 0
Yes = 1
Does development activities planning include the identification of critical path tasks? 5
No = 0
Yes = 1
Are critical path tasks planned to minimize cycle time impacts and improve schedule robustness? 4
No = 0
Yes = 1
Are individual developers encouraged to make contact with their customer counterpart? 6
No = 0
Will Cross-Functional Development Team (CFDT) phase reviews/sign-offs follow each product development phase: requirements, Yes = 1
9
preliminary design, final design, and test? No = 0
Yes = 1
Are formal reviews documented and defect data analyzed and tracked, along with any action items, to completion? 5
No = 0
Yes = 1
Do design reviewers share responsibility for the performance of the design once they have reviewed it? 3
No = 0
Yes = 1
Are developers rated on the success of the overall product in the field? 3
No = 0
Yes = 1
Is there a technical review board in place to minimize design changes and maintain cost, schedule and reliability goals? 5
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


332
Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Are engineering change (EC) costs budgeted, measured and tracked against their associated design driver? 6
No = 0
Is reliability and/or quality a significant goal or the number one goal placed on the entire development organization? This occurs Yes = 1
10
on safety critical applications such as air traffic control, nuclear or critical medical applications. No = 0
<10 = 0
What is the % of FRU reuse across the system? (thresholds at 10 and 20%). 10-20 = .5 4
>20 = 1
Are individual developers empowered by having input and control over resources to accomplish their job, such as having a travel Yes = 1
2
budget (if travel is required)? No = 0
Yes = 1
Are engineering team members dedicated full time to the project? 4
No = 0
Yes = 1
Are process owners identified across the development team for each configuration item (CI) and its components? 3
No = 0
Yes = 1
Is there tracking of open problems, action items and cross system dependencies? 4
No = 0
Yes = 1
Is there a change review board/process? 4
No = 0
Is development creativity fostered through planned creativity exercises to spawn breakthrough thinking with respect to design Yes = 1
4
simplicity, cost, schedule and reliability? No = 0
Yes = 1
Are failures traced to their root cause and managed to resolution? 4
No = 0
Yes = 1
Is this development process ISO rated? 4
No = 0
3=1
2 = .5
How many (0,1,2,3) of the cost, schedule and reliability goals did the last product developed by this organization meet? 10
1 = .25
0=0
Yes = 1
Do you know the reliability performance of your current products in the field versus their predicted reliability? 5
No = 0
Yes = 1
If so, are previous reliability estimates greater than 15% of the predicted reliability? When not applicable select "No". 7
No = 0
Is there a 15% staffing buffer on the program, i.e., will the program be staffed to 115% of the needed baseline to allow for Yes = 1
9
contingencies? No = 0
Yes = 1
Are in-process metrics maintained to track actual vs. planned defect rates, schedule and resource targets? 4
No = 0
Yes = 1
Can continuous measurable improvement (CMI) be demonstrated for the development processes? 4
No = 0
Yes = 1
Are development processes maintained on-line with all printed paper copies designated "for reference use only"? 6
No = 0
Yes = 1
Are there procedures to ensure that documentation stays current with the design? 6
No = 0
Yes = 1
Is there a requirements document for this program? 4
No = 0
Yes = 1
Is there a Functional Specifications document? 4
No = 0
Are there document owners or points-of-contact identified for these documents so the development team knows who it can go to Yes = 1
3
for a specific need? No = 0
Yes = 1
Do the team members contribute to the creation and/or review and approval of these documents? 4
No = 0
Yes = 1
Are documentation standards promoted with examples to demonstrate what is considered adequate documentation? 4
No = 0

Reliability Information Analysis Center


333
Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Is product documentation field-tested prior to product delivery or general availability? 5
No = 0
Yes = 1
Is there a procedure for field feedback on product operations and maintenance documentation? 4
No = 0
Yes = 1
Does product documentation maximize pictures and minimize words (fallibility of natural language)? 4
No = 0
Yes = 1
Is product documentation kept at reading grade level 10 or less? 4
No = 0
Yes = 1
Is there an operational concept document developed prior to high level design that is maintained throughout development? 8
No = 0
Yes = 1
Is there a set of hardware and process design guidelines that provide general and component-specific design guidance practices? 6
No = 0
Yes = 1
Is there an assumptions/dependencies database that is maintained and reviewed prior to each development stage exit? 8
No = 0
Yes = 1
Is a distributed architecture used? 5
No = 0
Yes = 1
Does the design exclude electro-optical devices? 5
No = 0
Yes = 1
Does the design exclude electro-mechanical devices? 5
No = 0
Yes = 1
Is chemical processing excluded from this design? 5
No = 0
Yes = 1
Is hot fusing (toner) excluded from this design? 5
No = 0
Yes = 1
Are all voltages used in this design less than 110 VAC? 2
No = 0
Yes = 1
Are all operating frequencies less than 50 MHz 5
No = 0
<20 = 1
What is the number of developers on this project (thresholds of 20 and 100)? 20-100 = .5 5
>100 = 0
<18 = 1
18-36 = .75
What is the development schedule in months? (thresholds at 18, 36 and 48 months) 8
36-48 = .5
>48 = 0
Yes = 1
Is there a 24-hour/day availability requirement? 6
No = 0
Does the operational concept call for a remote operations and maintenance (O&M) operator to be able to diagnose system problems Yes = 1
6
as part of the system concept? No = 0
Yes = 1
Is this PWB of standard size or dimension? 5
No = 0
Yes = 1
Does this FRU have a 25% reduction in parts count over its predecessor or competitor? 5
No = 0
Yes = 1
Are stuck faults required to be isolated down to a single failing FRU 90% of the time? 4
No = 0
Yes = 1
Does this FRU report its status via a "Management Information Data Bit" (MIB) capability for fault determination and isolation? 2
No = 0
Yes = 1
Is there over-voltage / under-voltage detection and reporting? 3
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


334
Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Will FRUs be "hot-pluggable"? 2
No = 0
Yes = 1
Is there an independent test team? 5
No = 0
Yes = 1
Is the customer directly involved in defining the product's operational profile and in reviewing the test plans? 5
No = 0
Yes = 1
Does test planning take into account the "lessons learned" database? 4
No = 0
Yes = 1
Does a problem tracking database exist and is it being used on this program? 5
No = 0
Yes = 1
Will accelerated testing be performed during development that combines temperature and vibration? 4
No = 0
Will alpha tests be conducted, whereby the final product is robustly tested against probable extensions of its operational Yes = 1
6
environment? No = 0
Will beta tests be conducted, whereby customers can use and test pre-release versions of the product, feeding back their results to Yes = 1
6
the developer? No = 0
Are test procedures, set-up conditions, results, etc., documented so that measurements can be verified, failures reproduced, test Yes = 1
5
conditions recreated, and corrective actions confirmed? No = 0
Yes = 1
Will a gold standard (tested product) be preserved for comparative regression analysis? 6
No = 0
Yes = 1
Will product changes be regression tested? 5
No = 0
Yes = 1
Will the FRU be reliability or endurance tested (at any assembly level)? 4
No = 0
Yes = 1
Can parts (ASIC, EPROM) be reprogrammed in the circuit? 4
No = 0
Yes = 1
Can active elements be backwardly driven for more complete coverage? 4
No = 0
<80 = 0
What % of nodes (interconnection of traces) can be backward driven (thresholds at 80 and 95%)? 80-95 = .5 4
>95 = 1
Yes = 1
Has test fixture complexity been analyzed for fixtures with over 50 pins per square inch? 4
No = 0
>40 = 1
32-40 = .75
What is the test point contact size in mils (thresholds at 40, 32, and 25 mils)? 4
25-32 = .25
<25 = 0
1=1
Is a one-sided or two-sided test fixture used, if this FRU is a PWB? 2
2=0
Yes = 1
Has mechanical loading of the test fixture on the device under test been analyzed? 2
No = 0
Yes = 1
Are the buses or signal lines actively driven (vs. passively driven)? 4
No = 0
Yes = 1
Are the test item configurations representative of both the design (development tests) or production (validation tests) products? 4
No = 0
3=1
2 = .8
Will the product be environmentally stress tested (0 to 3) for 1. Design, 2. Qualification, 3. Product Acceptance? 6
1 = .5
0=0

Reliability Information Analysis Center


335
Chapter 7: Examples

Table 7.2-23: Design Process Grade Factor Questions (continued)


Question Gij Wij
>95 = 1
80-95 = .5
What % of nodes can be tested on this FRU (thresholds at 95, 80 and 50%)? 4
50-80 = .25
<50 = 0
Yes = 1
Are Engineering Design Analysis (EDA) tools available that will be used to support the design task? 7
No = 0
Yes = 1
Are EDA tools stable? 3
No = 0
Yes = 1
Does the team have a core competency with tool experience? 4
No = 0
Yes = 1
Are there dedicated tool support personnel available to the development team? 3
No = 0
Yes = 1
Do the development team members have domain expertise with the operating platform, i.e., HP, Unix, networking, and OS? 3
No = 0
Yes = 1
Are the tools self-documenting? 3
No = 0

7.2.5.4. Manufacturing Process Grade Factor Questions

Table 7.2-24: Manufacturing Process Grade Factor Questions


Question Gij Wij
>3 = 0
How many product orientations (axes and directions) are required to assemble this Field Replaceable Unit (FRU) (thresholds at 3 = .5
6
1,2,3 axes out of 6 possible)? 2 = .75
1 = 1
>20 = 0
How many cuts and traces are allowed if this is a Printed Wiring Board (PWB) (preliminary thresholds at 10, 20)? 10-20 = .5 3
<10 = 1
Yes = 1
Is a Computer Aided Design/Computer Aided Manufacturing (CAD/CAM) process used to support manufacturing? 5
No = 0
Yes = 1
Does the CAD/CAM system allow manufacturing personnel to have access to design information and documentation? 3
No = 0
Yes = 1
Are there no adjustments associated with this FRU? 4
No = 0
Yes = 1
Are parts/assemblies designed to simplify and facilitate automatic feeding and insertion? 5
No = 0
Yes = 1
Do hand-inserted parts have visual guides to aid in building the assembly? 3
No = 0
Yes = 1
Is there a focused effort to minimize the number of cables and connectors on this FRU? 3
No = 0
Yes = 1
Does the system support the FRUs being "hot-pluggable"? 4
No = 0
Yes = 1
Has the number of different manufacturing processes been minimized in building this FRU? 3
No = 0
Yes = 1
Is a flexible manufacturing process used, such that this new product will be fabricated on an existing, proven line? 3
No = 0
Is a cellular manufacturing process used, where the autonomous manufacturing station has all materials and parts brought to it and it Yes = 1
3
produces a finished product? No = 0
If there are symmetrical, polarized components used on this FRU, is the mounting process made "fool-proof", so that they cannot be Yes = 1
3
inserted backwards? No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


336
Chapter 7: Examples

Table 7.2-24: Manufacturing Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Has the total number of threaded fasteners associated with this assembly been minimized? 3
No = 0
>2 = 0
2 = .25
What is the number of different fastener types associated with this FRU (threshold at 0, 1, 2, >2)? 3
1 = .5
0 = 1
Yes = 1
Is it easy to visually distinguish between fasteners (e.g., no minor differences in length) prior to installation? 3
No = 0
Yes = 1
Is there only one type fastener drive (torx, Phillips, etc.) needed in the assembly, installation and maintenance of this FRU? 3
No = 0
Yes = 1
Are mounting guides or registration pins provided for aligning and securing electro-mechanical or electro-optical parts? 3
No = 0
Yes = 1
Are development personnel, including manufacturing, all co-located? 6
No = 0
Yes = 1
Does this project have a built-in 15% staffing buffer, i.e., staffing is at least 115% of base requirements? 6
No = 0
Yes = 1
Is the project organized around self-directed work teams? 3
No = 0
Yes = 1
Are workers rated on both total output and quality? 4
No = 0
Yes = 1
Are there process improvement teams with continuous measurable improvement (CMI) goals? 5
No = 0
Yes = 1
Are employees rated on field performance of the product? 4
No = 0
Yes = 1
Is there an advanced manufacturing engineering (AME) support department to help bridge between engineering and production? 3
No = 0
Has Cross Functional Development Team (CFDT) been implemented such that the manufacturing manager is able to explain the Yes = 1
1
design concept? No = 0
Yes = 1
Are manufacturing people encouraged to ask questions of development people (identified points of contact) when questions arise? 5
No = 0
Are enterprise points-of-contact (POCs) identified (development, manufacturing, test, field, marketing) to help answer questions and Yes = 1
5
address issues across the organization? No = 0

Yes = 1
Can any of the line or quality personnel "stop the line" if that person believes a serious problem exists? 5
No = 0

Yes = 1
Has the majority of the manufacturing leadership had direct field or customer contact in the past year? 3
No = 0
Yes = 1
Do manufacturing people have measurable goals to improve production metrics, including quality and cycle time? 5
No = 0
Yes = 1
If answer to 3.2.13 is yes, do direct manufacturing people participate in developing the goals? 4
No = 0
Yes = 1
Do manufacturing personnel have goals for continuous quality improvement? 3
No = 0
Yes = 1
Are there quality circles that meet regularly? 3
No = 0
Yes = 1
Are teams rewarded or recognized for improving quality? 3
No = 0
Yes = 1
Are key metrics for quality and cost monitored and tracked? 5
No = 0
Yes = 1
Is the cost of defect prevention measures tracked (proactive quality)? 3
No = 0

Reliability Information Analysis Center


337
Chapter 7: Examples

Table 7.2-24: Manufacturing Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Is the cost of problem corrections tracked (corrective quality)? 3
No = 0
Yes = 1
Do the process operators collect and interpret their own statistical process control (SPC) operational data? 5
No = 0
Yes = 1
Is machine-level configuration control practiced? 5
No = 0
Yes = 1
Is the cost of engineering changes (EC's) tracked and allocated back to the responsible development entity that caused the EC? 4
No = 0
Yes = 1
Are root cause failure analyses performed on Pareto-significant manufacturing line problems? 6
No = 0
Yes = 1
Are root cause failure analyses performed on Pareto-significant field problems? 7
No = 0
Is there a continuing focus on eliminating test escapes so as to find problems when they are created rather than when the customer Yes = 1
6
receives the system? No = 0
Yes = 1
Is a lessons-learned database maintained based upon problem post mortem analysis? 6
No = 0
Are lessons learned fed back to development personnel at the corresponding development phase where particular, significant fault Yes = 1
4
types have been found to occur? No = 0
Yes = 1
Will this FRU have an (expected) yield of over 90%? 3
No = 0
Yes = 1
Are examples of field manufacturing defects displayed for production personnel? 3
No = 0
Do manufacturing people have current awareness of the field performance of their products, in terms of problem types and problem Yes = 1
4
rates? No = 0
Are the manufacturing processes based upon sensitivity analyses, process Failure Modes and Effects Analysis (FMEA) or Design of Yes = 1
6
Experiments (DOE)? No = 0
Has a declared manufacturing vision that incorporates reliability and quality been established, documented and communicated to Yes = 1
4
personnel? No = 0
Yes = 1
Is leadership rotated among manufacturing personnel participating in a quality circle? 2
No = 0
Yes = 1
Does management promote quality circles with continuous measurable improvement (CMI) targets? 3
No = 0
Yes = 1
Do employees' personal development/assessment plans emphasize product and process quality? 5
No = 0
Yes = 1
Are team-building exercises promoted as lead-ins to the production phases? 3
No = 0
Yes = 1
Do manufacturing personnel get 40 hours of training a year? 2
No = 0
Yes = 1
Do you visit suppliers, review their processes, and make suggestions for process improvement? 3
No = 0
Do you invite suppliers, or customers to review your company's processes and allow them to suggest ways the company can do Yes = 1
3
things better? No = 0
Yes = 1
Does manufacturing participate in design reviews? 5
No = 0
Yes = 1
Is management aware and involved in day to day manufacturing operations on a regular basis? 4
No = 0
Yes = 1
Is management located in proximity to line people and accessible to them? 4
No = 0
Yes = 1
Do part suppliers manage their stock at your production facility? 3
No = 0
Will this product be built on an existing manufacturing line vs. a new manufacturing process that will have to be developed to Yes = 1
6
support the manufacture this product? No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


338
Chapter 7: Examples

Table 7.2-24: Manufacturing Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Has the manufacturing process been mistake- proofed? 6
No = 0
Yes = 1
Is there an EC budget for this FRU, and will results be measured against this budget? 3
No = 0
Yes = 1
Do manufacturing personnel know the projected average cost of an EC once product is in the field? 3
No = 0
Yes = 1
Has a demand-based, pull system been established for manufacturing processing stations? 4
No = 0
Yes = 1
Are Printed Wiring Boards (PWBs) conformal coated? 3
No = 0
Are there tighter tolerances than 0.020" on unaided hand assembly operations associated with manufacturing this FRU, or Yes = 1
3
integrating it into the next higher level assembly? No = 0
Yes = 1
Are there tighter tolerance requirements than 0.005" for fixtured assembly operations with measurement capability? 3
No = 0
Yes = 1
Are there tighter tolerances than 0.0005" for automated assembly operations? 3
No = 0
Yes = 1
Is the manufacturing process documented? 5
No = 0
Yes = 1
Has manufacturing provided a product design checklist of their concerns to the development team at the start of development? 5
No = 0
Yes = 1
Is the checklist identified above reviewed for compliance at each development milestone review? 5
No = 0
Yes = 1
Is 90% test coverage achieved on the components in this FRU? 5
No = 0
Yes = 1
Is a shipping test performed on samples of the packaged product? 4
No = 0
Yes = 1
Are FRUs burned in for at least 24 hours? 5
No = 0
Yes = 1
Is a "gold standard" of the qualified item maintained for regression test purposes? 4
No = 0
Yes = 1
Is Design of Experiments (DOE) used in setting up and controlling testing? 7
No = 0
Yes = 1
Is there an Operational Reliability Test conducted to simulate the customer application? 5
No = 0
0 = 0
1 = .25
How many elements (0 to 4) of environmental stress screening (ESS) are run: 1. temperature bake, 2. temperature cycle, 3.
2 = .5 5
temperature shock, 4. vibration?
3 = .75
4 = 1
Yes = 1
Are production test stress screens conducted? 5
No = 0
Yes = 1
Does this PWB have fewer than 6 layers? 3
No = 0
Yes = 1
Is this PWB small enough so that it cannot bow or "oil can" in handling and usage? 3
No = 0
1 = 1
Is it a one-sided or two-sided board? 3
2 = 0
Yes = 1
If this is a PWB, are the majority of components attached via methods other than surface mount technology (SMT)? 3
No = 0
Yes = 1
Does this FRU have at least 25% fewer solder joints than its predecessor or competitor? 3
No = 0

Reliability Information Analysis Center


339
Chapter 7: Examples

Table 7.2-24: Manufacturing Process Grade Factor Questions (continued)


Question Gij Wij
<30 = 0
30-50 = .5
What is the solder joint spacing (in mils) (thresholds at 30,50,100)? 5
50-100 = .75
>100 = 1
Yes = 1
Is ball grid array (BGA) technology excluded from this design? 3
No = 0
Yes = 1
Has your organization previously implemented BGA technology into a design? 6
No = 0
Yes = 1
Have card insertion guides been used in the design? 3
No = 0
Yes = 1
Have board stiffeners been used in the design? 3
No = 0

7.2.5.5. Part Quality Process Grade Factor Questions

Table 7.2-25: Part Quality Process Grade Factor Questions


Question Gij Wij
Yes = 1
Is there a documented part selection and part management process? 5
No = 0
Yes = 1
Is there a Preferred Parts List (PPL)? 5
No = 0
Yes = 1
Are part evaluation and qualification processes established to add parts to the PPL? 5
No = 0
Yes = 1
Does a cross-functional development team (CFDT) review and approve new candidate parts for addition to the PPL? 5
No = 0
Is this a commercial off-the-shelf (COTS) purchased assembly with a good history of operational reliability? If the assembly is not Yes = 1
6
COTS, select "Yes". No = 0
Yes = 1
Will new parts be excluded from being added to the PPL to design this FRU? 4
No = 0
Yes = 1
Are procedures in place to detect part problems in both manufacturing and the field? 5
No = 0
Yes = 1
Are quality and reliability data tracked on parts and fed back to suppliers so they know their performance on this product? 5
No = 0
Is there a design compliance checklist to ensure that all parts are properly applied, operating at sufficient margin with respect to Yes = 1
6
environmental and operational stresses, and take into account lessons learned? No = 0
Are there processes in place that specifically address precautions and handling of parts/components susceptible to electrostatic Yes = 1
5
discharge (ESD)? No = 0
Yes = 1
Do part specifications reflect environmental and regulatory compliance requirements for the specific intended application? 5
No = 0
Yes = 1
Has mechanical interfacing of critical parts been facilitated by providing mating parts/assemblies to the part supplier? 4
No = 0
Yes = 1
Is there an end of life plan to recycle or dispose of this part? 4
No = 0
Yes = 1
Are teaming relationships established with all critical component suppliers? 6
No = 0
Yes = 1
Are critical parts ISO 9000 certified? 4
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


340
Chapter 7: Examples

Table 7.2-25: Part Quality Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Are critical parts QS 9000 (automobile manufacturer certification) certified? 6
No = 0
In the case of commercial off-the-shelf (COTS) equipment, is the purchased assembly certified and marked to sell in Europe (CE Yes = 1
6
marked)? If the assembly is not COTS, select “Yes”. No = 0
Yes = 1
Is the FRU under configuration management control by the time it enters system test? 5
No = 0
Yes = 1
Are critical parts burned in for at least 24 hours? 5
No = 0
Yes = 1
Will the supplier manage the developer’s inventory, in the case of high volume production? 4
No = 0
Will all suppliers provide timely failure reporting and corrective action support (FRACAS) for both critical and custom parts (timely Yes = 1
4
reporting implies a 2 week turnaround with faster response on priority demand)? No = 0
Yes = 1
Have vendor dependencies been identified for critical and custom components? 4
No = 0
Have suppliers identified the likely failure modes on critical and custom parts, and does the design take these failure modes into Yes = 1
4
account? No = 0
Yes = 1
Are operational failure rate and failure mode data provided by the suppliers of critical and custom parts being used? 4
No = 0
Yes = 1
Is there a part control drawing for critical and custom parts? 5
No = 0
Yes = 1
Is there a device specification for all critical and custom parts? 5
No = 0
Yes = 1
Has the supplier reviewed the part application for all critical and custom parts? 4
No = 0
Yes = 1
Has the developer met with suppliers to discuss the application of all critical and custom parts? 4
No = 0
Yes = 1
Has a supplier’s technical point of contact (POC) been identified for addressing reliability concerns? 3
No = 0
Yes = 1
Will critical suppliers provide timely notice of impending part changes to allow the developer to assess the impact? 4
No = 0
Is a change history log maintained to provide traceability of engineering change actions and their associated rationale for critical and Yes = 1
5
custom parts? No = 0
Will part identification (revision numbers) be shown on the part to identify the particular part configuration, including the level of Yes = 1
4
the part’s firmware? No = 0
Yes = 1
Will suppliers routinely update firmware on parts returned for repair? 4
No = 0
Yes = 1
If suppliers update firmware will the part identification reflect this change? 4
No = 0
Yes = 1
Will suppliers’ part support timing horizon meet program development, manufacture, and field support component requirements? 4
No = 0
Yes = 1
Will vendor provide timely notice of production/support cessation and provide an “end of life” buy opportunity? 4
No = 0
Yes = 1
Will future releases of this part be compatible with respect to form, fit and function? 4
No = 0
Yes = 1
Is there a first article inspection and acceptance test planned? 4
No = 0
Yes = 1
Do critical and custom parts on this FRU all have at least a 12-month warranty? 4
No = 0
Yes = 1
Have likely part developments, evolution, and extensions of critical/custom parts been identified by the supplier? 6
No = 0
Yes = 1
Are there 32 Kbytes or more of firmware embedded in this FRU? 4
No = 0

Reliability Information Analysis Center


341
Chapter 7: Examples

Table 7.2-25: Part Quality Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Have development personnel meet with supplier's technical personnel? 4
No = 0
Yes = 1
Has a functional block diagram been developed for COTS or purchased complex part assemblies? 4
No = 0
Yes = 1
Has a failure history been collected for critical parts, complex assemblies, or COTS items? 4
No = 0
Yes = 1
Have key suppliers identified their part failure mechanisms? 4
No = 0
Have suppliers, in the case of complex part assemblies, supported the developer in performing a Failure Modes and Effects Analysis Yes = 1
6
(FMEA) on those assemblies? No = 0
Yes = 1
Have the sources and the extent of part variation been identified? 5
No = 0
Yes = 1
Have mitigations been identified to handle the effects of part's variations? 5
No = 0
Yes = 1
Do you know the supplier's dependencies and needs? 4
No = 0
Yes = 1
Will a design of experiments part evaluation, considering variations, as well as manufacturing variations, be conducted? 6
No = 0
Yes = 1
Have mechanical interfacing components been provided to the key vendors to assure proper mechanical mating? 5
No = 0
Yes = 1
Will the developer's quality organization audit suppliers' processes and facility capabilities? 5
No = 0
Yes = 1
Will the developer receive notice of pending part changes? 5
No = 0
Yes = 1
Will the developer have approval rights of part changes? 4
No = 0
Are procedures and processes in place for the identification and handling of critical reliability components (derating, screening, Yes = 1
8
failure response, etc.)? No = 0

7.2.5.6. System Management Process Grade Factor Questions

Table 7.2-26: System Management Process Grade Factor Questions


Question Gij Wij
Yes = 1
Does the customer participate with the developer in developing/validating a requirements statement? 5
No = 0
Yes = 1
Is Quality Function Deployment (QFD) used to help develop requirements and requirements traceability? 7
No = 0
If QFD is not used, is there another systematic way used, such as a Pugh chart, to identify and document customer needs and Yes = 1
4
preferences? No = 0
Yes = 1
Is there a system specification? 5
No = 0
Yes = 1
Does an "operations concept" document exist? 6
No = 0
Yes = 1
Has a comprehensive literature study been done of relevant design and reliability technology advancements? 5
No = 0
Yes = 1
Have previous or similar products been reviewed for their advantages and pitfalls? 4
No = 0
Yes = 1
Has a "lessons learned" database been studied to ensure the product will not repeat past problems? 5
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


342
Chapter 7: Examples

Table 7.2-26: System Management Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Have aggressive requirements (particularly reliability, availability, and/or safety) been explicitly specified? 10
No = 0
Yes = 1
Have regulatory agency compliance requirements been included? 5
No = 0
Does the requirements definition also account for what the product is supposed to "not do" (for example, air bags should not deploy Yes = 1
5
except on impact)? No = 0
Yes = 1
Is there a plan as to how to retire or recycle this new system at the end of its life? 2
No = 0
Yes = 1
Does a requirements database exist to capture opportunistic requirements for future consideration? 4
No = 0
Have future expansion requirements been identified (such as loading growth) and can the system handle the projected growth in Yes = 1
4
demand? No = 0
Yes = 1
Are requirements deemed achievable within program budget and schedule restraints, with a 90% confidence level? 5
No = 0
Are product requirements allocated to a useful level of indenture (considering complexity, level of design flexibility, and safety Yes = 1
5
concerns)? No = 0
Has a project level Failure Modes and Effects Analysis (FMEA) been done in conjunction with designers and system engineers at Yes = 1
5
the planning stage? No = 0
Yes = 1
Will the FMEA be refined down to the Field Replaceable Unit (FRU) level during design? 5
No = 0
Yes = 1
Does this product have to meet CE (European) standards? 6
No = 0
Yes = 1
Have likely product extensions been identified in the planning stage? 3
No = 0
Yes = 1
Are creativity and team building exercises being conducted during the planning stage? 2
No = 0
Yes = 1
Are future product releases planned in order to systematically integrate new requirements and features? 4
No = 0
Yes = 1
Are trade studies shared with the customer to broaden the base of inputs and support for design decisions? 5
No = 0
Yes = 1
Does a vision statement that speaks to reliability exist for the product? 5
No = 0
Yes = 1
Does a functional block diagram exist for this system? 4
No = 0
Yes = 1
Do sketches, drawings, or models exist for the delivered product? 4
No = 0
Yes = 1
Is the development team provided guidelines for acceptable deliverables at kick-off meetings for each development stage? 6
No = 0
Yes = 1
Are prototypes planned for early design? 5
No = 0
Yes = 1
Is this design an incremental improvement over an existing design? 6
No = 0
Yes = 1
Will state diagrams be developed before detail design to depict control flows? 4
No = 0
Yes = 1
Will data flow diagrams be developed before detail design begins? 4
No = 0
Yes = 1
Are entity-relationship diagrams developed prior to detail design? 4
No = 0
Yes = 1
Will a list identifying the capabilities and advantages that this product provides the customer be developed and maintained? 2
No = 0
Is there a system transition plan to replace the current system with the new system, in a smooth, non-disruptive manner? When not Yes = 1
8
applicable select “Yes”. No = 0

Reliability Information Analysis Center


343
Chapter 7: Examples

Table 7.2-26: System Management Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Are requirements allocated to a useful level of indenture (considering complexity, level of design flexibility, and design autonomy)? 5
No = 0
Yes = 1
Are requirements verification activities planned for the appropriate stages of product development? 5
No = 0
Yes = 1
Are entrance and exit criteria established for each development stage? 5
No = 0
Yes = 1
Is requirements traceability verified and maintained throughout development? 5
No = 0
Yes = 1
Is requirements compliance verified prior to the exit of each phase and prior to shipment? 5
No = 0
Yes = 1
Are test cases developed concurrently with the design and reviewed by the designers? 5
No = 0
Yes = 1
Is there a log of key product decisions and accompanying rationale for traceability? 6
No = 0
Yes = 1
Does the specified reliability represent an improvement of 10% or greater over its predecessor or competitive products? 5
No = 0
Yes = 1
Is there definition and agreement as to what constitutes successful product reliability performance by the customer? 5
No = 0
Yes = 1
Can this product be built using existing manufacturing processes (line)? 8
No = 0
Yes = 1
Are development and reliability requirements developed by a cross-functional development team (CFDT)? 8
No = 0
Yes = 1
Are system issues routinely documented as action items? 5
No = 0
Yes = 1
Do design reviews have technical representation from all interfacing areas? 5
No = 0
Yes = 1
Is prototype interconnection hardware routinely provided to interfacing subsystems and suppliers to guide their packaging? 3
No = 0
Yes = 1
Is there a requirement to detect and isolate faults to a single FRU 90% of the time? 6
No = 0
Yes = 1
Is there a system failure modes and effects analysis (FMEA) done during planning stage, and is it updated throughout the program? 6
No = 0
Yes = 1
Customer and process Q1: Have I identified who are my internal customers and my external customers? 3
No = 0
Yes = 1
Customer and process Q2: Have I identified what deliverables my customers need (plans, prototypes, documentation,…)? 3
No = 0
Yes = 1
Customer and process Q3: Do I know when my customers require my deliverables? 2
No = 0
Yes = 1
Customer and process Q4: Is there a customer centered quality initiative that will be incorporated to differentiate your deliverables? 3
No = 0
Customer and process Q5: Is there an identified tool or process improvement that the reliability section or the development Yes = 1
3
organization will gain from this effort? No = 0
Yes = 1
Have the customers been notified and concur on items Q1-Q3 above? 4
No = 0
Yes = 1
Is there a database that documents cross-functional dependencies that is managed to closure? 6
No = 0
Yes = 1
Is a database on cross-functional dependencies maintained? 5
No = 0
Do the developers, reviewers, testers, QA, manufacturing, and customer program office, all share in the accountability for getting a Yes = 1
5
successful program to the field? No = 0
Yes = 1
Are developers and the entire product team rated or rewarded based upon the field performance of the product? 6
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


344
Chapter 7: Examples

Table 7.2-26: System Management Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Are there designated points of contact in each product development area? 3
No = 0
Yes = 1
Are there designated facilitators to manage cross-system issues? 5
No = 0
Yes = 1
Are there checklists covering reliability concerns for each program phase? 4
No = 0
Yes = 1
Is the technical staff encouraged to talk directly to its customer counterparts? 6
No = 0
Are periodic informal activities, such as brown bag lunches, promoted to encourage team member technical exchange in an informal Yes = 1
2
atmosphere? No = 0
Can a technical employee call for a technical review board of peers when it is felt appropriate to address a broad-impact technical Yes = 1
5
concern? No = 0
Yes = 1
Does this equipment not require an interface with other vendors' equipment or government furnished equipment (GFE)? 5
No = 0
Yes = 1
Is the % of product reuse from previous products 25% or more of the lines of code for software? 6
No = 0
Yes = 1
Is the % of product reuse from previous products 50% or more of the FRU count or cost for hardware? 6
No = 0
Yes = 1
Do program planning sessions have cross-functional representation? 5
No = 0
Yes = 1
Do technical reviews have cross-functional representation? 5
No = 0
Are there program development plans that show timing of activities and deliverables (this should be done during the requirements Yes = 1
4
phase and maintained throughout the program)? No = 0
Yes = 1
Is there a non-management person designated to work full-time as the program technical lead, who works as a cross-team facilitator? 8
No = 0
Yes = 1
Is there a team-building effort and project brain storming at each program phase? 3
No = 0
Yes = 1
Are documentation products maintained on-line and accessible to all program personnel? 5
No = 0
Yes = 1
Is there a program database of "action items" that is maintained and managed to closure? 6
No = 0
Yes = 1
Is there a formal documented change process? 5
No = 0
Yes = 1
Are self-audits periodically performed on the change process? 4
No = 0
Yes = 1
Are business cases always run to evaluate the benefits and impacts of making a change (e.g., Reinertsen's model)? 5
No = 0
Yes = 1
Are total cost estimates made for ECs, including scrap, rework, tooling, and the potential slippage of schedule? 5
No = 0
Yes = 1
Are there two or less EC's planned during the first year of shipping? 5
No = 0
Yes = 1
Are ECs blocked into sections and scheduled ahead on periodic intervals to promote timely integration of changes? 6
No = 0
Yes = 1
Are ECs at or below the plan to date? 3
No = 0
Yes = 1
Do change review meetings have cross-functional representation? 5
No = 0
Yes = 1
Are there any ECs that are modifying previous ECs on the FRU? 5
No = 0
Yes = 1
Is there an EC meeting log maintained that includes the change rationale, the analysis provided, and meeting participants? 5
No = 0

Reliability Information Analysis Center


345
Chapter 7: Examples

Table 7.2-26: System Management Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Are EC management metrics collected with a focus on continual, measurable process improvement? 5
No = 0
Are the program development, integration, and test activities charted, showing tasks, their timing, operational dependencies and Yes = 1
5
identification of critical path activities? No = 0
Yes = 1
Are critical path elements identified (e.g., long-lead items)? 5
No = 0
Yes = 1
Is there a focus to get items off the critical path? 5
No = 0
Yes = 1
Are there risk assessment and contingency plans to minimize critical path risk? 4
No = 0
Yes = 1
Has the program met its targeted dates so far? 5
No = 0
Yes = 1
Is this product architecture based upon a distributed architecture? 8
No = 0
Yes = 1
Are there no future product inventions required on this program? 8
No = 0
Yes = 1
Are the R&M design goals sufficiently defined and allocated to ensure that customer needs are met? 5
No = 0
Yes = 1
Has development committed to support the required tasks for meeting the customer's R&M needs? 5
No = 0
Yes = 1
Does the design approach emphasize R&M as a major goal? 5
No = 0
Yes = 1
Has an agreed-to process been defined to assess progress towards meeting R&M goals and requirements? 5
No = 0
Yes = 1
Have adequate means been agreed upon to ensure that the R&M objectives of the product will have been achieved? 5
No = 0
Have processes been defined and implemented to ensure that the designed-in (inherent) reliability does not degrade during Yes = 1
5
manufacturing and operational use? No = 0

7.2.5.7. Can Not Duplicate (CND) Process Grade Factor Questions

Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions
Question Gij Wij
Yes = 1
Is the system required to isolate to a single Field Replaceable Unit (FRU) on 90% % of failures? 6
No = 0
Yes = 1
Is there a specified time limit to isolate a fault, effect a repair and restore the system? 5
No = 0
Yes = 1
Is there a requirement for 90% or greater test coverage within the FRU being analyzed? 6
No = 0
Does the system promote remote serviceability with failure status communicated via Ethernet, serial port, parallel port, serial bus, Yes = 1
4
etc., to a central maintenance station? No = 0
Is there any remote failure protection for this FRU residing on a separate FRU (e.g., an arc suppression circuit that is located on a Yes = 1
5
different FRU than the relay FRU)? No = 0
Yes = 1
Is this FRU designed to be hot-pluggable? 4
No = 0
Yes = 1
Does the FRU designer also design the fault isolation software that supports fault diagnosis? 6
No = 0
Yes = 1
Are multiple occurrences of "Can Not Duplicate" (CND) incidents analyzed for root cause of the problem? 6
No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


346
Chapter 7: Examples

Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions
Question Gij Wij
Are test, warranty, early-life, and high fallout FRUs subjected to double fault verification (this procedure re-inserts the faulted FRU Yes = 1
10
to ensure the problems track the replaced FRU)? No = 0
Do your current products experience 40% or less Can Not Duplicate (CND) failures (note that CNDs are synonymous with No Yes = 1
8
Defects Found (NDF) and No Trouble Found (NTF))? No = 0
Is a failure mode and effect analysis (FMEA) performed down to the FRU level or the Circuit Card Assembly (CCA) level, Yes = 1
5
whichever is lower? No = 0
Yes = 1
Do design personnel participate directly in performing the FMEA? 5
No = 0
Yes = 1
Are maintenance analysis procedures (MAPs) developed to map failure symptoms to the failing FRU? 5
No = 0
Yes = 1
Are the MAPs verified by inserting faults in a maintainability test? 4
No = 0
Yes = 1
Are the MAPs updated with actual test and field data? 5
No = 0
Yes = 1
Has your company established the cost impact of a field failure? 3
No = 0
Yes = 1
Does the system contain error logging and reporting capability? 5
No = 0
Yes = 1
Does the system promote ongoing analysis of soft error conditions that might predict when a likely failure will occur? 4
No = 0
Yes = 1
Will the contractor developing this equipment also be responsible for maintaining it? 5
No = 0
Does the repair facility have the ability to recreate the conditions under which a true false alarm occurred (sequence of events, Yes = 1
5
operator error, sneak circuit, etc.) and are these techniques used to try to recreate the failure? No = 0
Does the repair facility have the ability to recreate the conditions under which a real failure occurred (high/low temperature, thermal Yes = 1
5
cycling/shock, vibration/ mechanical shock, etc.) and are these techniques used to try to recreate the failure? No = 0
Yes = 1
Will the maintainer be motivated to provide timely and complete documentation of the diagnosis and repair action? 5
No = 0
Do the system maintenance personnel receive feedback on their repair reports and the actions taken to mitigate the failure Yes = 1
5
reoccurrence? No = 0
Are the performance specification limits of the test equipment used to troubleshoot/repair the system, FRU, etc., equal to or more Yes = 1
5
stringent than the performance specification limits of the system, FRU, etc., in its actual application? No = 0
Are CND failures included in the Failure Reporting and Corrective Action System (FRACAS) system and closed out through Yes = 1
5
corrective action verification? No = 0

7.2.5.8. Induced Process Grade Factor Questions

Table 7.2-28: Induced Process Grade Factor Questions


Question Gij Wij
Are parts/materials selected, as appropriate to meet design performance requirements that minimize the risk of induced failure Yes = 1
6
through electrostatic discharge? No = 0
If parts/materials are susceptible, are procedures used to protect them during handling, test, assembly, packaging, storage, Yes = 1
4
transportation and use (i.e., wrist straps, non-conductive work areas, ionized air, warning labels, maintenance manuals, etc.)? No = 0
Are electronic circuits designed and analyzed to minimize secondary failures attributable to electrical overstress resulting from Yes = 1
4
another primary failure? No = 0
Are electronic circuits designed and analyzed to minimize secondary failures attributable to electrical transients generated within the Yes = 1
6
system/FRU, or received from outside the system/FRU (via cable/wiring harnesses)? No = 0
Are maintenance manuals/procedures written such that risk of Electrostatic Discharge/Electrical Overstress (ESD/EOS) during Yes = 1
4
troubleshooting and repair activity is identified (warning labels, etc.)? No = 0

Reliability Information Analysis Center


347
Chapter 7: Examples

Table 7.2-28: Induced Process Grade Factor Questions (continued)


Question Gij Wij
Has the operating environment that the part/FRU/system is to be used in been evaluated to determine the potential for mishandling of Yes = 1
6
the equipment that could result in induced mechanical failure (weather; personnel capabilities; training needs)? No = 0
Are parts/materials selected, as appropriate to meet design performance requirements that minimize the risk of induced (mechanical) Yes = 1
4
secondary failure resulting from the primary failure of another part/assembly? No = 0
If parts/materials are susceptible to induced mechanical damage, are procedures in-place to protect them during handling, test, Yes = 1
4
assembly, packaging, storage, transportation and use? No = 0
Is the part, FRU, and/or system designed such that it can be handled and transported in a manner that minimizes the risk of induced Yes = 1
4
mechanical failure (proper location/use of handles; orientation labels – “This Side Up”; etc.)? No = 0
Yes = 1
Are shipping tests run to ensure adequacy of packaging and shipping procedures to protect the product during transportation? 4
No = 0
Are maintenance manuals/procedures written such that the risk of induced mechanical damage during troubleshooting and repair Yes = 1
4
activity is identified (warning labels, etc.)? No = 0
Do maintenance manuals include detailed instructions for removing and replacing parts/components/assemblies from sockets and/or Yes = 1
4
soldered PCB and multiplayer boards, etc.? No = 0
Yes = 1
Do maintenance manuals include detailed instructions for disconnecting and reconnecting wires, harnesses, cables, hoses, etc.? 4
No = 0
Is the FRU/system ergonomically designed such that it can be used by the customer in normal operation without unnecessary risk of Yes = 1
4
induced mechanical damage? No = 0
Is the FRU designed to withstand normal handling and expected mishaps (e.g., a drop off a 36-inch high table top) without induced Yes = 1
4
mechanical damage? No = 0
Yes = 1
Are wires color coded, and connectors keyed or of differing configuration such that FRUs cannot be misplugged? 4
No = 0

7.2.5.9. Wearout Process Grade Factor Questions

Table 7.2-29: Wearout Process Grade Factor Questions


Question Gij Wij
Have all parts and materials been selected for use in the design that extend the wearout life of the part/Field Replaceable Unit Yes = 1
6
(FRU)/system to meet/exceed its required useful life? No = 0
Has the expected reliability of parts subjected to significant mechanical loading been modeled to ensure the capability to endure the Yes = 1
4
mission, e.g., using Miner’s life expectation rule for components subjected to cyclical loads? No = 0
Have wearout failure modes and mechanisms at the part, FRU and system level been identified and mitigated during the Failure Yes = 1
6
Modes and Effects Analysis (FMEA) process? No = 0
Do the relevant failure modes/mechanisms include fatigue (solder joints for electronic components/assemblies; welds for bonded Yes = 1
4
materials; fractures in mechanical parts/assemblies/materials; etc.)? No = 0
Do the relevant failure modes/mechanisms include leaks (electrolyte loss in electrolytic capacitors; worn seals in hydraulic systems; Yes = 1
4
etc.)? No = 0
Yes = 1
Do the relevant failure modes/mechanisms include chafing (wires in electrical harnesses; wear in hydraulic lines and hoses; etc.)? 4
No = 0
Do the relevant failure modes/mechanisms include cold flow of insulation (wires wrapped around sharp edges or subjected to Yes = 1
4
pressure points; etc.)? No = 0
Do the relevant failure modes/mechanisms include wearout resulting from cyclic operations (activation in electronic switch/relay Yes = 1
4
contacts; mating/unmating of electronic or mechanical connectors; etc.)? No = 0
Do the relevant failure modes/mechanisms include wearout resulting from breakdown of insulation in wires, or dielectric materials in Yes = 1
4
semiconductors? No = 0
Yes = 1
Do the relevant failure modes/mechanisms include wearout resulting from moving parts (bearings, gears, belts, springs, seals, etc.)? 4
No = 0
Has the system/FRU/part design been modified based on the wearout modes/mechanisms identified in the FMEA to reduce or Yes = 1
4
minimize their occurrence to the maximum extent feasible? No = 0

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


348
Chapter 7: Examples

Table 7.2-29: Wearout Process Grade Factor Questions (continued)


Question Gij Wij
Yes = 1
Are process FMEAs performed to determine the failure modes/mechanisms of critical processes during manufacturing? 4
No = 0
Yes = 1
Is data collected and analyses performed to determine the process capability of manufacturing processes? 4
No = 0
Yes = 1
Is statistical process control (SPC) applied to manufacturing processes to control the process mean and variability? 4
No = 0
Is the measured mean of each manufacturing process parameter equal to, or better than, the parameter value used to calculate the Yes = 1
4
wearout failure rates of the system/FRU parts/components? No = 0
Yes = 1
If required, has this product been hardened to withstand adverse environmental stresses such as corrosion, radiation, humidity, etc.? 4
No = 0
Are procedures defined/implemented to ensure that assembly/test steps during manufacturing do not contribute to early wearout of Yes = 1
6
susceptible items (i.e., minimize connector matings/unmatings; stress relief/tie downs to minimize chafing during test; etc.)? No = 0
Do maintenance manuals/procedures instruct repair personnel to check that wire harnesses are properly secured, seals are properly Yes = 1
4
reinstalled, connectors are properly mated, etc., following troubleshooting/repair? No = 0
Is preventative maintenance planned to replace wear out-susceptible parts/materials at or before their L10 life (where no more than Yes = 1
6
10% of the units should experience wearout)? No = 0
Are wearout-susceptible parts/materials inspected during each corrective maintenance action to find and replace items exhibiting Yes = 1
4
premature wearout? No = 0
Are wearout-susceptible parts/materials inspected during each preventive maintenance action to find and replace items exhibiting Yes = 1
4
premature wear out? No = 0
Are wearout failures (both valid and premature) included in the Failure Reporting and Corrective Action System (FRACAS) and Yes = 1
4
closed out through corrective action, which could include life-extension opportunities? No = 0
Yes = 1
Is field data tracked and analyzed to detect FRUs displaying increasing failure rate tendencies, i.e., wearout? 4
No = 0

7.2.5.10. Growth Process Grade Factor Questions

Table 7.2-30: Growth Process Grade Factor Questions


Question Gij Wij
Yes = 1
Is there an effective Failure Reporting and Corrective Action System (FRACAS) in place for the fielded system? 8
No = 0
Yes = 1
What is the percentage of field failures for which the root cause is determined? 8
No = 0
Is analysis performed to determine if the failure is recurring? G = percentage/100 6
Yes = 1
Are design, manufacturing, or system management related potential corrective actions identified? 6
No = 0
Yes = 1
Are the original designers or manufacturing personnel consulted regarding the potential corrective action? 4
No = 0
Yes = 1
Is there a field support infrastructure in place that can affect the necessary changes? 10
No = 0
Yes = 1
Are systems adequately tested to insure that the changes were made properly without inducing other defects or damage? 5
No = 0

Reliability Information Analysis Center


349
Chapter 7: Examples

7.3. Life Modeling Example


7.3.1. Introduction
This section presents the results of an analysis in which the intent was to quantify the
reliability of a seal used in an assembly. The approach taken in the analysis was to
perform life tests under a variety of conditions, and to develop life models from this data
so that lifetimes could be predicted as a function of the appropriate stress and product
variables. In this manner, estimates of reliability under a wide range of use conditions
could be made.

This is an example of an assessment methodology, the results of which would be more


accurate than a prediction method applied to the seal. If the analyst is able to develop a
model like the one presented here for a specific component or failure cause, the resulting
model should be weighed more heavily than a prediction on the specific component.

7.3.2. Approach
All samples were tested under a variety of temperature and relative humidity conditions.
In addition, samples included two factors which were varied in the life tests: Process
Force and Hardness. These stresses and product/process variables were expected to be
the ones that most heavily influenced the product reliability.

7.3.3. Reliability Test Plan


The Reliability Test Plan required that the lifetime be measured at various magnitudes of
these variables, such that life model parameters (including acceleration factors) could be
quantified. Table 7.3-1 summarizes, for each variable, the number of levels, and the level
values.

Table 7.3-1: Parameter Levels


Variable Number of Levels Levels
Temperature 2 85, 130 C
Humidity 2 85, 100%
Process force 2 2, 20 N
Hardness 3 25, 50, 100 V

Table 7.3-2 summarizes the tests performed.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


350
Chapter 7: Examples

Table 7.3-2: Test Plan Summary


Sample
Temperature Humidity Hardness Process Force
Size
7 85 85 25 2
7 85 85 50 2
7 85 85 100 2
7 85 85 25 20
7 85 85 50 20
7 85 85 100 20
7 130 85 25 2
7 130 85 50 2
7 130 85 100 2
7 130 85 25 20
7 130 85 50 20
7 130 85 100 20
7 130 100 25 2
7 130 100 50 2
7 130 100 100 2
7 130 100 25 20
7 130 100 50 20
7 130 100 100 20

The tests were performed by first inspecting each sample, then exposing them to the
specific combination of variables as previously summarized, and, finally, re-inspecting
them at various intervals. The exposure times and inspection intervals were structured
such that short lifetimes could be observed in the event that acceleration factors were
higher than anticipated. Therefore, more frequent inspections were performed early in
the test, followed by less frequent inspections for the surviving samples. Failed samples
were removed from the test.

Data was then summarized in a format suitable for life modeling. The required data
elements included stress and product/process variables, plus life variables, as follows:

• Variables:
o Temperature
o Humidity
o Process force
o Hardness

Reliability Information Analysis Center


351
Chapter 7: Examples

• Life variables
o Last known good time
o First known bad time

7.3.4. Results

7.3.4.1. Times to Failure Summary


The test results for the seal samples are presented in Table 7.3-3. Included in this table is
the sample number, the temperature (in degrees C), the relative humidity, the Hardness,
the process force, whether the sample failed (F) or survived (S), and the time at which it
failed or survived.

Table 7.3-3: Life Test Results


T RH Thickness Speed F or S Time to F/S T RH Thickness Speed F or S Time to F/S
85 85 25 2 S 1159 85 85 100 20 S 1159
85 85 25 2 S 1159 85 85 100 20 S 1159
85 85 25 2 S 1159 85 85 100 20 S 1159
85 85 25 2 F 1159 85 85 100 20 S 1159
85 85 25 2 S 1159 85 85 100 20 S 1159
85 85 25 2 S 1159 85 85 100 20 S 1159
85 85 25 2 S 1159 130 85 25 2 F 278
85 85 25 20 S 1159 130 85 25 2 F 158
85 85 25 20 S 1159 130 85 25 2 F 130
85 85 25 20 S 1159 130 85 25 2 F 237.5
85 85 25 20 S 1159 130 85 25 2 F 158
85 85 25 20 S 1159 130 85 25 2 F 196.5
85 85 25 20 S 1159 130 85 25 2 F 130
85 85 25 20 S 1159 130 85 25 20 F 158
85 85 50 2 S 1159 130 85 25 20 F 196.5
85 85 50 2 S 1159 130 85 25 20 F 237.5
85 85 50 2 S 1159 130 85 25 20 F 428
85 85 50 2 S 1159 130 85 25 20 F 237.5
85 85 50 2 S 1159 130 85 25 20 F 130
85 85 50 2 S 1159 130 85 25 20 F 158
85 85 50 2 S 1159 130 85 50 2 F 237.5
85 85 50 20 S 1159 130 85 50 2 F 196.5
85 85 50 20 S 1159 130 85 50 2 F 278
85 85 50 20 S 1159 130 85 50 2 F 196.5
85 85 50 20 F 778 130 85 50 2 F 278
85 85 50 20 S 1159 130 85 50 2 F 158
85 85 50 20 S 1159 130 85 50 2 F 158
85 85 50 20 S 1159 130 85 50 20 F 196.5
85 85 100 2 S 1159 130 85 50 20 F 158
85 85 100 2 S 1159 130 85 50 20 F 196.5
85 85 100 2 S 1159 130 85 50 20 F 428
85 85 100 2 S 1159 130 85 50 20 F 158
85 85 100 2 S 1159 130 85 50 20 F 237.5
85 85 100 2 S 1159 130 85 50 20 F 158
85 85 100 2 S 1159 130 85 100 2 F 278
85 85 100 20 S 1159 130 85 100 2 F 158

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


352
Chapter 7: Examples

T RH Thickness Speed F or S Time to F/S T RH Thickness Speed F or S Time to F/S


130 85 100 2 F 278 130 100 50 2 F 58
130 85 100 2 F 220 130 100 50 2 F 58
130 85 100 2 F 371 130 100 50 2 S 70
130 85 100 2 F 278 130 100 50 20 F 58
130 85 100 2 F 325 130 100 50 20 S 70
130 85 100 2 F 428 130 100 50 20 F 34
130 85 100 20 F 58 130 100 50 20 F 34
130 85 100 20 F 325 130 100 50 20 F 58
130 85 100 20 F 428 130 100 50 20 F 34
130 85 100 20 F 325 130 100 50 20 F 58
130 85 100 20 F 428 130 100 100 2 S 70
130 85 100 20 F 278 130 100 100 2 F 58
130 85 100 20 F 196.5 130 100 100 2 F 58
130 100 25 2 F 58 130 100 100 2 F 58
130 100 25 2 F 59 130 100 100 2 F 58
130 100 25 2 F 34 130 100 100 2 F 58
130 100 25 2 F 58 130 100 100 2 F 34
130 100 25 2 F 34 130 100 100 20 S 70
130 100 25 2 F 34 130 100 100 20 S 70
130 100 25 2 F 58 130 100 100 20 F 58
130 100 25 20 F 58 130 100 100 20 F 58
130 100 25 20 S 70 130 100 100 20 F 58
130 100 25 20 F 34 130 100 100 20 F 58
130 100 25 20 F 34 130 100 100 20 S 70
130 100 25 20 F 58
130 100 25 20 F 1.5
130 100 25 20 F 58
130 100 50 2 F 58
130 100 50 2 S 70
130 100 50 2 F 58
130 100 50 2 S 70

The 2-parameter Weibull distribution parameters for the TTF distributions for the
samples are shown in Table 7.3-4.

Table 7.3-4: Times to Failure Distribution Parameters


Test Condition Characteristic Shape
Life Parameter
85C/85%RH 2109 5.1
130C/85%RH 268 2.71
130C/100%RH 62.1 3.2

The TTF distributions for each of the three test conditions are illustrated in Figure 7.3-1.

Reliability Information Analysis Center


353
Chapter 7: Examples
Re lia So f t W e ibull+ + 7 - w w w . Re lia So f t. co m
Probability - Weibull
9 9. 0 00
P ro ba bility-W e ibull

F o lio1 \SL-130 , 1 00
W e ibull-2P
9 0. 0 00 MLE SRM MED F M
F = 3 3/S= 9
Da ta Po ints
Susp Po ints
Pro ba bilit y Line

F o lio1 \SL-130 , 8 5
W e ibull-2P
MLE SRM MED F M
5 0. 0 00 F = 4 3/S= 0
Da ta Po ints
Pro ba bilit y Line

F o lio1 \SL-85, 85
W e ibull-2P
MLE SRM MED F M
Unreliability, F(t)

F = 2 /S= 40
Da ta Po ints
Pro ba bilit y Line

1 0. 0 00

5. 00 0

Bill De nson
Co rning
1 1 /24 /2 00 8
1. 00 0 5 :0 5:2 2 PM
1. 0 0 0 10 . 00 0 1 00 . 0 00 10 0 0. 0 00 1 0 00 0. 00 0
T ime, (t)
F olio 1\SL-13 0, 10 0: β =3 .2 1 8 3 , η=6 2 .1 1 9 5
F olio 1\SL-13 0, 85 : β= 2 .7 2 2 1 , η= 2 6 8 .2 4 7 9
F olio 1\SL-85 , 85: β =5 .0 5 0 5 , η=2 1 0 9 .0 6 3 5

Figure 7.3-1: Times To Failure Distributions

7.3.4.2. Life Models


Life models were generated from the data summarized above. These life models estimate
the TTF distribution as a function of the variables used in the experiments.

A general form of the Weibull reliability function used is:

β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R=e
where:

R= the reliability, or probability of survival, at time “t”


α= the Weibull characteristic life (i.e., the time to 63% failure)
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
354
Chapter 7: Examples

β= the Weibull shape parameter

The characteristic life is then developed as a function of the applicable variables. The
model form is:
α1
α = e e T RH α H α F α
α0 2 4 3

Where:

α 0 through α 4 = parameter coefficients estimated in the life modeling process


T= temperature in degrees K (C+273)
RH = relative humidity
H= ionic contamination
F= the process force

Maximum likelihood analysis was performed to determine the values of β, α0, α 1, α 2, α 3,


and α 4 that maximize the value of the above likelihood function. These parameter
estimates then become the coefficients in the life model. The Likelihood Function is:

( ( ))
L = ∏ f (ti , β ,α 0 ,α1,α 2 ,α 3 ,α 4 ) * ∏ 1 − F t j , β ,α 0 ,α1,α 2 ,α 3 ,α 4 *

∏(F (tk , β ,α 0 ,α1,α 2 ,α 3 ,α 4 ))(1 − F (tk −1, β ,α 0 ,α1,α 2 ,α 3 ,α 4 ))

where:

f = Weibull pdf (probability density function)


F= cumulative Weibull function (probability of failure)
ti = failure times
tj survival times
tk and tk-1 = times that bracket the failure interval

The first of the three product terms represent failures at known times, the second
represents survivals, and the third represent failures that occur within intervals but the
precise failure times are not known

Once the model parameters are estimated in this fashion, the reliability at any time, and
for any combination of variables, can be estimated.

Reliability Information Analysis Center


355
Chapter 7: Examples

The estimated parameters are summarized in Table 7.3-5. In this table, the best estimate
is provided along with the 80% 2-sided confidence levels around the estimate. A small
variation between the lower and upper confidence bound are indicative of significant
variables.

Table 7.3-5: Estimated Parameter 80% 2-Sided Confidence Bounds


Parameter Lower 80% CL Best Estimate Upper 80% CL
β 2.737 3.073 3.450
α0 19.68 23.98 28.28
α1 6957.2 8015.7 9074.3
α2 -9.45 -8.83 -8.21
α3 0.131 0.215 0.299
α4 -0.0031 0.0388 0.0807

The resulting equation for the characteristic life is then:

8015.7
23.98 −8.83 0.2150 0.0388
α =e e T
RH H F
Once the model parameters are estimated, then a variety of output formats are possible.
For example, Figure 7.3-2 illustrates the probability of failure as a function of
temperature and relative humidity at a time of 50,000 hours.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


356
Chapter 7: Examples

Unreliability vs Stress Surface

Figure 7.3-2: Probability of Failure vs. Temperature and Relative Humidity at 50,000
Hours

7.4. NPRD Description


Information from the RIAC document “Nonelectronic Parts Reliability Data (NPRD)”
(Reference 4) is presented here to provide the reader with:

1. An understanding of the issues involved in the collection and interpretation of


field reliability data
2. A summary of alternatives that can be used to combine data from various sources

The purpose of NPRD is to present failure rate data on a wide variety of


electromechanical and mechanical parts and assemblies (including many types of
electronic assemblies). While there are reliability prediction methodologies for standard
electronic components such as MIL-HDBK-217 and the RIAC 217Plus methodology,
Reliability Information Analysis Center
357
Chapter 7: Examples

there are few sources of failure rate data for other component types. All part types and
assemblies for which RIAC has data are included in NPRD with the exception, of
standard electronic component types. Although the data contained in NPRD were
collected from a wide variety of sources, RIAC has screened the data such that only high
quality data is added to the database and presented in this document. In addition, only
field failure rate data is included. The intent of this section is to provide the user with
information to adequately interpret and use data to supplement standard reliability
prediction methodologies.

It is not feasible for documents like MIL-HDBK-217 or other prediction methodologies


to contain failure rate models on every conceivable type of component and assembly.
Traditionally, reliability prediction models have been primarily applicable only for
generic electronic components. Therefore, NPRD serves a variety of needs:

• To provide failure rates on assemblies in cases where piece-part level analyses


are not feasible or required
• To complement other prediction methodologies by providing data on part
types not addressed by its models

7.4.1. Data Collection


The failure rate data contained in the newest version of NPRD (NPRD-2010) will
represent a cumulative compilation of data collected from the early 1970's through
December 2008. RIAC is continuously soliciting new field data in an effort to keep the
databases current. The goals of these data collection efforts are as follows:

1. To obtain data on relatively new part types and assemblies.


2. To collect as much data on as many different data sources, application
environments, and quality levels as possible.
3. To identify as many characteristic details as possible, including both part and
application parameters.

The following generic sources of data were used for this publication:

1. Published reports and papers


2. Data collected from government-sponsored studies
3. Data collected from military maintenance data collection systems
4. Data collected from commercial warranty repair systems
5. Data from commercial/industrial maintenance databases

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


358
Chapter 7: Examples

6. Data submitted directly from military or commercial organizations that maintain


failure databases

An example of the process by which RIAC identifies candidate systems and extracts
reliability data on military systems is summarized in Table 7.4-1.

Table 7.4-1: Data Summarization Process


(1) Identify System Based On: • Environments/Quality
• Age
• Component Types
• Availability of Quality Data

(2) Build Parts List: • Obtain Illustrated Parts Breakdown (IPB)


• Ensure Correct Version of System Consistent with
Maintenance Data
• Identify Characteristics of Components (Part Numbers,
Federal Stock Number, Vendor Catalogs, etc.)
• Enter Part Characteristics into Database

(3) Obtain Failure Data: • Reliability Improvement Warranty, DO56, Warranty


Records, etc.
• Match Failures to IPB
• Insure Part Replacements were Component Failures
• Add Failure Data to Database

(4) Obtain Operating Data: • Verify Equipment Inventory


• Equipment Hours/Miles, Part Hours/Miles
• Application Environment

(5) Transform Data to


Common RIAC Database
Template

Perhaps the most important aspect of this data collection process is identifying viable
sources of high quality data. Large automated maintenance databases, such as the Air
Force REMIS system or the Navy's 3M and Avionics 3M systems, typically will not
provide accurate data on piece parts. They can, however, provide acceptable data on
assemblies or LRUs, if used judiciously. Additionally, there are specific instances in
which they can be used to obtain piece-part data. Piece-part data from these maintenance
systems is used in the RIAC's data collection efforts only when it can be verified that
they accurately report data at this level. Reliability Improvement Warranty (RIW) data
are another high quality data source which has been used.
Reliability Information Analysis Center
359
Chapter 7: Examples

Completeness of data, consistency of data, equipment population tracking, failure


verification, availability of parts breakdown structures, and characterization of
operational histories are all used to determine the adequacy of the data. In many cases,
data submitted to the RIAC is discarded since an acceptable level of credibility does not
exist.

Inherent limitations in data collection efforts can result in errors and inaccuracies in
summary data. Care must be taken to ensure that the following factors are considered
when using a data source. Some of the sources of error are:

1. There are many more factors affecting reliability than can be identified
2. There is a degree of uncertainty in any failure rate data collection effort. This
uncertainty is due to the following factors:
a. Uncertainty as to whether the failure was inherent (common cause) or
event-related (special cause)
b. Difficulty in separating primary and secondary failures
c. Much of the collected data is generic and not manufacturer specific,
indicating that variations in the manufacturing process are not accounted
for
d. It is very difficult to distinguish between the effects of highly correlated
variables. For example, the fact that higher quality components are
typically used in more severe environments makes it impossible to
distinguish the effect that each has, independently, on reliability.
e. Operating hours can be reported inaccurately
f. Maintenance logs can be incomplete

Actual component stresses are rarely known. Even if nominal stresses are known, actual
stresses which significantly impact reliability can vary significantly about this nominal
value. The impacts of complex environmental stresses on reliability during field
operation of a product or system is also extremely difficult, if not impossible, to discern.

When collecting field failure data, a very important variable is the criteria used to define,
detect and classify failures. Much of the failure data presented in NPRD-2010 were
identified by maintenance technicians performing a repair action, indicating that the
criteria for failure is that a part in a particular application has failed in a manner that
makes it apparent to the technician. In some data sources, the criteria for failure were
that the component replacement must have remedied the failure symptom.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


360
Chapter 7: Examples

7.4.2. Data Interpretation


Data contained in NPRD-2010 reflects industry average failure rates, especially the
summary failure rates which were derived by combining several failure rates on similar
parts/assemblies from various sources. In certain instances, reliability differences can be
distinguished between manufacturers or between detailed part characteristics. Although
the summary section of NPRD cannot be used to identify these differences (since it
presents summaries only by generic type, quality, environment, and data source), the
listings in the detailed section of NPRD contain all of the specific information that was
known for each part and, therefore, can sometimes be used to identify such differences.

Data in the summary section of NPRD represent an "estimate" of the expected failure
rate. The "true" value will lie within some confidence interval about that estimate. The
traditional method of identifying confidence limits for components with exponentially
distributed lifetimes has been the use of the Chi-Square distribution. This distribution
relies on the observance of failures from a homogeneous population and, therefore, has
limited applicability to merged data points from a variety of sources.

To give users of NPRD a better understanding of the confidence they can place in the
presented failure rates, an analysis of RIAC data in the past concluded that, for a given
generic part type, the natural logarithm of the observed failure rate is normally distributed
with a standard deviation of 1.5. This means that 68 percent of the actual experienced
failure rates will be between 0.22 and 4.5 times the mean value. Similarly, 90% of actual
failure rates will be between 0.08 and 11.9 times the presented mean value. As a general
rule-of-thumb, this type of precision is typical of probabilistic reliability prediction
models and point-estimate failure rates such as those contained within NPRD. It should
be noted that this precision is applicable to predicted failure rates at the component level,
and that confidence will increase as the statistical distributions of components are
combined when analyzing modules or systems.

In virtually all of the field failure data collected for NPRD, TTF was not available. Few
current DoD or commercial data tracking systems report elapsed time indicator (ETI)
meter readings that would allow TTF compilations. Those that do lose accuracy
following removal and replacement of failed items. To accurately monitor these times,
each replaceable item would require its own individual time recording device. Data
collection efforts typically track only the total number of item failures, part populations,
and the number of system operating hours. This means that the assumed underlying TTF
distribution for all failure rates presented in NPRD is the exponential distribution.
Unfortunately, many part types for which data are presented typically do not follow the
exponential failure law, but rather exhibit wearout characteristics, or an increasing failure

Reliability Information Analysis Center


361
Chapter 7: Examples

rate in time. While the actual TTF distribution may be Weibull or lognormal, it may
appear to be exponentially distributed if a long enough time has elapsed. This
assumption is accurate only under the condition that components are replaced upon
failure, which is true for the vast majority of data contained in NPRD. To illustrate this,
refer to Figure 7.4-1, which depicts the apparent failure rate for a population of
components that are replaced upon failure, each of which follow the Weibull TTF
distribution. This illustrates Drenick’s theorem that was discussed earlier in this book.

MTTF = Mean-Time-to-Failure, α = Weibull Characteristic Life

Figure 7.4-1: Apparent Failure Rate for Replacement Upon Failure

At t = 0, the population of parts has not experienced operation. As operating time


increases, parts in the original population are replaced and the failure rate increases. The
failure rate then decreases as the majority of parts have been replaced with new parts.
The population of replaced parts undergo the same process with the exception that the
deviation of the second distribution is greater due to the fact that the "time zeros" of the
replaced parts, themselves, are spread over time. This process continues until the "time
zeros" of the parts have become sufficiently randomized to result in an apparent
exponentially distributed population. The approximate time at which this asymptotic
value is reached as a function of beta is given in Table 7.4-2. The asymptotic value of
failure rate is 1/alpha, regardless of beta.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


362
Chapter 7: Examples

Table 7.4-2: Time at Which Asymptotic Value is Reached


β Asymptote
2 1.0
4 2.4
6 4.2
8 7.0

Additionally, since MTTF is often used instead of characteristic life, their relationship
should be understood. The ratio of alpha/MTTF is a function of beta and is given in
Table 7.4-3.

Table 7.4-3 α/MTTF Ratio as a Function of β


β Asymptote
1.0 1.00
2.0 1.15
2.5 1.12
3.0 1.10
4.0 1.06

Based on the previous discussion, it is apparent that the time period over which data is
collected is very important. For example, if the data is collected from “time zero” to a
time which is a fraction of alpha, the failure rate will be increasing over that period and
the average failure rate will be much less than the asymptotic value. If however the data
is collected during a time period after which the failure rate has reached its asymptote, the
apparent failure rate will be constant and will have the value 1/alpha. The detailed data
section in NPRD presents part populations which provide the user the ability to further
analyze the time logged to an individual part or assembly, and to estimate the
characteristic life. For example, the detailed section presents the population and the total
number of operating hours for each data record. Dividing the part operating hours by the
population yields the average number of operating hours for the system/equipment in
which the part/assembly was operating. An entry for a commercial quality mercury
battery in a ground, fixed (GF) environment indicates that a population of 328 batteries
had experienced a total of 0.8528 million part hours of operation. This indicates that
each battery had experienced an average of 0.0026 million hours of operation in the time
period over which the data was collected. If a shape parameter, beta, of the Weibull
distribution is known for a particular part/assembly, the user can use this data to
extrapolate the average failure rate presented in NPRD to a Weibull characteristic life

Reliability Information Analysis Center


363
Chapter 7: Examples

(alpha). If the percent failure rate is relatively low, the methodology is of limited value.
If a significant percent of the population has failed, the methodology will yield results for
which the user should have a higher degree of confidence. The methodology presented is
useful only in cases where TTF characteristics are needed. In many instances, knowledge
of the part characteristic life is of limited value if the logistics demand is the concern.
This data can, however, be used to estimate characteristic life in support of preventive
maintenance efforts. The assumptions in the use of this methodology are:

1. Data were collected from "time zero" of the part/assembly field usage
2. The Weibull distribution is valid and β is known

Table 7.4-4 contains cumulative percent failure as a function of the Weibull beta shape
parameter and the time/characteristic life ratio (t/α). The percent failure from the NPRD
detailed data section can be converted to a (t/alpha) ratio using the data in Table 7.4-4.
Once this ratio is determined, a characteristic life can be determined by dividing the
average operating hours per part (part hours/population) by the (t/alpha) ratio. It should
be noted here that the percentage failures in the table can be greater than 100, since parts
are replaced upon failure and there can be an unlimited number of replacements for any
given part.

Table 7.4-4: Percent Failure for Weibull Distribution

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


364
Chapter 7: Examples

As an example, consider the NPRD detailed data for “Electrical Motors, Sensor”;
Military Quality Grade; Airborne, Uninhabited (AU) environment; and a Population Size
of 960 units. Assume for this data entry that there were 359 failures in 0.7890 million
part-operating hours. The data may be converted to a characteristic life in the following
manner:

1. Determine the Percent Failure:


359
% Failure = 960 = 37.4%

2. Determine a typical Weibull shape parameter (β). For motors, a typical beta
value is 3.0 (Reference 5).
3. Convert the Percent Failure to a t/alpha ratio using Table 7.4-4 (for % fail = 37.4
and β = 3)

t
≅ 0.65 (extrapolating between 31 and 42)
α

4. Calculate average operating hours per part:

Part Hours 0.7890


= = 0.00082 million hours
Population Count 960

5. Calculate α:

⎛ Part Hours ⎞
⎜⎜ ⎟⎟
Population Count ⎠ 0.00082
α=⎝ = = 0.00126 million hours
⎛t⎞ 0.65
⎜ ⎟
⎝α ⎠

Based on this data, an approximate Weibull characteristic life is 1260 hours. The user of
this methodology is cautioned that this is a very approximate method for determining the
characteristic life of an item when TTF data is not available. It should also be noted that
for small values of time (i.e.; t < 0.1 alpha), random failures can predominate, effectively
masking wearout characteristics and rendering the methodology inaccurate.

Reliability Information Analysis Center


365
Chapter 7: Examples

Additionally, for small operating times relative to α, the results are dependent on the
extreme tail of the distribution, thus significantly decreasing the confidence in the derived
alpha value.

For part types exhibiting wearout characteristics, the failure rate presented represents an
average failure rate over the time period in which the data was collected. It should also
be noted that for complex nonelectronic devices or assemblies, the exponential
distribution is a reasonable assumption. The user of this data should also be aware of
how data on cyclic devices such as circuit breakers is presented in NPRD. Ideally, these
devices should have failure rates presented in terms of failures per operating cycles.
Unfortunately, from the field data collected, the number of actuations is rarely known
and, therefore, the listed failure rates are presented in terms of failures per operating hour
for the equipment in which the part is used.

7.4.3. Document Overview


The RIAC NPRD databook is organized into the following sections:

Section 1: Introduction
Section 2: Part Summaries
Section 3: Part Details
Section 4: Data Sources
Section 5: Part Number/Mil Number Index
Section 6: National Stock Number Index with Federal Stock Class Prefix
Section 7: National Stock Number Index without Federal Stock Class Prefix
Section 8: Part Description Index

Sections 2 through 8 are described in detail in the following sections.

7.4.3.1. "Part Summaries" Overview


The summary section of NPRD contains combined failure rate data, presented in order of
Part Description, Quality Level, Application Environment, and Data Source. The Part
Description itself is presented in a hierarchical classification. The known technical
characteristics, in addition to the classification, are contained in Section 3 of the book,
“Part Details”. All data records were combined by totaling the failures and operating
hours from each unique data source. In some cases, only failure rates were reported to
RIAC. These data points do not include specific operating hours and failures, and have
dashes in the Total Failed and Operating Hours/Miles fields. Table 7.4-5 describes each
field presented in the summary section.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


366
Chapter 7: Examples

Table 7.4-5: Field Descriptions


Field Field Description
Name

Part Description of the part, including the major family of parts and specific part-type breakdown within the part
Description family.

The RIAC does not distinguish parts from assemblies within NPRD. Information is presented on
parts/assemblies at the indenture level at which it was available. The description of each item for which data
exists is made as clear as possible so that the user can choose a failure rate on the most similar part or assembly.
The parts/assemblies for which data is presented can be comprised of several part types, or they can be a
constituent part of a larger assembly. In general, however, data on the part type listed first in the data table is
representative of the part type listed and not of the higher level of assembly. For example, a listing for “Stator,
Motor” represents failure experience on the stator portion of the motor and not the entire motor assembly.
Added descriptors to the right, separated by commas, provide further details on the part type listed first.
Additional detailed part/assembly characteristics can be found, if available, in the Part Details section of
NPRD.

Quality The Quality Level of the part, as indicated by:


Level Commercial - Commercial quality parts
Military - Parts procured in accordance with MIL specifications
Unknown - Data resulting from a device of unknown quality level

App. Env. The Application Environment describes the conditions of field operation. See Table 7.4-6 for a detailed list of
the application environments and their descriptions. These environments are consistent with MIL-HDBK-217.
In some cases, environments more generic than those used in MIL-HDBK-217 are used. For example: "A"
indicates the part was used in an Airborne environment, but the precise location and aircraft type was not
known. Additionally, some environments are more specific than the current version of MIL-HDBK-217, since
the current version has merged many of the environment categories and the NPRD data was originally
categorized into the more specific environment. Environments preceded by the term "NO" are indicative of
components used in a non-operating product or system in the specified environment.

Data Source Source of data comprising the NPRD data entry. The source number may be used as a reference to Section 4 of
NPRD to review the specific data source description.

Failure Rate The failure rate presented for each unique part type, environment, quality, and source combination. It is the
Fails / (E6) total number of failures divided by the total number of life units. No letter suffix indicates that the failure rate
is in failures per million operating hours. An "M" suffix indicates the unit is failures per million miles. For
roll-up data entries (i.e., those without sources listed), the failure rate is derived using the data merge algorithm
described in this section. A failure rate preceded by a "<" is representative of entries with no failures. The
failure rate listed was calculated by using a single failure divided by the given number of operating hours. The
resulting number is a “worst case” failure rate and the real failure rate is less than this value. All failure rates
are presented in NPRD in a fixed format of four decimal places after the decimal point. The user is cautioned
that the presented data has inherently high variability and that four decimal places does not imply any level of
precision or accuracy.

Total Failed The total number of failures observed in the merged data records.

Op. Hours/ The total number of operating life unit (in millions) observed in merged data records. Absence of a suffix
Miles (E6) indicates operating hours is the life unit and "M" indicates that miles is the life unit.

Detail Page The page number containing the detail data source description which comprises the summary record.

Reliability Information Analysis Center


367
Chapter 7: Examples

Table 7.4-6: Application Environments Defined in NPRD


Env Description

A Airborne - The most generalized aircraft operation and testing conditions.

AI Airborne Inhabited - General conditions in inhabited areas without environmental extremes.

AIA Airborne Inhabited Attack - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as used for ground support.

AIB Airborne Inhabited Bomber -Typical conditions in bomber compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on long mission bomber
aircraft.

AIC Airborne Inhabited Cargo - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on long mission transport
aircraft .

AIF Airborne Inhabited Fighter - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as fighters and interceptors.

AIT Airborne Inhabited Transport - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as trainer aircraft.

ARW Airborne Rotary Wing - Equipment installed on helicopters; includes laser designators and fire control systems.

AU Airborne Uninhabited - General conditions of such areas as cargo storage areas, wing and tail installations
where extreme pressure, temperature, and vibration cycling exist.

AUA Airborne Uninhabited Attack - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as used for ground support.

AUB Airborne Uninhabited Bomber - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on long mission bomber aircraft.

AUF Airborne Uninhabited Fighter - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as fighters and interceptors.

AUT Airborne Uninhabited Transport - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as used for trainer aircraft.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


368
Chapter 7: Examples

Table 7.4-6: Application Environments Defined in NPRD (continued)


Env Description

DOR Dormant - Component or equipment is connected to a system in the normal operational configuration and
experiences non-operational and/or periodic operational stresses and environmental stresses. The system may
be in a dormant state for prolonged periods before being used in a mission.

G Ground - The most generalized ground operation and test conditions.

GB Ground Benign - Non-mobile, laboratory environment readily accessible to maintenance; includes laboratory
& instruments and test equipment, medical electronic equipment, business and scientific computer complexes.
GBC GBC refers to a commercial application of a commercial part.

GF Ground Fixed - Conditions less than ideal such as installation in permanent racks with adequate cooling air and
possible installation in unheated buildings; includes permanent installation of air traffic control, radar and
communications facilities.

GM Ground Mobile - Equipment installed on wheeled or tracked vehicles; includes tactical missile ground support
equipment, mobile communication equipment, tactical fire direction systems.

ML Missile Launch - Severe conditions related to missile launch (air and ground), and space vehicle boost into
orbit, vehicle re-entry and landing by parachute. Conditions may also apply to rocket propulsion powered
flight.

MP Manpack - Portable electronic equipment being manually transported while in operation; includes portable field
communications equipment and laser designations and rangefinders.

N Naval - The most generalized normal fleet operation aboard a surface vessel.

NH Naval Hydrofoil - Equipment installed in a hydrofoil vessel.

NS Naval Sheltered - Sheltered or below deck conditions, protected from weather; include surface ships
communication, computer, and sonar equipment.

NSB Naval Submarine - Equipment installed in submarines; includes navigation and launch control systems.

NU Naval Unsheltered - Nonprotected surface shipborne equipment exposed to weather conditions; includes most
mounted equipment and missile/projectile fire control equipment.

N/R Not Reported - Data source did not report application environment.

SF Spaceflight - Earth orbital. Approaches benign ground conditions. Vehicle neither under powered flight nor in
atmosphere re-entry; includes satellites and shuttles.

Data records are also merged and presented at each level of part description (categorized
from most generic to most specific). The data entries with no source listed represent
these merged records. Merging data becomes a particular problem due to the wide
dispersion in failure rates, and because many data points consist of only survival data in
which no failures occurred, thus making it impossible to derive a failure rate. Several
Reliability Information Analysis Center
369
Chapter 7: Examples

approaches were considered in defining an optimum data merge routine. These options
are summarized as follows:

1. Summing all failures and dividing by the sum of all hours. The advantages of
this methodology are its simplicity and the fact that all observed operating
hours are accounted for. The primary disadvantage is that it does not weigh
outlier data points less than those clustering about a mean value. This can
cause a single failure rate to dominate the resulting value.

2. Using statistical methods to identify and exclude outliers prior to summing


hours and failures. This methodology would be very advantageous in the
event there are enough failure rate data points to properly apply the statistical
methods. The data being combined in NPRD often consists of a very limited
number of data points, thus negating the validity of this method.

3. Deriving the arithmetic mean of all observed failure rates which are from data
records with failures, and modifying the resulting value in accordance with the
percentage of operating hours associated with the zero failure records.
Advantages of this method are that modifying the mean in accordance with
the percentage of operating hours from survival data will ensure that all
observed part hours are accounted for, regardless of whether they have
experienced failures. Disadvantages are that the arithmetic mean does not
apply less weight to those data points substantially beyond the mean and,
therefore, a single data point could dominate the calculated failure rate.

4. Using a mean failure rate by taking the lower 60% confidence level (Chi-
square) for zero failure data records and combining them with failure rates
from failure records. The disadvantages of this methodology are that the 60%
lower confidence limit can be a pessimistic approximation of the failure rate,
especially in the case where there are few observed part hours of operation;
and an arithmetic mean failure rate of these values (combined with the failure
rates from failure records) could yield a failure rate which is dominated by a
single failure rate, which itself may be based on a zero failure data point. The
use of a geometric mean would alleviate some of this effect. The problem
with the pessimistic nature of using the confidence level ,however, will
remain.

5. Deriving the geometric mean of all the failure rates associated with records
having failures and multiplying the derived failure rates by the proportion:

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


370
Chapter 7: Examples

[observed hours with failures/total observed hours]. For example, if 70


percent of the total part hours correspond to records with failures, the
geometric mean of failure rates from the data records with failures would be
multiplied by 0.7. This option is appealing, since the geometric mean will
inherently apply less weight to failure rates that are significantly greater than
the others for the same part type. The merged failure rate should be
representative of the population of parts since it takes into consideration all
observed operating hours, regardless of whether or not there were observed
failures.

Option 5 was selected for NPRD, since it is the only one that (1) accounts for all
operating hours and (2) applies less weighting to the outliers. The resulting algorithm
used to merge data within NPRD is:

⎛ n′ ⎞
1 ⎜ ∑ h′ ⎟
⎛ n′ ⎞ n ′ ⎜ ⎟
λmerged = ⎜ ∏ λi ⎟ • ⎜ i =1 ⎟
⎜ ⎟ n
⎝ i =1 ⎠ ⎜ h⎟
⎜∑ ⎟
⎝ i =1 ⎠

where,

n′
∏ λ i = The product of failure rates from NPRD Section 2 records with failures*
i =1

n′
∑ h ′ = The sum of hours from NPRD Section 2 records with failures*
i =1

n
∑ h = The sum of hours from NPRD Section 2 records
i =1

n = The total number of NPRD Section 2 data records


n' = The total number of NPRD Section 2 data records with failures*
h = The number of hours associated with all NPRD Section 2 data records
h' = The number of hours associated with all NPRD Section 2 data records
with failures*
* Note: Or having a second source failure rate.

Reliability Information Analysis Center


371
Chapter 7: Examples

In NPRD Section 2, part descriptions with "(Summary)" following the part name
comprise a merge of all data related to the generic part listed. An example of the NPRD
summary section is given in Figure 7.4-2.

Figure 7.4-2: Example of Part Summary Entries

To illustrate how the data was rolled up, consider the entries for linear mechanical
actuators. The failure rate of 41.7293 listed for "Actuator, Mechanical, Linear" is a roll-
up of three individual data entries for which there are sources listed (two for commercial
quality, AUC environment and one for unknown quality in an Airborne environment).
The listing of 5.5413 for "Actuator, Mechanical" is a roll-up of four individual data
entries (two for Mil/AIF, one for Unk/AUT , and one for Unk/GM ). Using the algorithm
described previously, the roll-up was calculated as follows:

1 0.1957 + 0.0595
⎡ ⎤
λsummary = [(5.110)(33.6241)] 2 ⎢ ⎥ = 5.5413
⎣ 0.1957 + 0.0595 + 0.0830 + 0.2655 ⎦

Now consider the entry for "Actuator, Mechanical (Summary)". This listing is a roll-up
of all "Actuator, Mechanical" data (in this case Actuator, Mechanical and Actuator,
Mechanical, Linear) using the algorithm described previously. In other words, the failure
rate of 25.8092 is a summary of failure data from seven individual data sources. For
these "(Summary)" data entries, sources are not listed since they represent a merge of one
or more data sources which are presented below the summary level. Roll-up values are
presented for each specific quality level and application environment for all components
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
372
Chapter 7: Examples

having multiple part type entries at the same indenture level. If there is no summary
record indicated for a particular part type, the listed part description represents the lowest
level of indenture available. For example, the listing for "Actuator, Mechanical,"
although being identical to the generic level for which the summary data is presented,
was the most detailed description available for the particular data entry. More detailed
part level information may be available in NPRD Section 3. Each failure rate record
listed in the NPRD summary section is a merge of all detailed data from Section 3 for a
specific part type, quality, environment and unique data source. Each of these failure rate
records refers to a Section 3 page which contains all detailed records, including part
details, when they were known. Roll-ups are performed at every combination of part
description (down to 4 levels), quality level, and application environment. The data
points being merged in the NPRD summary section include only those records for which
a data source is listed. These individual data points were already combined by summing
part hours and failures (associated with the detailed records) for each unique data source.
Roll-ups performed on only zero-failure data records are accomplished simply by
summing the total operating hours, calculating a failure rate by assuming one failure, and
denoting the resulting worst case failure rate with a "<" (“less than”) sign.

The roll-ups were performed in this manner to give the NPRD user maximum flexibility
in choosing data on the most specific part type possible. For example, if the user needs
data on a part type which is not specified in detail or for conditions for which data does
not exist in this document, the user can choose data on a more generic part type or
summary condition for which there is data.

7.4.3.2. "Part Details" Overview


The detailed part data in NPRD Section 3 can be used to:

1. Determine if there is data on a specific part number, manufacturer or device


with similar physical characteristics to the one of interest.
2. View the detailed data that was used to generate the summarized data section,
so that a qualitative assessment of the data can be made.

The user is cautioned that individual data points from the detailed section may be of
limited value relative to the merged summary data in NPRD Section 2, which combines
records from several sources and typically results in many more part hours. Under no
circumstance should the NPRD detailed data or summary data be used to blindly
cherry-pick the most favorable or “optimistic” failure rate for a particular part or
assembly type.

Reliability Information Analysis Center


373
Chapter 7: Examples

NPRD Section 3 contains a listing of all field experience records contained in the RIAC
part databases. The detailed data section presents individual data records that are
representative of the specific part types used in a particular application from a single data
source. For example, if 20 relays of the same type were used in a specific military
system, for which there were 300 systems in service, each with 1300 hours of operation
over the time in which the data was collected, the part population is 20x300 = 6000, and
the total part operating hours are 6000x1300 = 7,800,000 hours. If the same part is used
in another system, or if the system is used in different operating environments, or if the
information came from a different source, then separate NPRD data records were
generated. If known, the population size is given for each data record as the last element
in the “Part Characteristics” field. An example of NPRD Section 3 is shown in Figure
7.4-3.

Figure 7.4-3: Example of Part Detail Entries

7.4.3.3. Section 4 "Data Sources" Overview


This section of NPRD describes each of the data sources from which data were extracted
for the databook. The Title, author(s), publication dates, report numbers, and a brief
abstract are presented. In a number of cases, information regarding the source of the data
had to be kept proprietary. In these cases, "Source Proprietary" is indicated.

7.4.3.4. Section 5 "Part Number/MIL Number" Index


This NPRD section provides an index, ordered by generic part type, of those Section 3
data entries that contain a generic commercial part number or a MIL-Spec number. The
Section 3 page which contains the specific entry for the part or MIL number of interest is
given. Note that not all data entries contain a part or MIL number, since these numbers
either were not applicable or were not known for all entries.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


374
Chapter 7: Examples

7.4.3.5. Section 6 “National Stock Number Index with Federal Stock Class”
This NPRD section provides an index of those Section 3 data entries that contain a
National Stock Number (NSN), including the four digit Federal Stock Class (FSC) prefix.
This index contains all parts for which the NSN is known.

7.4.3.6. Section 7 "National Stock Number Index without Federal Stock Class Prefix"
This NPRD section provides an index similar to the Section 6 index, with the exception
that the four-digit FSC is omitted.

7.5. References
1. RADC-TR-88-97, “RELIABILITY PREDICTION MODELS FOR DISCRETE
SEMICONDUCTOR DEVICES”, Final Technical Report, 1988
2. Denson, W.K. and S. Keene, “A New System Reliability Assessment
Methodology”, Final Report, 1998
3. “Photonic Component and Subsystem Reliability Process Final Report”,
Subcontract 0044-SC-20100-0203, Prepared for Penn State University Electro-
Optics Center, September 25, 2008
4. “Nonelectronic Parts Reliability Data (NPRD)”, Reliability Information Analysis
Center
5. RADC-TR-77-408, “Electric Motor Reliability Model”)
6. MIL-HDBK-344A, “Environmental Stress Screening of Electronic Equipment”,
August 1993

Reliability Information Analysis Center


375
Chapter 7: Examples

This page intentionally left blank

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


376
Chapter 8: The Use of FMEA in Reliability Modeling

8. The Use of FMEA in Reliability Modeling


Although analytical techniques like FMEA are not the primary focus of this book, they
are important in the development of a reliability model. For example, when identifying
the root failure causes that are to be included in a comprehensive product or system
reliability model, a need exists for the identification of the highest risk failure causes that
should be addressed in the model. The FMEA is a popular technique to use for this
purpose. The intent of this chapter is not to present a detailed procedural guide to FMEA,
as this has been done extensively in the literature. Rather, it is to present practical FMEA
guidelines based on the experience of the author, specifically toward the goal of
developing a reliability model.

8.1. Introduction
In order to “build” reliability into a product or system, it is necessary to anticipate failure
causes, and ensure that they are eliminated or, at least, that their probability of occurring
is made acceptably low. This “anticipation” can be accomplished empirically through
test, or analytically through analysis and modeling. Failure Mode and Effects Analysis
(FMEA) is a structured way of identifying root cause failure modes, and is the backbone
of an effective reliability program, particularly as it relates to reliability growth during the
design and development phase.

A successful product or system depends on the requirements being fully understood, that
the design is robust, and that the manufacturing process is also robust. A Design FMEA
(DFMEA) assesses the first two of these, and a Process FMEA (PFMEA) assesses the
third. This is illustrated in Figure 8.1-1.

Reliability Information Analysis Center


377
Chapter 8: The Use of FMEA in Reliability Modeling

Requirements for Understanding of Robust Robust


Successful Product Requirements Design Manufacturing

Wrong or Bad Bad


What Can Bad Design Manufacturing
Go Wrong? Requirements
Build the Build the
Wrong Product Product
Wrong

DFMEA PFMEA

Figure 8.1-1: Two Basic Types of FMEA

Generally, the best manner in which to perform the FMEA is to separate the design and
process attributes and perform separate process and design FMEAs. However, in some
cases, these can essentially be combined into a single design FMEA by incorporating the
manufacturing process-related failure modes into the DFMEA “failure cause/mechanism”
column. The circumstances when this is appropriate are generally those when the item
under analysis is not complex from both a design and manufacturing perspective. This
book primarily addresses the DFMEA, since a reliability model is generally driven more
by the design than the process. However, process variables are often used as factors in
the reliability model.

A FMEA is the cornerstone of a reliability program, having many uses. The primary
purpose of the FMEA is to acquire an understanding of the reliability characteristics of a
product or system, such that corrective action can be taken to make the item more reliable
(reliability growth). The results of a FMEA are also used to support other reliability
engineering tasks, such as test plan development, the evaluation of engineering changes,
assessing detectability, the basis of troubleshooting manuals, and the development of
reliability models.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


378
Chapter 8: The Use of FMEA in Reliability Modeling

The logical, bottom-up analysis technique of the FMEA facilitates the understanding of
the reliability characteristics of a product or system. This understanding is a core
requirement for the attainment of the reliability objectives, and, as such, it will help
reduce the total program cost. While reliability engineering tasks are sometimes
considered to be costly to a program, the reality is that they will save significant amounts
of money, if properly implemented. Costs incurred when reliability problems are
identified in the field will be orders of magnitude higher than the upfront cost of the
reliability engineering tasks that solve them during design and development. Since the
success of a reliability program depends largely on the effectiveness of FMEA,
implementation of the FMEA is a critical element of the cost avoidance of field failures11.

Benefits of performing an FMEA include:

• The assurance that all conceivable root failure causes and their effects have been
considered in the early stages of the product or system design and development
process, and that corrective actions are taken to mitigate the risk associated with
critical failure modes.

• If elements such as accelerating stresses are included in the FMEA analysis, it can
be used to develop reliability growth, demonstration and screening test plans, as
well as environmental qualification test plans. In this case, the importance of
each potential accelerating stress can be quantified and prioritized in accordance
with the severity, criticality or failure rate of the individual failure modes
accelerated by the specific stress. For example, if temperature is determined to
accelerate the majority of critical failure modes, then it should be used as a stress
in reliability and qualification testing.

• If and when reliability problems occur after a product or system is delivered to the
customer, the FMEA can be used as a basis for determining the root cause of
failure. Based on failure symptoms, the possible causes can be identified based
on the FMEA analysis that was performed.

• It can be used as a basis for the reliability model, in which the reliability of each
high risk failure cause is quantified.

11
It should be noted that an FMEA is only technically effective if it has an impact on the design of the product or system. An FMEA
that does an excellent job of identifying root failure causes, but is performed “after-the-fact” so as to have no impact on the actual
design, is a waste of reliability program resources. An FMEA is only cost effective if it impacts the design of the product or system
before the design is finalized and “bending metal” has started. A poorly timed FMEA that results in extensive and costly redesign
efforts to eliminate or mitigate root failure causes is also counterproductive.
Reliability Information Analysis Center
379
Chapter 8: The Use of FMEA in Reliability Modeling

Another benefit of the FMEA is that it can be used as a basis for evaluating the risk
associated with engineering changes. If a design change is proposed, the FMEA can be
consulted to determine if the change will result in new failure modes or an increase in the
probability of failure of identified modes. Based on this information, the change can be
accepted, or additional reliability characterization can be performed to further assess the
reliability impact of the proposed change.

Detectability can also be assessed by the FMEA. This is particularly useful in instances
where failures that are undetectable are of special importance to the project. An example
of this is when alarms are used as a means to detect failures. Some failure modes may
not result in an alarm, and, therefore, the criticality associated with the failure mode can
be high. In this case, the FMEA can be used to assess these failure modes.

Troubleshooting manuals are essentially an FMEA that is presented in reverse order. The
FMEA is generally presented in the order of functional elements, or components. If the
FMEA is sorted by the effect of failure (or symptom), it essentially becomes a
troubleshooting aid, since the analyst can review the specific failure modes that will
result in the observed symptom. Additionally, if the probability of failure is included in
the FMEA analysis, the possible failure modes or causes can be ranked in accordance
with their probability. This can aid in the troubleshooting process.

Typical problems with the implementation of an FMEA include:

• Confounding of failure modes, effects and causes


• The tiering effect between the failure cause–mode–effect as a function of the level
of assembly makes it difficult to keep the cause-mode-effect straight.
• In the determination of occurrence, severity and detectability, there are several
dimensions of each that need to be accounted for, and, therefore, definitions
should be tailored for each product or system in accordance with the various
dimensions
• Lack of follow-up action tracking
• The tendency to normalize effects and detectability to “in process” and not “in-
field” which is the purpose of the FMEA

The methodology outlined in this document is intended to provide guidance to overcome


these limitations.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


380
Chapter 8: The Use of FMEA in Reliability Modeling

8.2. Definitions
FMEA refers to a generic analysis methodology and, while there are industry standards
that define the specifics of the analysis, there are many different ways in which the
analysis can be accomplished. The following list of terms and definitions summarizes
typical data elements that the FMEA will typically include as columns in the FMEA
worksheet template, presented in the order of which they usually appear.

• Item: The name of the "Item" being analyzed


• Item Function: The function of the "Item" under analysis. If the "Item" has
more than one function, with different potential modes of
failure, list all functions separately.
• Potential Failure Mode: The manner in which an item can fail, relative to the
particular "Item" and "Item Function". These should
include all failure modes that could occur, but may not
necessarily occur.
• Failure Effect on Item: The local effect that the failure mode will have on the
item under analysis. This effect should reflect how an
item can fail to meet its functional requirements.
• Failure Effect: Defined as the effect(s) of the "Failure Mode" on the end-item
(module) function, as perceived by the customer. This should be
described in terms of what the customer/end-user might notice or
experience.
• Severity (S): The rank associated with the most serious effect for a given failure
mode. A numerical value of one (1) to ten (10), proportional to
this severity, is assigned. High numbers are applicable to effects
for which the consequences are severe. For example, if the effect
of a particular failure mode is that a critical module will fail
catastrophically (i.e., no output), then the assigned severity value
will be close to ten. Guidelines for assigning this value are
provided in a subsequent table in this book.
• Potential Cause/Mechanism: Defined as an indication of a design weakness, the
consequence of which is the failure mode. You
should list every potential cause and/or failure
mechanism for each failure mode. Causes can be
any underlying reason that the failure mode
occurs, and can be manufacturing process
anomalies, human error, defect type, product

Reliability Information Analysis Center


381
Chapter 8: The Use of FMEA in Reliability Modeling

attributes that can contribute to a failure mode,


etc. Failure mechanisms are generally the
physical processes which result in the failure
mode. Examples are corrosion, crack
propagation, electromigration, spalling, etc.
• Accelerating Stress(es): The stresses that will accelerate the cause/mechanism
of failure. These can be operational or environmental
stresses. This information is useful in developing
reliability growth and characterization test plans.
• Occurrence (O): The likelihood that the specific failure cause/mechanism will
occur during the design life. A numerical value of one (1) to
ten (10), proportional to this likelihood, is assigned. High
numbers are applicable to causes/mechanisms that are likely to
occur. For example, if the specific cause/mechanism has been
observed to exhibit a relatively high failure rate, then the
assigned value will be close to ten.
• Current Design Control Preventions: Indicate what has been done to prevent
the cause/mechanism of failure, or the
failure mode, from occurring, or reduce
its rate of occurrence.
• Current Design Control Detections: Indicate what has been done to detect the
cause/mechanism of failure. This can be
done via test or analysis.
• Detectability (D): The rank associated with the best detection control listed in
the design control columns, for the specific failure
cause/mechanism under analysis. A numerical value of one
(1) to ten (10), inversely proportional to the level of
detectability, is assigned. High numbers are applicable to
causes/mechanisms that are virtually undetectable. For
example, if it is known that a specific failure
cause/mechanism will be difficult to detect if it occurs, then
the assigned value will be close to ten.
• PRN: The Risk Priority Number, which is defined as the product of O, S and D
• Recommendations: The recommendations of the FMEA team members or
stakeholders regarding corrective actions that should be
taken to address the specific failure cause

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


382
Chapter 8: The Use of FMEA in Reliability Modeling

• Responsibility: The assigned individual or team that will lead the


implementation of the corrective action recommendation(s)
• Target Date: The date by which the corrective action recommendations are to
be implemented. Note that this date can also reflect the date by
which the corrective action recommendations, once implemented,
have been verified as being effective.

8.3. FMEA Logistics


This section provides specific guidance on FMEA implementation.

8.3.1. When initiated


The FMEA should be initiated when there is preliminary design. In a multi-stage
development process, this generally occurs in one of the first few stages. Most reliability
experts will recommend initiating the FMEA as early as possible in the development of a
product. However, for products that are very early in the research phase, and whose
design is changing rapidly based on research learnings, there is a risk that FMEA analysis
will be premature. When the design fundamentals start becoming finalized, the FMEA
results are used, along with other learnings from the development process to improve the
design.

8.3.2. FMEA Team


To be effective, the FMEA team, led by an FMEA facilitator, must be cross-functional,
and must have participation by all relevant departments, functions and/or disciplines (i.e.,
all relevant stakeholders). Specifically, the following functions should be represented:

• Design, especially the Design Lead


• Reliability
• Quality
• Project Management
• Application Engineers
• Manufacturing (Participation by manufacturing is especially critical when
performing PFMEAs, due to the fact that it is manufacturing processes that are
being analyzed)

Additional functions may also be required, depending on the specific organization and
nature of the product. These additional functions can include component engineering,
procurement, measurements, and marketing. There are also instances where an FMEA

Reliability Information Analysis Center


383
Chapter 8: The Use of FMEA in Reliability Modeling

might include the direct involvement of the customer, particularly for critical or highly
complex products or systems.

All of the above listed functions are not required for every part of the FMEA. For
instance, the initial parts of the FMEA can efficiently be performed by only the
Reliability engineering and the Applications engineering (or the Project Manager)
functions. After this, engagement by the entire team is critical, especially to gain “buy-
in” on corrective actions, which can be the responsibility of any of the disciplines.

The ideal team size is 5 to 8 people. Any larger, and the efficiency of the analysis is
compromised. It is also more efficient to break up the analysis into distinct functional
elements of the design, i.e. mechanical, electrical, optical, software/firmware, etc.,
although it is also imperative to account for failure causes that are due to interactions of
these functional elements. The FMEA facilitator needs to ensure that these interactions
are accounted for, since the individuals cognizant of their functional element will often
overlook these interactions.

8.3.3. FMEA Facilitation


As with any cross functional team, it is important to have a “lead” which facilitates the
FMEA. Specific responsibilities of this facilitator are:

• Document the results of analysis (it is also beneficial to have a separate “scribe”
that documents the results, and allows the facilitator to concentrate on the
additional items listed below)
• Keep the group focused on the task at hand
• Ensure that all components or processes are accounted for
• Prompt the group for participation, as required
• Spark the discussion by suggesting failure modes
• Ensure that the analysis is kept moving
• Ensure that the inputs of all participants are heard and captured. This includes
making sure that certain people are not allowed to dominate the analysis, and that
the ideas of quiet people are brought out.
• Manage conflicts – Professional people take a great deal of pride in their work.
Since the FMEA goal is to find fault with the product or system, FMEA sessions
can sometimes get contentious. The facilitator must manage this by keeping the
session constructive and not allow emotions to dictate the course of the analysis.

The facilitator is often from the reliability group, but does not have to be. It is more
important that the facilitator be skilled in the responsibilities listed above.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
384
Chapter 8: The Use of FMEA in Reliability Modeling

8.3.4. Implementation
Some suggestions for implementing an effective FMEA are listed here:

• Determine the appropriate FMEA methodology that will be used. Factors to


consider are:
a. Standards used in the specific industry that the product is intended for
b. Customer needs and expectations
c. Previous experience pertaining to the effectiveness of specific FMEA
methodologies
• Use a facilitator who is experienced with FMEA techniques. This individual does
not necessarily need to be a member of the technical team.
• Have a small team, or an individual, prepare the background information before
the larger FMEA team meets. The information should include system
documentation such as schematics, drawings, Bills of Materials (BOMs), theory
of operation, and “potential failure mode” lists.
• Re-use as much information from previous FMEAs as possible, as this will save
time
• Segment the FMEA team sessions into logical groupings, if appropriate. As an
example, electronic design and mechanical design can sometimes be separated.
However, if they are separated, areas of interaction must be adequately covered.
Thermal properties is a typical example of an area of potential interaction.

8.4. How to Perform an FMEA


The concept of FMEA is very straightforward. First, each component or element of the
product or system is studied to see how it could fail. These are called failures modes, or
an observable effect of a failure mechanism. For example, a resistor can have “open” or
“short” failure modes. These failure modes are directly observable and external to the
part. Possible causes of each failure mode are then determined. Examples may include
metal migration, corrosion, etc.

Next, the analysis determines what happens if the failure mode was to occur. These are
the effects, and are determined at the various levels of the assembly architecture
(progressing from low to high), such as the surrounding components, the sub-assembly,
the assembly and the entire product or system. The specific levels used in the analysis
are very item-specific, in that the more complex that the product or system is, the more
levels that may be required for analysis. If the product or system is relatively simple,
only one level may be required.

Reliability Information Analysis Center


385
Chapter 8: The Use of FMEA in Reliability Modeling

The next step is the determination of the possible corrective actions. This will be
discussed later in this chapter.

There are many ways in which an FMEA can be performed. This section outlines one
approach that has been successful based on the experience of the author. The process
flow is illustrated in Figure 8.4-1.

Figure 8.4-1: FMEA Process Flow

In this approach, the following steps are followed:

• Make a hierarchical listing of the product (Section 8.5)


• List functional requirements of each item in the hierarchy (Section 8.6)
• Use “IPOUND” analysis to generate a set of potential failure modes of the
system. These are the effects. (Section 8.7)
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
386
Chapter 8: The Use of FMEA in Reliability Modeling

• For each effect, identify the severity (Section 8.8)


• Use “IPOUND” analysis to identify potential failure modes of parts
• Identify the possible effect(s) that could result from occurrence of each failure
mode (Section 8.9)
• Identify potential causes of each failure mode (Section 8.10)
• For each cause, identify Section 8.11):
o Accelerating stress(es) or applicable tests
o Occurrence
o Preventions
o Detections
o Detectability
• Calculate the RPN (Section 8.12)
• Determine the appropriate corrective actions to be taken (Section 8.13)
• Update the RPN (Section 8.14)

Each of these steps is summarized below, along with guidance and tips on performing the
step.

8.5. Identify System Hierarchy


A hierarchical description of the product or system is first generated. The highest level is
the system level, and levels below the system level are equipments, major assemblies,
then subassemblies, etc. This breakdown continues until the lowest level that will be
analyzed is reached.

The complexity of the system will dictate the number of hierarchical levels. Items
comprised of a single component will have only a single level. More complex products
or systems can have from four, up to eight, levels.

It may be necessary to treat the “system” as the customer’s system, since the effects of
failure will be manifested at that level. Whether this is necessary also depends on if the
FMEA is being driven by customer requests.

The lowest level to be analyzed also needs to be determined. A general guideline to


determine the appropriate “lowest” level is that it should be one level lower than that for
which design control exists. For example, if an electrical circuit is being designed, and a
constituent component is a commercial off-the-shelf (COTS) capacitor, the FMEA should
go down to the capacitor level. In this case, it is not necessary to determine the specific
failure causes of the capacitor, but it is necessary to understand the failure modes of the
capacitor so that design actions can be taken with respect to the circuit design to mitigate
Reliability Information Analysis Center
387
Chapter 8: The Use of FMEA in Reliability Modeling

these potential modes. Ideally, the capacitor manufacturer will have performed a FMEA
on their product, in which case specific failure causes have already been identified and,
hopefully, mitigated.

8.6. Function Analysis


The next step in the FMEA is to list, for each item, its functional requirements. This
function analysis is performed on each item at each level in the hierarchy. It can include
both functions and the attributes of those functions. A function is the purpose of the item,
whereas an attribute is a characteristic of the function. Attributes are generally what is
detailed in a product specification.

8.7. IPOUND Analysis


An IPOUND analysis is a means to identify all possible ways in which a function or
attribute can fail. The failure modes of the system will become the failure effects, and the
failure modes of the parts will become the failure modes that will be further analyzed by
identifying their causes.

The IPOUND categories are:

I: Intermittent
P: Partial
O: Over
U: Unintended
N: Negative
D: Degraded

These are defined as follows:

• I (Intermittent): The function is performed sometimes. This is common for


electrical connections, where continuity is intermittent
• P (Partial): Too little of the function or attribute is initially achieved. This
does not refer to the situation in which a function is initially
fine, but degrades over time. That situation is described by the
Degraded category.
• O (Over): Too much of the function or attribute is achieved. This is only
applicable for attributes where “more is better”
• U (Unintended): This refers to failure modes that are not directly attributable to
the function under analysis, but rather a different attribute is
affected.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
388
Chapter 8: The Use of FMEA in Reliability Modeling

• N (Negative): None, complete loss of the function


• D (Degraded): Degraded function, when a function or attribute is initially fine,
but degrades over time

The loss function used in the Taguchi methodology can be used to determine some of the
applicable failure modes. Here, functions or attributes are categorized as “larger the
better”, “nominal the best” and “smaller the better”. For “larger the better”
function/attributes, a failure can occur when there is too little of the function/attribute, but
cannot fail when there is too much of it. This is illustrated in Table 8.7-1. It relates only
to the “over” function and the “partial” function IPOUND categories. The other
IPOUND categories are used when appropriate, and will generally be independent of the
Taguchi categories.

Table 8.7-1: Failure Mode Relationship to Taguchi Loss Function


Function/Attribute Too Much Too Little
Type (Over Function) (Partial Function)
Larger the better X
Nominal the best X X
Smaller the better X

The IPOUND categories are intended to represent a complete and mutually exclusive set
of the ways in which a function or attribute can fail. When identifying failure modes in
this manner, it is helpful to set up a matrix of functions/attributes and the IPOUND
categories, and proceed to fill it in. When filling in this matrix, it is not necessary to
identify failure modes for all categories of IPOUND. Likewise, for any single category,
multiple failure modes are possible. The IPOUND methodology is simply a way to get
the team to think about all possible ways in which a function/attribute can fail.

The flow diagram in Figure 8.7-1 depicts a simple system consisting of a two-level
hierarchy. In practice, there can be any number of levels in the system hierarchy. The
failure cause-mode-effect relationship shifts in the FMEA as a function of the system
level, as illustrated in Figure 8.7-1. For example, at the most basic level, the part
manufacturing process, the cause of failure may be a process step that is out of control.
The ultimate effect of that cause becomes the failure mode at the part level. The failure
effect of the part becomes the failure mode at the next level of assembly, and so forth. It
is very important that the failure cause, mode and effect are not confounded in the a

Reliability Information Analysis Center


389
Chapter 8: The Use of FMEA in Reliability Modeling

System Assembly Part Part Manufacturing


Process
Effect
Mode Effect
Cause Mode Effect
Cause Mode Effect
Cause Mode
Cause

Figure 8.7-1: Failure Cause-Mode Effect Relationship

The failure modes of the system functions/attributes are the effects of failure modes at the
subordinate hierarchical level. This tiering continues as the system is broken down to the
lowest level that the analysis will take place.

For simple, single-level products (for example, a component made from a monolithic
material) there is only a single level and, therefore, this is not an issue. Also, for
relatively simple products with two levels, a “local effects” column can be added to
capture the effects of the failure mode on the subassembly function. In this case, the
effects are relative to the functional requirements of the subassembly.

When identifying failure modes, the assumption is made that the failure could occur but
may not necessarily occur.

8.8. Identify the Severity


Based on the IPOUND analysis at the system level, in which system failure modes were
identified, these failure modes will be the effects used in the FMEA. For each of these
effects, a severity rating is required. Table 8.8-1 summarizes the factors that should be
accounted for in establishing the severity value of each effect, and provides a summary of
the range of magnitudes for each dimension, from least to most severe.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


390
Chapter 8: The Use of FMEA in Reliability Modeling

Table 8.8-1: Dimensions of Functional Severity


Dimension of Severity Magnitude Example
No Degradation
Degree to Which Function is Slight Degradation
Lost Severe Degradation
Intermittent
Importance of Not Critical
Function/Attribute Critical
In Research & Design Not Capable
Development (R&D) Process Not Capable
In Process Screen Fallout
When Occurs Customer Inspections
Infant Mortality
In Deployment Random
Wearout

In the identification of severity, effects of failure modes that are potentially safety-related
are usually considered to be the most severe. For these, a severity value of 9 or 10 is
used, regardless of the above listed factors.

The “when occurs” dimension of severity pertains to the life cycle phase in which the
failure mode and its effect occurs. If the failure mode occurs in the R&D phase, it is
either because the design or process is not capable, or because intrinsic or extrinsic
failure causes occur. In either case, this is the best phase to identify these, since they can
be corrected in the most cost-effective manner possible.

If the failure mode occurs “in process”, the effect is essentially a yield reduction.
Failures occurring during inspections or quality checks by the customer are similar, but
they occur at the customer’s site and are, therefore, more severe then when the defects are
caught in-house.

Failures occurring in deployment represent the most severe type of failure effect (with the
possible exception of safety-related failures). These failures can be represented by the
three types of failures in the bathtub curve: infant mortality, random and wearout.

Reliability Information Analysis Center


391
Chapter 8: The Use of FMEA in Reliability Modeling

Usually, the severity of an effect is treated as one factor in the FMEA. However,
separating the severity into three factors and subsequent columns can be beneficial. For
example, if the “when occurs” dimension of severity is separated, the failure modes that
can be caught “in process” are identified, and this, in turn, can be used to establish in-
process checks and screening protocols.

If they are separated, any convenient numeric scale can be used, including 1-to-10, 1-to-
3, or others. If 1-to-10 is used, the RPN of a failure cause will range from 1 to 1,000.

If these dimensions are not separated (which will usually be the case), each of the three
should be represented in the criteria used to define the severity levels. One way in which
this can be accomplished is to use the guidelines in Table 8.8-2, in which each dimension
is assumed to have a value between 1 and 3, directly proportional to its severity. The
total severity is then the sum of each of the three values.

Table 8.8-2: Dimensions of Severity


Dimension of Severity Magnitude
Degree to Which Function is No Degradation 1
Lost Slight Degradation
Severe Degradation
Intermittent 3
Importance of Not Critical 1
Function/Attribute Critical 3
When Occurs In R&D 1
In Process
Customer Inspections
In Deployment 3

8.9. Identify the Possible Effect(s) that Result from Occurrence


of Each Failure Mode
At this point in the analysis, the part failure modes have been identified, and the effects
(and their severity) have also been identified. This task is to identify the effects that will
result if the failure mode occurs. There can be any number of effects that can result from
the occurrence of the mode.

8.10. Identify Potential Causes of Each Failure Mode


Up to this point in the analysis, the FMEA has been a relatively straightforward
systematic approach to identify failure modes and their effects. For this reason, these
previous tasks can be accomplished by a small group of people, and the entire team is not

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


392
Chapter 8: The Use of FMEA in Reliability Modeling

required. It is only required that someone knowledgeable in the system and part
functional and attribute requirements be involved.

The task of identifying failure causes is a much more unstructured, brainstorming-like


activity. For this, it is important to get the entire team involved. The intent of this task is
to identify all possible causes that could result in the failure mode. Causes are often more
complex than the identification of a single failure mechanism and, therefore, describing
them in a few sentences in the FMEA table can be problematic. A failure cause will
often be the result of sub-causes, and can be broken down further and further until the
physical failure phenomena is identified. For this reason, an alternative is to perform a
fault tree analysis (FTA) on each failure mode. This allows for the breaking down of
failure causes into any level of detail.

There should be one severity rating for each failure effect, since the severity is a direct
1:1 relation to the effect. The maximum of each of these severities associated with each
failure mode is the severity used in the RPN calculation, since the RPN is applicable to
the cause. Here, the failure mode can result in several effects, but can also be initiated by
several causes. Therefore, a single cause can result in several effects, the worst of which
should be used in the PRN calculation. The relationship between failure cause, mode and
effect is illustrated in Figure 8.10-1.

Figure 8.10-1: Failure Cause, Mode and Effect Hierarchy

Reliability Information Analysis Center


393
Chapter 8: The Use of FMEA in Reliability Modeling

Here are some examples of design-related failure causes:

• “Failures resulting from operational stress” refers to failures resulting from the
inability of a product or system to tolerate the applied stresses to which the
component, item or material within the item is exposed.

• “Failures resulting from environmental stress” refers to failures resulting from


the inability of a product or system to tolerate the applied environmental
stresses to which the item is exposed.

• “Tolerance stack up” refers to the initial tolerance at time zero, and the failure
of a product or system to tolerate the cumulative effect of those tolerances.

• “Wear and component or material ageing” refers to the inability of a product


or system to tolerate the changes of its constituent components or materials
due to wear and ageing.

• The “combination of component ageing and tolerance stack-up” is the


cumulative effects of wear, ageing, and tolerance stack-up. As components
and materials within a product or system age, the susceptibility of the item to
the cumulative effects of component/material tolerance will increase.

Failures can also be a result of short-term exposure to extreme stresses. While the
product or system is not designed to tolerate these stresses under steady-state conditions,
it should be able to tolerate short-term extreme stress exposure. There is a limit to the
stress level(s) that the product or system should be able to tolerate. However, design
actions can be taken to minimize the probability of failure due to these stresses.

The information presented here is generic in nature and applies equally to mechanical and
electronics failures. The specific failure mechanisms will vary, but the concepts are the
same.

Failure causes are often the result of a combination of conditions and events. Therefore,
when identifying causes, the analyst needs to consider these combinations. The factors
whose combinations can cause failure generically include:

• Design not capable


• Process not capable
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
394
Chapter 8: The Use of FMEA in Reliability Modeling

• Screen fallout/out of-the-box failure


• Infant mortality
• Random failure
• Wearout
• Design
• Manufacturing
• Environmental exposure
• Stress exposure

When hypothesizing failure causes, it is useful to think about them in terms of their initial
conditions, stresses, and failure mechanisms, as illustrated in Figure 8.10-2.

Figure 8.10-2: Failure Causes

A list of typical initial conditions, stresses, and failure mechanisms is provided below.

• Initial conditions:
o Defect free (the item is made “as designed”)
o Defects:
Intrinsic:
• Voids
• Material property variation
• Geometry variation
• Contamination
• Ionic contamination
• Crystal defects
• Stress concentrations
Extrinsic:
• Organic contamination
• Nonconductive particles
• Conductive particles
Reliability Information Analysis Center
395
Chapter 8: The Use of FMEA in Reliability Modeling

• Contamination
• Ionic contamination
o Stresses:
Operation - steady state
Operation - cycling
Chemical exposure
Salt fog
Mechanical shock
UV exposure
Drop
Vibration
Temperature-high
Temperature-low
Temperature cycling
Damp heat
Pressure - low
Pressure - high
Radiation - EMI
Radiation - cosmic
Sand and dust
o Failure mechanisms (physical process):
Electromigration
Dielectric breakdown
Corrosion
Dendritic growth
Tin whiskers
Metal fatigue
Stress corrosion cracking
melting
Creep
Warping
Brinelling
Fracture
Fretting fatigue
Galvanic corrosion
Pitting corrosion
Chemical attack
Fretting corrosion
Spalling

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


396
Chapter 8: The Use of FMEA in Reliability Modeling

Crazing
Abrasive wear
Adhesive wear
Surface fatigue
Erosive wear
Cavitation pitting
Stress corrosion cracking
Elastic deformation
Material migration
Oxidation
Cracking
Plastic deformation
Elastic deformation
Brittle fracture
Expansion
Contraction
Emod change
Outgassing
Index of refraction changes
Photodarkening
Condensation
Crystallization

Each failure cause can be characterized with a specific combination of initial condition,
stress and degradation process. For example, a cause could be represented as:

Defect1-temperature–corrosion

After identifying all of the FMEA elements in accordance with the guidelines presented
herein, it is very useful to check the completeness of the analysis by hypothesizing what
would happen if the product or system:

• Is exposed to various environmental stresses


• Is exposed to high operating stresses (i.e., voltage, current, optical power, flow
rates, etc.)
• Has manufacturing defects. Manufacturing defects are typically analyzed in a
PFMEA. However, they can be included in the design FMEA, with the defect
type being listed in the “Cause” column. In fact, if a PFMEA is not planned for a

Reliability Information Analysis Center


397
Chapter 8: The Use of FMEA in Reliability Modeling

product or system, then manufacturing process-related failure causes should be


included in the “Cause” column.

In these cases, you are filling in the FMEA backwards by essentially hypothesizing the
cause “a-priori” and then determining what the resulting failure mode would be. For
example, the cause identified in this manner will result in a failure mode, which in turn
will have an effect at the next higher level in the system.

8.11. Identify Factors for Each Failure Cause


8.11.1. Accelerating Stress(es) or Potential Tests
Accelerating stresses or tests can include the following. This information can be used to
define reliability test plans. A list of potential tests may include:

1. Operation - steady state


2. Operation - cycling
3. Chemical exposure
4. Salt fog
5. Mechanical shock
6. UV exposure
7. Drop
8. Vibration
9. Temperature - high
10. Temperature - low
11. Temperature cycling
12. Damp heat
13. Pressure - low
14. Pressure - high
15. Radiation - EMI
16. Radiation - cosmic
17. Sand and dust

8.11.2. Occurrence

8.11.2.1. Occurrence Rankings


The occurrence rating should be a function of two factors: (1) an estimate of the
likelihood of occurrence based on the analysts experience, and (2) the degree to which
the failure cause/mechanism has been observed. For example, Figure 8.11-1 represents
the occurrence, as defined by Reference 1.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


398
Chapter 8: The Use of FMEA in Reliability Modeling

Figure 8.11-1: Occurrence Definitions

Ideally, a reliability model would be available from which to determine the occurrence,
but this is usually impractical due to the fact that the FMEA is generally performed
before the reliability modeling activities commence.

The occurrence should be based on engineering judgment and on empirical data. The
resulting occurrence value is based on both, as illustrated in Figure 8.11-2. For example,
if empirical information exists on a specific cause, it should be used as part of the
assessment of the Occurrence level. In this case, heavier weighting should be given to
field data over manufacturing and test data. If no empirical data exists, engineering
Reliability Information Analysis Center
399
Chapter 8: The Use of FMEA in Reliability Modeling

judgment should be used, and should be based on the collective experience of the FMEA
team.

The occurrence should be based on the likelihood that the cause will occur and that the
resulting mode will occur. Some FMEA methodologies, like the cancelled MIL-STD-
1629, include a separate factor that accounts for the probability that the effect will occur
if the mode is to occur (the same concept can be used for the cause-mode relationship).

High 10

Failure rate
estimate
based on
experience

1
Low
Not at all Frequently
How often has the failure cause/mechanism been
observed in the past (heavier weighting should
be given to field data over manufacturing and
test data)

Figure 8.11-2: Occurrence Guidelines

The frequencies of occurrence should be rated relative to the required reliability for a
specific failure cause. For example, if a reliability allocation is performed to allocate the
product or system failure rate (or unreliability) to its constituent components, then the
occurrence value should be relative to this allocated value.

Common cause vs special cause


Categories of failure effects are shown in Table 8.11-14. These illustrate the differences
between common cause and special cause failure effects.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


400
Chapter 8: The Use of FMEA in Reliability Modeling

Table 8.11-1: Categories of Failure Effects


Design Not Process Screen Infant Random Wearout
Capable Not Fallout/Out- Mortality Failure
Capable of-the-Box
Failure
Always
X X X X
(Common Cause)
Sometimes
X X X X X X
(Special Cause)

8.11.3. Preventions
Preventions are the actions taken to prevent the cause/mechanism of failure or the failure
mode from occurring, or to reduce their rate of occurrence. These will generally be
design-related actions. Examples include: “Ensured proper derating for all components”
or “Use of a conformal coating”.

8.11.4. Detections
Detections are actions taken to detect the cause/mechanism of failure. This can be via
either test or analysis.

8.11.5. Detectability
Detectability is a value between 1 and 10 that is inversely proportional to the degree to
which the failure cause can be detected, i.e., the less likely the detection of the failure
cause, the higher the detectability value. The traditional detectability definitions are
listed in Figure 8.11-3.

Reliability Information Analysis Center


401
Chapter 8: The Use of FMEA in Reliability Modeling

Figure 8.11-3: Detectability Definitions

There are four aspects of detection that should be captured in the FMEA:

Current design control detections:


Indicate what has been done to detect the cause/mechanism of failure or the failure mode,
either by analytical or physical methods, before the item is released into production.
These are generally the application of tests or analytical techniques whose goals are to

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


402
Chapter 8: The Use of FMEA in Reliability Modeling

ascertain the probability of occurrence of the failure cause/mechanism. Therefore, this


aspect of detection relates to detecting the probability of occurrence.

Probability of detection if the Failure cause/mechanism occurs


This is the probability that the failure cause/mechanism will be detected if it occurs.
Some failure modes are inherently undetectable when they occur. An example of this is
cracks that are initiated within a structure.

Screening
This aspect of detectability addresses the question: “What will be done in the
manufacturing process to detect and eliminate the items prone to the failure
cause/mechanism?” Reliability screening is a common technique for accomplishing this.
If screening is to be employed, the screening effectiveness must be determined. This
screening effectiveness is directly related to the “Probability of detection if the failure
cause/mechanism occurs”.

Degree of Warning
The fourth aspect of detectability relates to how detectable a failure cause/mechanism is
before it results in the worst case effect identified.

The life cycle phases to which each of these four dimensions is applicable are illustrated
in Figure 8.11-4.

Figure 8.11-4: Life Cycle vs Detectability Dimension

Reliability Information Analysis Center


403
Chapter 8: The Use of FMEA in Reliability Modeling

The combinations of each of the four dimensions (H = High, L = Low, x = Doesn’t


Matter) and the recommended detectability are summarized in Table 8.11-2.

Table 8.11-2: Recommended Detectability Rating Criteria


Current Design Probability of Detection if Screening Degree of Detectability
Control the Failure Warning
Detections Cause/Mechanism Occurs

x L x L 10
L H L H 8
x L x H 7
H H L L 5
L H H L 5
H H L H 2
L H H H 2
H H H H 1

8.12. Calculate the RPN


A measure of criticality is the Risk Priority Number, or RPN. The probability of failure
is usually a value between 1 and 10. The RPN is the product of the probability of
occurrence, the severity and the detectability:

RPN = O*S*D

where:

O= Probability of occurrence
S= Severity
D= Detectability

Another definition of criticality is provided in MIL-STD-1629. In this case, the


criticality is the product of the failure rate, the failure effect probability and the failure
mode ratio, and is expressed as:

C= λβα

where:

C= Criticality
λ= Failure rate

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


404
Chapter 8: The Use of FMEA in Reliability Modeling

β= Failure effect probability


α= Failure mode ratio

The failure rate is the rate of occurrence of failure, expressed in failures per million
cumulative operating hours, or in FITs (failures per billion operating hours). The failure
effect probability is the conditional probability that, if the failure mode occurs, the
severity level identified in the FMEA will be the result. The failure mode ratio is the
fraction of the failure rate that can be attributed to the specific failure mode under
analysis. In other words the sum of these probabilities for all failure modes of an item
will be 1.0.

The same logic applies to the RPN methodology, in that the occurrence rating (O) is the
product of “the probability of the failure cause occurring” times “the probability that the
failure cause will result in the identified effect”.

Since severity is not included in this calculation, failure modes are usually sorted by
criticality for each severity level. This is done since a true measure of criticality must
include the severity of the failure mode.

The RPN methodology is generally the most common used in many industries. However,
in some cases, the criticality metric is more applicable. Such cases occur when the
system under analysis is complex, or when quantitative failure rate estimates are
available. These failure rate estimates are generally derived from reliability modeling, as
summarized in this book.

8.13. Determine Appropriate Corrective Action


The FMEA failure causes are then sorted by RPN, from highest to lowest. After the RPN
of each failure cause is identified, the team will identify the actions that should be taken
to mitigate the most important failure causes. These are the causes with the highest RPN
number.

An issue that needs to be addressed is the identification of a critical RPN value above
which corrective action should take place. The RPN is a qualitative measure of risk and,
therefore, there is not a single value. Usually, the number of failure causes that can be
addressed with corrective actions will be determined by the availability of resources and
the criticality or severity of failure. Some organizations state to their suppliers that RPNs
of 40, or 50, or greater shall be addressed. However, this value is somewhat arbitrary.
Also, in many cases, it is required that all failure causes with high severity be addressed,
regardless of their occurrence or detectability.
Reliability Information Analysis Center
405
Chapter 8: The Use of FMEA in Reliability Modeling

The other factor that determines the failure causes that are to be addressed is the Pareto
ranking of the RPNs. In other words, in some cases, there are a well-defined number of
causes that comprise the total risk to the system. This situation becomes evident in the
Pareto analysis of RPNs.

Corrective actions will generally fall into three categories:

• First, the design can be modified such that the effect of failure is minimized,
thus effectively lowering the severity level of the failure effect. Options for
this include the addition of redundant elements or fault tolerance, the selection
of better materials, and/or the use of more robust components.
• The second general option is to reduce the likelihood of the failure mode
occurring in the first place. This often can be achieved by the use of more
robust components. This robustness can be achieved with components of
higher quality levels or the ability to handle high stress levels. Another option
for reducing this likelihood is the control of environmental stresses and
reducing the stress to which the component is exposed.
• The third general corrective action is to improve detectability. Many products
will have failure modes for which the only viable corrective action is to make
the failure mode detectable. A common example of this is digital circuitry. If
redundancy is not an option, and higher quality components are not available,
the failure mode can be made detectable through the use of alarms or built-in-
test (BIT) features.

Examples of these corrective actions are shown in Figure 8.13-1.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


406
Chapter 8: The Use of FMEA in Reliability Modeling

Corrective Action

Modify Reduce Improve


Likelihood of Detectability
Design
Failure

Fault Better
Reduce
Tolerance Materials More Robust
Stress
Components

Control
Environment

Figure 8.13-1: Potential Corrective Actions

Potential corrective actions possibilities include:

• Severity Reduction:
o Add redundancy
o Add a fail-safe feature
o Use personal protection equipment (for safety critical items)
• Occurrence Reduction:
o “Design out” the cause
o Reduce the rate of occurrence
• Detection Improvement:
o Implement alarm features
o Implement screening tests
o Design more relevant tests to detect the failure cause
o Develop better characterization methods

Reliability Information Analysis Center


407
Chapter 8: The Use of FMEA in Reliability Modeling

8.14. Update the RPN


The objective of the FMEA is to improve the reliability of the product or system under
analysis. Updating the FMEA with information that is learned during the analysis allows
the FMEA to be used as means by which the reliability status of the item can be tracked
and improved. As failure causes are identified and eliminated, the reliability will
improve and the resulting RPN will decrease. This RPN value can be an effective means
for tracking the reliability growth of a product or system both during the design and
development phases of the life cycle.

8.15. Using Quality Function Deployment to Feed the FMEA


Quality Function Deployment (QFD) analysis can provide valuable information in
support of an FMEA. The manner in which this can be done is illustrated in Figure 8.15-
1. Note that it is assumed that the reader has knowledge of the QFD process, so those
details are not included in this book.

Figure 8.15-1: QFD-to-FMEA Links

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


408
Chapter 8: The Use of FMEA in Reliability Modeling

The QFD elements are defined as:

1. Characteristics (or “Whats”) – These are the high level characteristics of the
product that need to be achieved for the product’s customer to be satisfied. The
lack of these characteristics is synonymous with failure modes of the product,
which in turn are failure effects of failure modes at the next lower level of the
product hierarchy.
2. Importance (or “Ranking of Needs”) – The QFD will generally include a rating
of the importance of the characteristic (#1). The severity of the failure effects will
then be proportional to this importance. The dimensions of this importance should
include the dimensions of severity as described previously.
3. Measures (or “Hows”) – The “measures” in the QFD pertain to the
characteristics of the items comprising the design. These measures can be
comprised on a hierarchical listing of the items and their critical characteristics or
functions. The manner in which these characteristics can fail can be identified
with an IPOUND analysis previously discussed, and become the failure modes in
the FMEA.
4. Relationships – the relationships matrix in the QFD identifies if the measure is
related to the characteristic and, if so, whether it is a strong, medium or weak
relationship. These relationships essentially identify the effects (i.e. the negative
of the characteristic) that will occur if the failure mode (i.e. failure modes of the
measures) occur.

If the FMEA elements are obtained from the QFD in this manner, the first five columns
in the FMEA can be populated directly, as shown in Figure 8.15-2. These columns are
identified in bold in the above descriptions.

Reliability Information Analysis Center


409
Chapter 8: The Use of FMEA in Reliability Modeling

These FMEA columns are populated


directly from the QFD/IPOUND
analysis

Figure 8.15-2: QFD-FMEA

Failure modes can be interpreted in several ways:

1. Inability to perform an intended function


2. Inability to meet customer expectations

Number 2 is a broader definition of failure, in that it encompasses whether a product or


system has features that customers want. Number 1 relates to whether a set of features
that are assumed to meet customer wants are capable of being sustained over the design
life of the product. If the QFD results are used as summarized above, then “failure” will
generally be defined as #2, since the QFD should encompass all customer expectations.

8.16. References
1. SAE J1739 (R) Potential Failure Mode and Effects Analysis in Design (Design
FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and
Assembly Processes (Process FMEA)

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


410
Chapter 9: Concluding Remarks

9. Concluding Remarks
Reliability modeling has been used successfully as a reliability engineering tool for many
years. It is only one element of a well-structured reliability program and, to be effective,
it must be integrated into a complete reliability program. This book has reviewed options
an analyst has for developing a reliability model of a product or system, and has provided
guidance on applying the appropriate methodology based on the specific needs and
constraints of the analyst.

The premise of the holistic approach described in this book is that the reliability model is
a living model that needs to be continuously updated throughout the program
development and deployment phases. This approach to modeling consists of predictions,
assessments and estimation. Each of these is performed at specific points in the
development cycle and has different purposes and approaches. Reliability predictions are
performed very early, before there is any empirical data on the item under analysis.
Reliability assessments are made to determine the effects of certain factors on reliability,
and to identify and study specific failure causes. Reliability estimates are made based on
empirical data, and encompass all three elements.

A critical theme of this book has been that the purpose of a reliability model must be
clearly defined, and then an appropriate methodology should be chosen. Each model
need must be fully defined in terms of the customers being served (their roles,
educational background, requirements, etc.), the constraints placed upon that customer
(including legal and contractual, as well as technical and engineering), and the purpose of
the model (what decisions are being supported and in what manner).

A summary of recommendations for an analyst developing a product or system reliability


model are:

1. Clearly define the purpose and objectives of the model


2. Apply the appropriate methodologies in the appropriate program phase
3. Identify critical failure causes early in the program, as they will require the most
attention in terms of modeling
4. Fully leverage all available expertise in areas of design analysis, testing,
measurement, etc.
5. Use all available data and information, and be diligent about seeking needed data
6. Strategically perform tests to characterize critical failure causes
7. Engage suppliers and customers to maximize the consistency of models
throughout all system hierarchical levels
Reliability Information Analysis Center
411
Chapter 9: Concluding Remarks

8. Use multiple modeling techniques, and work toward the goal of having them
reasonably agree with each other. In this manner, confidence in the results will be
greater.
9. Continuously update the model based on data that is obtained throughout all
program phases of the product or system life cycle
10. Identify and use available reliability software tools. These tools have become
very cost effective and are readily available, making the application of techniques
which were impractical several decades ago easily implemented.

It is hoped that this book has provided the reader with a knowledge of approaches, tools,
and interpretations that will allow a better understanding of the usefulness and limitations
of various reliability modeling techniques. Given its stochastic nature, reliability
modeling is part science and part art, and there are many ways to approach it. But, if the
analyst keeps the goals in mind and uses common sense, there is a high probability that
the model will be successful in achieving its objectives.

100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC


412

You might also like