Professional Documents
Culture Documents
10 1 1 385 6682
10 1 1 385 6682
10 1 1 385 6682
L
LI [(Ta1,Tb1 ),...,(TaL ,TbL ) /θ] ∝ ∏[F(Tb i ) − F(Tai )] |θ)
i=1
− Ea
λ = λ b e KT S n
k r −1
1 − CL = ∑
r
(λt ) ⎡ ( λt ) (λt )r ⎤
e =e ⎢1 + λt + ⋅ ⋅ ⋅ ⋅ +
−λt −λt
+ ⎥
k =0 k! ⎢⎣ (r −1)! ( r)! ⎥⎦
RIAC is a DoD Information Analysis Center sponsored by the Defense Technical Information Center. RIAC is operated by a
team of Wyle Laboratories, Quanterion Solutions, the University of Maryland, the Penn State University Applied Research
Laboratory and the State University of New York Institute of Technology.
Ordering No.: RPAE
Reliability Modeling -
The RIAC Guide to Reliability
Prediction, Assessment and
Estimation
Prepared by:
The intent of this book is to provide guidance on modeling techniques that can be used to quantify the reliability of a product or system. In
this context, reliability modeling is the process of constructing a mathematical model that is used to estimate the reliability characteristics of
a product. There are many ways in which this can be accomplished, depending on the product or system and the type of information that
is available, or practical to obtain, to the analyst. This book will review possible approaches, summarize their advantages and
disadvantages, and provide guidance on selecting a methodology based on the specific goals and constraints of the analyst. While this
book will not discuss the use of specific published methodologies, in cases where examples are provided, tools and methodologies with
which the author has personal experience in their development are used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus.
The data contained in the RIAC databases is collected on a continuous basis from a broad range of
sources, including testing laboratories, device and equipment manufacturers, government laboratories
and equipment users (government and industry). Automatic distribution lists, voluntary data
submittals and field failure reporting systems supplement an intensive data solicitation program.
Users of RIAC are encouraged to submit their RMQSI data to enhance these data collection efforts.
RIAC publishes documents for its users in a variety of formats and subject areas. While most are
intended to meet the needs of RMQSI practitioners, many are also targeted to managers and designers.
RIAC also offers RMQSI consulting, training and responses to technical and bibliographic inquiries.
Copyright © 2010 by Quanterion Solutions Incorporated. This handbook was developed by Quanterion
Solutions Incorporated, in support of the prime contractor (Wyle Laboratories) in the operation of the Department
of Defense Reliability Information Analysis Center (RIAC) under Contract HC1047-05-D-4005. The Government
has a fully paid up perpetual license for free use of and access to this publication and its contents among all the
DOD IACs in both hardcopy and electronic versions, without limitation on the number of users or servers. Subject
to the rights of the Government, this document (hardcopy and electronic versions) and the content contained within
it are protected by U.S. Copyright Law and may not be copied, automated, re-sold, or redistributed to multiple
users without the express written permission. The copyrighted work may not be made available on a server for use
by more than one person simultaneously without the express written permission. If automation of the technical
content for other than personal use, or for multiple simultaneous user access to a copyrighted work is desired,
please contact 877.363.RIAC (toll free) or 315.351.4202 for licensing information.
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents
Page
1. INTRODUCTION 1
1.1. Scope 2
1.2. Book Organization 5
1.3. Reliability Program Elements 7
1.4. The History of Reliability Prediction 11
1.5. Acronyms 17
1.6. References 18
2. GENERAL ASSESSMENT APPROACH 19
2.1. Define System 20
2.2. Identify the Purpose of the Model 22
2.3. Determine the Appropriate Level at Which to Perform the Modeling 25
2.3.1. Level vs. Data Needed 26
2.3.2. Using an FMEA as the basis for a reliability model 28
2.3.3. Model Form vs. Level 34
2.4. Assess Data Available 36
2.5. Determine and Execute Appropriate Approach 38
2.5.1. Empirical 44
2.5.1.1. Test 44
2.5.1.2. Field Data 77
2.5.2. Physics 106
2.5.2.1. Stress/Strength Modeling 106
2.5.2.2. First Principals 111
2.6. Combine Data 114
2.6.1. Bayesian Inference 121
2.7. Develop System Model 123
2.7.1. Monte Carlo Analysis 127
2.8. References 133
3. FUNDAMENTAL CONCEPTS 135
3.1. Reliability Theory Concepts 135
3.2. Probability concepts 142
3.2.1. Covariance 142
3.2.2. Correlation Coefficient 142
3.2.3. Permutations and Combinations 143
3.2.4. Mutual Exclusivity 144
i
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents
Page
3.2.5. Independent Events 144
3.2.6. Non‐independent (Dependent) Events 145
3.2.7. Non‐independent (Dependent) Events: Bayes Theorem 146
3.2.8. System Models 146
3.2.9. K‐out‐of‐N Configurations 151
3.3. Distributions 153
3.3.1. Exponential 159
3.3.2. Weibull 160
3.3.3. Lognormal 166
3.4. References 169
4. DOEBASED APPROACHES TO RELIABILITY MODELING 171
4.1. Determine the Feature to be Assessed 172
4.2. Determine Factors 172
4.3. Determine the Factor Levels 172
4.4. Design the Tests 174
4.5. Perform Tests and Measurements 180
4.6. Analyze the Data 181
4.7. Develop the Life Model 183
4.8. References 183
5. LIFE DATA MODELING 185
5.1. Selecting a Distribution 185
5.2. Parameter Estimation Overview 186
5.2.1. Closed Form Parameter Approximations 189
5.2.2. Least Squares Regression 190
5.2.3. Parameter Estimation Using MLE 192
5.2.3.1. Brief Historical Remarks 193
5.2.3.2. Likelihood Function 193
5.2.3.3. Maximum Likelihood Estimator (MLE) 195
5.2.4. Confidence Bounds and Uncertainty 198
5.2.4.1. Confidence Bounds with MLE 198
5.2.4.2. Confidence Bounds Approximations 199
5.3. Acceleration Models 206
5.3.1. Fundamental Acceleration Models 207
5.3.1.1. Examples 208
ii
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents
Page
5.3.2. Combined Models 210
5.3.3. Cumulative Damage Model 214
5.4. MLE Equations 216
5.4.1. Likelihood Functions 217
5.5. References 221
6. INTERPRETATION OF RELIABILITY ESTIMATES 223
6.1. Bathtub Curve 223
6.2. Common Cause vs. Special Cause 225
6.3. Confidence Bounds 238
6.3.1. Traditional Techniques for Confidence Bounds 238
6.3.2. Uncertainty in Reliability Prediction Estimates 240
6.4. Failure Rate vs pdf 243
6.5. Practical Aspects of Reliability Assessments 245
6.6. Weibayes 245
6.7. Weibull Closure Property 246
6.8. Estimating Event‐Related Reliability 247
6.9. Combining Different Types of Assessments at Different Levels 248
6.10. Estimating the Number of Failures 250
6.11. Calculation of Equivalent Failure Rates 251
6.12. Failure Rate Units 252
6.13. Factors to be Considered When Developing Models 253
6.13.1. Causes of Electronic System Failure 253
6.13.2. Selection of Factors 255
6.13.3. Reliability Growth of Components 257
6.13.4. Relative vs. Absolute Humidity 259
6.14. Addressing Data with No Failures 259
6.15. Reliability of Components Used Outside of Their Rating 261
6.16. References 262
7. EXAMPLES 263
7.1. MIL‐HDBK‐217 Model Development Methodology 264
7.1.1. Identify Possible Variables 266
7.1.2. Develop Theoretical Model 266
7.1.3. Collect and QC Data 267
7.1.4. Correlation Coefficient Analysis 268
iii
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents
Page
7.1.5. Stepwise Multiple Regression Analysis 270
7.1.6. Goodness‐of‐Fit Analysis 271
7.1.7. Extreme Case Analysis 272
7.1.8. Model Validation 272
7.2. 217Plus Reliability Prediction Models 273
7.2.1. Background 273
7.2.2. System Reliability Prediction Model 274
7.2.2.1. 217Plus Background 274
7.2.2.2. Methodology Overview 277
7.2.2.3. System Reliability Model 278
7.2.2.4. Initial Failure Rate Estimate 279
7.2.2.5. Process Grading Factors 280
7.2.2.6. Basis Data for the Model 281
7.2.2.7. Uncertainty in Traditional Approach Estimates 281
7.2.2.8. System Failure Causes 282
7.2.2.9. Environmental Factor 287
7.2.2.10. Reliability Growth 291
7.2.2.11. Infant Mortality 292
7.2.2.12. Combining Predicted Failure Rate with Empirical Data 292
7.2.3. Development of Component Reliability Models 292
7.2.3.1. Model Form 292
7.2.3.2. Acceleration Factors 294
7.2.3.3. Time Basis of Models 294
7.2.3.4. Failure Mode to Failure Cause Mapping 295
7.2.3.5. Derivation of Base Failure Rates 296
7.2.3.6. Combining the Predicted Failure Rate with Empirical Data 296
7.2.3.7. Estimating Confidence Levels 298
7.2.3.8. Using the 217Plus Model in a Top-Down Analysis 298
7.2.3.9. Capacitor Model Example 299
7.2.3.10. Default Values 301
7.2.4. Photonic Model Development Example 303
7.2.4.1. Introduction 303
7.2.4.2. Model development methodology and results 306
7.2.4.3. Uncertainty Analysis 322
7.2.4.4. Comments on Part Quality Levels 325
7.2.4.5. Explanation of Failure Rate Units 325
7.2.5. System‐Level Model 326
7.2.5.1. Model Presentation 326
iv
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents
Page
7.2.5.2. 217Plus Process Grading Criteria 328
7.2.5.3. Design Process Grade Factor Questions 330
7.2.5.4. Manufacturing Process Grade Factor Questions 336
7.2.5.5. Part Quality Process Grade Factor Questions 340
7.2.5.6. System Management Process Grade Factor Questions 342
7.2.5.7. Can Not Duplicate (CND) Process Grade Factor Questions 346
7.2.5.8. Induced Process Grade Factor Questions 347
7.2.5.9. Wearout Process Grade Factor Questions 348
7.2.5.10. Growth Process Grade Factor Questions 349
7.3. Life Modeling Example 350
7.3.1. Introduction 350
7.3.2. Approach 350
7.3.3. Reliability Test Plan 350
7.3.4. Results 352
7.3.4.1. Times to Failure Summary 352
7.3.4.2. Life Models 354
7.4. NPRD Description 357
7.4.1. Data Collection 358
7.4.2. Data Interpretation 361
7.4.3. Document Overview 366
7.4.3.1. "Part Summaries" Overview 366
7.4.3.2. "Part Details" Overview 373
7.4.3.3. Section 4 "Data Sources" Overview 374
7.4.3.4. Section 5 "Part Number/MIL Number" Index 374
7.4.3.5. Section 6 “National Stock Number Index with Federal Stock Class” 375
7.4.3.6. Section 7 "National Stock Number Index without Federal Stock Class
Prefix" 375
7.5. References 375
8. THE USE OF FMEA IN RELIABILITY MODELING 377
8.1. Introduction 377
8.2. Definitions 381
8.3. FMEA Logistics 383
8.3.1. When initiated 383
8.3.2. FMEA Team 383
8.3.3. FMEA Facilitation 384
v
Table of Contents: Reliability Modeling – The RIAC Guide
Table of Contents
Page
8.3.4. Implementation 385
8.4. How to Perform an FMEA 385
8.5. Identify System Hierarchy 387
8.6. Function Analysis 388
8.7. IPOUND Analysis 388
8.8. Identify the Severity 390
8.9. Identify the Possible Effect(s) that Result from Occurrence of Each Failure Mode 392
8.10. Identify Potential Causes of Each Failure Mode 392
8.11. Identify Factors for Each Failure Cause 398
8.11.1. Accelerating Stress(es) or Potential Tests 398
8.11.2. Occurrence 398
8.11.2.1. Occurrence Rankings 398
8.11.3. Preventions 401
8.11.4. Detections 401
8.11.5. Detectability 401
8.12. Calculate the RPN 404
8.13. Determine Appropriate Corrective Action 405
8.14. Update the RPN 408
8.15. Using Quality Function Deployment to Feed the FMEA 408
8.16. References 410
9. CONCLUDING REMARKS 411
vi
List of Figures: Reliability Modeling – The RIAC Guide
List of Figures
Page
FIGURE 1.1‐1: PHASES OF A RELIABILITY PROGRAM ..................................................................................... 2
FIGURE 1.1‐2: RELATIVE COST OF FAILURES VS. PHASE ................................................................................ 3
FIGURE 1.1‐3: RELIABILITY PREDICTION, ASSESSMENT AND ESTIMATION.................................................... 4
FIGURE 1.1‐4: PERCENT OF COMPANIES USING RELIABILITY ENGINEERING TOOLS ..................................... 5
FIGURE 1.3‐1: EXAMPLE RELIABILITY PROGRAM APPROACH ........................................................................ 7
FIGURE 2.0‐1: GENERAL MODELING APPROACH ......................................................................................... 20
FIGURE 2.1‐1: FAULT TREE REPRESENTATION OF SYSTEM MODEL ............................................................. 21
FIGURE 2.1‐2: FAULT TREE REPRESENTATION TO THE FAILURE CAUSE LEVEL ............................................ 21
FIGURE 2.2‐1: BREAKDOWN OF POTENTIAL RELIABILITY MODELING PURPOSES ....................................... 23
FIGURE 2.3‐1: TYPICAL DATA REQUIREMENTS VS. LEVEL OF HIERARCHY ................................................... 27
FIGURE 2.3‐2: THE BASIC FMEA APPROACH ................................................................................................. 28
FIGURE 2.3‐3: HIERARCHICAL RELATIONSHIP BETWEEN CAUSE, MODE AND EFFECT ................................. 29
FIGURE 2.3‐4: APPROACH TO IDENTIFYING CAUSES .................................................................................... 29
FIGURE 2.3‐5: FAULT TREE OF PRODUCT OR SYSTEM ................................................................................. 32
FIGURE 2.3‐6: FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE AS THE LOWEST LEVEL ....................... 32
FIGURE 2.3‐7: FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE ABOVE THE LOWEST LEVEL ................ 33
FIGURE 2.3‐8: FAULT TREE OF PRODUCT OR SYSTEM WITH CAUSE TWO LEVELS ABOVE THE LOWEST
LEVEL ................................................................................................................................................... 33
FIGURE 2.5‐1: BREAKDOWN OF RELIABILITY ASSESSMENT OPTIONS .......................................................... 38
FIGURE 2.5‐2: QUALIFICATION CONCEPTS AND TERMINOLOGY .................................................................. 46
FIGURE 2.5‐3: EVT, DVT AND PVT RELATIONSHIPS....................................................................................... 48
FIGURE 2.5‐4: ACCELERATION LEVELS ......................................................................................................... 51
FIGURE 2.5‐5: UNCERTAINTY IN EXTRAPOLATION ...................................................................................... 52
FIGURE 2.5‐6: ACCELERATION LEVELS ......................................................................................................... 53
FIGURE 2.5‐7: ACCELERATION ALTERNATIVES ............................................................................................. 53
FIGURE 2.5‐8: RELATIVE LIFETIME VS. STRESS ............................................................................................. 54
FIGURE 2.5‐9: RELIABILITY REQUIREMENT VS. SMALL POPULATION RELIABILITY INFERENCE ................... 60
FIGURE 2.5‐10: LIFE MODELING METHODOLOGY ....................................................................................... 62
FIGURE 2.5‐11: IDENTIFICATION OF TEST STRESSES BASED ON THE FMEA ................................................. 64
FIGURE 2.5‐12: USING THE DESTRUCT LIMIT TO DEFINE THE LIFE TEST MAX STRESS ................................ 66
FIGURE 2.5‐13: POSSIBLE STRESS PROFILES ................................................................................................ 67
FIGURE 2.5‐14: MEASUREMENT POINTS FOR AN INFANT MORTALITY FAILURE CAUSE .............................. 69
FIGURE 2.5‐15: MEASUREMENT POINTS FOR A WEAROUT FAILURE CAUSE ............................................... 69
FIGURE 2.5‐16: ACCELERATION WHEN THE DISTRIBUTIONS FOR AT LEAST TWO STRESSES ARE AVAILABLE
............................................................................................................................................................ 71
FIGURE 2.5‐17: ACCELERATION WHEN THE DISTRIBUTIONS FOR LOW STRESSES ARE NOT AVAILABLE ..... 71
FIGURE 2.5‐18: LIFE MODEL SEQUENCE ....................................................................................................... 72
FIGURE 2.5‐19 DEGRADATION MODELING APPROACH ................................................................................ 75
FIGURE 2.5‐20: DEGRADATION DATA EXAMPLE .......................................................................................... 76
FIGURE 2.5‐21: DEGRADATION DATA CONVERSION TO TIMES TO FAILURE ................................................ 77
FIGURE 2.5‐22: RELIABILITY ESTIMATES FROM FIELD DATA ........................................................................ 78
vii
List of Figures: Reliability Modeling – The RIAC Guide
List of Figures
Page
FIGURE 2.5‐23: FMEA AS A TOLL FOR ASSESSING SIMILARITY ..................................................................... 81
FIGURE 2.5‐24: MIL‐HDBK‐217 PART COUNT EXAMPLE ............................................................................... 85
FIGURE 2.5‐25: MIL‐HDBK‐217 PART STRESS EXAMPLE ............................................................................... 86
FIGURE 2.5‐26: TELCORDIA SR‐332 (BELLCORE) ........................................................................................... 87
FIGURE 2.5‐27: RAC PRISM REPLACED BY RIAC 217PLUS ............................................................................. 88
FIGURE 2.5‐28: CNET/RDF 2000 ................................................................................................................... 89
FIGURE 2.5‐29: CNET/RDF 2000 MODEL EXAMPLE ...................................................................................... 90
FIGURE 2.5‐30: FIDES .................................................................................................................................... 91
FIGURE 2.5‐31: USES OF PROGRAM DATA ELEMENTS ................................................................................. 93
FIGURE 2.5‐32: PROGRAM DATABASE STRUCTURE ..................................................................................... 93
FIGURE 2.5‐33: DATABASE INFORMATION FLOW ........................................................................................ 95
FIGURE 2.5‐34: HIERARCHY OF MAINTENANCE ACTIONS ............................................................................ 97
FIGURE 2.5‐35: CALCULATION OF PART LIFE UNIT ..................................................................................... 100
FIGURE 2.5‐36: FAILURE TIMES BASED ON OPERATING TIME .................................................................... 101
FIGURE 2.5‐37: FAILURE TIMES BASED ON CALENDAR TIME ..................................................................... 102
FIGURE 2.5‐38: FAILURE RATE SIMULATION WITH WEIBULL BETA = 20 .................................................... 103
FIGURE 2.5‐39: FAILURE RATE SIMULATION WITH WEIBULL BETA = 5.0 ................................................... 103
FIGURE 2.5‐40: FAILURE RATE SIMULATION WITH WEIBULL BETA = 2.0 ................................................... 104
FIGURE 2.5‐41: FAILURE RATE SIMULATION WITH WEIBULL BETA = 1.0 ................................................... 104
FIGURE 2.5‐42: FAILURE RATE SIMULATION WITH WEIBULL BETA = 0.5 ................................................... 105
FIGURE 2.5‐44: STRESS/STRENGTH INTERFERENCE ................................................................................... 108
FIGURE 2.5‐45: STRESS/STRENGTH INTERFERENCE VS. TIME .................................................................... 109
FIGURE 2.6‐1: 217PLUS APPROACH TO FAILURE RATE ESTIMATION ......................................................... 114
FIGURE 2.6‐3. BAYESIAN INFERENCE OUTLINE .......................................................................................... 122
FIGURE 2.7‐1: COMBINING SEVEN FAILURE CAUSE DISTRIBUTIONS .......................................................... 125
FIGURE 2.7‐2: POSSIBLE FAULT TREE REPRESENTATION OF A SERIES RELIABILITY BLOCK DIAGRAM ........ 126
FIGURE 2.7‐3: PDF OF NORMAL DISTRIBUTION WITH MEAN OF 10 AND STANDARD DEVIATION OF 3. ... 128
FIGURE 2.7‐4: CUMULATIVE NORMAL DISTRIBUTION WITH MEAN OF 10 AND STANDARD DEVIATION OF 3
.......................................................................................................................................................... 128
FIGURE 2.7‐5: VALUE SELECTION FROM A DISTRIBUTION ......................................................................... 129
FIGURE 2.7‐6: VALUE SELECTION FROM A WEIBULL DISTRIBUTION .......................................................... 130
FIGURE 2.7‐7: RELIABILITY BLOCK DIAGRAM OF REDUNDANT EXAMPLE .................................................. 131
FIGURE 2.7‐8: SYSTEM MONTE CARLO EXAMPLE....................................................................................... 131
FIGURE 2.7‐9: MONTE CARLO SIMULATION OF EXAMPLE SYSTEM ........................................................... 132
FIGURE 3.1‐1: DISCRETE PROBABILITY DISTRIBUTION ............................................................................... 135
FIGURE 3.1‐2: CONTINUOUS PROBABILITY DISTRIBUTION ........................................................................ 136
FIGURE 3.2‐1: EXAMPLES OF CORRELATION COEFFICIENTS ....................................................................... 142
FIGURE 3.2‐2: VENN DIAGRAM OF MUTUALLY EXCLUSIVE EVENTS ........................................................... 144
FIGURE 3.2‐3: INDEPENDENT EVENTS ........................................................................................................ 145
FIGURE 3.2‐4: FAULT TREE OR GATE .......................................................................................................... 147
FIGURE 3.2‐5: RELIABILITY BLOCK DIAGRAM FOR AN OR GATE ................................................................. 147
FIGURE 3.2‐6: FAULT TREE AND GATE ........................................................................................................ 148
viii
List of Figures: Reliability Modeling – The RIAC Guide
List of Figures
Page
FIGURE 3.2‐7: RELIABILITY BLOCK DIAGRAM FOR AN AND GATE .............................................................. 149
FIGURE 3.2‐8: FAULT TREE OF AN AND/OR COMBINATION ....................................................................... 150
FIGURE 3.2‐9: RBD OF AND/OR COMBINATION ......................................................................................... 150
FIGURE 3.3‐1: SHAPES OF FAILURE DENSITY AND RELIABILITY FUNCTIONS OF COMMONLY USED DISCRETE
DISTRIBUTIONS (FROM MIL‐HDBK‐338B) ......................................................................................... 157
FIGURE 3.3‐2: SHAPES OF FAILURE DENSITY, RELIABILITY AND HAZARD RATE FUNCTIONS FOR COMMONLY
USED CONTINUOUS DISTRIBUTIONS (FROM MIL‐HDBK‐338B) ........................................................ 158
FIGURE 3.3‐3: EXAMPLE PDF PLOTS FOR THE WEIBULL DISTRIBUTION .................................................... 164
FIGURE 3.3‐4: EXAMPLE HAZARD RATE PLOTS FOR THE WEIBULL DISTRIBUTION .................................... 164
FIGURE 3.3‐5: EXAMPLE PROBABILITY PLOTS FOR WEIBULL DISTRIBUTION ............................................. 165
FIGURE 3.3‐6: EXAMPLE PDF PLOTS FOR THE LOGNORMAL DISTRIBUTION .............................................. 167
FIGURE 3.3‐7: EXAMPLE HAZARD RATE PLOTS FOR THE LOGNORMAL DISTRIBUTION .............................. 168
FIGURE 3.3‐8: EXAMPLE PROBABILITY PLOTS FOR THE LOGNORMAL DISTRIBUTION ............................... 168
FIGURE 4.0‐1: THE DOE CONCEPT .............................................................................................................. 171
FIGURE 4.3‐1: POSSIBLE RESPONSE‐FACTOR LEVEL RELATIONSHIP ........................................................... 173
FIGURE 4.4‐1: DOE TERMINOLOGY ............................................................................................................ 174
FIGURE 4.4‐2: ONE‐FACTOR‐AT‐A‐TIME EXPERIMENTS ............................................................................. 176
FIGURE 4.4‐3: STANDARD DOE NOMENCLATURE ...................................................................................... 177
FIGURE 4.4‐4: POTENTIAL INTERACTIONS .................................................................................................. 178
FIGURE 4.6‐1: ANALYSIS OF MEANS ........................................................................................................... 182
FIGURE 4.6‐2: LINEARIZATION OF THE ARRHENIUS RELATIONSHIP ........................................................... 182
FIGURE 4.6‐3: OPTIMAL FACTOR SETTINGS................................................................................................ 183
FIGURE 5.4‐1: LIKELIHOOD CONTOUR EXAMPLE........................................................................................ 220
FIGURE 6.1‐1: BATHTUB CURVE ................................................................................................................. 223
FIGURE 6.2‐1: EXAMPLE OF NON‐MONOMODAL DISTRIBUTION .............................................................. 228
FIGURE 6.2‐2: MULTIMODAL DISTRIBUTION EXAMPLE 1 ........................................................................... 229
FIGURE 6.2‐3: MULTIMODAL DISTRIBUTION EXAMPLE 2 ........................................................................... 230
FIGURE 6.2‐4: MULTIMODAL DISTRIBUTION EXAMPLE 3 ........................................................................... 231
FIGURE 6.2‐5: MULTIMODAL DISTRIBUTION EXAMPLE 4 ........................................................................... 232
FIGURE 6.2‐6: MULTIMODAL DISTRIBUTION EXAMPLE 5 ........................................................................... 233
FIGURE 6.2‐7: MULTIMODAL DISTRIBUTION EXAMPLE OF POOLED DATA SET ......................................... 234
FIGURE 6.2‐8: AGE AT DEATH DATA ........................................................................................................... 235
FIGURE 6.2‐9: PDF OF MULTIMODE DISTRIBUTION OF AGES .................................................................... 236
FIGURE 6.2‐10: FAILURE RATE OF AGE DATA ............................................................................................. 236
FIGURE 6.2‐11: PROBABILITY PLOT OF AGE DATA ...................................................................................... 237
FIGURE 6.2‐12: SINGLE MODE WEIBULL FIT TO THE AGE DATA ................................................................. 238
FIGURE 6.3‐1: SOURCES OF ERROR IN EMPIRICAL MODELS ....................................................................... 241
FIGURE 6.3‐2: CONFIDENCE LEVEL THROUGH PREDICTION, ASSESSMENT AND ESTIMATION .................. 243
FIGURE 6.6‐1: WEIBAYES EXAMPLE ............................................................................................................ 246
FIGURE 6.13‐1: NOMINAL FAILURE CAUSE DISTRIBUTION OF ELECTRONIC SYSTEMS ............................... 254
ix
List of Figures: Reliability Modeling – The RIAC Guide
List of Figures
Page
FIGURE 6.13‐2: IPO MODEL ........................................................................................................................ 256
FIGURE 6.13‐3: RELATIONSHIP BETWEEN ABSOLUTE AND RELATIVE HUMIDITY....................................... 259
FIGURE 6.14‐1: ESTIMATED UPPER BOUND FAILURE RATES VS OPERATING TIME AT 60 AND 90%
CONFIDENCE ..................................................................................................................................... 260
FIGURE 7.1‐1: MIL‐HDBK‐217 MODEL DEVELOPMENT METHODOLOGY ................................................... 265
FIGURE 7.2‐1: FAILURE CAUSE DISTRIBUTION OF ELECTRONIC SYSTEMS .................................................. 275
FIGURE 7.2‐2: OPTICAL AMPLIFIER FAILURE CAUSE DISTRIBUTION ........................................................... 277
FIGURE 7.2‐3: ΠG VS. TIME AND GROWTH RATES ..................................................................................... 291
FIGURE 7.2‐4: MODEL DEVELOPMENT METHODOLOGY FLOWCHART ...................................................... 306
FIGURE 7.2‐5: DISTRIBUTION OF LOG10 PREDICTED/OBSERVED FAILURE RATE RATIO FOR ALL DATA .... 323
FIGURE 7.2‐6: DISTRIBUTION OF LOG10 PREDICTED/OBSERVED RATIO FOR FIELD DATA ONLY ............... 324
FIGURE 7.2‐7: DISTRIBUTIONS OF THE PREDICTED/OBSERVED FAILURE RATE RATIO FOR ALL DATA AND
FOR FIELD DATA ONLY ...................................................................................................................... 324
FIGURE 7.3‐1: TIMES TO FAILURE DISTRIBUTIONS ..................................................................................... 354
FIGURE 7.3‐2: PROBABILITY OF FAILURE VS. TEMPERATURE AND RELATIVE HUMIDITY AT 50,000 HOURS
.......................................................................................................................................................... 357
FIGURE 7.4‐1: APPARENT FAILURE RATE FOR REPLACEMENT UPON FAILURE........................................... 362
FIGURE 7.4‐3: EXAMPLE OF PART DETAIL ENTRIES ................................................................................... 374
FIGURE 8.1‐1: TWO BASIC TYPES OF FMEA ................................................................................................ 378
FIGURE 8.4‐1: FMEA PROCESS FLOW ......................................................................................................... 386
FIGURE 8.7‐1: FAILURE CAUSE‐MODE EFFECT RELATIONSHIP ................................................................... 390
FIGURE 8.10‐1: FAILURE CAUSE, MODE AND EFFECT HIERARCHY ............................................................. 393
FIGURE 8.10‐2: FAILURE CAUSES ................................................................................................................ 395
FIGURE 8.11‐1: OCCURRENCE DEFINITIONS ............................................................................................... 399
FIGURE 8.11‐2: OCCURRENCE GUIDELINES ................................................................................................ 400
FIGURE 8.11‐3: DETECTABILITY DEFINITIONS ............................................................................................. 402
FIGURE 8.11‐4: LIFE CYCLE VS DETECTABILITY DIMENSION ....................................................................... 403
FIGURE 8.13‐1: POTENTIAL CORRECTIVE ACTIONS .................................................................................... 407
FIGURE 8.15‐1: QFD‐TO‐FMEA LINKS ......................................................................................................... 408
FIGURE 8.15‐2: QFD‐FMEA ......................................................................................................................... 410
x
List of Tables: Reliability Modeling – The RIAC Guide
List of Tables
Page
TABLE 1.3‐1: RANGES OF POTENTIAL CUSTOMER REACTIONS...................................................................... 8
TABLE 2.2‐1: RELIABILITY ASSESSMENT PURPOSES ..................................................................................... 24
TABLE 2.2‐2: PROGRAM PHASE VS. RELIABILITY ASSESSMENT PURPOSE ................................................... 25
TABLE 2.3‐1: EXAMPLES OF INITIAL CONDITIONS, STRESSES AND MECHANISMS ...................................... 30
TABLE 2.3‐2: RELATIONSHIP BETWEEN CAUSE, MODE AND EFFECT. .......................................................... 31
TABLE 2.5‐1: SUMMARY OF RELIABILITY ASSESSMENT OPTIONS ............................................................... 39
TABLE 2.5‐1: SUMMARY OF ASSESSMENT OPTIONS (CONTINUED) ............................................................ 40
TABLE 2.5‐2: RELEVANCY OF APPROACH TO PREDICTION, ASSESSMENT AND ESTIMATION....................... 41
TABLE 2.5‐3: IDENTIFICATION OF APPROPRIATE APPROACHES BASED ON THE PURPOSE ......................... 43
TABLE 2.5‐4: RANKING THE ATTRIBUTES OF EMPIRICAL DATA ................................................................... 44
TABLE 2.5‐5: EVT, DVT AND PVT PURPOSE AND APPROACH ....................................................................... 47
TABLE 2.5‐6: RELIABILITY DEMONSTRATION EXAMPLE ............................................................................... 50
TABLE 2.5‐7: EXAMPLE OF A QUALIFICATION PLAN FOR AN ASSEMBLY ..................................................... 57
TABLE 2.5‐8: QUALIFICATION EXAMPLE FOR A LASER DIODE ..................................................................... 58
TABLE 2.5‐9: STRESS PROFILE OPTION ADVANTAGES AND DISADVANTAGES ............................................. 68
TABLE 2.5‐10: SIMILARITY ANALYSIS ............................................................................................................ 80
TABLE 2.5‐11: DIGITAL CIRCUIT BOARD FAILURE RATES (IN FAILURES PER MILLION PART HOURS) ........... 83
TABLE 2.5‐12: TEST CONDITIONS ............................................................................................................... 111
TABLE 2.5‐13: DATA TO ESTIMATE DIFFUSION RATE ................................................................................. 112
TABLE 2.5‐14: PREDICTED LIFETIMES VS. OBSERVED ................................................................................. 113
TABLE 3.1‐1: PROBABILITY DISTRIBUTION NOTATION & MATHEMATICAL REPRESENTATIONS ............... 141
TABLE 3.2‐1: COMBINATIONS EXAMPLE .................................................................................................... 143
TABLE 3.2‐2: COMBINATIONS OF AN OR CONFIGURATION ....................................................................... 147
TABLE 3.2‐3: COMBINATIONS OF AN AND CONFIGURATION ..................................................................... 149
TABLE 3.2‐4: EXAMPLE OF “K‐OUT‐OF‐N” PROBABILITY CALCULATIONS................................................... 151
TABLE 3.2‐5: EXAMPLE OF “2‐OUT‐OF‐3” REQUIRED FOR SUCCESS .......................................................... 152
TABLE 3.3‐1: PROBABILITY DISTRIBUTIONS APPLICABLE TO RELIABILITY ENGINEERING .......................... 154
TABLE 3.3‐2: EXPONENTIAL DISTRIBUTION PARAMETERS ........................................................................ 160
TABLE 3.3‐3: CONFUSING TERMINOLOGY OF THE WEIBULL DISTRIBUTION ............................................. 162
TABLE 3.3‐4: WEIBULL DISTRIBUTION PARAMETERS ................................................................................ 163
TABLE 4.3‐1: POSSIBLE CONCLUSIONS FOR A NON‐LINEAR RESPONSE‐FACTOR RELATIONSHIP ............... 173
TABLE 4.4‐1: FULL‐FACTORIAL EXAMPLE .................................................................................................... 175
TABLE 4.4‐2: FULL AND HALF FACTORIAL EXAMPLE FOR CORROSION ...................................................... 179
TABLE 5.2‐1: TERMINOLOGY USED IN PARAMETER ESTIMATION ............................................................. 187
TABLE 5.2‐2: TECHNIQUES FOR PARAMETER ESTIMATION ....................................................................... 188
TABLE 5.2‐3: PARAMETERS TYPICALLY ESTIMATED FROM STATISTICAL DISTRIBUTIONS ......................... 189
TABLE 5.2‐4: CONFIDENCE BOUNDS FOR THE POISSON DISTRIBUTION ................................................... 200
TABLE 5.2‐5: CONFIDENCE BOUNDS FOR THE BINOMIAL DISTRIBUTION ................................................. 201
TABLE 5.2‐6: CONFIDENCE BOUNDS FOR THE EXPONENTIAL DISTRIBUTION ........................................... 202
TABLE 5.2‐8: CONFIDENCE BOUNDS FOR THE NORMAL DISTRIBUTION ................................................... 203
TABLE 5.3‐10: CONFIDENCE BOUNDS FOR THE WEIBULL DISTRIBUTION ................................................. 205
xi
List of Tables: Reliability Modeling – The RIAC Guide
List of Tables
Page
TABLE 6.1‐1: CATEGORIES OF FAILURE EFFECTS ........................................................................................ 227
TABLE 6.2‐2: BIMODAL POPULATION EXAMPLE 1 ...................................................................................... 229
TABLE 6.2‐3: BIMODAL POPULATION EXAMPLE 2 ...................................................................................... 230
TABLE 6.1‐4: BIMODAL POPULATION EXAMPLE 3 ...................................................................................... 231
TABLE 6.1‐5: BIMODAL POPULATION EXAMPLE 4 ...................................................................................... 232
TABLE 6.1‐6: BIMODAL POPULATION EXAMPLE 5 ...................................................................................... 233
TABLE 6.1‐7: FOUR MODE WEIBULL DISTRIBUTION PARAMETERS ............................................................ 235
TABLE 6.3‐1: FAILURE RATE UNCERTAINTY LEVEL MULTIPLIERS ................................................................ 242
TABLE 6.9‐1: EXAMPLE OF COMBING DIFFERENT TYPES OF MODELS........................................................ 248
TABLE 6.13‐1: FACTORS TO BE CONSIDERED IN A RELIABILITY MODEL ..................................................... 256
TABLE 6.13‐2: FAILURE RATE DATA SUMMARY ......................................................................................... 258
TABLE 7.1‐1: DATA COLLECTED FOR MODEL DEVELOPMENT .................................................................... 269
TABLE 7.1‐2: DATA TRANSFORMS .............................................................................................................. 270
TABLE 7.1‐3: REGRESSION DATA INCLUDING CATEGORICAL VARIABLES ................................................... 271
TABLE 7.2‐1: UNCERTAINTY LEVEL MULTIPLIER ......................................................................................... 282
TABLE 7.2‐2: PERCENTAGE OF FAILURES ATTRIBUTABLE TO EACH FAILURE CAUSE .................................. 283
TABLE 7.2‐3: WEIBULL PARAMETERS FOR FAILURE CAUSE PERCENTAGES ................................................ 283
TABLE 7.2‐4: MULTIPLIERS AS A FUNCTION OF PROCESS GRADE ............................................................. 284
TABLE 7.2‐5: EXAMPLE OF FAILURE MODE‐TO‐FAILURE CAUSE CATEGORY MAPPING ............................. 295
TABLE 7.2‐6: CAPACITOR PARAMETERS ..................................................................................................... 301
TABLE 7.2‐7: DEFAULT ENVIRONMENTAL STRESS VALUES ........................................................................ 302
TABLE 7.2‐8: DEFAULT OPERATING PROFILE VALUES................................................................................. 303
TABLE 7.2‐9: FAILURE CAUSE SUMMARY FOR CONNECTORS .................................................................... 308
TABLE 7.2‐10: FAILURE MODE TO FAILURE CAUSE CATEGORY FOR CONNECTORS (SC AND FC) .............. 309
TABLE 7.2‐11: FAILURE CAUSE PERCENTAGES FOR CONNECTORS ............................................................. 311
TABLE 7.2‐12: DATA COLLECTED FOR CONNECTORS.................................................................................. 312
TABLE 7.2‐13: CATEGORIES OF ACCELERATION MODEL PARAMETERS ...................................................... 315
TABLE 7.2‐14: ACCELERATION MODEL PARAMETERS ................................................................................ 315
TABLE 7.2‐15: DEFAULT MODEL PARAMETERS .......................................................................................... 316
TABLE 7.2‐16: SUMMARY OF PI‐FACTOR CALCULATIONS .......................................................................... 317
TABLE 7.2‐17: APPLICABILITY OF TEST DATA .............................................................................................. 318
TABLE 7.2‐18: BASE FAILURE RATES (FAILURES PER MILLION CALENDAR HOURS) .................................... 319
TABLE 7.2‐19: PART QUALITY PROCESS GRADE FACTOR QUESTIONS FOR PHOTONIC DEVICE MODELS .. 320
TABLE 7.2‐20: SUMMARY OF UNCERTAINTY METRICS ............................................................................... 323
TABLE 7.2‐21: PARAMETERS FOR THE PROCESS GRADE FACTORS ............................................................. 327
TABLE 7.2‐22. INDEX OF PROCESS GRADE TYPE QUESTIONS .................................................................... 328
TABLE 7.2‐23: DESIGN PROCESS GRADE FACTOR QUESTIONS .................................................................. 330
TABLE 7.2‐24: MANUFACTURING PROCESS GRADE FACTOR QUESTIONS ................................................. 336
TABLE 7.2‐25: PART QUALITY PROCESS GRADE FACTOR QUESTIONS ....................................................... 340
TABLE 7.2‐26: SYSTEM MANAGEMENT PROCESS GRADE FACTOR QUESTIONS ........................................ 342
TABLE 7.2‐27: CAN NOT DUPLICATE (CND) PROCESS GRADE FACTOR QUESTIONS .................................. 346
TABLE 7.2‐28: INDUCED PROCESS GRADE FACTOR QUESTIONS ............................................................... 347
xii
List of Tables: Reliability Modeling – The RIAC Guide
List of Tables
Page
TABLE 7.2‐29: WEAROUT PROCESS GRADE FACTOR QUESTIONS ............................................................. 348
TABLE 7.2‐30: GROWTH PROCESS GRADE FACTOR QUESTIONS ............................................................... 349
TABLE 7.3‐1: PARAMETER LEVELS .............................................................................................................. 350
TABLE 7.3‐2: TEST PLAN SUMMARY ........................................................................................................... 351
TABLE 7.3‐3: LIFE TEST RESULTS ................................................................................................................. 352
TABLE 7.3‐4: TIMES TO FAILURE DISTRIBUTION PARAMETERS .................................................................. 353
TABLE 7.3‐5: ESTIMATED PARAMETER 80% 2‐SIDED CONFIDENCE BOUNDS ............................................ 356
TABLE 7.4‐1: DATA SUMMARIZATION PROCESS ........................................................................................ 359
TABLE 7.4‐2: TIME AT WHICH ASYMPTOTIC VALUE IS REACHED ............................................................... 363
TABLE 7.4‐3 α/MTTF RATIO AS A FUNCTION OF β ..................................................................................... 363
TABLE 7.4‐4: PERCENT FAILURE FOR WEIBULL DISTRIBUTION ................................................................... 364
TABLE 7.4‐5: FIELD DESCRIPTIONS ............................................................................................................. 367
TABLE 7.4‐6: APPLICATION ENVIRONMENTS DEFINED IN NPRD ............................................................... 368
TABLE 8.7‐1: FAILURE MODE RELATIONSHIP TO TAGUCHI LOSS FUNCTION ............................................. 389
TABLE 8.8‐1: DIMENSIONS OF FUNCTIONAL SEVERITY .............................................................................. 391
TABLE 8.8‐2: DIMENSIONS OF SEVERITY .................................................................................................... 392
TABLE 8.11‐1: CATEGORIES OF FAILURE EFFECTS ...................................................................................... 401
TABLE 8.11‐2: RECOMMENDED DETECTABILITY RATING CRITERIA ............................................................ 404
xiii
List of Tables: Reliability Modeling – The RIAC Guide
xiv
Chapter 1: Introduction
1. Introduction
Few engineering techniques have caused as much controversy in the last several decades
as the topic of reliability prediction. One of the primary reasons for this is the stochastic
nature of reliability. Whereas many engineering disciplines are governed by
deterministic processes, reliability is governed by a complex interaction of stochastic
processes. As a result, the metrics of interest in other engineering disciplines are
generally much more quantifiable by their very nature. While there is always a stochastic
element in any engineering model, the topic of reliability quantification must address its
extreme stochastic nature.
Many highly respected reliability engineering texts treat the topic of reliability modeling
thoroughly and in great detail. Included in these texts are detailed ways to model system
reliability using techniques like Failure Modes and Effects Analysis (FMEA), Fault Tree
Analysis (FTA), Markov models, fault tolerant design techniques, etc. The techniques
that are addressed in detail in these texts often gloss over a fundamental requirement in
order to effectively utilize these techniques, i.e., the ability to quantify the reliability of
the constituent components and subsystems comprising the system.
The intent of this book is to provide guidance on reliability modeling techniques that can
be used to quantify the reliability of a product or system. In this context, reliability
modeling is the process of constructing a mathematical model that is used to estimate the
reliability characteristics of an item. There are many ways in which this can be
accomplished, depending on the item and the type of information that is available to, or
practical to obtain by, the analyst. This book will review possible approaches, summarize
their advantages and disadvantages, and provide guidance on selecting a methodology
based on specific goals and constraints. While this book will not discuss the use of
specific published methodologies, in cases where examples are provided, tools and
methodologies with which the author has personal experience in their development are
used, such as life modeling, NPRD, MIL-HDBK-217 and 217Plus.
The Reliability Information Analysis Center (RIAC) has prepared many documents in the
past relating to many different reliability engineering techniques, such as FMEA, FTA,
Worst Case Analysis (WCA), etc. However, one noteworthy omission from this list is
reliability modeling. This, coupled with (1) the RIAC’s history of providing reliability
modeling data and solutions, and (2) the need to objectively address some of the
confusion and misconceptions related to this topic, formed the inspiration for this book.
In years past, DoD contracts would require specific reliability prediction methodologies,
usually MIL-HDBK-217, be used. This resulted in system developers having very little
flexibility in applying different reliability prediction practices. Since the DoD has not,
until very recently, supported updates to MIL-HDBK-217, companies were encouraged
to use best practices in quantifying product reliability. The difficult question to be
addressed is “what are the best practices that should be used?” This book attempts to
provide guidance on selecting an appropriate methodology based on the specific
conditions and constraints of the company and its products or systems.
It is hoped that the author’s experience gained by attempting many different reliability
assessment approaches, including physics and empirical approaches, can be used to the
advantage of the reader in a practical way.
1.1. Scope
The intent of a reliability program is to identify and mitigate failure modes/mechanisms,
verify their removal through reliability testing, implement corrective actions for
“discovered” failures, and maintain reliability levels after reliability has been designed in.
These correspond to the designing-in reliability, reliability growth and ensuring on-going
reliability goals, respectively, as illustrated in Figure 1.1-1.
The use of reliability engineering techniques early in the development cycle of a system
is critical to achieving high reliability. An important part of these efforts is the modeling
of reliability before the product or system is fielded.
The term “Reliability Prediction” has had a relatively narrow connotation, primarily
associated with “handbook” approaches. This document attempts to take a broader view
of this topic by investigating the various approaches for quantifying reliability, and their
effectiveness when used to achieve specific objectives. For this reason, the book is
entitled “Reliability Modeling – the RIAC Guide to Reliability Prediction, Assessment
and Estimation”. The definitions of these are:
Predictions are performed very early, before there is any empirical data on the item under
analysis. Reliability assessments are made to determine the affects of certain factors on
reliability and to identify failure causes. Reliability estimates are made based on
empirical data. This book covers all three areas, as illustrated in Figure 1.1-3.
Chapter 2 covers the primary topic of this book, and includes information on the various
ways in which a product can be modeled and guidance on selecting an approach. It
presents a generic approach, and describes the elements of this approach.
These examples are provided to give the reader a better appreciation for the tools,
techniques and limitations of various approaches to reliability modeling.
There are many possible approaches to “designing in” reliability. The specific approach
used will depend on the needs of the specific organization. Figure 1.3-1 presents one
possible approach, and includes the elements that should be included in all approaches.
The premise of this approach is to identify the critical parts and material which warrant
detailed attention. Since it is impractical to perform some reliability modeling
approaches on all system parts, it is imperative to identify the critical parts which are the
highest risk. Since one of the most effective ways to verify the robustness of parts or
materials is from experience, an effective reliability program must leverage knowledge
gained in the development and deployment of previous systems. It will be shown that
reliability assessments impact many of the elements of this approach.
1. Design requirements: The first step in any product development process is the
identification of requirements. These requirements include items pertaining to
Performance, Reliability (failure rate, life), Maintainability, Diagnostics, and Use
Environment and Operational stresses (i.e., mission profiles). Typically, the medium for
communicating these requirements is the product specification. While the specification
usually contains details regarding the require performance of the product or system, it is
often lacking relative to quantifying the reliability attributes required. The following
questions should be answered to determine these reliability requirements:
• What is the required failure rate of the item in its useful life?
• What is the service life required?
• What criteria will be used to determine when the requirements are not met?
• Whose responsibility will it be to take corrective action if these requirements are
not met?
• What are the operating and environmental profiles expected in field deployed
conditions?
The reliability that is considered acceptable will, of course, be specific to the industry,
criticality of failure, etc. The specific value may be specified, or it may not be,
depending on the industry and the maturity of the product. The range of potential
customer reactions to various scenarios are summarized in Table 1.3-1.
If the requirement is not specified, an estimate of the requirement must be made so that
there is a goal that can be used in the development process.
2. Initial Design: After the product requirements are understood, the design team
generally derives an initial, or preliminary, design for the product or system. Inputs to
this initial design should be in the form of design rules and a Standard parts list. Design
rules are the culmination of lessons learned from previous development activities, from
both empirical field or test data, and from analysis. These design rules should be a living
document which is continuously updated based on current information. Effective use of
design rules also saves much effort since reliability attributes which have a reliability
history or which have been previously studied do not need to be addressed in detail, thus
saving resources to be applied to the study of critical parts.
4. Identify attributes that are similar: Similar attributes are those that have a reliability
history
5. Assess robustness of attribute: If the part or attribute does have a history, previous test
data or field experience data can be used to assess the robustness of the part or attribute.
6. Identify attributes that are not similar: Attributes that are not similar do not have a
reliability history.
7. Perform design analysis: Although any attribute that is potentially different in the new
design relative to the previous design must be analyzed, particular attention is given to
the attributes that are not similar. Design techniques that are used for this purpose are
FMEA, tolerance or worst case analysis, thermal analysis, stress analysis, and reliability
predictions.
8. Implement corrective action: From the results of the design analysis, corrective action
should be taken to improve the robustness of the design.
9. Identify critical parts/materials: Based on the results of the analysis, critical parts or
materials are identified.
10. Model critical parts/materials: Once critical parts are identified, action must be taken
to ensure that the parts or materials are robust enough to meet the reliability and
durability requirements. More details of the approach used for this purpose will be
presented later in the book.
11. Identify effective tests for non-similar attributes: Based on the identification of
critical parts and the design analysis that was performed, specific tests that will assess the
reliability and durability of the attribute can be determined. Part of the FMEA should
include identification of stresses that will accelerate the attribute under analysis and
therefore, this analysis is important for identifying the appropriate stress tests.
12. Develop a test plan and execute tests: Based on the design analysis performed and
the identification of tests for non-similar attributes, a test plan can be determined. In the
context of this approach, the goal of these tests is to assess the robustness of the product
by subjecting the product to test stresses that are intended to accelerate the critical parts
and non-similar attributes to failure. In addition to these tests, other test requirements
should be incorporated into this test plan. These additional test requirements include any
tests required by the customer, such as qualification or reliability demonstration tests.
13. Document the test results: Once the tests have been performed and the data analyzed,
the results should be fully documented, since they subsequently will be used for a variety
of purposes.
14. Monitor field reliability: Once the product is deployed, field reliability experience
data should be carefully gathered, since it will be used for a variety of purposes.
Elements of the data to be gathered include:
15. Update reliability database: A database is required to manage the reliability data, and
should include both test data and field data. This data can be used to generate a
company-specific reliability prediction methodology.
16. Update Design Rules: Data acquired from tests and field surveillance should be used
to update the design rules. Field data is probably the most valuable type of data for this
purpose since it represents the actual product or system in the intended use environment.
The process of maintaining design rules and ensuring that they are used in new designs is
the cornerstone of the means by which reliability is improved in a reliability growth
process.
Critical parts are those which may result in a significant risk to the project. This risk can
be related to reliability, lifetime, availability, or maintainability. Some of the factors that
constitute critical parts are:
These critical parts or items warrant additional attention in assessing their reliability, as
they generally will represent the greatest reliability risk.
During World War II, electronic tubes were by far the most unreliable component used in
DoD electronic systems. This observation led to various studies and ad hoc groups
whose purpose was to identify ways that their reliability, and the reliability of the systems
in which they operated, could be improved. One group in the early 1950’s concluded
that:
Item 5, above, was implemented in the form of the Advisory Group on Reliability of
Electronic Equipment (AGREE), whose charter was to identify actions that could be
taken to provide more reliable electronic equipment. This time period was the advent of
the reliability engineering discipline. It soon became clear that the emerging discipline
was using several different methods to achieve its goal of higher reliability. One was the
identification of root causes of field failure and determination of mitigating actions.
Another was the specification of quantitative reliability requirements. The specification
of requirements in turn led to the desire to have a means of estimating reliability before
an equipment is built and tested so that the probability of achieving its reliability goal
could be estimated. This, of course, was the beginning of reliability prediction. The
1950’s also saw much pioneering work in the reliability discipline, including;
In addition to these accomplishments, the 50’s also included pioneering work in the area
of quantitative reliability prediction. In 1956, RCA released TR-1100, “Reliability Stress
Analysis for Electronic Equipment”, which presented mathematical models for the
estimation of component failure rates. This report turned out to be the predecessor of
MIL-HDBK-217.
Several additional early works in the area of reliability prediction were produced in the
early 1960’s, including D.R. Erles’ report (Reference 2) and the Erles and Edins paper
(Reference 3). In 1962, the first version of MIL-HDBK-217 was published by the Navy.
Once issued, MIL HDBK-217 quickly became the standard by which reliability
predictions were performed, and other sources of failure rates gradually disappeared.
Part of the reason for the demise of other sources was the fact that MIL-HDBK-217 was
often a contractually cited document and defense contractors did not have the option of
using other sources of data.
These early sources of failure rates also often included design guidance on the reliable
application on electronic components. However, subsequent versions of the documents,
primarily MIL-HDBK-217, would delete the application information because it was
treated in more detail elsewhere.
By now, the reliability discipline was working under the tenet that reliability was a
quantitative discipline that needed quantitative data sources to support its many
statistically based techniques, such as allocations and redundancy modeling. However,
another branch of the reliability discipline focused on the physical processes by which
components were failing. The first symposium devoted to this topic was the “Physics of
Failure In Electronics” Symposium sponsored by the Rome Air Development Center
(RADC) and IIT Research Institute (IITRI) in 19621. This symposium later became
known as the International Reliability Physics Symposium (IRPS). In this period of time,
the two branches of reliability engineering seemed to be diverging, with the “systems”
engineers devoted to the tasks of specifying, allocating, predicting and demonstrating
reliability, while the physics-of-failure (PoF) engineers and scientists were devoting their
efforts to identifying and modeling the physical causes of failure. Both branches were
integral parts of the reliability discipline, and both were hosted at RADC (later to become
Rome Laboratory). The physics-based information was necessary to develop part
qualification, screening and application requirements, and the “systems” tasks of
specifying, allocating, predicting and demonstrating reliability were necessary to insure
that reliability requirements were met. The component research efforts of the 1950’s and
1960’s culminated with the implementation of the “ER” and “TX” families of
specifications. This complicated the issue of predicting their reliability because there
were now many different combinations of quality levels and environments that needed to
be addressed in MIL-HDBK-217.
In the early 1970’s, the responsibility for preparing MIL-HDBK-217 was transferred to
RADC, who published revision B in 1974. However, other than the transition to RADC,
the 1970’s maintained the status quo in the area of reliability prediction. MIL-HDBK-
217 was updated to reflect the technology at that time, but there were few other efforts
that changed the manner in which predictions were performed. One exception, however,
was that there was a shift in the complexity of the models being developed for MIL-
HDBK-217. There were several efforts to develop new and innovative models for
reliability prediction. The results of these efforts were extremely complex models that
may have been technically sound, but were criticized by the user community as being too
1
IITRI was the original contractor of the Reliability Analysis Center (RAC). In 2005, the RAC contract was awarded as RIAC to the current team of
Wyle Labs (prime), Quanterion Solutions Incorporated, the University of Maryland Center for Risk and Reliability, the Pennsylvania State Applied
Research Laboratory (ARL), and the State University of New York Institute of Technology (SUNYIT)
Reliability Information Analysis Center
13
Chapter 1: Introduction
complex, too costly, and unrealistic given the low level of detailed design information
available at the point in time when the models were needed. RCA, under contract to
RADC, had developed PoF-based models which were rejected as unusable, since the
detailed design and construction data for microcircuits were simply unavailable to typical
model users. These models were never incorporated into MIL-HDBK-217.
While MIL-HDBK-217 was updated again several times in the 1980’s, there were
agencies that were developing reliability prediction models unique to their industries. As
an example, the automotive industry, under the auspices of the Society of Automotive
Engineers (SAE) Reliability Standards Committee, developed a series of models specific
to automotive electronics. The SAE committee felt that there was no existing prediction
methodologies that were applicable to the specific quality levels and environments of
automotive applications. The Bellcore reliability prediction standard is another example
of a specific industry developing methodologies for their unique conditions and
equipment. It originally was developed by modifying MIL-HDBK-217 to better reflect
the conditions of interest of the telecommunications industry. It has since taken on its
own identity with models derived from telecommunications equipment and is now used
widely within that industry.
The 1980’s also saw explosive growth in integrated circuit technology. Very dense
circuits were being fabricated using feature sizes as small as 0.5 microns. This presented
unique challenges to reliability modelers. The VHSIC (Very High Speed Integrated
Circuit) program was the government’s attempt to leverage from the technological
advancements of the commercial industry and, at the same time, produce circuits capable
of meeting the unique requirements of military applications. From the VHSIC program
came the Qualified Manufacturers List (QML) - a qualification methodology that
qualified an integrated circuit manufacturing line, unlike the traditional qualification of
specific parts. The government realized that it needed a QML-like process if it were to
leverage from the advancements in commercial technologies and, at the same time, have
a timely and effective qualification scheme for military parts. A reliability prediction
model was also developed for VHSIC devices in 1989 (Reference 9) in support of a MIL-
HDBK-217 update. An interesting observation was made during that study that deviated
from the premise on which most of the MIL-HDBK-217 models were based. The
traditional approach to developing models was to collect as much field failure rate data as
possible, statistically analyze it, and quantify model factors based on the results of the
statistical analysis. For integrated circuits, one of the factors that was quantified was
inevitably device complexity. This complexity was measured by the number of gates or
transistors and was the primary factor on which the models were based. The correlation
between failure rate and complexity was strong and could be quantified because the
failure rate of circuits was much higher than they are today and the defect rate was
directly proportional to the complexity. As technology has advanced, the gate or
transistor count became so high that it could no longer effectively be used as the measure
of complexity in a reliability model. Furthermore, transistor or gate count data was often
difficult or impossible to obtain. Therefore, the model developed for VHSIC
microcircuits needed another measure of complexity on which to base the model. The
best measures, and the ones most highly correlated to reliability are defect density and
silicon area. It can be shown that the failure rate (for small cumulative percent failure) is
directly proportional to the product of the area and defect density. However, another
factor that is highly correlated to defect density and area is the yield of the die, or the
percent of die that are functional upon manufacture. Ideally, a reliability model would
use either yield or defect density/area as the primary factor(s) on which to base the
model. The problem in using these factors in a model is that they are considered highly
proprietary parameters from a market competition viewpoint and, therefore, are rarely
released by the manufacturers. Therefore, the single most important driver of reliability
cannot be obtained by the user of the device, which is unfortunate because the accuracy
of the model suffers. The conflict between the usability of a model and its accuracy has
always been a difficult tradeoff to address for model developers.
Much of the literature in the 1990’s on the topic of reliability prediction has centered
around the debate as to whether the reliability discipline should focus on PoF-based or
empirically-based models (such as MIL-HDBK-217) for the quantification of reliability.
In the author’s opinion, many of the primary criticisms of MIL-HDBK-217 stem from the
fact that it was often used for purposes for which it was not intended. For example, it
was often used as a means by which the reliability of a product was demonstrated. Since
its use was contractually required, contractors would try to demonstrate compliance to the
specified reliability requirements by “adjusting” factors in the model to make it appear
that the reliability would meet requirements. Sometimes these adjustments had a
technical basis, and sometimes they did not. Les Gubbins, one of the government’s first
project managers for the handbook, once made the analogy that engaging in the use of
these adjustment factors is like pushing the needle on your car’s speedometer up, and
convincing yourself you’re going faster. This, of course, is not good engineering
practice, but rather was done for nontechnical reasons.
Another key development in the area of reliability predictions was related to the
implications of acquisition reform. In 1994, Military Specifications and Standards
Reform (MSSR) was initiated which decreed the adoption of performance-based
specifications as a means of acquiring and modifying weapon systems. It also overhauled
The 2000’s was a time in which there was progress on development of new standards,
some of which will be summarized in this book. Also, the DoD has initiated efforts to
resurrect MIL-HDBK-217 by updating it with models reflecting state-of-the-art
technologies.
1.5. Acronyms
Acronyms and abbreviations that are used in his book are defined as follows:
AL Accelerated Life
ALM Accelerated Life Model
ALT Accelerated Life Testing
CA Constant acceleration
CDF Cumulative Distribution Function
CRR Center for Risk and Reliability
D Detectability
DoD Department of Defense
DPA Destructive Physical Analysis
DVT Design Verification Test
ED Electrical distributions
ELFR Early life failure rate
EPRD Electronic Parts Reliability Data
ESD Electrostatic discharge
EV External visual
EVT Engineering Verification Test
FMEA Failure Mode and Effect Analysis
FMECA Failure Mode and Effect Criticality Analysis
FRU Field Replaceable Unit
GFL Gross/fine leak
HALT Highly Accelerated Life Test (simultaneous temperature cycling and vibration)
HASS Highly Accelerated Stress Screening
HAST Highly Accelerated Stress Testing
HTB High temperature bake
HTOL High temperature operating life
HTRB High temp. reverse bias
IOL Intermittent operational life
IPL Inverse Power Law
IWV Internal water vapor
KPSI Pounds per square inch, in thousands
LI Lead integrity
MCMC Markov Chain Monte Carlo
MLE Maximum Likelihood Estimator
MS Mechanical shock
MTTF Mean Time to Failure
NPRD Non-Electronic Parts Reliability Data
O Occurrence
PD Physical dimensions
PDF Probability Density Function
PVT Process Verification Test
RBD Reliability Block Diagram
RPN Risk Priority Number
RSH Resistance to solder heat
S Severity
SD Solderability
TBD To Be Defined
TC Temperature cycling
TR Thermal resistance
TST Pre and post electrical test
TTF Time to Failure
VVF Vibration - variable freq.
1.6. References
1. Coppola, A., Reliability Engineering of Electronic Equipment, A Historical
Perspective,” IEEE Transactions on Reliability. Vol. R-33. No. 1, April 1984.
2. Erles, D.R., “Reliability Application and Analysis Guide,” The Martin Company, July
1961.
3. Erles D.R. and M.F. Edins, “Failure Rates,” AVCO Corp. April, 1962.
4. Knight, C.R., “Four Decades of Reliability Progress,” 1991 Proceedings Annual
Reliability and Maintainability Symposium.
5. “Reliability Prediction Methodologies For Electronic Equipment,” AIR 5286, SAE G-
11 Committee, Electronic Reliability Prediction Committee, 31 Jan. 1998
6. “Reliable Application of Plastic Encapsulated Microcircuits,” Reliability Analysis
Center Publication PEM2.
7. Morris, S.F. and J.F. Reilly (Rome Laboratory), “MIL-HDBK-217 - A Favorite
Target.”
8. Denson, W. And P. Brusius, “VHSIC and VHSIC-Like Reliability Modeling,” RADC-
TR-89-177.
9. Reliability Analysis Center, “Benchmarking Commercial Reliability Practices”
• What is the goal of the model, and what decisions will be made based on it?
• What data is currently available on the product?
• Is field data available? If so, is it from the product or system operating in the
same manner and environment as the one under analysis?
• Is test data available? If so, what types of tests (i.e., accelerated life tests, non-
accelerated life tests, qualification tests, etc.)
• Is data, either field or test, available on a predecessor (i.e., earlier version) of the
product?
• Have models been developed for specific failure modes, mechanisms and/or
causes of the product?
o Life models?
o Stress-strength models?
o Models from first principals?
• Have critical failure causes of the product been identified?
• How much support can be expected from suppliers regarding identification and
quantification of the failure causes of their product?
Assess data
available Determine appropriate approach
and execute
Assess feasibility of
performing reliability
tests
Combine data
the root failure mode cause or mechanism level. Tools for this “system model” include
FMEA and FTA.
Fault tree representation of a system breakdown in which the level at which reliability
estimates are made are the components, represented by circles (basic events) is illustrated
in Figure 2.1-1. This Figure represents a reliability prediction performed using MIL-
HDBK-217 or 217Plus.
System
Assembly 1 Assembly 2
Comp. 1a1 Comp. 1a2 Comp. 1a3 Comp. 1b1 Comp. 1b2 Comp. 2b1 Comp. 2b2 Comp. 2b3 Comp. 2c1 Comp. 2c2
Fault tree representation of a system breakdown in which the level at which reliability
estimates are made are the failure mechanisms of the components, represented by circles
(basic events) is illustrated in Figure 2.1-2. This would be the representation of a
reliability prediction performed using a physics approach in which the intent is to
estimate the reliability of specific root-cause failure mechanisms.
System
Assembly 1 Assembly 2
Comp. 1a1 Comp. 1a2 Comp. 1a3 Comp. 1b1 Comp. 1b2 Comp. 2b1 Comp. 2b2 Comp. 2b3 Comp. 2c1 Comp. 2c2
FM1 FM2 FM2 FM1 FM1 FM2 FM3 FM1 FM1 FM2 FM1 FM2 FM3 FM1 FM2 FM1 FM1 FM2 FM1 FM2
Approaches such as this, in which the reliability of each failure mechanism is estimated,
are practical if:
This same representation is relevant to performing FMEAs. In this case, the lowest level
events in the fault tree are the constituent failure modes of the component. If a failure
mechanism modeling approach is to be used, it needs to be applied to all failure
mechanisms in order for the assessment to quantify the reliability of the entire system.
All of the approaches described in this book have merit. All have their strengths and
weaknesses. A successful assessment will leverage the strengths of specific
methodologies toward the specific goals of the assessment. Toward this end, the intent of
this section (and the following sections) is to provide guidance on the applicable
approaches for specific assessment purposes.
Purpose of model
Risk assessment
Reliability demo
Maintainability
Design aid
Anticipated Observed Determine if
Determine if
failure failure minimum reliability rqmt Allocate
robustness is is achieved maintenance
achieved PM personnel
Input to schedules
FMEA/FTA for ID Warranty cost
of failure cause predictions
priority Spares
Determine allocation
Determine impact of factors
Compare Model feasibility of on reliability Determine
competing Reliability meeting rel. rqmt screening rqmt
designs growth
Specific reliability modeling purposes are generally suited to specific program phases, as
summarized in Table 2.2-2.
Maintainability PM schedules x x x x
Spares allocation x x x x
Allocate maintenance
personnel x x x x
System
Subsystem
Assembly
Component
Failure Modes (Root)
Failure Causes/Mechanisms (Root)
If the level to be analyzed is failure causes, then additional detailed data and information
is required. Therefore, the practicality of obtaining the required data must be a
consideration when choosing an appropriate approach. The degree of difficulty of
obtaining required data generally increases as you go lower in the hierarchy. This
concept is illustrated in Figure 2.3-1.
Subsystem
Parts lists
Environmental conditions
Part stresses Assembly
Component
Yield
Defect density
Internal part stresses & Failure causes/mechanisms
distributions
Figure 2.3-1: Typical Data Requirements vs. Level of Hierarchy
As shown in Figure 2.3-1, the data required for the assessment of specific failure causes
can be factors like yield, defect density, internal part stresses and distributions. Because
these are factors often difficult to obtain by outside organizations, the best approach is
generally to have the manufacturer assess the reliability of the causes in the event that the
selected approach requires this sort of data.
The appropriate approaches for a reliability assessment will, therefore, generally depend
on the location of a company’s product in the hierarchy of the product or system.
System Hierarchy
How functions can fail
Occurrence
Detectability
Risk Priority
Number (RPN)
Improve design
The hierarchical relationship between cause, mode and effect is shown in Figure 2.3-3.
For example, a failure mode can have any number of potential effects, and also can have
any number of potential causes.
Failure
Mode
If the reliability assessment is to be performed at the failure cause level, then all possible
causes need to be identified. One of the FMEA objectives is to identify all conceivable
failure causes. One way to accomplish this is to identify all combinations of initial
conditions, stresses and mechanisms, as illustrated in Figure 2.3-4 and Table 2.3-1.
One of the keys to a successful FMEA is to understand the relationship between cause,
mode and effect. In general, there is a natural tiering effect that occurs in an FMEA as a
function of the product or system level, as illustrated in Table 2.3-2. For example, at the
most basic level, the part manufacturing process, the cause of failure may be a process
step that is out of control. The ultimate effect of that cause becomes the failure mode at
the part level, the failure effect of the part becomes the failure mode at the next level of
assembly, and so forth. It is very important that the cause, mode and effect are not
confounded in the analysis.
Part
System Assembly Part Manufacturing
Process
Effect
Mode Effect
Figures 2.3-5 through 2.3-8 illustrate, with fault trees, how the relationship between
cause, mode and effect scale up or down the product or system hierarchy, depending on
the hierarchical level at which the analysis is to take place.
TOP
OR AND
OR VT AND AND OR OR OR OR
Event
Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event
TOP
OR AND
OR VT AND Mode OR OR OR OR
Event
Event Event Event Event Event Event Cause Event Event Event Event Event Event Event Event
Figure 2.3-6: Fault Tree of Product or System with Cause as the Lowest Level
TOP
OR Effect
OR VT AND Cause OR OR OR OR
Event
Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event
Figure 2.3-7: Fault Tree of Product or System with Cause Above the Lowest Level
Effect
OR Mode
OR VT AND AND OR OR OR OR
Event
Event Event Event Event Event Event Event Event Event Event Event Event Event Event Event
Figure 2.3-8: Fault Tree of Product or System with Cause Two Levels Above the
Lowest Level
Therefore, if the FTA view of the product or system is to be consistent with the reliability
assessment, then the lowest level in the tree must be the level at which reliability
estimates are made.
The section above describes the hierarchical level at which a reliability model will be
developed, whether it be a failure cause, failure mode, a component or an assembly.
Once this physical level is determined, there are several model forms possible to
Reliability Information Analysis Center
33
Chapter 2: General Assessment Approach
construct a model to describe its reliability. This form will, of course, depend on the
specific approach and data used to develop the reliability model. Some of these forms are
described below. More detail on each of these is provided in subsequent sections.
Failures
λ=
operating time
If a life model is developed from life tests performed at various stress levels, the result
will be a time-to-failure (TTF) distribution (described by the Weibull, lognormal or other
statistical distributions) that is a function of stress levels. If a Weibull distribution is
used, the general model will be:
β
⎛t ⎞
−⎜ ⎟
R(t ) = e ⎝α ⎠
If models are to be derived from the analysis of field data, there are several possible
model forms. Traditional methods of reliability prediction model development have
included the statistical analysis of empirical failure rate data. When using multiple linear
regression techniques with highly variable data (which is often the case with empirical
field failure rate data), a requirement of the model form is that it be multiplicative (i.e. the
predicted failure rate is the product of a base failure rate and several factors that account
for the stresses and component variables that influence reliability). An example of a
multiplicative model is as follows:
λ p = λbπ eπ qπ s
where:
However, a primary disadvantage of the multiplicative model form is that the predicted
failure rate value can become unrealistically large or small under extreme value
conditions (i.e., when all factors are at their lowest or highest values). This is an inherent
limitation of multiplicative models, primarily due to the fact that individual failure
mechanisms, or classes of failure mechanisms, are not explicitly accounted for.
Another possible approach to model reliability is to segment the failure rate for each
group of failure causes that are accelerated by stresses incurred during specific portions
of a mission. Each of these failure rate terms are then accelerated by the appropriate
stress or component characteristic. This is the model form used in the RIAC 217Plus
methodology. This model form is as follows;
λ p = λ oπ o + λ eπ e + λ cπ c + λ i + λ sj π sj
where:
The concept of this approach is that the occurrence of each group of failure causes is
mutually exclusive, and their failure rates can be modeled separately and summed. By
modeling the failure rate in this manner, factors that account for the application and
component-specific variables that affect reliability (π factors) can be applied to the
appropriate additive failure rate term. Additional advantages to this approach are that
they:
the applicable failure rate term, thereby eliminating many of the extreme value
problems that plague multiplicative models.
o Are based on observed failure mode distributions, so that observed component
root failure causes are empirically modeled
o Can be tailored with test data (if available) by applying it in a Bayesian fashion to
the appropriate failure rate term. As examples, temperature cycling data can be
combined with the failure rate from power or temperature cycling stresses (λc), or
high temperature operating life can be combined with the failure rate from
operational stresses term (λo).
Perhaps the most important element of a reliability program is the reliability testing of the
product. Reliability test data is, in turn, a critical element for assessing reliability. In this
context, a reliability test consists of two primary elements: measurement and exposure.
The measurement is the means of assessing the performance of the product or system
relative to its requirements. It usually consists of quantifying parameters that are
specifiable attributes. It may include both continuous variables (i.e. gain, power output,
etc.) or attribute data (i.e. a binomial representation of whether a product possesses an
attribute or not). Exposure is the application of a stress or stresses. These stresses may
consist of operational stresses or environmental stresses. Operational stresses are defined
as those stresses to which the product will be exposed by the act of operating the product.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
36
Chapter 2: General Assessment Approach
For example, a transistor is designed to have a voltage applied, and pass a given amount
of current. As such, these are operational stresses. It will also be exposed to externally
applied environmental stresses such as temperature, temperature cycling, vibration, etc.
Additional considerations for testing products and systems are provided in Chapter 5.
Table 2.5-1 describes the approach, it strengths and its weaknesses. This information is
presented in the context of the intent of this book, which is to present options for
quantifying the reliability of a product as it is used by customers in actual use conditions.
Effective techniques also include using a combination of the approaches in this section.
The manner in which these approaches can be combined will be addressed in Section 2.6.
Test data can be collected and applied before Cannot quantify special cause failure modes
the system is fielded
Correlation to field use conditions is difficult
Can demonstrate a degree of robustness to
the specific qualification tests
2 Can excite non-relevant failure modes (i.e.,
Exposure to industry
those that are not representative under field
standard “trade and Reflects the actual reliability
Qual environmental conditions)
commerce” tests
Test data can be collected and applied before
Cannot quantify special cause failure modes
the system is fielded
Can accurately model lifetime due to
common cause mechanisms
Can be expensive to execute
3 Can quantify acceleration factors
Life tests under a Difficult to quantify special cause failure
DOE multicell Can estimate reliability at use conditions
variety of stress levels modes due to large sample sizes sometimes
required
Reflects the actual reliability
6 The most representative data Usually, the data is not available in time for
Use of field experience
use in product or system development
data on the product or
Field data – Can quantify failure causes that exhibit low
system under analysis percent failures Collecting field data is prone to errors
same product
Selecting a methodology
The various approaches summarized here are suited to various program phases,
corresponding to prediction, assessment and estimation. This is shown in Table 2.5-2
(note that the shaded area indicates where the approach can be applied). For example,
MIL-HDBK-217 should only be used for prediction, meaning that its usefulness is
limited for assessment and estimation. Conversely, 217Plus was designed to provide a
framework for all three reliability modeling phases.
Assessment
Estimation
Prediction
Approach
HALT
Qualification
Accelerated
Test DOE multicell
Reliability demo
Non-
Reliability demo
Accelerated
Same
Empirical
product
Models
Field 217Plus
data Similar MIL-HDBK-217
product Bellcore
Raw data (EPRD,
NPRD)
First principals
• The severity of product failure. In this context, severity can mean that there are
significant financial ramifications of failure, that there are safety-related risks, or
that the system is not maintainable. For all of the reasons that high reliability may
be required in the first place, are the same reasons that the reliability model must
be acceptably accurate. Since reliability is a stochastic process, reliance on any
one of the methodologies discussed in this book is susceptible to uncertainties.
Sometimes these uncertainties can be very large. This is true for any of the
methods. If, however, several methodologies can be employed, and their results
are consistent with each other, then this adds much more credibility to the
modeled reliability of the product. This is especially true if a physics approach is
coupled with an empirical approach.
• The amount and level of detailed information available to the analyst. Often, this
will dictate the available choices for the analysis.
• Complexity of the product. If the product or system is very complex, has many
levels of indenture, and there is a complex supply chain involving many suppliers,
then the available suitable choices for analysis at the top of the supply chain will
be limited. For example, as discussed previously, it is very difficult to obtain the
data required to utilize one of the physics approaches by organizations higher in
the supply chain. If, however, the entire supply chain utilizes the PoF approach
for the product or system, it can be a viable approach.
If empirical data is to be used as a basis for one or more of the approaches, there are
various factors that will influence the uncertainty in assessments made with this empirical
data. These include the following data attributes:
Quality – this pertains to the accuracy inherent in the data itself. For test data, the
accuracy is generally much better than with field data, since test data is usually
much better controlled with known sample sizes, failure times, etc. Field data, on
the other hand, is usually fraught with many problems and sources of uncertainty.
This will be discussed in Section 5.2.1.2.
Empirical Physics
Similar product
Accelerated
Accelerated
Purpose
Stress/Strength modeling
Non-
Reliability demo
Reliability demo
First principals
DOE multicell
Same product
Qualification
Raw data
Models
HALT
Relevancy is a function of the type of data that is available, and the product or system on
which that data is available. To further address the relevancy issue for assessments made
with empirical data, consider the information in Table 2.5-4, which summarizes the
various attributes of empirical data. This notion is valid, regardless of the level of
assembly, ranging from root failure causes to the system level.
There has been much information published in the literature comparing and contrasting
empirical and physics-based models. However, they are not mutually exclusive
methodologies. For example, empirical models generally utilize PoF principals in their
derivation, and PoF models utilize empirical data in their derivation and parameter
estimation.
The majority of component field failures are a result of special causes. These causes may
be an anomaly in the manufacturing process, an application anomaly, or a host of other
assignable causes. They are rarely the result of a common cause failure mechanism,
which can generally be modeled by life modeling techniques.
Guidelines and examples are provided in the following sections for each of the
approaches.
2.5.1. Empirical
2.5.1.1. Test
Testing product ors system reliability is performed for many reasons, including:
An important consideration for all of the tests described above is the definition of
“failure”, i.e., the "failure criteria" that will be used to determine if a product passes or
fails. Industry guidelines, specifications, or an understanding of end-use application
tolerances are often used to set pass/fail criteria.
Therefore, for a product or system to be considered fit for use for a specific application, it
must conform to the requirements of its specification over its intended life (verification)
and the specification must adequately capture the requirements of the end user
(validation). The various elements of qualification are illustrated in Figure 2.5-2.
Qualification
Validation Verification
Specification Reliability
Compliance Testing
EVT Root
cause
analysis
DV and
corrective
action
PVT
Verification ensures that the product or system meets the specified requirements both
initially (specification compliance) and over its intended lifetime (reliability testing).
Specification compliance ensures that, at the beginning of its lifetime, the item meets
specified performance requirements and that the distribution of performance parameters
over the population of items is within acceptable limits. Reliability testing ensures that
the product is robust and that it meets the specified performance requirements over its
intended lifetime. Reliability testing consists of several test phases, each of which has its
own purposes and approaches.
The testing sequence can be grouped into three categories: Engineering Verification Tests
(EVT), Design Verification Tests (DVT), and Production Verification Tests (PVT).
These are further explained below, along with their relationship to the establishment of a
life model, the prediction of product reliability and the various elements of each test
approach. This is provided specifically to highlight how reliability testing can be used in
a reliability program.
Engineering Verification Tests (EVT) are intended to identify and assess “high risk”
critical items so that corrective action can be taken, if necessary. The intent is to uncover
weaknesses or to identify product capability, not to pass a set of predefined tests, as is the
case with traditional qualification testing. The purpose and approach of these tests are
described in Table 2.5-5. These tests are also used to identify the maximum stress
capability of a product, which is a prerequisite for developing a complete test plan (in
DVT) to assess lifetime. Step stress tests are often used for this purpose, and the results
can support establishment of an upper bound on subsequent test stresses.
One of the primary purposes of DVT testing is to provide the data required to develop life
models. Often, there are multiple accelerating stresses, in which case life tests must be
conducted for various stress combinations. Design of Experiments (DOE) is used to
develop an effective and cost-efficient test plan. DOE concepts, as they pertain to
reliability testing, are covered in Chapter 4.
PVT tests demonstrate that the robustness of production units is equivalent to that of the
EVT/DVT samples. Whereas EVT and DVT demonstrate the intrinsic robustness, PVT
demonstrates the “as-built” robustness.
Some reliability practitioners choose to separate qualification tests from reliability tests.
In this case, reliability tests are those that have a purpose similar to the DVT tests. The
reason for separation is that the reliability tests are more of an engineering test that are
not dictated by industry standards. As such, the results may or may not be shared with
customers. Likewise, qualification tests are required and thus shared with customers to
demonstrate compliance.
Time
Component
(EVT) (DVT) (PVT) Ongoing Reliability
Test (ORT)
Assembly
2.5.1.1.1. Non-Accelerated
Nonaccelerated reliability tests are those in which samples are tested in a manner that
recreates the use conditions the product will experience in its intended use environment
as used by customers. These tests may be performed for several reasons:
Generally, if the purpose is #1, a more effective way of acheiving this is with accelerated
testing, discussed in the next section of this book. If the purpose is #2, then concepts of
reliability demostration can be used, as discussed in the next section.
2.5.1.1.2. Reliability Demonstration
The fundamental concept of reliability demonstration is the following:
1 − CL = R
This is essentially a hypothesis test in which the hypothesis is that the true product
reliability is “R” or greater. For example, consider a case in which the reliability
requirement is 0.95 at 5000 hours, and the desired confidence level is 0.80 (80%). In this
case, the implied failure rate is 0.0000103 failures per hour.
If the hypothesis is true and the test is run such that there less than a 20% probability of
experiencing the observed number of failures (or fewer), then the analyst can be 80%
certain that the reliability requirements have been met.
Table 2.5-6 summarizes the probability as a function of the number of failures and
cumulative operating time. The values in the cells are the Poisson probability that there
will be “F” or fewer failures, under the hypothesis that the true failure rate is 0.0000103
(failures per hour). In this example, if the test can be run until 200,000 hours are
accumulated, with no failures, then the test is passed and the hypothesis is verified. This
is the first opportunity to pass the test, as this is the shortest time at which the Poisson
probability falls below 0.20 (i.e., 0.13). In this example, 0.20 is the risk of concluding
that the failure rate is less than 0.0000103 when it is not.
The test is run until the number of failures and time combinations falls either above or
below the shaded red area. If it falls above the red area, then the null hypothesis is
confirmed (that the failure rate is greater than the required). If it falls below the red area,
the hypothesis is confirmed. If the combination of hours and failures remains in the red
area, the hypothesis cannot be confirmed or denied, and further testing is required.
Reliability Information Analysis Center
49
Chapter 2: General Assessment Approach
The probability values are generally calculated from the binomial or Poisson
distributions, depending on whether the probability is time-based (Poisson) or attribute-
based (binomial). Poisson is used in the case of constant failure rates.
150
200
250
300
350
400
450
500
550
600
650
700
750
800
50
10 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.97 0.95 0.92 0.89 0.85 0.79
9 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.96 0.94 0.90 0.86 0.81 0.75 0.69
8 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.95 0.92 0.88 0.83 0.77 0.71 0.64 0.56
Number of Failures
7 1.00 1.00 1.00 1.00 1.00 0.99 0.97 0.94 0.90 0.85 0.79 0.72 0.65 0.57 0.50 0.42
6 1.00 1.00 1.00 0.99 0.98 0.96 0.93 0.88 0.82 0.74 0.66 0.58 0.50 0.42 0.35 0.29
5 1.00 1.00 0.99 0.98 0.95 0.91 0.85 0.77 0.68 0.59 0.50 0.42 0.35 0.28 0.22 0.17
4 1.00 1.00 0.98 0.94 0.88 0.80 0.71 0.61 0.51 0.42 0.34 0.26 0.21 0.16 0.12 0.09
3 1.00 0.98 0.93 0.85 0.74 0.63 0.52 0.41 0.32 0.25 0.19 0.14 0.10 0.07 0.05 0.04
2 0.98 0.91 0.80 0.66 0.53 0.41 0.30 0.22 0.16 0.11 0.08 0.06 0.04 0.03 0.02 0.01
1 0.91 0.73 0.54 0.39 0.27 0.19 0.13 0.08 0.06 0.04 0.02 0.02 0.01 0.01 0.00 0.00
0 0.60 0.36 0.21 0.13 0.08 0.05 0.03 0.02 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00
One of the critical aspects of accelerated testing is the degree to which acceleration takes
place. Consider the situation depicted in Figure 2.5-4. The reliability requirement, in
terms of lifetime in this example, will be specified at a specific stress condition. If tests
are performed at the accelerated conditions of Test 1, there will be some extrapolation to
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
50
Chapter 2: General Assessment Approach
lifetimes at use conditions (if the purpose is to quantify life). If tests are performed at the
accelerated conditions of Test 2, there will be additional extrapolation to lifetimes at use
conditions. Life modeling is the means of performing this extrapolation, and will be
covered in Section 2.5.1.1.2.3 and Chapter 5.
The larger the extrapolation distance, the larger the uncertainty in the reliability estimate
at use conditions. This is illustrated in Figure 2.5-5.
The relevancy of failure causes must be considered when using accelerated test data to
model product or system reliability in field deployed conditions. For example, if failures
occur in an accelerated test, the questions to be addressed are:
1. Can the failure cause occur under field conditions? Or has it been induced by the
test?
2. If the failure cause is relevant, can its reliability characteristics be scaled to field
use conditions with an acceleration model?
For example, consider several scenarios illustrated in Figure 2.5-6. Case 1 illustrates the
situation in which the failure cause observed in accelerated testing is relevant, and its
probability of occurrence can be extrapolated to use conditions with an acceleration
model. Case 2 illustrates the situation in which the failure cause observed in accelerated
testing is not relevant, and its probability of occurrence cannot be extrapolated to use
conditions with an acceleration model. Case 2 is representative of a situation in which
there is a “threshold” stress, above which the failure cause has been induced by the test.
The higher the acceleration, the higher the risk is that Case 2 will occur. For this reason,
for the purposes of quantifying reliability under field use conditions, highly accelerated
tests (like HALT) must be used with caution.
Alternatives are also available that will cover more of the “life-stress” space, as shown in
Figure 2.5-7. This approach is desirable because there is minimal extrapolation to field
use conditions, and validity of the acceleration models over a broader stress range can be
ascertained.
Another factor to consider in accelerated testing, when used to quantify reliability at use
conditions, is the relative probability of occurrence of various failure causes as a function
of stress level. Each failure cause will have unique acceleration characteristics as a
function of stress, depicted as the slope of the life-stress line. They will also have unique
probabilities of occurrence, as depicted as the vertical position of the life-stress line.
These factors together indicate that the relative probabilities of the causes require a model
for each. This is illustrated in Figure 2.5-8. In this life-stress plot, the slope represents
the dependency of life as a function of stress, and the position of the line represents the
absolute life. As can be seen, the relative probabilities of the causes will depend on the
stress level.
HALT requires a different mindset than "conventional" accelerated testing. One is not
trying to predict or demonstrate life, but rather to induce failures of the weakest links in
the design, strengthen those links, and thereby greatly extend the life of the design. Root
cause failure analyses are conducted and repairs and redesign are carried out, as feasible
and cost-effective. Output results from HALT may include a Pareto chart showing the
weak links in the design, and design guidance that can be used to create a more robust
design.
Testing a new design and comparing it against a proven previous generation design using
the same accelerated test provides an efficient benchmarking test. Based on HALT
results, a determination of "optimum" design characteristics can be made using statistical
design of experiments (DOE).
stabilize the product's internal temperature (the thermal rate of change between
each temperature transition step should be ~100°C/minute)
The vibration stress is provided by mechanically impacting the table with “hammers”.
As such, the frequency spectrum is not truly random, but rather is “pseudorandom”. The
purpose of the vibration survey is to detect weakness in the design as a function of the
stresses created by the increased vibration levels.
In this example, the vibration is applied during temperature dwells, but if failure causes
are possible that are accelerated by vibration stresses during temperature transitions, the
stress profile can be modified to apply vibration continuously throughout the temperature
cycle.
This is a typical stress profile, and will be varied (and should be tailored) based on the
limits of the product or system being tested. The purpose of the step-stress temperature
test is to detect sensitivity of design functionality to temperature and temperature change
rates.
The purpose of the combined environment test should highlight weaknesses that result
from the interaction effects of simultaneous exposure to temperature and vibration.
Quantifying reliability is generally not the objective of HALT. The ability to improve the
inherent reliability/robustness of the product or systems design is. However, in some
cases it can be used as an indicator of field reliability performance. The fundamental
question to address is this: Does the HALT test excite failure causes that the item may
experience in the field? The answer to this question will depend entirely on the
characteristics of the item under test, and the stresses to which it will be exposed in field
use. For example, if the product or system critical failure causes are accelerated by
thermal cycling and random vibration, and the item will experience these stresses in the
field, then HALT test results may be indicative of field reliability. Likewise, if the
product or system critical failure causes are not accelerated by thermal cycling and
vibration, and/or the product will operating in a benign environment, then the HALT
results will provide very little information regarding field reliability.
2.5.1.1.5. Qualification Testing
Qualification testing is a term used to describe a series of tests that a product or system
must be exposed to, and pass, for it to be considered “qualified” by the industry or
standards body governing the qualification requirements. Several examples of
qualification requirements are provided in Tables 2.5-7 and 2.5-8, for an assembly, and
for a laser diode component, respectively.
There are many qualification standards in existence, governed by standards bodies within
specific industries. Some noteworthy standards organizations are IEC (International
Electrochemical Commission), the U.S. Military (via MIL-specs), ISO, and Telcordia
(for telecommunication components and equipment.
There are several factors which will impact the usefulness of qualification data as an
indicator or field reliability. These are:
• The degree to which the stress is accelerated, and the acceleration factor between
the test and field environments
• The degree to which the stress accelerates critical failure causes that the product
or system will experience in the field
• The sample sizes used, which impacts the statistical significance of the data
The first two bullets are treated in detail elsewhere in this book. The last bullet is
discussed next.
A common way in which sample size requirements are identified in standards is with a
Lot Tolerance Percent Defective (LTPD) methodology. This concept is identical to the
reliability demonstration idea presented previously. In this case, two parameters are
specified:
From before:
1 − CL = R
In this case, the value of “R” is the reliability of the entire sample size. So, if the test
plan is established to allow no failures (this will require the minimum sample size), the
equation becomes:
1 − CL = R n
where “n” is the sample size. For example, if the allowable percentage of defects is 20%,
and the desired confidence level is 0.90 (i.e. 90%), then n = 11 is the minimum sample
size required, as shown:
11
1 − 0.9 = 0.8
So, if the test is performed on 11 samples with no failures, then there is a 90% confidence
that the true reliability is greater than 0.8 (i.e., the probability of failure is less than 0.2).
Other plans are also available that allow a certain number of failures. These require
larger sample sizes, and are determined with binomial statistics.
Since the LTPD is generally less than the required reliability, qualification data is usually
not sufficient, in and of itself, to demonstrate reliability requirements. It can, however,
be valuable data when used in combination with other data sources.
operation, the only statistical statement that can be made is that there is a 90% confidence
that the true unreliability is less than 0.2 at 300 hours, shown as the solid star and arrow.
Here, the data is not sufficient to determine if the actual distribution is “Case 1,” or that
the reliability requirement is met. However, testing 11 samples may be sufficient to
determine if we have a wearout mode occurring at a time less than 300 hours, as
illustrated in “Case 2.”
Probability - Weibull
99.000
90.000
50.000
Case 2
10.000
Case 1
Unreliability, F(t)
5.000
1.000
0.500
0.100
0.100 1.000 10.000 100.000 1000.000 10000.000
Time, (t)
If the goal of the test is to demonstrate the infant mortality percent fail value from the
first of the distribution modes is for example, less than 1%, it can be seen that testing 11
samples will not come close to demonstrating this requirement.
This example is shown to illustrate the fact that the demonstration of reliability due to
wearout related failure causes can be done with relatively small populations, whereas low
percent fail values typical of infant mortality cannot.
In any case, however, the goal of a reliability program is to ensure that the actual
probability line is to the right of the reliability requirement point.
2.5.1.1.6. DOE-Based Multicell
The methodology of a DOE (Design of Experiment)-based multicell involves subjecting
a sample of products to a combination of factors, or accelerants. These factors can be
stresses or categorical variables. The intent of these tests is to generate the data that is
required to develop a life model that is capable of predicting reliability under a variety of
use conditions. Life modeling is usually performed for specific failure causes. A goal of
a reliability program is to identify those causes that warrant the work required to develop
a life model. Characteristics of these “critical failure causes” often include:
After the identification of critical failure causes of a product or system that require life
modeling, action must be taken to ensure that those items are sufficiently robust to meet
product/system reliability and durability requirements. Life modeling is used for this
purpose, and involves the characterization and quantification of specific failure causes,
making it a critical element of a reliability program.
Tools Measurement:
• Environment
DOE Life • Stresses
Modeling • Duty Cycle
• Extreme Event
FMEA FTA Statistics FTA
Characterize
operating
stresses
Actions
Each of the elements in Figure 2.5-10 are further examined below. Additionally, the
topics of Design of Experiments (DOE) and life modeling are treated in more detail in
Chapters 4 and 5, due to their relatively complex nature and their importance to life
modeling. A detailed example of a life model developed is also provided in Chapter 7.
Identify Factors
Factors are the independent variables that can influence the product reliability, and the
response variable is the dependent variable. DOE is a common technique used to study
the relationships amongst many types of factors. In the context of this book, the response
variables specifically refer to the reliability metric of interest.
Critical failure causes and the factors that potentially affect their probability of
occurrence need to be identified. This can be done through testing, through analysis, or
both. EVT testing that is performed as part of the overall product/system reliability
program can be used for the identification of these factors, as previously described.
FMEA is also a popular analytical technique for this and will be used in the upcoming
example.
• Stresses
o Environmental
o Operational
• Product/System Attributes
o Design factors
o Manufacturing processes
• A continuous variable is one that can assume any value within a given range
• Categorical variables are those that assume a discrete number of possibilities
Some factors can be modeled as either. For example, environmental stress can be
modeled with continuous variables of the specific environmental stresses (i.e.,
temperature, vibration, humidity, etc.), or it can be modeled as a categorical variable.
The latter case is the approach that has historically been used in MIL-HDBK-217, which
uses environmental categories like “Ground, Benign,” Airborne, Inhabited”, etc. The
217Plus methodology treats them as continuous variables, but default values are provided
for the categorical values of environment.
There are several ways in which these factors can be identified. One method that has
proven to be an efficient means of accomplishing this is to utilize the FMEA. This
involves modifying the FMEA to include several additional columns that correspond to
the above listed factors. At the analysts discretion, from one to four additional columns
can be included. This will depend on the type of product or system under analysis and
the level of rigor desired. In this approach, the FMEA team (or at least someone
knowledgeable with the item design and process attributes) identifies the specific stresses
or attributes that will affect the probability of occurrence of the specific failure cause that
was identified in the FMEA. Since each failure cause will generally have an associated
risk priority number (RPN), the cumulative RPN can be calculated for all failure causes
affected by the specific stress or product/system attribute.
For example, consider the case in which an FMEA was accomplished in this manner, and
the results in Figure 2.5-11 were obtained. Here, only the environmental stresses are
Reliability Information Analysis Center
63
Chapter 2: General Assessment Approach
shown, but the same methodology would apply to whichever additional factors are
included in the FMEA.
In this case, the sum of the RPN values for all failure causes accelerated by mechanical
shock is about 500. This cumulative RPN value is a relative number only, but can
provide valuable insight into the most important stresses to be addressed in the reliability
test plan.
In this example, the test stresses shown pertain to all of the failure causes addressed in the
FMEA. In performing life tests on specific failure causes, the information identified in
the FMEA should be used to identify the test stresses to be considered in the DOE plan.
Reliability Tests
If critical item failure mechanisms are time dependent, then time-based life tests are
required. Life tests are conducted by subjecting test samples to a defined stress level and
measuring the times when failure occurs. The process is repeated for various
combinations of factor levels. Considerations for the reliability tests are described below.
Test Plan
If there are multiple accelerating stresses, then life tests must be conducted at various
combinations of stress magnitudes. A plan should be developed using an effective tool
such as Design of Experiments. The plan should consider all aspects of testing so that the
test program generates data in a cost effective way. It is easy to lapse into the mentality
of testing “one factor at a time”, in which tests are conducted to assess specific factors,
but this approach is generally not time- or cost-effective.
Factors to consider in establishing an appropriate DOE include (1) the sample size per
test cell, (2) stress levels, (3) the number of stress levels for each stress, (4) stress
interactions, (5) stress durations, (6) failure criteria, and (7) measurement methodology
(i.e., in-situ or periodic). The principals of DOE are treated in more detail in Chapter 4.
In many cases, it is desirable to establish the upper bound of the test stress for each
specific stressor. An efficient way to determine this stress level, often called the
“destruct limit”, is to perform a step stress test. Here, a sample of units is exposed to a
stress level well below the suspected destruct limit. Then, the stress is increased until the
product is overstressed. This step-stress test can include a linearly ramped stress, or a
stepped-stress in which the samples are exposed to a constant stress for a given dwell
time, after which the stress is increased, dwelled, and so on until failure. An example of
the identification of these maximum stresses was mentioned previously in the HALT
discussion.
The destruct limit can be used as the upper limit of all subsequent life tests. Usually, the
actual life tests will be performed at a maximum stress that is a certain percentage level
below the destruct limit. This percentage is dictated primarily by the sensitivity of the
TTFs to the stress. For example, consider the two cases illustrated in Figure 2.5-12.
Case 1 is a situation in which the lifetime, and subsequent reliability, is moderately
sensitive to the stress level. Case 2 is a situation in which the lifetime has an extreme
sensitivity to the stress level.
Figure 2.5-12: Using the Destruct Limit to Define the Life Test Max Stress
For example, if a power law acceleration model is used, the life – stress relationship is:
A
Life =
Sn
A typical value of “n” for Case 1 would be 1 to 3, whereas a typical value of “n” for Case
2 would be greater than 20.
In case 1, the maximum stress for the life tests may be 10-20% below the destruct limit.
For Case 2, however, the maximum stress should be only a few percentage points below
the destruct limit. Otherwise, the risk is taken that the product or system will not fail
within a reasonable time period, which is required for reliability model development.
Stress Profile
The two main types of stress profiles are steady-state and time varying. Steady state tests
are those in which a sample set is exposed to constant stress levels, and the response
(performance parameter(s)) is measured. Several examples are shown in Figure 2.5-13.
Any of the profiles in Figure 2.5-13 can be used to develop life models. If the time-
varying stress profiles are used, a cumulative damage model is usually appropriate. In
this case, the stress function is integrated to obtain the cumulative damage. This will be
explained in more detail later.
Some of the advantages and disadvantages of the two generic approaches are listed in
Table 2.5-9.
1. Use constant intervals. While this approach may not be optimal, it can be
appropriate in cases where the failure characteristics are completely unknown
2. If the rate of occurrence of failure (ROCOF) is expected to decrease over time,
the measurement intervals can start out very frequent, and decrease in frequency
as the failure rate decreases. This is shown in Figure 2.5-14.
Failure
Rate
Measurement Points
If the ROCOF is expected to increase over time, the measurement intervals can start out
very infrequent, and increase in frequency as the failure rate decreases. This is shown in
Figure 2.5-15.
Failure
Rate
Measurement Points
This case is generally much more difficult to implement because the failure
characteristics need to be known before the tests. Therefore, one of the first two
approaches is usually desirable.
Reliability Information Analysis Center
69
Chapter 2: General Assessment Approach
If the failure cause is a common cause mechanism, meaning that the entire population is
at risk, then many fewer items would be required. In this case, test data on enough
samples is required such that differences in reliability as a function of the factors (i.e.,
stresses, indicator variables) can be determined in a statistically significant manner. This
will be a function of how much inherent variability there is in the population, and how
sensitive the reliability is as a function of the factors under analysis. Essentially, if these
variabilities are known, then statistical techniques, like the Fisher F-test, could be used.
However, in practice, these variabilities are rarely known a priori. Therefore, sample
sizes as large as possible are preferred. In practice, the sample sizes are usually dictated
by programmatic constraints, in which case it is the reliability practitioner’s responsibility
to lobby program managers for the required samples.
Test Time
The question as to how long tests should be run before stopping them inevitably needs to
be addressed. This is especially true in cases where the stress levels are low and the
resulting lifetimes are long. While it is usually difficult to determine an appropriate test
duration before the test is run, a general rule of thumb is that tests should be run for
durations sufficient to cause at least 50% of the items to fail. This facilitates
quantification of the median life. Keep in mind that tests are used to characterize the
statistical distribution at a specific stress level, and therefore enough failures need to be
experienced to quantify the distribution.
Consider the illustration in Figure 2.5-16. In this case, tests were performed at two stress
levels, and the resulting TTF distributions were obtainable for each level. The
acceleration in this case can be quantified, along with confidence bounds around the
acceleration model parameters.
Figure 2.5-16: Acceleration When the Distributions for at Least Two Stresses are
Available
Now, consider the case in which the lower stress samples are not tested until enough
failures have occurred. This is shown in Figure 2.5-17. In this case, the distribution
cannot be quantified. All that is possible is the estimation of the lower bound of life, via
techniques like Weibayes analysis (shown as the star).
Figure 2.5-17: Acceleration When the Distributions for Low Stresses are Not
Available
This 50% objective can sometimes be offset if enough data is available in at least two
other, more stressful conditions, to compensate for the lack of data in the low stress
condition.
Collect data
• TTFs
• Acceleration variables
• Stress(es)
• Indicator
The TTF distribution can typically be modeled using the Weibull, exponential or
lognormal distributions. For sample "subpopulations" that exhibit different reliability
behavior than the main population, TTF distributions may manifest themselves as
bimodal. It is important that bimodal distributions be characterized. If one of the two
"modes" in the distribution appears to be the result of early failures from workmanship,
materials or process defects, then this information should be used to develop an
appropriate reliability screen. This topic is discussed in detail later in this book.
There are a variety of sources that can be used to estimate the stresses to which an item
will be exposed. First, customers will usually specify nominal and worst case
environmental requirements in the product or system specification. However, the data in
specifications are often very generic and lack sufficient detail for reliability analysis.
Field maintenance personnel can also often provide qualitative information pertaining to
stresses, especially when those stresses have resulted in failures.
• Customer specifications
• Customer usage information
• Measurement of conditions:
• Stresses
• Duty cycle
• Extreme event statistics
Degradation Modeling
In many cases, the reliability response variable will not be a TTF, but rather it will be the
behavior of a critical parameter as a function of time. In these cases, there are several
choices:
1. Develop a model that predicts the parameter as a function of all factors that need
to be quantified.
2. Derive a simple model (linear, logarithmic, exponential or power law) model that
describes the parameter as a function of time, and then use this model to estimate
a time to failure (i.e. the time the parameter is predicted to degrade to some
predefined failure threshold.
In many cases, Option 2 is a good choice. Option 1 is a good choice in the following
cases:
1. When the failure mechanism can reach an asymptotic value of degradation. This
condition is difficult to model using the conventional life modeling techniques
2. If the goal of the analysis is to feed other analytical techniques, like worst case
analysis (WCA).
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
74
Chapter 2: General Assessment Approach
Data of a performance
parameter vs. time
Regression
Prediction of
delta
Non linear Model of performance
model vs. time
parameter Prediction of
estimation percent fail
Prediction of life
Life modeling Life model distribution
This approach starts with data pertaining to the value of a critical parameter as a function
of an independent parameter. This independent parameter is usually time, but can be
other parameters, such as cycles. An example of such data is shown in Figure 2.5-20, in
which five samples were put on test and the critical parameter was measured in situ.
Next, models of performance vs. time are modeled. This can be accomplished by using
some standard model forms like, linear, exponential, logarithmic, polynomial, or a more
sophisticated non linear model form. The standard model forms can be quantified by
applying a linear transform to the data and applying regression techniques. The models
are easily performed in MS EXCEL with the trend line functions. Non linear model
forms can be quantified using numerical methods. The “Solver” utility in MS EXCEL is,
again, an example of this solution type.
Once these degradation models are available, predictions can be made regarding the
degradation value or the percent of the population failing in accordance to a predefined
failure criterion (i.e. percent degradation). Or, another option is to convert the
degradation data to failure times, as shown in Figure 2.5-21. The estimated TTFs are
then used to generate a life model using the techniques covered elsewhere in this book.
Note that the resulting TTF distribution can sometimes be counterintuitive. For example,
when dealing with what is believed to be a wearout-related phenomenon, the conversion
of degradation into TTFs can reveal a TTF distribution that is not usually considered to
be a wearout characteristic (for example a Weibull distribution with a shape parameter
that is less than one).
2.5.1.1.7. Reliability Demonstration
An accelerated reliability demonstration is conceptually the same as a non-accelerated
test, but the level of acceleration needs to be quantified. For this, life modeling
approaches are used.
The first approach is to utilize the field data directly, and the second is to utilize the data,
via an interim model developed from the data. This is shown in Figure 2.5-22.
Empirical Reliability
Field Data Estimate
Model
This data has been the primary source of data used to develop most of the empirical
prediction methodologies such as MIL-HDBK-217, 217Plus, etc. Due to the author’s
experience with these prediction methodologies, they will be used as examples in Chapter
7 to illustrate the concepts discussed in this section.
2.5.1.2.1. Same Product
Field data on the exact item under analysis is the best information on which to estimate
the reliability of the product or system. Unfortunately, it is usually available too late to
do any good. Reliability predictions and estimates are required long before product or
system deployment. This type of data is a lagging indicator of reliability, whereas the
other techniques discussed in this book are leading indicators. In other words, we need
leading indicators to estimate the reliability that will ultimately be observed with the field
data. This data, however, which should always be collected on products, is valuable in
the reliability assessment of future products.
2.5.1.2.2. Similar product
When using data on a similar product or system to assess a new product or system, the
degree of similarity needs to be accounted for to estimate the new item reliability based
on the empirical data available on the similar product. There are several ways in which
similarity can be assessed. The first approach is to utilize a reliability prediction
technique. This technique can be any of those covered in this document. The
technique’s ability to assess similarity is dependent on the ability of the specific
methodology to:
1. Address the factors that drive the reliability for the two products under analysis
2. Be reasonably sensitive to these drivers.
Regarding #2, for example, if a system is being developed that represents an evolutionary
change to the system for which a reliability estimate is available, estimating the reliability
of the new system based on the data from the old system requires that the prediction
methodology be sensitive to the design differences between the old and new systems. If
these differences consist of the addition of new components, an increase in the operating
temperature, and the addition of software, then the methodology used to assess the
“delta” in reliability between the new and old system must be capable of assessing these
elements, and the reliability prediction approach must be reasonably sensitive to these
factors. The methodology of 217Plus was designed to accommodate this type of
situation, and is further detailed in Section 2.6.
Additionally, it is not necessary that a single methodology be used to assess this “delta”.
Different techniques can be used to assess each of the elements of the design, and the
cumulative effect can be pooled together to form a complete system model. The
techniques used to assess each of the design elements will generally fall into the
categories described in this document.
Another more qualitative technique is to simply list the general attributes of the design, as
shown in Table 2.5-10. The relative expected reliability of each of these elements for the
new and old designs are then listed. This is a qualitative method, but can be useful in
some cases.
This approach needs to be developed for each product or system, since the reliability
attributes will be unique to that particular item type.
Another approach that can be used to assess similarity is to utilize the FMEA, if
available. This is illustrated in Figure 2.5-23. Here, the FMEA is performed on both the
new and the predecessor system. The failure causes identified represent a cumulative
listing of all failure causes, whether they are applicable to either or both items. Then, the
Occurrence rating is determined for each failure cause for both items. If a specific failure
cause is not applicable to one of the items, then it gets a rating of zero. The sum of the
Occurrence ratings are then calculated for each of the products or systems. The ratio of
this sum is an indicator of the relative reliability levels of the two items, and is a good
measure of the degree to which the items are similar.
Applicable to:
Recommended Actions
Old System New System
Failure effects
Failure mode
Detectability
Occurrence
Component
function
Severity
Causes
In these two columns, it is
This section represents the FMEA results for both the new and old identified whether the failure
systems, and lists a cumulative set of failure modes, causes, etc. cause is applicable to the old,
new or both systems.
Sum of “O” Sum of “O”
values of values of
causes causes
applicable to applicable to
the old system the new
system
For the most part, methodologies such as EPRD (Electronic Parts Reliability Data),
NPRD, MIL-HDBK-217, and 217Plus rely on field data from similar products or systems
in order to make reliability estimates. The manner in which they do this differs, but they
all share the same fundamental type of data as their basis.
2.5.1.2.4. Models
The use of models derived from empirical data to estimate the reliability of a product or
system is just one option for estimating reliability. Empirical models can be developed
and used by the analyst, or he/she can use empirical models developed by others. Models
developed by others include the industry standards or methodologies that many reliability
analysts are familiar with.
This section of the book deals with such models that are derived from the analysis of
empirical field data. Modeling is the means by which mathematical equations are
developed for the purpose of estimating the reliability of a specific item used and applied
in a specific manner. There are many ways in which models can be derived, and there is
no single “correct” way to develop these models. There are many such models in
existence. These models are generally easy to use, in that they are of a closed form and
simply require the analyst to identify the appropriate values of the input variables. The
developers of each of these models had their own perspective in terms of the user
community to be served, the variables that were to be modeled, the data that was
available, etc. It is not the intent of this book to review the specifics of these models, or
to compare them in detail. It is the intent, however, to discuss the rationale and options
for development of the models, and to provide some examples.
The analyst must first decide what variables are to be modeled. Factors that should be
considered as indicators of reliability include:
• Environmental stresses
• Operational stresses
• Reliability growth
• Time dependency
o Infant mortality
o Wearout
• Engineering practices
• Technology
o Feature sizes
o Materials
• Defect rates
• Yields
Which ones are actually included depend on whether the data is available to support the
quantification of a factor, if a valid theoretical basis exists for its inclusion, and whether
the factor can be empirically shown to be an indicator of reliability.
There are always many more potential factors influencing reliability that can realistically
be included in a model. The analyst must choose which ones are considered to be the
predominant reliability drivers, and include them in the model. The next step of model
development is to theorize a model form. This is generally accomplished by attempting
to establish a model consistent with the fundamental physics of reliability. Examples of
the development of several empirically-based models are provided in Chapter 7.
To compare various empirical methodologies, Table 2.5-11 contains the predicted failure
rate of various empirical methodologies for a digital circuit board. The failure rates in
this table were calculated for each combination of environment, temperature and stress.
As can be seen from the data, there can be significant differences between the predicted
failure rate values, depending on the method used. Differences are expected because
each methodology is based on unique assumptions and data. The RIAC data in the last
row of the table is based on observed component failure rates in a ground benign
application.
Table 2.5-11: Digital Circuit Board Failure Rates (in Failures per Million Part Hours)
Environment Ground Benign Ground Fixed
Temperature 10 Deg. C 70 Deg. C 10 Deg. C 70 Deg. C
Stress 10% 50% 10% 50% 10% 50% 10% 50%
ALCATEL 6.59 10.18 13.30 19.89 22.08 29.79 32.51 47.27
Bellcore Issue 4 5.72 7.09 31.64 35.43 8.56 10.63 47.46 53.14
Bellcore Issue 5 8.47 9.25 134.45 137.85 16.94 18.49 268.90 275.70
British Telecom HDR4 6.72 6.72 6.72 6.72 9.84 9.84 9.84 9.84
British Telecom HDR5 2.59 2.59 2.59 2.59 2.59 2.59 2.59 2.59
MIL-HDBK-217 E Notice 1 10.92 20.20 94.37 111.36 36.38 56.04 128.98 165.91
MIL-HDBK-217 F Notice 1 9.32 18.38 20.15 35.40 28.31 48.78 45.44 79.46
MIL-HDBK-217 F Notice 2 6.41 9.83 18.31 26.76 24.74 40.15 73.63 119.21
217Plus Version 2.0 0.28 4.89 0.51 6.04
RIAC data 3.3
class of components being considered. For example, for most all electronic components,
the predicted failure rate is found to be a function of operating temperature and applied
electrical stress. In general, the lower the operating temperature and applied electrical
stress, the lower the predicted failure rate will be. Therefore, the parts stress method
includes model factors for these specific stresses. However, if specific stress values
cannot be determined, it is still possible to perform a prediction using the more general
parts count methodology. For the parts count method, model stress levels have been set
to typical default levels to allow a failure rate estimate simply by knowing the generic
type of component (such as chip resistor) and its intended use environment (such as
ground mobile). It should be noted that these reliability prediction handbook approaches
are, by necessity, generic in nature. Actual test or field data from other similar items is
always more desirable, given sufficient similarity, as was discussed previously.
MIL-HDBK-217
MIL-HDBK-217, “Reliability Prediction of Electronic Equipment”, has historically been
the most widely used of all of the empirically-based reliability prediction methodologies.
The basic premise of the handbook is the use of historical piece part test and field failure
rate data as the basis for predicting future system reliability. The handbook includes
failure rate models for most electronic part types. The latest released version of MIL-
HDBK-217 is “F, Notice 2”, dated 28 February 19952. The handbook was almost a
casualty of the DoD Acquisition Reform initiative, but it survived primarily because of its
widespread use, the dependency on it throughout the military-industrial complex, and the
lack of a suitable replacement.
Figure 2.5-24 presents a brief example of the MIL-HDBK-217 parts count method, where
the product or system failure rate is the sum of the failure rates of the generic electrical
and electromechanical components of which it is comprised3. Each piece-part failure rate
is derived by assigning “typical” defaults to the generic component category stress
models. The only factors considered in these parts count component models are (1) the
generic base failure rate for that part type (represented by λg) that is based on an assumed
application environment and default temperature, (2) a generic quality factor (πq) that is
used to modify this part type base failure rate, and (3) the quantity of that part type used
in the equipment. In the example shown here, the λg for a bipolar microcircuit comprised
of between 1 and 100 gates in a 16-pin dual-in-line package used in a ground, fixed (i.e.,
GF) environment and operating at an assumed junction temperature of 60 degrees C is
2
As of the publication date of this book, a Draft version of MIL-HDBK-217G is in development and is expected to be released some
time in 2010.
3
The current version of MIL-HDBK-217 does not predict the reliability of mechanical components or non-hardware reliability
elements, such as software, human reliability, and processes. Field failures of mechanical components and non-hardware items should
not be scored against MIL-HDBK-217 or any other electronics-based empirical methodologies.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
84
Chapter 2: General Assessment Approach
0.012 failures per million hours. The quality factor for a parts count prediction is also
determined from a table (not shown in this example).
The parts count prediction approach is intended for use early in the design phase of the
equipment life cycle, prior to the start of detailed design, when there is little known about
the specific characteristics of the parts being used, or how they will be applied (such as
individual operating and environmental stresses).
The parts stress approach of MIL-HDBK-217 is applied later in the design development
phase of the equipment life cycle, when more details of the design are becoming
available and specific part applications are being identified. The use of this approach
requires detailed knowledge of the applied stresses and physical characteristics of the
device, including ambient and/or operating junction temperatures, electrical stress levels
(such as voltage, power, or current) vs. rated parameters, device complexity (such as gate
counts or transistor counts for semiconductors), etc. An example of a MIL-HDBK-217
parts stress model is shown in Figure 2.5-25 for gate/logic arrays and microprocessors.
The form of the model separately addresses failure rate contributions from the
microcircuit die and the microcircuit package. The quantitative values for C1 (shown
here) and C2 (not shown) are device- and package technology-dependent. The values for
the temperature and environmental factors, πT and πE respectively, are also technology
dependent and are seen to independently impact the die and package failure rate
contributions. The quality factor, πQ, represents the amount of pre-condition screening or
testing that the part might get, and the learning factor (πL) reflects the level of maturity
associated with the manufacture of the device. Component maturity has been shown to
be a predominant reliability driver of components, since the maturity is inversely
proportional to the defect density, which in turn is proportional to the failure rate. .
The primary discriminator between the two approaches is that the Telcordia models have
been tailored specifically for the telecommunications industry, meaning that the breadth
of available environmental factors in the Telcordia models is much narrower than in
MIL-HDBK-217. There are three basic reliability prediction methods in the Telcordia
methodology:
It uses the Bayesian methodology to combine data from various sources, in a manner
similar to the RAC PRISM/RIAC 217Plus methodology.
PRISM/217Plus
The original RAC PRISM®4 system reliability assessment tool was developed and
released by the RAC in January 2000 as a potential replacement for MIL-HDBK-217.
With the subsequent transition to RIAC in June 2005, the RIAC 217Plus methodology
replaced the RAC PRISM tool and added additional component models. As a result, the
217Plus methodology currently addresses all the major component types found in MIL-
HDBK-217. Figure 2.5-27 symbolizes the replacement of the RAC PRISM tool by
217Plus.
With no DoD sponsorship and funding, the models contained within MIL-HDBK-217F,
Notice 2, and the data upon which they were based, were becoming increasingly
outdated, thereby subject to increasing criticism. The part failure rate models
incorporated within PRISM, and ultimately within 217Plus, are based on a much larger
and more recent dataset, reflecting the improvements made in semiconductor device and
packaging technologies, resulting in more “accurate” part failure rate predictions. The
PRISM/217Plus software tool incorporates many of the ideas contained in the “New
System Reliability Assessment Study” performed by the RIAC for what was then Rome
Laboratory (Reference 7). These ideas include the ability to update an analytical
reliability prediction using in-house test data or field experience through Bayesian
techniques, and the ability to factor in system-level reliability impacts resulting from the
robustness (or lack, thereof) of the system development process. This methodology is
discussed in more detail in Chapter 7.
4
PRISM is a registered trademark of Alion Science and Technology.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
88
Chapter 2: General Assessment Approach
CNET/RDF 2000
The CNET/RDF 2000 Reliability Prediction Standard, shown in Figure 2.5-28, covers
most of the same component categories as MIL-HDBK-217.
This model has many similarities to the PRISM/217Plus models, in that it addresses the
year of manufacture, dormancy failure rates, thermal cycling characteristics and electrical
overstress failure rates. The form of the integrated circuits model is shown here, and
bears a resemblance to the format of the MIL-HDBK-217F, Notice 2 microcircuit failure
rate model, in that it partitions the predicted device failure rate into die- and package-
related contributions. As can be seen from Figure 2.5-29, there is quite a bit of
information that the analyst must have access to in order to use the model, but this is
typical for virtually all parts stress reliability prediction models.
FIDES
The FIDES methodology, illustrated in Figure 2.5-30, was created by a French
consortium of reliability experts from various companies. It has similarities to the
CNET, RAC PRISM and RIAC 217Plus methods.
Other Methodologies
Other methodologies include:
IEC 62380 TR Ed.1 (2003), Reliability Data Handbook - A universal model for
reliability prediction of electronics components, PCBs and equipment.
Reliability prediction models such as those summarized above can easily become
outdated as technology advances. As previously mentioned, maintaining the currency
and accuracy of a reliability prediction model can be a prohibitively costly and labor-
intensive effort. Failure to invest in this activity, however, will doom a reliability
prediction methodology to eventual irrelevancy and obsolescence.
Good data collection is the key to an effective process for utilizing data obtained from a
reliability tracking system. This information includes:
The intent of this section is to outline a reliability data collection and analysis system that
can provide the data required. Although a reliability tracking system outlined herein has
similarities to a FRACAS program, there are distinct differences. While a FRACAS
program is intended identify the causes of failures so that corrective action can take
place, the program outlined herein is intended to be more comprehensive in that it assists
its user in more than the implementation of corrective actions, as it also provides the data
required to quantify reliability, in accordance with the methodologies outlined in this
book. This concept is illustrated in Figure 2.5-31.
Reliability
Vendor Implement
Warranty RCM
Selection Design
Claims Implementation
Improvements
A data system consists of several basic elements: a database, software analysis tools, and
an interface to the data system users. The database is the core of the system that captures
the raw maintenance data that is necessary to perform the required data analysis. A
typical structure of a database is provided in Figure 2.5-32.
Maintenance Data
Root Failure
Cause/Analysis Data
The blocks in the above figure correspond to records in a relational database structure.
The data elements associated with each record are defined below. The System
Reliability Information Analysis Center
93
Chapter 2: General Assessment Approach
Information record consists of population statistics and needs to be updated whenever the
product or system status changes. Such a change occurs when new or modified items are
fielded.
The parts breakdown data element consists of a hierarchical description of the system.
This description is necessary to avoid confusion as to which FRUs (Field Replaceable
Units) belong to which assemblies and the number of FRUs in the assembly, as well as in
the entire system.
The maintenance data element consists of a record of the maintenance action taken to
maintain or repair the system. It also consists of a description of the anomaly, the failure
mode, and the failure mechanism of the failed unit as determined by the maintenance
technician. One record corresponds to a single maintenance action, and there can be any
number of them for each FRU in the system (i.e., a FRU in the system can be replaced
any number of times over the life cycle of the system).
The root failure cause/analysis data element consists of information on the results of the
detailed failure analysis that may be performed on the failed unit. It is a separate record
because not all maintenance actions will result in the failure analysis of a removed unit.
There are two primary interfaces required of the system. The first is the maintenance
technician interface. This interface is the means by which maintenance data is entered
into the database. Ideally, this interface would consist of computers located within the
maintenance facility for direct data entry. The second interface is the one utilized by
individuals that need the results of the data analysis. The flow of the interface to the
system from the perspective of the system user is given in Figure 2.5-33.
System maintenance is
required and maintenance
commences
System user
Maintenance technician enters part
identifies the part breakdown and
requiring maintenance maintains system
and enters the part data usage status
into the database
Central
Database
User runs
Technician performs appropriate analysis
required maintenance and obtains
necessary reliability
metric(s)
Maintenance technician
enters maintenance data
into the database
Important elements of the data system that should be considered for inclusion are
summarized below:
• System information
• Number of systems fielded
• Dates of fielding for each system j
• Location of operation (optional)
• System Numbers (unique identifier for each system)
Parts Breakdown
A description of every level of assembly must be available, down to the lowest level of
repair. For the purposes of this example, this assembly will be called a FRU (Field
Reliability Information Analysis Center
95
Chapter 2: General Assessment Approach
• Part number
• Serial number
• Part identification code (unique descriptor of part in hierarchical breakdown of
system; sometimes referred to as a Reference Designator)
• Number of parts in the product or system
• Applicable Life Unit (i.e. hours, miles, cycles, operations, etc.)
• Identification as to if there is an individual elapsed time meter (or miles, cycles,
operations) on the specific part or whether system life units must be used
• Manufacturer name
Maintenance Information
A critical element to an effective reliability data collection and analysis system is the
accurate quantification of the failure cause. Not all perceived failures are real failures
and, therefore, it is important to identify whether part removals are indeed true failures.
Figure 2.5-34 illustrates the hierarchy of maintenance actions.
Maintenance Action
Scheduled Unscheduled
Unnecessary
Necessary Faulty Unit Cannot
Repair
Repair Gets Put Back Duplicate
into Field
Analysis
From the data collected and captured in the database, several fundamental reliability
parameters, including those listed below, can be calculated.
For many of these parameters, it is necessary to calculate the number of life units to
which each part has been exposed. This is done by calculating the number of life units on
the part since the last time that the part was replaced. This calculation procedure is
illustrated in Figure 2.5-35.
Yes No
Yes No
Outputs
A list of typical output parameters are listed below:
Drenick’s Theorem
An important aspect of interpreting field reliability data is distinguishing between
calendar time and operating time. Consider a situation in which five items are fielded at
the same time, as illustrated in Figure 2.5-36. They will each have a failure time (or other
appropriate life unit) that is described by the TTF distribution as a function of operating
time.
1
2
3
4
5
Operating
Time
Failure Times
Now, consider the same five items that were placed in the field at different calendar
times, as illustrated in Figure 2.5-37. They will have the same failure times relative to
their operating time, but the apparent failure times relative to calendar time will be quite
different.
1
2
3
4
5
Calendar
Time
Failure Times
Furthermore, if the product or system is repairable (in which case the failed items are
replaced upon failure with a new item), an interesting effect occurs in which the apparent
failure rate will reach an asymptotic value that appears to represent a constant failure rate.
This occurs as the “time zero” values become randomized as items fail and are replaced
with new items.
To illustrate the relationship between the beta value (Weibull shape) and the
instantaneous failure rate as a function of calendar time when parts are replaced upon
failure, a simulation was performed. In this simulation example, the failure rate of 1100
items as a function of calendar time was calculated.
Figures 2.5-38 through 2.5-42 illustrate the results. These figures correspond to Weibull-
distributed TTFs with shape parameters of 20, 5, 2, 1 and 0.5, respectively. The time axis
is calendar time, normalized to a time unit of one characteristic life.
Consider the case where the Weibull beta = 20 (Figure 2.5-38). When the populations
start operating at the same time at t = 0, the failures occur at a rate described by the
Weibull distribution with a beta value of 20. The peak of the failure rate occurs at
approximately the characteristic life value of time. As units fail and are replaced, the
“time zeros” start to become randomized. As enough time passes, the “times zeros” will
eventually become completely randomized. At this point, the asymptotic value of failure
rate is reached, which is the reciprocal of the characteristic life (in this case, 100). Figure
2.5-39, depicting the simulation results for a beta value of 5.0, indicates a similar effect.
The asymptotic failure rate, however, is reached sooner. This happens because the
variance in failure time is greater for a beta of 5.0 relative to a beta of 20, which, in turn,
means that the population “time zeros” become randomized sooner. The plot illustrating
a beta of 2.0 (Figure 2.5-40) is similar, with a corresponding asymptotic value reached
sooner. The plot corresponding to a beta of 1.0 (Figure 2.5-41) indicates that the random
failure rate occurs at t=0 which intuitively make since it has, by definition, a randomly
occurring failure rate.
However, when the beta is less than 1.0 (Figure 2.5-42), the asymptotic failure rate value
is zero. This occurs because, when enough time has passed, the failed items have been
replaced with items that have a higher probability of living longer. The lower the beta
value, the shorter the time period required to achieve a zero failure rate.
Reliability Information Analysis Center
105
Chapter 2: General Assessment Approach
2.5.2. Physics
The generic approaches covered here in using a physics approach are stress strength
interference models and models from first principals. Each is described below.
This technique is general in nature and applies equally to any situation that the two
distributions can be quantified, as long as the X-axis represents the same variable for both
distributions. The variable can be electrical, such as voltage or current, or it can be
mechanical strength, for example, in units of KPSI.
The goal of any design for robustness effort is to minimize the variance of both
distributions, and maximize the separation of the distribution means. In this manner, the
probability of distribution intersection, or failure, is minimized.
Dimensions Extrinsic
Stresses Modulus CTE
Strength
Probability of
Failure vs Time
In this example, a mechanical item has certain physical properties, for example its
modulus and its coefficient of thermal expansion (CTE). These material properties are
used in addition to the design variable (i.e. dimensions, extrinsic stresses) to estimate the
stresses to which the item is exposed. This stress can be modeled in several ways. One is
the use of handbooks that contain closed-form equations that estimate the stress to which
a material is exposed as a function of dimensions, force, deflections, etc. This is usually
only viable for simple structures. For more complex mechanical structures, finite
element models and analysis (FEA) may be required to simulate stresses.
For the strength portion of the model, two factors need to be considered:
An example of strength as a function of time is the fatigue properties of the material. The
fatigue properties pertain to the strength degradation over time.
At time = 0, the probability of failure is the intersection of the stress and the strength
distributions, as illustrated in Figure 2.5-44.
Z= Standard Normal variant (i.e., the number of standard deviations from the
normal standardized distribution). The value for “Z” can be obtained
from:
1. Tables of the Standard Normal distribution
2. MS EXCEL formula = Normdist(Z)
μx = the mean of the strength
μy = the mean of the stress
σx = the standard deviation of the strength
σy = the standard deviation of the stress
In many real situations, distributions other than the Normal are used, requiring alternate
methods of calculating the interference probability. Readily available software tools can
be used for this purpose (Reference 3).
An example of a model that has been successfully used for brittle materials is the
following:
⎛ m
m
⎞
⎜ ⎛V ⎞⎛ σ ⎞ ⎛t ⎞n ⎟
P = 1 − exp⎜ − ⎜⎜ ⎟⎟⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎟
⎜ ⎝ V0 ⎠⎝ S 0 ⎠ ⎝ t0 ⎠ ⎟
⎝ ⎠
where:
P= probability of failure
m = Weibull slope of the initial strength
S0 = characteristic strength
n= fatigue constant
V and V0 are volume parameters to account for the effects of size (i.e., they
account for the effect that the more volume or surface area that there is, the more
likely it is to have a strength limiting flaw)
σ= stress
Now, if a screen is applied to the material to eliminate defects having strength values
below the applied screen stress threshold (Sth), the probability of failure becomes:
⎛ ⎛ 1 m
⎞ ⎞
⎜ ⎜ ⎛ t0 ⎞ n ⎟ ⎟
⎜ σ − S ⎜ ⎟ m
⎟
⎛V ⎞⎜ th
⎝ t ⎠ ⎟ ⎛ t ⎞n ⎟
P = 1 − exp⎜ − ⎜⎜ ⎟⎟⎜ ⎟ ⎜ ⎟
⎜ ⎝ V0 ⎠⎜ S0 ⎟ ⎜⎝ t 0 ⎟⎠ ⎟
⎜ ⎜ ⎟ ⎟
⎜ ⎜ ⎟ ⎟
⎝ ⎝ ⎠ ⎠
This is only one example of a stress strength model. Many others can be found in the
literature.
The original test plan included Accelerated Aging Tests on Fused Splitters for 3
conditions, as shown in Table 2.5-12.
The “X” values in the cells indicate which test conditions have a constant value for the
stress indicated in each column. The values were chosen to assess whether relative
humidity or absolute humidity was the predominant mechanism of the failure mode. In
this case, two of the three conditions have equivalent relative humidity and two of three
have equivalent absolute humidity.
The results of the accelerated tests did not agree with a previously hypothesized failure
mechanism that proposed epoxy creep as the coupling ratio drift mechanism. Therefore,
in an effort to obtain a model that was consistent with empirical evidence, the
fundamental physics were investigated. This process is described below:
From optical component physics, it can be shown that the coupling between two fibers is:
3πλ 1
c=
32n2 a (1 + 1 / V )2
2
where:
V ≡ ak (n22 − n32 )1 / 2
Additionally, the diffusion of water vapor into silica can be represented as:
⎜ ⎟
⎝ n =1 ⎠
where:
Bn ≡ 2 /[ jnJ 1( jn)]
The hypothesis of the physical mechanism is that water diffuses into the outer surface of
the fused region very slowly and slightly decreases the index of refraction of this outer
surface. This increases the coupling coefficient, thereby increasing the coupling ratio.
As time goes by, more and more water diffuses in, and the coupling ratio increases until
the device goes out of spec. The amount of water in the silica is simply the number of
water molecules hitting the surface of the silica per unit time (directly proportional to the
absolute humidity) and the diffusion rate at that temperature. Therefore, if the time to
failure at a specific condition is known, the time to failure at a new condition is the
known TTF multiplied by a ratio of the absolute humidity level times the ratio of the
diffusion rates.
The data obtained in the tests were used to estimate the diffusion rate and the temperature
dependence of this diffusion rate, as shown in Table 2.5-13.
The predictions from the model were then obtained. The predicted and observed
lifetimes are shown in Table 2.5-14.
As can be seen above, the model is extremely accurate in predicting the failure
mechanism behavior.
Models developed from first principals like the one shown in this example can be very
accurate and, thus, beneficial to a reliability program. However, several pieces of
information were required in order to make this approach a viable alternative:
The primary difference between this approach and the DOE-based life modeling
approach previously described is the manner in which the model form is determined. In
the DOE approach, the model forms are assumed and are based on standard forms like
the power law or the Arrhenius law. In the physics approach, the model forms are
determined from first principals of physics. In both cases, however, certain model
parameters are generally estimated from empirical data.
Figure 2.6-1 summarizes the 217Plus methodology for estimating the failure rate of a
product or system. In this example, only constant failure rates are addressed. If specific
items are described by non constant failure rates, the mathematics become more difficult,
but the basic approach remains the same.
The specific approach that can be used depends on several factors, including:
• Whether the analyst chooses to evaluate and assess the processes used in the
development of the product or system
The types of data that may be available can be any of the types summarized previously in
this section of the book.
If the product or system under analysis is an evolution of a predecessor item, the field
experience of the predecessor product can be leveraged and modified to account for the
differences between the new product and the predecessor product. A predecessor is
defined as a product or system that is based on similar technology and uses
design/manufacturing processes similar to the new item under development for which a
reliability prediction is desired. In this case, the new product or system is an evolution of
its predecessor. In this analysis, a prediction is performed on both the predecessor item
and the new item under development. These two predictions form the basis of a ratio that
is used to modify the observed failure rate of the predecessor, and account for the degree
of similarity between the new and predecessor products pr systems. The result of the
predecessor analysis is expressed as λ1, as presented in Figure 2.6-1.
If enough empirical data (field, test, or both) is available on the new product or system
under development, it can be combined with the reliability prediction on the new item to
form the best failure rate estimate possible. A Bayesian approach is used for this
combination, which merges the reliability prediction with the available data. As the
quantity of empirical data increases, the failure rate using the Bayesian combination will
be increasingly dominated by the empirical data. The result of the Bayesian combination
is defined as λ2, as presented in Figure 2.6-1.
The minimum amount of analysis required to obtain a predicted failure rate for a product
or system is the summation of the component estimated failure rates. The component
failure rates are determined from the component models, along with other data that may
be available to the analyst. The result of this component-based prediction is λIA,new. This
value can be further modified by incorporating the optional data, resulting in λpredicted,new,
as shown in Figure 2.6-1. All methods of analysis require that a prediction be performed
on the new product or system under development in accordance with the component
prediction methodology. Predictions based solely on the component analysis should
be used only when there is no field or test reliability history for the new item and no
suitable predecessor item with a field reliability history. In this case, the reliability
model is purely predictive in nature. After a product or system has been fielded, and
there has been a significant amount of operating time, the best data on which to base a
failure rate estimate is field observed data, or a combination of prediction and observed
Reliability Information Analysis Center
115
Chapter 2: General Assessment Approach
failure data. In this case, the reliability model yields an estimate of reliability, because
the reliability is estimated from empirical data.
Each element of the 217Plus methodology is further described in the following sections.
λIA,predecessor
λIA,predecessor is the initial reliability assessment of the predecessor product or system. It is
the sum of the predicted component failure rates, and uses any of the methods described
in this book.
λobserved, predecessor
λobserved, predecessor is the observed failure rate of the predecessor product or system. It is
the point estimate of the failure rate, which is equal to the number of observed failures
divided by the cumulative number of operating hours5.
Optional data
Optional data is used to enhance the predicted failure rate by adding more detailed data
pertaining to environmental stresses, operating profile factors, and process grades (the
concept of process grades is explained in detail in Chapter 7). The 217Plus models
contains default values for the environmental stresses and operational profile, but in the
event that actual values of these parameters are known, either through analysis or
measurements, they should be used. The application of the process grades is also
optional, in that the user has the option of evaluating specific processes used in the
design, development, manufacturing and sustainment of a product or system. If process
grades are not used in a 217Plus analysis, default values are provided for each process
(failure cause), so that the user can evaluate any or all of the processes.
λpredicted, predecessor
λpredicted, predecessor is the predicted failure rate of the predecessor product or system after
combining the initial assessment with any optional data, if appropriate.
λIA,new
λIA,new is the initial reliability assessment of the new product or system. This is the sum
of the predicted component failure rates, and uses the 217Plus component failure rate
models or other methods (such as data from NPRD or other data sources). A reliability
5
Note that “operating hours” can be replaced by any other life unit, such as calendar hours, miles, cycles, etc. The 217Plus
methodology predicts failure rates in terms of calendar hours. The important point is that all life units used in the assessment must be
consistent.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
116
Chapter 2: General Assessment Approach
prediction performed in accordance with this method is the minimum level of analysis
that will result in a predicted reliability value. Applying any optional data can further
enhance this value.
λpredicted, new
λpredicted, new is the predicted failure rate of the new system after combining the initial
reliability assessment with any optional data, if used. If optional data is not used, then
λpredicted, new is equal to λIA,new
λ1
λ1 is the failure rate estimate of the new system after the predicted failure rate of the new
system is combined with the information on the predecessor product (predicted and
observed data). The equation that translates the failure rate from the old product or
system to the new one is:
λobserved , predecessor
λ1 = λ predicted , new ×
λ predicted , predecessor
The values for λpredicted,new and λpredicted,predecessor are obtained using the component
reliability prediction procedures. The ratio of λobserved,predecessor /λpredicted,predecessor inherently
accounts for the differences in the predicted and observed failure rates of the predecessor
system, i.e., it inherently accounts for the differences in the products or systems analyzed
in the component reliability prediction methodology.
This methodology can be used when the new product or system is an evolutionary
extension of predecessor designs. If similar processes are used to design and
manufacture a new item, and the same reliability prediction processes and data are used,
then there is every reason to believe that the observed/predicted ratio of the new system
will be similar to that observed on the predecessor system. This methodology implicitly
assumes that there is enough operating time and failures on which to base a value of
λobserved,predecessor. For this purpose, the observance of failures is critical to derive a point
estimate of the failure rate (i.e., failures divided by hours). A single-sided confidence
level estimate of the failure rate should not be used.
ai
ai is the number of failures for the ith set of data on the new product or system.
bi
bi is the cumulative number of operating hours for the ith set of data on the new product or
system.
AFi
AFi is the acceleration factor (AF) between the conditions of the test or field data on the
new product or system and the conditions under which the predicted failure rate is
desired. If the data is from a field application in the same environment for which the
prediction is being performed, then the AF value will be 1.0. If the data is from
accelerated test data or from field data in a different environment, then the AF value
needs to be determined. If the applied stresses are higher than the anticipated field use
environment of the new system, AF will have a value greater than 1.0. The AF can be
determined by performing a reliability prediction at both the test and use conditions. The
AF can only be determined in this manner, however, if the reliability prediction model is
capable of discerning the effects of the accelerating stress(es) of the test. As an example,
consider a life test in which the product was exposed to a temperature higher than what it
would be exposed to in field-deployed conditions. In this case, the AF can be calculated
as follows:
λT 1
AF =
λT 2
where:
λT1 = the predicted failure rate at the test conditions obtained by performing a
reliability prediction of the system at temperature 1
λT2 = the predicted failure rate at the use conditions obtained by performing a
prediction at temperature 2
b i’
bi’ is the effective cumulative number of hours of the test or field data used. If the tests
were performed at accelerated conditions, the equivalent number of hours needs to be
converted to the conditions of interest, as follows:
bi ' = bi × AFi
ao
ao is the effective number of failures associated with the predicted failure rate. If this
value is unknown, then use a default value of 0.5. In the event that predicted and
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
118
Chapter 2: General Assessment Approach
observed data is available on enough predecessor products or systems, this value can be
tailored. See the next section for the appropriate tailoring methodology.
λ2
λ2 is the best estimate of the new system failure rate after using all available data and
information. As much empirical data as possible should be used in the reliability
assessment. This is done by mathematically combining λ1 with empirical data. Bayesian
techniques are used for this purpose. The technique accounts for the quantity of data by
weighting large amounts of data more heavily than small quantities. λ1 forms the “prior”
distribution, comprised of a0 and ao/λ1. If empirical data (i.e., test or field data) is
available on the system under analysis, it is combined with λ1 using the following
equation:
n
a 0 + ∑ ai
λ2 = i =1
n
a0
+ ∑ bi '
λ1 i =1
where λ2 is the best estimate of the failure rate, and ao is the “equivalent” number of
failures of the prior distribution corresponding to the reliability prediction. For these
calculations, 0.5 should be used unless a tailored value can be derived. An example of
this tailoring is provided in the next section.
a1 through an are the number of failures experienced in each source of empirical data.
There may be “n” different sources of data available (for example, each of the n sources
corresponds to individual tests or field data from the total population of products or
systems).
b1’ through bn’ are the equivalent number of cumulative operating hours experienced for
each individual data source. These values must be converted to equivalent hours by
accounting for any accelerating effects between the use conditions.
To estimate the value of ao that should be used, a distribution of the following metric is
calculated for all products for which both predicted and observed data is available:
λobserved, predecessor
λ predicted, predecessor
The lognormal distribution will generally fit this metric well, but others (for example,
Weibull) can also be used. The cumulative value of this distribution is then plotted.
Next, failure rate multipliers (as calculated by a chi square distribution) are calculated
and plotted. This chi-square distribution should be calculated and plotted for various
numbers of failures, to ensure that the distribution of observed/predicted failure rate
ratios falls between the chi-square values. In most cases, one, two and three failures
should be sufficient. Next, the plots are compared to determine which chi-square
distribution most closely matches the observed uncertainty values. The number of
failures associated with that distribution then becomes the value of a0. Figure 2.6-2
illustrates an example for which this analysis was performed.
Figure 2.6-2: Comparison of Observed Uncertainty with the Uncertainty Calculated With
the Chi-square Distribution
As can be seen from Figure 2.6-2, the observed uncertainty does not precisely match the
Chi-square calculated uncertainty for any of the one, two or three failures used in this
analysis. This is likely due to the fact that the population of products on which this
analysis is based is not homogeneous, as assumed by the chi-square calculation.
However, the confidence levels of interest are generally in the range 60 to 90 percent. In
this range, the chi-square calculated uncertainty with 2 failures most closely
approximates the observed uncertainty. Therefore, in this example, an a0 value of 2 was
used. This value is also consistent with the Telcordia GR-332 reliability prediction
methodology (Reference 6).
Model
for
Failure
Data Likelihood
L( Failure Data | θ )
Failure
Data
f 0 (θ ) × L(DATA θ )
f (θ ) = f (θ DATA) =
∫θ f (θ ) × L(DATA θ ) dθ
0
where,
θ = the vector of model parameters, (θ1, θ2, …, θn)
f(θ) = the posterior joint distribution of parameters
f0(θ) = the prior joint distribution of parameters
L(data|θ) = the likelihood of data given the model parameters
In practice, the features of this distribution include the updated marginal and conditional
distribution of each parameter given the provided information. The marginal distribution
of a single parameter is defined by the next equation. The marginal distribution is
estimated by integrating the posterior joint distribution, f(θ), over the range of other
parameters, as shown. The other important outcome of the posterior joint distribution is
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
122
Chapter 2: General Assessment Approach
the conditional distribution of each parameter, when other elements of vector θ are given.
The conditional distribution is constructed by substituting the known parameters in the
joint distribution, f(θ). Here again, the function needs to be scaled by a normalization
factor, as demonstrated in the equation below, in order to be consistent with the basic
characteristics of the distribution functions.
f (θ j , θˆ− j )
(
g j θ j | θˆ− j =)
∫θ f (θ , θˆ )dθ
j −j j
j
where:
The integrals necessary for Bayesian computation usually require analytic or numerical
approximations. While the computations for non-constant failure rate distributions can
get quite involved, they are relatively straightforward for the exponential distribution.
The method explained in the previous section details this situation.
1. Perform a Monte Carlo analysis, where the TTF distribution of each element is
preserved, and the operating time and number of failures is modeled
2. If all of the failure causes are independent, the options are:
a. Calculate the reliability of each cause at a specific time of interest, and
then calculate the reliability as:
n
R(t ) = ∏ Ri (t )
i =1
n
λ = ∑ λi
i =1
1. The analysis is performed only to the component level, without modeling the
specific failure causes
2. A constant failure rate distribution is used
3. All components are required for the product or system to meet its requirements
(i.e., failure probability values are independent)
Then, the product reliability is simply the product of the reliabilities of the individual
components, or likewise the failure rate is the sum of the failure rates of the individual
constituent components. This has been the traditional approach when using the
“handbook” types of methodologies.
If all of the above listed conditions are not present, then more sophisticated techniques
are required. For example, consider the situation in which Condition 1 and 2 are not
satisfied, but Condition 3 is. In this example, let’s say that there are seven failure causes
for which the life modeling has resulted in an estimate of the TTF distribution under field
use conditions. These distributions can be any arbitrary shape, dependent entirely on the
characteristics of the specific failure causes. This situation is depicted in Figure 2.7-1,
where the reliability block diagram is shown as a series configuration. Each failure cause
is represented by Events 1 through 7, each of which has its own probability density
function.
8.000E-4
7.200E-4
1.600E-4 7.200E-4
8.000E-4 0.002
0.002
6.000E-4
5.400E-4
f(t)
f(t)
f(t)
f(t)
f(t)
f(t)
f(t)
4.000E-4
3.600E-4
2.000E-4
1.800E-4
4.000E-5 6.000E-4
1.800E-4
4.000E-4 2.000E-4
0.000
0.000 1000.000 2000.000 3000.000 4000.000 5000.000
0.000
0.000 600.000 1200.000 1800.000 2400.000 3000.000 Time, (t)
0.000
0.000 0.000 1000.000 2000.000 3000.000 4000.000 5000.000
0.000 0.000 1000.000 2000.000 3000.000 4000.000 5000.000 0.000 Time, (t)
0.000 0.000 1000.000 2000.000 3000.000 4000.000 5000.000 0.000 600.000 1200.000 1800.000 2400.000 3000.000
0.000 1000.000 2000.000 3000.000 4000.000 5000.000 Time, (t)
Time, (t)
Time, (t) Time, (t)
Time, (t)
0.002
0.001
f(t)
8.000E-4
4.000E-4
0.000
0.000 1000.000 2000.000 3000.000 4000.000 5000.000
Time, (t)
For repairable systems, in which repairs are made as failures occur, the system reliability
would be simulated over a given time period, such as the mission duration or the
warranty period. In this case, failure times are simulated from time = 0 to the specified
time period. In this simulation, multiple systems are simulated, for which the failure
times of the constituent components are also simulated. As failures occur, new
replacement components are installed which have a new component time zero (the
system operating time will not be zero, but will be the cumulative operating time). This
continues until the duration is exceeded for each of the simulated systems. The resulting
failure times for the system can then be analyzed, and the distribution parameters defined.
The resultant distribution will generally not be a mono-modal distribution; rather, it will
be a distribution of an arbitrary shape that is usually represented by a multi-modal
distribution.
It is noteworthy that the above model is valid for any situation in which all items are
critical, i.e., the failure of any one item results in product or system failure. For example,
the Fault Tree for this situation may look like Figure 2.7-2. In this case, all gates are OR
gates, which means that all failures of items represented by Events 1 through 7 constitute
critical failures. This is shown to illustrate the fact that the analysis does not necessarily
Reliability Information Analysis Center
125
Chapter 2: General Assessment Approach
need to be performed at the same level of hierarchy. The most important thing is that all
of the critical failure causes are accounted for.
TOP
OR Event 3 OR
OR OR
Event 1 Event 2
For non repairable systems, in which the first failure causes system failure and all items
represented by each failure cause are required for the product or system to function, this
becomes a competing risk situation in which the first failure cause to occur will define
the item’s TTF distribution. The Type 1 extreme value distribution, also known as the
Gumbel distribution, is sometimes used to model this situation when components have
the same reliability distribution. This competing risk situation, modeled with times to
first failure (TTFF), will not yield the same results as taking either the product of the
reliability values or the sum of the failure rates (in the case of constant failure rates)
because, in the latter cases, there is a probability that multiple failures will occur in the
time period analyzed, which is not the case for the competing risk situation.
For all but the simplest of situations, closed-form solutions cannot be obtained. These
require solutions with numerical simulation, like Monte Carlo analysis, which is
described in the next section.
For #2, there are handbooks available which provide estimates of interference probability
based on the individual stress and strength distributions. Or, a statistical simulation can
be performed to estimate the degree of interference via numerical techniques. This is
generally a more efficient and effective way of performing the simulation, given software
tools that are readily available.
The basic principal behind Monte Carlo analysis, as applied to stress/strength interference
analysis is shown here:
where:
The first step is to randomly select a value from each of the stress and strength
distributions. As an example, consider a normally distributed strength with a mean of 10
and standard deviation of 3, the pdf of which is shown in Figure 2.7-3.
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 2.7-3: pdf of Normal Distribution with Mean of 10 and Standard Deviation of
3.
Next, the cumulative function of this distribution is calculated, as shown in Figure 2.7-4.
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Next, the randomly selected value from the distribution is obtained by:
NORMINV(rand(),mean,standard deviation)
where:
The Weibull distribution is simpler to use than the Normal distribution since an integral
of the pdf is not required to derive the CDF. The closed-form pdf of the Weibull
distribution is:
β
β −1 ⎛ t ⎞
β⎛t⎞ −⎜ ⎟
⎝α ⎠
f (t ) = ⎜ ⎟ e
α ⎝α ⎠
The Weibull distribution is one of the most widely used distributions in reliability
engineering due to its versatility. It also has the advantage of having a closed-form
solution for its cumulative function.
To select a random value from this distribution, a random number between 0 and 1 is
selected, this value is substituted for R(t) and the corresponding TTF is determined from
the equation. In this example, time(t) is shown as the independent variable, but the
specific parameter could be any parameter whose distribution is used in a Monte Carlo
analysis. The inverse cumulative function is shown in Figure 2.7-6, along with the
selection of the random value.
Now, let’s consider another application of Monte Carlo simulation. In this example, a
simple relationship between items for a repairable system is shown in Figure 2.7-7. Here,
the items can be failure causes, components or assemblies, in accordance with the level to
which the analysis is performed.
The behavior for each item, along with the resultant system behavior, is shown in Figure
2.7-8.
For example, item A operates until it fails at TTFA1. At that point in time, it takes TTRA1
to repair it. Items B and C fail and get repaired at rates determined by the simulated
times for each, and governed by the specific distribution of each. The resultant system
availability (Asystem) is shown on the bottom.
A simulation was performed on this hypothetical system using a software tool, the results
of which are shown in Figure 2.7-9. In this case, the following metrics were calculated
from the Monte Carlo analysis:
Simulations of product reliability, as described above, are generally the best way to
combine life estimates of constituent parts in a system. If a system is comprised of
redundant elements, closed-form equations are available that calculate the effective
failure rate of the redundant elements. However, care must be taken when using these
equations. For example, the manner in which they are generally derived is to calculate
the failure characteristics as time approaches infinity. Only in this manner are closed-
form solutions possible. The results are “effective” failure rate estimates that often
underestimate the benefits of redundancy. This is especially true when mission times are
relatively short. As a result, calculating reliability based on the failure probability
examples described above is generally a more sound approach. Additionally, the
availability of software tools has made it much easier to perform these calculations.
2.8. References
1. “Production Part Approval Process (PPAP)”, Third Edition, Daimler-Chrysler,
Ford , General Motors, 1999)
2. Modarres, M., “Accelerated Testing”, ENRI 641, Univ. of Maryland, May 2005
3. Weibull++, Reliasoft Corp.
4. Colm V. Cryan, James R. Curley, Frederick J. Gillham, David R. Maack, Bruce
Porter, and David W. Stowe, “Long Term Splitting Ratio Drifts in Singlemode
Fused Fiber Optic Splitters”, NFOEC 95
5. David R. Maack, David W. Stowe and Frederick J. Gillham, “Confirmation of a
Water Diffusion Model For Splitter Coupling Ratio Drift Using Long Term
Reliability Data”, NFOEC 96
6. Telcordia GR-332, “Reliability Prediction Methodology”
7. Denson, W.K. and S. Keene, “A New System Reliability Assessment
Methodology – Final Report”, Available from the Reliability Information
Analysis Center, 1998
3. Fundamental Concepts
The intent of this book is not to cover the basics of probability or reliability theory. The
understanding of some of these fundamental concepts, however, is critical to the
interpretation of reliability estimates. The definition of reliability is a probability, the
value of which is estimated by the techniques covered in this book. Therefore, the basics
of reliability terminology, and the basis for various theoretical concepts are covered in
this section.
p(x5)
p(x4) p(x6)
Probability - p(xi)
p(x3) p(x7)
p(x2) p(x8)
p(x1) p(x9)
x1 x2 x3 x4 x5 x6 x7 x8 x9
The probability that a random variable “x” takes on a specific value “xi” is expressed as:
P{x = xi } = p(xi )
A continuous variable is one that is measured on a continuous scale, and its probability
distribution is defined as a continuous distribution. For example, the distribution of the
TTF would be a continuous distribution, since an infinite number of positive time values
can be represented in the distribution. Figure 3.1-2 illustrates a continuous distribution.
The probability that a random variable “x” lies between the interval from “a” to “b” is
expressed as:
b
P{a ≤ x ≤ b} = ∫ f ( x)dx
a
The cumulative distribution function F(t) is defined as the probability in a random trial
that the random variable is not greater than t:
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
136
Chapter 3: Fundamental Concepts
t
F (t ) = ∫ f (t )dt
−∞
The Cumulative Distribution Function (CDF) is the probability that the value of a
corresponding random variable will not be exceeded. Cumulative distribution functions
are non-negative and non-decreasing. Given a random variable that cannot be negative,
the value of the CDF at the origin is zero. The upper limit of a CDF is always 1.0, as
illustrated in Figure 3.1-3. The CDF is the integral of the pdf, and is illustrated in Figure
3.1-3 for discrete and continuous distributions, respectively.
The reliability function, R(t), is the probability of a device surviving (not failing) prior to
time “t”, and is given by:
∞
R(t ) = 1 − F (t ) = ∫ f (t )dt
t
Note that for the reliability, the integral of the pdf is from “t” to infinity for the
probability of success, as opposed to minus infinity to “t” as in the case of the failure
probability. The sum of the probability of success and the probability of failure needs to
be 1.0, consistent with the definition of a pdf.
− dR(t )
= f (t )
dt
The probability of failure in a given time interval between t1 and t2 can be expressed by
the reliability function:
∞ ∞
The rate at which failures occur in the interval t1 to t2, the failure rate “λ(t)”, is defined as
the ratio of the probability that a failure occurs within the interval, given that it has not
occurred prior to t1 (the start of the interval), divided by total the interval length. Thus:
R (t1 ) − R (t 2 ) R (t ) − R (t + Δt )
λ (t ) = =
(t 2 − t1 )R (t1 ) (Δt )R (t )
where t = t1 and t2 = t + Δt. The hazard rate, h(t), or instantaneous failure rate, is defined
as the limit of the failure rate as the interval length approaches zero, or:
− dR(t )
= f (t )
dt
Then,
f (t )
h (t ) =
R (t )
• The hazard rate, h(t), is the rate at which failures occur, providing the item has not
failed before the time h(t) is evaluated
• f(t) is the normalized percentage of the population failing in a given time interval
(Δt), such that the population size times value of f(t) is equal to the number of
failures in the interval of time.
• The denominator, R(t), is the probability of survival at t, which is equivalent to
the percentage of the population surviving at time t.
Multiplying R(t) by the population size yields the total number of units surviving until
“t”. This is the binomial probability, or expected value of the number of survivors at “t”.
Since this population will have accrued an operating time of “RN*Δt”, the denominator is
equivalent to the cumulative operating time on the population in the time interval.
Therefore,
f (t ) f (t )× N # failures in Δt
h(t ) =
Failures
= = = = Failure rate
R(t ) R(t )× N # units surviving Δt item hours
1 ⎡ − dR(t ) ⎤
h(t ) =
R(t ) ⎢⎣ dt ⎥⎦
Resulting in:
⎡ t ⎤
⎢ − ∫ h (t )dt ⎥
R(t ) = e ⎣ 0 ⎦
This is the general expression for the reliability function. If h(t) can be considered a
constant failure rate (λ), which is often the case, the equation becomes:
R(t ) = e − λt
The mean time to failure (MTTF) is the expected value of the time to failure, and is:
∞
MTTF = ∫ R(t )dt
0
If the reliability function can be easily integrated, this is a convenient way to calculate the
mean time to failure (MTTF). If not, then numerical techniques can be used.
If all parts in a population are operated until failure, the mean life is:
∑t i
θ= i =1
n
where:
T (t )
MTBF =
r
where:
Failure rate and MTBF are applicable only to the situation in which the failure rate is
constant, i.e., the exponential TTF distribution. Per the definitions above, it can be seen
that the failure rate and MTBF are reciprocals of each other:
1
λ=
MTBF
The failure rate is the number of failures divided by the cumulative operating time of the
entire population (failure/part hours), whereas the MTBF is the cumulative operating time
of the entire population divided by the number of failures (part hours per failure).
Table 3.1-1 provides an overview of the basic notation and mathematical representations
that are common among the various types of probability distributions.
x
E[u( X )] Expected Value ⎧ ∞
⎪ ∑ u(w) f(w), Discrete Distribution
⎪⎪ w =0
E[ u ( X )] = ⎨ ∞
⎪
⎪ ∫ u ( w ) f ( w ) dw , Continuous Distribution
⎪⎩ 0
μ Mean μ = E (X )
σ Standard Deviation
σ = E[( X − μ ) 2 ]
Note: These definitions are based on the assumption that all realizations of a random variable
must be non-negative.
3.2.1. Covariance
Covariance is a measure of the extent to which one variable is related to another, and is
expressed as:
n
(x − x )(y − y )
Cov( X , Y ) = ∑
i =1
i i
n −1
n!
Pr =
n
(n − x )!
A combination is defined as the number of distinct combinations of “n” items taken “x”
at a time, when ordering is not relevant, and is mathematically expressed as:
n!
Pr =
n
x!(n − x )!
As an example of permutations and combinations, define n=4 and x=2. The number of
combinations is:
n! 4!
Pr = = =6
n
Consider these combinations, as illustrated in Table 3.2-1. Here, there are 4 items (n=4),
each of which can have two possible values (blank or “x”).
n
1 2 3 4
x x
x x
x x
x x
x x
x x
(n − x )! (4 − 2)!
Each set of 2 can be reversed, thus the number of permutations is double the number of
combinations for n=4.
Mutually exclusive sets are those with no common members, shown in the Venn diagram
in Figure 3.2-2.
or
P(a and b ) = P(b )P(a b )
where:
P(b a1 )∗ P(a1 )
P (a1 b ) =
∑ P(b a )∗ P(a )i i
where:
The event set a is mutually exclusive. Therefore their probabilities can be added.
If the failure rate is constant, the probability of survival for a specific cause is:
R = e − λt
λtotal = λ1 + λ2 + λ3 + ..........λn
The above equations are relevant to a series configuration of items, each with a constant
failure rate. The fault tree representation of this configuration is shown in Figure 3.2-4.
Here, the system reliability is represented by a logical OR gate, since the failure of A or
B or C will cause system failure.
OR
A B C
The corresponding reliability block diagram representation for this scenario is shown in
Figure 3.2-5.
A B C
All possible outcomes for this example are shown in Table 3.2-2.
Note that each of these eight possible outcomes in the table are mutually exclusive, in
that there is only one possible way in which each of the eight can occur.
RA = 0.95
RB = 0.92
RC = 0.99
The reliability of the series configuration (i.e., the probability of exactly zero failures) of
the three items is:
R = RA RB RC
Now, suppose that several items must fail in order for the system to fail. This scenario is
represented by an AND gate in a fault tree representation, as is shown in Figure 3.2-6.
AND
A B C
Starting Ending
Block Block
All possible outcomes for this example are shown in Table 3.2-3.
RA = 0.95
RB = 0.92
RC = 0.99
R = 1 − (1 − RA )(1 − RB )(1 − RC )
As an example of a slightly more complex situation, consider the fault tree representation
of a system in Figure 3.2-8.
TOP
AND Event 3 OR
AND OR
Event 1 Event 2
Event 1 Event 4
Extra
Starting Event 3 Event 6 Event 7
Block
Event 5
Event 2
Combining the series and parallel events yields the following reliability expression for
this configuration
As an example, let us assume that there are three units operating in parallel, two of which
are required for the system to perform adequately. If R=0.9 and Q=0.1, then the
probabilities associated with each possible combination of outcomes is summarized in
Table 3.2-4.
In this example, the probability of each combination of possible outcomes (in this case,
eight) is calculated. Note that the sum of the probabilities for all possible outcomes is
1.0, since each of the eight possibilities is mutually exclusive and their probabilities can,
therefore, be added. This approach of calculating the probability of every possible
outcome is always valid, regardless of whether the reliability values of each of the
elements are the same or not. For example, if two of the three units are required for the
system to perform adequately, the system will “pass” if there are either no failures or if
there is one failure, as shown below. This is summarized in Table 3.2-5.
It can be seen that the system will pass with outcomes 4, 6, 7 and 8. Outcomes 4, 6 and 7
correspond to exactly one failure (i.e., there are three ways in which one failure can
occur), and outcome 8 corresponds to exactly zero failures (there is only one way in
which this can occur).
If the probability of failure of all of the units is the same and they are independent, then
the binomial or Poisson distributions can be used:
• If the metric used in the reliability analysis is the probability of failure, use the
binomial distribution
• If the metric is a failure rate, use the Poisson distribution
Since this example pertains to items with defined probabilities, the binomial distribution
applies. As defined previously:
r
⎛n⎞ r
n!
F (x; r ) = ∑ ⎜ ⎟ p x q n− x = ∑ p x q n− x
x =0 ⎝ x ⎠ x =0 (n − x!)x!
where:
The probability of exactly no failures (i.e., the first term in the above summation) is:
n! 3!
F (3,0 ) =
x n− x 3 3−3
p q = .9 q = 1 * 0.729 = 0.729
(n − x )! x! (3 − 3)!3!
The probability of exactly one failure (i.e. the second term in the above summation) is:
n! 3!
F (2,1) =
x n− x 2 3− 2
p q = .9 .1 = 3 * 0.81 * 0.1 = 0.243
(n − x )! x! (3 − 2 )!2!
Therefore, the cumulative binomial expression for 0 or 1 failures (r = 0 or 1) is:
r
n!
F ( x; r ) = ∑ (n − x)! x!p q
x =0
x n− x
= 0.729 + 0.243 = 0.972
Because the first term in the binomial probability expression is the number of
combinations of a specific number of failures (or survivals) occurring, the number of
combinations (as calculated by the first term) essentially adds the probabilities associated
with the mutually exclusive events.
3.3. Distributions
Reliability distributions are at the heart of a reliability model. They represent the
fundamental relationship between the reliability metric of interest (probability of failure,
failure rate, etc.) and the independent variable (TTF, cycles to failure, etc.). This
independent variable is called the “life unit”. Table 3.3-1 summarizes probability
distributions often used in reliability modeling, along with a description of their primary
uses.
Exponential Continuous Used to describe the distribution of the time to failure when the
failure rate is constant
Gamma Continuous Used to determine the distribution of the time by which a specified
number of failures will occur when the failure rate is constant
Normal Continuous Used to describe the statistical mean of a sample taken from any
population with a finite mean and variance. Often used to model
parameter distributions. Rarely used for time to failure distributions.
Standard Normal Continuous The Standard Normal distribution (Z) is derived from the Normal for
ease of analysis and interpretation (mean = 0; standard deviation =
1).
Student t Continuous Used to test for statistical significance of the difference between the
means of two samples
F Distribution Continuous Used to test for statistical significance of differences between the
variances of two samples
The following section discusses several of the distributions used in reliability assessment.
While the intent of this book is not to cover the statistical aspects of distributions, some
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
154
Chapter 3: Fundamental Concepts
fundamental concepts are critical to the understanding of the basis for certain techniques
pertaining to reliability assessment, namely confidence level calculations and
demonstrating reliability levels. In particular, the binomial and Poisson distribution are
critical for these purposes.
The binomial distribution is used when there are only two outcomes, such as success or
failure, and the probability remains the same for all trials. The probability density
function (pdf) of the binomial distribution is:
⎛n⎞
f (x) = ⎜ ⎟ p x q (n− x )
⎝ x⎠
where:
⎛n⎞ n!
⎜ ⎟=
⎝ x ⎠ (n − x!)x!
and q = 1 – p.
The function “f(x)” is the probability of obtaining exactly “x” good items and “(n-x)” bad
items in a sample of “n” items, where “p” is the probability of obtaining a good item
(success) and “q” (or 1-p) is the probability of obtaining a bad item (failure).
The CDF, i.e., the probability of obtaining “r” or fewer successes in “n” trials, is given
by:
r
⎛n⎞ r
n!
F (x; r ) = ∑ ⎜ ⎟ p x q n− x = ∑ p x q n− x
x =0 ⎝ x ⎠ x =0 (n − x!)x!
The Poisson distribution is an extension of the binomial distribution when “n” is infinite.
In fact, it is used to approximate the binomial distribution when n ≥ 20 and p ≤ 0.05.
If events are Poisson-distributed, they occur at a constant average rate and the number of
events occurring in any given time interval is independent of the number of events
occurring in any other time interval. Since the TTF distribution for this situation is the
exponential (i.e., constant failure rate), the Poisson distribution will predict the number of
failures for specific values of time and failure rates. The number of failures in a given
time would be given by:
a x e −a
f (x ) =
x!
where “x” is the actual number of failures and “a” is the expected number of failures.
Since the expected number of failures (i.e., the expected value) for the exponential
distribution is “λt”, the Poisson expression becomes:
f (x ) =
(λt )x e −λt
x!
where:
λ= failure rate
t= length of time being considered
x= number of failures
The reliability function, R(t), or the probability of zero failures in time “t” is given by:
R(t ) =
(λt )0 e − λt
= e −λt
0!
There are many cases where the probability of experiencing a given number of failures (r)
or fewer is required. Examples are reliability demonstration, test planning, etc. For these
cases, the CDF is used:
R(x) = ∑
r
(λt )x e −λt
x =0 x!
Continuous distributions are used when analyzing time to failure data, since times to
failure is a continuous variable. The most common distributions used in reliability
modeling to describe times to failure characteristics are the exponential, Weibull and
lognormal distributions. These are described in more detail n the following sections.
3.3.1. Exponential
The exponential distribution is most commonly applied in reliability to describe the times
to failure for repairable items. For non-repairable items, the Weibull distribution is
popular due to its flexibility. In general, the exponential distribution has numerous
applications in statistics, especially in reliability and queuing theory.
The exponential distribution describes products whose failure rates are the same
(constant) at each point in time (i.e., the “flat” portion of the reliability bathtub curve,
where failures occur randomly, by “chance”). This is also called a Poisson process. This
means that if an item has survived for "t" hours, the chance of it failing during the next
hour is the same as if it had just been placed in service. It is sometimes referred to as the
distribution with no memory. It is an appropriate distribution for complex systems that
are comprised of different electronic and electromechanical component types, the
individual failure rates of which may not follow an exponential distribution.
Since the exponential distribution is relatively easy to fit to data, it can be misapplied to
data sets that would be better described using a more complex distribution.
Table 3.3-2 lists the parameters for the exponential distribution: the probability density
function (pdf), the cumulative distribution function (CDF), the mean, the variance, and
the standard deviation. Another useful parameter of continuous distributions is the 100-
pth percentile of a population, i.e., the age by which a portion of the population has failed.
The 50% point is the median life. The mean of the exponential distribution is equal to the
63rd percentile. Thus, if an item with a 1000 hour MTBF had to operate continuously for
1000 hours, there would only be a 0.37 probability of success.
As an example, consider a software system with a failure rate (λ) of 0.0025 failures per
processor hour. Its corresponding mean time between failure (MTBF) is calculated as:
1 1
MTBF = θ = = = 400 processor hours
λ 0 .0025
The reliability function (i.e., the probability, or population fraction that survives beyond
age “t”) at 100 and 1000 processor hours is:
3.3.2. Weibull
The Weibull distribution is important in reliability modeling since it represents a general
distribution which can model a wide range of life characteristics. It can accommodate
increasing, decreasing and constant failure rates. Weibull analysis assumes that there has
been no repair of failed items and is often used to model single failure causes. The basic
features of the Weibull are:
• The scale (or characteristic life) parameter, α, is the value at which 63rd percentile
of the distribution occurs
• The location parameter, γ (or gamma), is only used in the three parameter version
of the Weibull distribution, and is the value that represents the failure free period
for the item. If an item does not have a period where the probably of failure is
zero, then γ= 0 and the Weibull distribution becomes a two parameter distribution.
This third parameter is used when there are threshold effects.
• Determination of β, η, and γ can easily be estimated using Weibull probability
paper or by using available Weibull software programs
• A multi-mode version of the Weibull distribution can be used to determine the
points on the bathtub curve where the failure rate is changing from decreasing, to
constant, to increasing
There are two general versions of the Weibull distribution, the first being the two-
parameter Weibull and the second being the three-parameter Weibull. The two-
parameter Weibull uses a shape parameter that reflects the tendency of the failure rate
(increasing, decreasing, or constant) and a scale parameter that reflects the characteristic
life of items being measured ( ≅ 63.2% of the population will have failed). The three-
parameter Weibull adds a location parameter used to represent the minimum life of the
population (e.g., a failure mode that does not immediately cause system failure at time
zero, such as a software algorithm whose degrading calculation accuracy does not cause
system failure until four calls to the algorithm have been made). Note that in most cases,
the location parameter is set to zero (failures assumed to start at time zero) and the
Weibull distribution reverts to the two-dimensional case. The three parameter Weibull
distribution is also commonly used to characterize strength distributions (i.e., when using
a stress/strength model), where the γ-value represents a screen value, or proof test, in
which case this value of stress is applied to the item as a screen. It is also used to model
failure causes that are not initiated until a time equal to the gamma value has passed.
For much life data, the Weibull distribution is more suitable than the exponential, normal
and extreme value distributions, so it should be the distribution of first resort. The
characteristics of various shape parameters are summarized below:
• For shape parameter < 1.0, the Weibull pdf takes the form of the gamma
distribution (see Section 3.7.1.4) with a decreasing failure rate (i.e., infant
mortality)
• For shape parameter = 1.0, the failure rate is constant so that the Weibull pdf
takes the form of the simple exponential distribution with failure rate parameter
“λ” (the flat part of the reliability bathtub)
• For shape parameter = 2.0, the Weibull pdf takes the form of the lognormal or
Rayleigh distribution, with a failure rate that is linearly increasing with time (i.e.,
wearout). This is often used to model software reliability.
• For 3 < shape parameter < 4, the Weibull pdf approximately takes the form of the
Normal distribution
• For shape parameter > 10, the Weibull distribution is close to the shape of the
smallest extreme value distribution
The basic parameters of the 2-parameter Weibull distribution are presented in Table 3.3-
4. To have the mathematical expressions reflect a 3-parameter Weibull, replace all
values of “x” with “(x-x0)”, where x0 represents the γ value as described above.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
162
Chapter 3: Fundamental Concepts
Shape parameter β
Scale parameter α
Failure Rate β⎛x⎞
β−1
λ ( x) = ⎜ ⎟
α ⎜⎝ α ⎟⎠
Mean ⎛ 1⎞
μ = α Γ ⎜⎜ 1 + ⎟⎟
⎝ β ⎠
Variance ⎡ ⎛
2⎞ ⎛ 1⎞
2⎤
σ 2 = α 2 ⎢ Γ ⎜⎜ 1 + ⎟⎟ − Γ ⎜⎜ 1 + ⎟⎟ ⎥
⎢ ⎝ β⎠ ⎝ β⎠ ⎥
⎣ ⎦
Standard deviation ⎡ ⎛ 2⎤
0.5
2⎞ ⎛ 1⎞
σ = α ⎢ Γ ⎜⎜ 1 + ⎟⎟ − Γ ⎜⎜ 1 + ⎟⎟ ⎥
⎢ ⎝ β⎠ ⎝ β⎠ ⎥
⎣ ⎦
Reliability ⎡ ⎛ x ⎞β ⎤
⎢ −⎜⎜ ⎟⎟ ⎥
⎢ ⎝α⎠ ⎥
R (x) = e ⎣ ⎦
Figure 3.3-3 provides a graphical example of the Weibull distribution pdf with a
characteristic life of 1000 hours for a variety of shape parameters (β). Figures 3.3-4 and
3.3-5 illustrate the hazard rate and probability plot, respectively, for the same values of
the shape parameter.
Figure 3.3-4: Example Hazard Rate Plots for the Weibull Distribution
For an example, consider that very early in the system integration phase of a large
software development effort, there have been numerous failures due to software that have
caused the system to crash (the predominant system failure cause). Plotting the failure
times of this specific failure mode (other failure modes are ignored for now) on Weibull
probability paper resulted in a shape parameter value of 0.77 and a scale parameter value
of approximately 32 hours. Based on these parameters, the calculated reliability and
failure rate of the software at 10 system hours is expected to be:
0 .77 −1
0 . 77 ⎛ 10 ⎞
λ (10 ) = ⎜⎜ ⎟⎟ = 0 . 0314 failures per hour
32 ⎝ 32 ⎠
⎡ ⎛ 10 ⎞
0.77 ⎤
⎢ −⎜⎜ ⎟⎟ ⎥
⎢ ⎝ 32 ⎠ ⎥
R (10 ) = e ⎣ ⎦ = 0.6647
3.3.3. Lognormal
The lognormal distribution is the distribution of a random variable whose natural
logarithm is distributed normally; in other words, it is the normal distribution with “ln t”
as the independent variable. The probability density function is
1 ⎛ ln (t )− μ ⎞
2
1 − ⎜ ⎟
f (t ) = e 2⎝ σ ⎠
σt 2π
(e )
1
2 μ + 2σ 2 2 μ +σ 2 2
−e
where μ and σ are the mean and standard deviation (SD), respectively, of ln (t).
The lognormal distribution is used in the reliability analysis of semiconductors and the
fatigue life of certain types of mechanical components. This distribution is also
commonly used in maintainability analysis.
t
1 ⎡ 1 ⎛ ln (t ) − μ ⎞ 2 ⎤
F (t ) = ∫ exp ⎢− ⎜ ⎟ ⎥ dt
0 tσ 2π ⎢⎣ 2 ⎝ σ ⎠ ⎥⎦
⎛ ⎛ ln(t ) − μ ⎞ ⎞
F (t ) = P⎜⎜ Z ≤ ⎜ ⎟ ⎟⎟
⎝ ⎝ σ ⎠⎠
⎛ ⎛ ln(t ) − μ ⎞ ⎞
R(t ) = P⎜⎜ Z > ⎜ ⎟ ⎟⎟
⎝ ⎝ σ ⎠⎠
⎛ ln(t ) − μ ⎞
φ⎜ ⎟
f (t ) ⎝ σ ⎠
h(t ) = =
R(t ) tσR(t )
where φ is the standard normal probability function, and μ and σ are the mean and
standard deviation of the natural logarithm of the random variable, t.
Figures 3.3-6 through 3.3-8 illustrate the lognormal distribution for a mean value of 1000
and standard deviations of 0.1, 1 and 3. Shown are the pdf, the hazard rate, and the
cumulative unreliability function, F(t), respectively.
Figure 3.3-7: Example Hazard Rate Plots for the Lognormal Distribution
3.4. References
1. Lyu, M.R. (Editor), “Handbook of Software Reliability Engineering”, McGraw-
Hill, April 1996, ISBN 0070394008
2. Musa, J.D., “Software Reliability Engineering: More Reliable Software, Faster
Development and Testing”, McGraw-Hill, July 1998, ISBN 0079132715
3. Nelson, W., “Applied Life Data Analysis”, John Wiley & Sons, 1982, ISBN
0471094587
4. Musa, J.D.; Iannino, A.; and Okumoto, K.; “Software Reliability: Measurement,
Prediction, Application”, McGraw-Hill, May 1987, ISBN 007044093X
5. Montgomery, D.C., “Introduction to Statistical Quality Control – 2nd Edition”,
John Wiley & Sons, 1991, ISBN 047151988X
6. Shooman, M., "Probabilistic Reliability, An Engineering Approach," McGraw-
Hill, 1968.
7. Abernethy, Dr. R.B., "The New Weibull Handbook," Gulf Publishing Co., 1994.
The tenets of DOE is that one or more of a products’ or systems’ responses is observed as
a function of pertinent factors that may affect that response, as illustrated in Figure 4.0-1.
At the heart of this technique is the product/system or process under analysis. This is the
feature for which we want to quantify the behavior. The independent variables are called
the factors. These represent the inputs to the product/system or process and are the things
that can potentially change how the product behaves. The output of the DOE activity is
the response, and is a measure of how good the product/system or process behaves.
The levels for each factor are varied, tests are performed, and the resulting response is
measured. The resultant data is analyzed to quantify the item or process response as a
function of the factor levels. The generic steps in applying DOE to generate life models
are:
drawback is that it cannot detect non-linearity in the relationship between the factor and
the response. For example, consider the relationship in Figure 4.3-1.
In this example, the levels “a” and “d” represent the operating space of the product. The
conclusions will be very different, depending on the levels chosen within this operating
space. For example, if levels “a” and “b” are chosen, the conclusion will be that there is
a strong positive relationship; if levels “b” and “d” are chosen, the conclusion will be that
the factor has no effect on the response; and if “a” and “d” are chosen, which is a typical
approach, the conclusion will be that there is a moderate relationship. These results are
summarized in Table 4.3-1.
Levels Conclusion
a-b High positive relationship
c-d High negative relationship
b-d No relationship
a-d Moderate positive relationship
The number of levels for each factor should be chosen, in part, based on knowledge of
the physics of the manner in which the factor affects the response. Otherwise, there can
be large uncertainty in using the resulting model to interpolate or extrapolate the response
behavior as a function of the factor. For example, if the response under analysis is
Reliability Information Analysis Center
173
Chapter 4: DOE-Based Approaches to Reliability Modeling
corrosion, and the relationship between the factor, temperature, and the corrosion rate is
expected to be governed by the Arrhenius relationship over the entire operating space,
then a two-level temperature test may be appropriate. If, however, it is hypothesized that
there is a temperature threshold within the operating space, then more than two levels
may be required.
In this example, there are three factors to be assessed, A, B and C, represented by the
three right-hand columns. Each factor has two levels, a “+” indicating the high level and
a “–“ indicating the low level. This experiment has four runs, each one representing a
treatment. A treatment refers to the combination of levels used in the tests.
Repetition and replication are techniques used to increase the number of runs. The
advantage of increasing the number of runs is that obtaining multiple responses with
exactly the same factor levels is valuable in quantifying the amount of variability and
error in the measurements obtained. Repetition is the practice of repeating the same run
sequentially. Replication is the practice of repeating a set of runs sequentially. Both
practices will result in multiple responses for a given set of factor levels, but the
advantage of replication over repetition is that it is better able to quantify measurement
error in the event when there is a gradually changing parameter in the test or
measurement system.
The full-factorial approach will be used as an example for illustrating the concepts of data
analysis, followed by a discussion of other approaches.
Run A B C R (response)
1 + + + R1
2 + + - R2
3 + - + R3
4 + - - R4
5 - + + R5
6 - + - R6
7 - - + R7
8 - - - R8
The number of required runs is calculated as yx, where “y” is number of levels per factor
(2, for this example), and “x” is number of factors (3). In Table 4.4-1, then, the number
of runs is 23=8.
A full-factorial array can be scaled such that the resultant array has the characteristics of
orthogonality. These are referred to as fractional factorial arrays, since only a fraction of
the full-factorial runs are required, yet are still orthogonal. The naming convention for
these arrays is determined from:
La ( y x )
where:
In the previous examples, “y” and “x” were the number of factors and the number of
runs, respectively. In the standard DOE nomenclature, however, “La” refers to the
number of runs. For example, a seven-factor, two-level experiment for which there will
be eight runs is shown in Figure 4.4-3.
Another critical element that must be considered when defining reliability tests is the
potential interactions between factors. Everything discussed thus far in this section has
assumed that the effects of each of the factors are independent of each other. In practice,
there are often interactions between factors that must be accounted for. Graphical
representations of potential interactions are shown in Figure 4.4-4. Referring to the
Figure, if the responses for the two levels of the “B-factor” plotted against the two levels
of the “A-factor” are parallel, then this is an indication that there is no interaction
between the two factors. This is shown on the top left. In other words, the relative
magnitudes of the B-response are independent of the level of “A”. If however, when the
plots of the same factors result in the plot on the top right, then this is an indication that
there is a strong interaction between factors A and B. In this example, the levels of “A”
change the entire relationship between the B-levels and the response. The plot on the
bottom indicates that there is a mild interaction between the two factors.
If the potential interactions are not accounted for in the reliability test plan, the risk is that
the effects of the factors cannot be deconvolved (separated) from the interactions between
the factors. There are many DOE test plans and tools that assist in identifying the
capability of various plans to identify main effects and interactions.
A detailed treatment of DOE principals is beyond the scope of this book, as this has been
done extensively in the literature, but it is important to understand the impact of some of
the principals as they pertain to reliability testing.
Resolution is a term that describes the degree to which the main effects of factors are
aliased, or confounded, with the interactions amongst factors. In general, the resolution
number of a design is one more than the smallest order interaction with which some main
effects are aliased. For example, if some main effects are confounded with some 2-level
interactions, the resolution number of the DOE is 3. Since full-factorial designs test the
response of every possible combination of factors, there is no confounding and, therefore,
they have infinite resolution. As stated previously, since the implementation of a full-
factorial test is often not practical, weaker tests are often necessary. The key is to select
the aliasing structure of the test such that the actual critical interactions can be
deconvolved from the main effects.
Another possible plan would be a half factorial, also shown in Table 4.4-2. Notice that,
for the half-factorial design, the temperature-humidity (T*H) interaction (i.e., the product
of the two) is the same as for ionic contamination (I). Also, the T*I interaction is the
same as H, and the H*I interaction is the same as T. Therefore, this Resolution 3 plan is
incapable of deconvolving the main effects of T, H or I with the interactions of the other
two.
From physics, we know that both humidity and ionic contamination are required for
corrosion. Therefore, the fact that H*I is the same as T (i.e., they are confounded) is
unacceptable, since we would not be able to determine if the lifetime is governed by
temperature, or the combination of humidity and ionic contamination. Therefore, we
need a better DOE test plan. The full-factorial plan would be the best, if it could be
executed, since none of this confounding exists. For the full-factorial plan, notice that
none of the interaction terms are the same as the main effects.
If we were to actually model this failure cause based on the tests defined in these plans,
the general form of the reliability model may be based on the two parameter Weibull
distribution, which is:
β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R=e
where:
The characteristic life is then developed as a function of the applicable variables. The
model in this case is:
α1
α = e e T H α I α HI α
α0 2 4 3
where:
All model parameters, α 0 through α 4, could be adequately quantified with the full-
factorial design, but not with the half-factorial.
There are many other potential test plans that would be adequate, providing that the
required model variables can be quantified and are not confounded with one another
(Reference 1).
experiment must be kept as constant as possible. Make sure that all results are fully
documented. This also must include any anomalies or potential sources of error that may
have occurred. The order of the runs must be kept intact, per the experimental plan. If
repetition is used, the same run or treatment is repeated sequentially. If replication is
used, then the set of runs to be repeated have been identified in the experimental design.
For in-situ measurements, careful time stamping of the data is required. Life models to
be developed from the collected data often represent parameter degradation data and not
actual TTF data. As a result, a model of degradation rate as a function of time may be
used as the response to predict failure times. All test samples should be carefully stored,
as root-cause failure analysis may be required at some future time.
The means can be pictorially represented, as shown in Figure 4.6-1. This is a convenient
way to illustrate the sensitivity of the response to each factor. Data analysis techniques
more sophisticated than the analysis of means shown here are also often used, and there
are many good software tools available to aid in this analysis. However, if a balanced,
orthogonal design is used, analysis of means can be very straightforward and effective.
In the event that it is known that the response does not behave linearly with the factor
level, the response can sometimes be linearized by making the appropriate data
transformation. For example, if the response under analysis is corrosion that is governed
by the Arrhenius relationship over the entire operating space, then the response, life in
this case, would be exponential with temperature. However, if the transformation shown
in Figure 4.6-2 is applied, the response will be linear. This is especially useful when a
goal of the analysis is to determine the activation energy.
After the data has been analyzed, the optimal combination of factor levels can be
determined. The goal of this approach is to determine the factor levels that will result in
minimal variability of the product response and maximum probability of the product
meeting its requirements. This is the payoff in this approach, since it results in a more
robust design.
In this example, if the desirable response is high, then a high value of A and B with a low
value of C provides the best response, as shown in Figure 4.6-3.
4.8. References
1. William Y. Fowlkes and Clyde M. Creveling, “Engineering Methods For Robust
Product Design: Using Taguchi Methods In Technology And Product
Development,”
If all samples are tested to failure, or have been tested in exactly the same manner, then
traditional statistical analysis techniques (like regression, F-tests, T-tests, AVOVA, etc.)
will generally suffice for reliability modeling purposes. However, most real world cases
include censored data, unbalanced datasets, uncertain failure times, etc. It is these cases
where life modeling techniques are most effectively used.
1. TTF distributions
2. Acceleration factors (which provide a relative value of the reliability parameter as
a function of the stress level)
Each of these two major elements is discussed in the following sections and presents
more detailed information regarding development of the models after the life data has
been obtained.
In cases where the failure mechanism is random in nature, the exponential is applicable.
Parameter estimation provides a means for the effective use of data to aid in life
modeling and the estimation of constants appearing in those models. The constants that
appear in distribution functions (e.g., “p” in the binomial distribution; “λ” in the Poisson
distribution; “μ” and “σ” in the Normal distribution; “λ” or “θ” in the exponential
distribution; and “α” and “β” in the Weibull distribution) are called parameters. The true
value of the parameters from a given distribution may not be known or measurable, so it
becomes more practical to obtain approximate or estimated values of these parameters
from a sample of data. In the larger context, parameter estimation is typically applied to
one of the following scenarios.
Point estimation is frequently used in reliability analysis to quantify parameters like the
failure rate in the exponential distribution.
Formally, a statistic, Y, is a function of random variables that does not depend on any
unknown parameter:
Y = u( X 1 , K , X n )
Let “θ” denote the parameter to be estimated. Consider functions w(Y) of the statistic,
which might serve as point estimates of the parameter. Since w(Y) is a random variable,
it has a probability distribution. Statisticians have defined certain properties for assessing
the quality of estimators. These properties are defined in terms of this probability
distribution.
A loss function, L[θ,w(Y)], assigns a number to the deviation between a parameter and an
estimator. A typical loss function is the square of the difference, and is the value used in
least squares regression:
An unbiased estimator that minimizes the risk function for the above loss function is
referred to as a minimum variance unbiased estimator. An estimator that minimizes this
risk function uniformly in θ is called a minimum mean squared estimator. Table 5.2-1
summarizes the terms most commonly used in parameter estimation.
Least Squares Least square estimators may be better • Express the sum of the squared distance between
when small or medium sample sizes are actual and predicted values as a function of
involved, since they may have smaller parameter estimates
bias, or approach normality faster. • Determine the parameter estimators that minimize
Least squares estimation minimizes the the sum of this squared distance (typically using
variance around the estimated differential calculus)
parameter. The technique is familiar to
those comfortable with linear regression
modeling.
Method of This technique works by equating • Determine the distribution whose parameters are to
Moments statistical sample moments calculated be estimated (suppose there are “n” parameters to
from a data set to actual population be estimated)
moments. Population moments are • Find the first “n” moments of the distribution, either
determined by the parameters to be around zero, or around the mean for moments
estimated. As many moments are higher than the first
equated as there are parameters to be • Equate these moments to sample moments
estimated. In most cases of practical • Solve for the parameters as a function of the
interest, these can be found in closed realizations of the random variables in the sample.
form., but their theoretical justification
is not as rigorous as for other parameter
estimation methods.
The parameter estimates shown in Table 5.2-3 are rather simplistic and easy to use, and
often provide adequate estimates. There are more rigorous techniques available that do a
better, more accurate job of estimating parameters, but their complexity requires the use
of software tools.
The most popular techniques used in reliability modeling are least squares regression and
maximum likelihood. These are described in the next sections.
For example, if a two parameter Weibull distribution is used (Step 1), the linear transform
is performed as follows (Step 2):
β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R=e
Taking the natural log (base e) of both sides, twice, yields:
This is now a linear model with ln(t) being the independent variable, β being the slope,
and ln(α) being the intercept.
Step 3 calculates the plotting position (i.e., the estimated percent fail of the population at
the failure time of each) for each data point. A common way to accomplish this is by
using Bernard’s formula:
i − 0.3
F=
N + 0.4
where:
For example, if there are ten items, the value of F after the second failure is:
i − 0.3 2 − 0.3
F= = = 0.163
N + 0.4 10 + 0.4
The value of F is calculated for each failure. These pairs of x-y points are the values to
which a linear model will be fit.
β=
∑ (x − x )(y − y )
∑ (x − x )
2
ln(α ) = y − β x
Times-to-failure data are seldom complete. A complete sample is one in which all items
observed have failed during a given observation period, and all the failure times are
known. When “n” items are placed on test or observed in the field, whether with
replacement or not, it is sometimes necessary (due to the long life of certain components)
to terminate the test and perform the reliability analysis based on the observed data up to
the time of termination.
There are two basic types of possible life observation termination. The first type is time
terminated (which results in Type I right censored data), and the second is failure
terminated (resulting in Type II right-censored data). In the time-terminated life
observation, “n” units are monitored and the observation is terminated after a
predetermined time has elapsed. The number of items that failed during the observation
time, and the corresponding TTF of each component, are recorded. In the failure-
terminated life observations, “n” units are monitored and the observation is terminated
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
192
Chapter 5: Life Data Modeling
when a predetermined number of component failures have occurred. The time to failure
of each failed item, including the time that the last failure occurred, are recorded.
The MLE method is one of the most widely used methods for estimating reliability model
parameters. I n the first part of this section, a brief historical review of the MLE method
is presented. The likelihood function concept for different types of failure data, as well
as the mathematical approach to solve likelihood equations, is presented next. The last
part of this section reviews the basic equations of the MLE approach for specific case
studies, including exponential, Weibull and lognormal distribution likelihood functions.
Fisher used the conditional probability of occurrence for each failure event as a measure
for his mathematical curve fitting. He argued that, using a subjective assumption about
the TTF model, one can characterize the probability of each failure event, conditioned to
the model parameter. He then derived the posterior probability of failure events in a
Bayesian framework using a uniform distribution as a prior for the model parameters. He
later calculated the best estimate for model parameters by maximizing the posterior.
Note that, in a Bayesian framework, a uniform distribution cancels out from the equation
since it is a constant. The normalizing factor in the denominator is also a constant, which
has no impact when one is interested in the extremes of the function. Therefore, this
method was eventually called the maximum likelihood estimator, because it is basically
the likelihood function that is maximized in this process.
Let “[f (t) × dt]” be the chance of a failure observation falling within the range “dt”.
Fisher introduces the method of maximum likelihood by claiming that the factor “dt” is
independent of the theoretical curve, and the probability is proportional to “f(t)”.
Using the notion of the conditional probability density function, f(t|θ), helps to integrate
many different types of failure data into the likelihood function. For example, the
likelihood of the right-censored observations will be the reliability function, because this
is the probability that the component remains reliable up to the censored time. Therefore,
the likelihood of “M” independent right-censored observations will be the product of the
reliability functions as illustrated in the second equation above. For left-censored times
(that is, the time before which a failure has occurred) the likelihood is also the definition
of probability of failure at that time. In the case of many left-censored times, the total
likelihood will be the multiplication of the likelihood values of individual components
using the independency assumption, as shown in the third equation on the previous page.
For interval times, the likelihood is the probability of having one failure in that interval
(which is basically the integral of the probability density function between the upper and
lower abounds of interval). This is simply the difference between the cumulative
distribution function when it is evaluated at the upper and lower bounds, respectively, as
shown in the final equation from the previous page. Assuming the independency of
failure or censored time events, these likelihood functions can be multiplied with the
likelihood of the complete failure data in order to build the likelihood function for the
entire population.
The practical way to find the modes of the likelihood function is derivation. A
multivariable function has its maximum value at a point in which the first-order partial
derivative of the function with respect to each variable becomes zero, as shown below:
⎧ ∂Λ
⎪ ∂θ = 0
⎪ 1
⎪ ∂Λ
⎪ =0
Λ = ln( L ) ⇒ ⎨ ∂θ 2 ⇒ θˆ = (θˆ1 ,θˆ 2,..., θˆn )
⎪...
⎪
⎪ ∂Λ
⎪ ∂θ = 0
⎩ n
where:
Note that the likelihood function, as explained before, is based on a multiplication format.
This makes the derivation process very complex. The likelihood is always positive, so
Reliability Information Analysis Center
195
Chapter 5: Life Data Modeling
one may take the natural logarithm of this function to convert these multiplication
operators to summation. This will significantly reduce the mathematical derivation
complexity, while still providing the same best estimates for the mode of the likelihood
function. Constructing the likelihood function, L, as explained before, one may set up the
equations that need to be solved for the modes of this function.
In the following sections, three examples of the likelihood function, representing the
exponential, Weibull and lognormal distributions, are presented for further clarification.
( ) − ∑ N λT
F S
L = ∑ N i ln λe − λt i
j j
i =1 j =1
The only variable in this equation is λ. In the MLE method, the best estimate of λ is
evaluated by maximizing the likelihood (or log-likelihood) function. The next equation
shows the criteria to estimate the best estimate of λ. The uncertainty of the calculation
can be illustrated as confidence bounds over λ, which is calculated using the
corresponding local Fisher information matrix. This step will be explained in detail later.
∂L F
⎛1 ⎞ S
= ∑ N i ⎜ − ti ⎟ − ∑ N jT j = 0
∂λ i =1 ⎝ λ ⎠ j =1
complete (uncensored) data, is not a trivial problem. It can be easily shown that, in the
situation where all “r” units out of “n” observed units fail, the log-likelihood estimates of
the Weibull distribution are represented by the equation below:
⎛ ⎞ S
β
β −1 ⎛ t i ⎞ β
⎜ β ⎛ ti ⎞ − ⎜⎜ ⎟⎟
⎟ ⎛ Tj ⎞
F
L = ∑ N i ln ⎜ ⎜ ⎟ e ⎝ α ⎠ ⎟⎟ − ∑ N j ⎜⎜ α ⎟⎟
i =1 ⎜α ⎝α ⎠ j =1 ⎝ ⎠
⎝ ⎠
The best estimate of the parameters “α” and “β” are made using the first derivative of the
log-likelihood function, as shown in the next equations below. The best estimates will be
the unique answer for the set of two equations and two unknowns, as shown:
β β
∂L 1 F F
⎛ ti ⎞ F ⎛ ti ⎞ ⎛ ti ⎞ S ⎛T ⎞ ⎛T ⎞
= ∑ Ni + ∑ Ni ln⎜ ⎟ − ∑ Ni ⎜ ⎟ ln⎜ ⎟ − ∑ N j ⎜⎜ j ⎟⎟ ln⎜⎜ j ⎟⎟ = 0
∂β β i =1 i =1 ⎝ α ⎠ i =1 ⎝ α ⎠ ⎝ α ⎠ j =1 ⎝ α ⎠ ⎝ α ⎠
β β
∂L − β F
β F
⎛ ti ⎞ S ⎛Tj ⎞
∂α
=
α
∑ Ni +
α
∑ N ⎜
α
⎟ + ∑ N ⎜
j⎜
⎟ =0
⎟
⎝α
i
i =1 i =1 ⎝ ⎠ j =1 ⎠
Note that, despite the complexity of the mathematical representations of the likelihood
and log-likelihood functions and their derivatives, the basic concept is fairly simple. In
advanced numerical approaches using computers, the entire mathematical derivation is
done through numerical simulations using predefined tool boxes and library functions.
5.2.3.3.3. Lognormal Distribution
In the case of estimating the parameters of the lognormal distribution, the only difference
is in the construction of the likelihood function for which the pdf and CDF of the
distribution are used for complete and suspended failure data, respectively. The equation
below shows the log-likelihood function for a combination of complete failure and
suspended (right-censored) data.
F
⎛ 1 ⎛ ln(ti ) − μ ⎞ ⎞ S ⎛ ⎛ ln(T j ) − μ ⎞ ⎞
L = ∑ N i ln⎜⎜ φ ⎜ ⎟ ⎟⎟ + ∑ N j ln⎜⎜1 − Φ⎜⎜ ⎟⎟ ⎟
⎟
i =1 ⎝ σti ⎝ σ ⎠ ⎠ j =1 ⎝ ⎝ σ ⎠⎠
Having the log-likelihood function of failure data, the MLE approach can be executed
using the first derivative approach, as explained in previous sections. The first derivative
of the log-likelihood function with respect to the mean and standard deviation is
illustrated in the following two equations:
⎛ ln(T j ) − μ ⎞
φ ⎜⎜
⎟⎟
∂L 1 F 1 S ⎝ σ ⎠ =0
= 2 ∑ Ni (ln(ti ) − μ ) + ∑ N j
∂μ σ i =1 σ j =1 ⎛ ln(T j ) − μ ⎞
1 − Φ⎜⎜ ⎟⎟
⎝ σ ⎠
⎛ ln(T j ) − μ ⎞ ⎛ ln(T j ) − μ ⎞
⎜ ⎟φ ⎜ ⎟
⎛ (ln(t i ) − μ ) ⎜ ⎟ ⎜ ⎟
1 ⎞⎟ 1 S σ σ
2
∂L F
⎝ ⎠ ⎝ ⎠
= ∑ Ni ⎜ − − ∑Nj =0
∂σ i =1 ⎜⎝ σ3 ⎟
σ ⎠ σ j =1 ⎛ ln(T j ) − μ ⎞
1 − Φ⎜⎜ ⎟
⎟
⎝ σ ⎠
where:
1
1 − (x)2
φ(x) = e 2
2π
x 1
1 − (t )2
Φ(x ) =
2π ∫
−∞
e 2 dt
The capital Φ in the above equation is basically the cumulative Normal distribution,
which is defined as the integral of the small φ (i.e., normal pdf). The derivative of the
CDF always becomes the pdf, since the derivative operator cancels out the integration.
parameters. The next equation represents these uncertainties for a general case for which
there are “n” parameters in the likelihood function.
where:
⎡ ∂ 2Λ ∂2Λ ∂2Λ ⎤
⎢ − 2 − ... − ⎥
⎢ ∂θ2 1 ∂θ1∂θ 2 ∂θ1∂θ n ⎥
⎢ ∂ Λ ∂2Λ ∂2Λ ⎥
⎢− ∂θ ∂θ − 2
∂θ 2
... −
∂θ 2 ∂θ n ⎥
F=⎢ 2 1
⎥
⎢ . . . . ⎥
⎢ . . . . ⎥
⎢ ⎥
⎢ ∂2Λ ∂2Λ ⎥
... ... − − 2
⎢⎣ ∂θ n ∂θ n −1 ∂θ n ⎥⎦
Having the variance and the best estimate of each parameter, one may estimate the
uncertainty bounds for any given confidence bounds. Note that the important underlying
assumption here is independency, as well as the Normal distribution for all parameters.
= F [γ ; ( 2 y L + 2 ); 2 n ] = F [(1 + γ ) 2 ; ( 2 y L + 2 ); 2 n ]
s t s t
( y L + 1) n ( y L + 1) n
Future Occurrence
Rate, y
Normal Approximation
When “n” and “y” are large (e.g., each is > 10)
(
y L ≅ yˆ − z γ λˆ s ( t + s ) t )
0.5
y L ≅ yˆ − z ( 1+ γ ) 2 (λˆ s ( t + s ) t ) 0.5
y U ≅ yˆ + z γ (λˆ s ( t + s ) t )
0.5
y U ≅ yˆ + z ( 1+ γ ) 2 (λˆ s ( t + s ) t ) 0.5
Normal Approximation
True Proportion, When “x” and “n-x” are large (e.g., each is > 10)
p p L ≅ pˆ − z γ ( pˆ (1 − pˆ ) / n ) 0.5 p L ≅ pˆ − z ( 1+ γ ) 2 ( pˆ (1 − pˆ ) / n ) 0.5
p U ≅ pˆ + z γ ( pˆ (1 − pˆ ) / n ) 0.5 p U ≅ pˆ + z ( 1+ γ ) 2 ( pˆ (1 − pˆ ) / n ) 0.5
Poisson Approximation
When “n” is large and “x” is small (e.g., when “x” < n/10)
p L ≅ 0.5 χ 2 [(1 − γ ); 2 x ] n p L ≅ 0.5 χ 2 [(1 − γ ) 2 ; 2 x ] n
p U ≅ 0 .5 χ 2
[γ ; 2 x + 2 ] n p U ≅ 0.5 χ 2 [(1 + γ ) 2 ; 2 x + 2 ] n
Given: Given the observed probability above, the prediction for the number of “y” future category units
is:
yˆ = mpˆ = m ( x / n )
where,
x, n = as defined above
m= future sample size
Normal Approximation
When “x”, “n-x”, “y” and “m-y” are all large (say, > 10)
[
y L ≅ yˆ − z γ m pˆ (1 − pˆ )( m + n ) n ]
0.5
[ ]
y L ≅ yˆ − z (1+ γ ) 2 m pˆ (1 − pˆ )( m + n ) n 0 .5
y U ≅ yˆ + z γ [m pˆ (1 − pˆ )( m + n ) n ] 2 [m p
ˆ (1 − pˆ )( m + n ) n ]0.5
0.5
y U ≅ yˆ + z ( 1+ γ )
Prediction of
Future Poisson Approximation
Probability of When “n” is large and “x” is small (e.g., when “x” < n/10)
“Success”, y Closest integer solutions for yL and yU from the following equations
( x + 1) ( x + 1)
F [γ ; 2 x + 2; 2 y U ]
yU
F [(1 + γ ) 2 ; ( 2 x + 2 ); 2 y U ]
yU
= =
m n m n
= F [γ ; ( 2 y L + 2 ); 2 x ] = F [(1 + γ ) 2 ; ( 2 y L + 2 ); 2 x ]
m n m n
( y L + 1) x ( y L + 1) x
mean, θ 2nx 2n x
θU = 2 θU =
χ [(1 − γ);2(n + 1)] χ [(1 − γ) 2;2(n + 1) ]
2
n
where,
θhat= sample mean
xi = individual times to failure for each of the observations of sample size “n”
n= number of statistically independent sample observations
Exponential Limits (exact) for Failure Truncated Tests
1 χ 2 [(1 − γ ); 2 n ] 1 χ 2 [(1 − γ ) 2; 2 n ]
True value of the λL = = λL = =
θU 2 nx θU 2 nx
failure rate, λ
1 χ 2 [γ ; 2 n ] 1 χ 2 [(1 + γ ) 2; 2 n ]
λU = = λU = =
θL 2 nx θL 2 nx
Given: The usual estimate of the reliability, R(t), at any age, t, is:
R * (t) = e −(t x)
where,
R= reliability as a function of time, distance, etc.
t= period at which reliability is assessed (time, distance, etc.)
True value of ( {
R L ( t ) = e − ( t θ L ) = exp − t * χ 2 [γ ; 2 n ] 2 nx }) ( {
R L ( t ) = e − ( t θ L ) = exp − t * χ 2 [(1 + γ ) 2; 2 n ] 2 nx })
reliability at
end of period, R U (t ) = e −( t / θ U )
= exp (− t * {χ 2
[(1 − γ );2n ] 2 nx }) ( {
R U ( t ) = e −( t / θ U ) = exp − t * χ 2 [(1 − γ ) 2; 2 n ] 2 nx })
R(t)
Given: The estimate of the reliability at any age “t”, R(t), is:
R * (t ) = 1 − Φ( z )
where,
R= reliability as a function of time, distance, etc.
t= period at which reliability is assessed (time, distance, etc.)
Φ(z) = estimate of the fraction of a population failing by age “t”
True value of R L ( t ) = 1 − FU ( t ) = 1 − Φ ( z U ) R L ( t ) = 1 − FU ( t ) = 1 − Φ ( z U )
reliability at end where , where ,
of period, R(t) (x − x) (x − x)
z= z=
s s
0 .5 0 .5
zγ ⎛ 2 ⎞ z (1+ γ ) 2 ⎛ 2 ⎞
zU ≅ z + ⎜ 1 + z (n / 2) ⎟ zU ≅ z + ⎜1 + z (n / 2) ⎟
⎜ n −1 ⎟ ⎜ n −1 ⎟
n ⎝ ⎠ n ⎝ ⎠
R U ( t ) = 1 − FL ( t ) = 1 − Φ ( z L ) R U ( t ) = 1 − FL ( t ) = 1 − Φ ( z L )
where , where ,
(x − x) (x − x)
z= z=
s s
0. 5 0 .5
zγ ⎛ 2 ⎞ z (1+ γ ) 2 ⎛ 2 ⎞
zL ≅ z − ⎜1 + z (n / 2) ⎟ zL ≅ z − ⎜1 + z (n / 2) ⎟
⎜ n −1 ⎟ ⎜ n −1 ⎟
n ⎝ ⎠ n ⎝ ⎠
Accelerated testing is often used for this purpose, in which case tests are performed at
stress levels higher than the item will experience in use, to speed up failure processes.
Acceleration models consist of two generic types:
There are four basic forms of accelerated life models. Combinations of these are also
possible:
In all of these equations, “y” is the dependent variable, usually either lifetime (as
measured by characteristic life or mean life, depending on the TTF distribution used), or
failure rate. Since the failure rate is the reciprocal of the mean life (in the case of the
exponential distribution), the constant “a” will generally be positive in one case and
negative in the other.
The most commonly used reliability models are the power law and exponential models.
Several points regarding acceleration models are:
5.3.1.1. Examples
Several commonly used acceleration models are summarized in this section.
Arrhenius
The Arrhenius relationship is a widely used model describing the effect that temperature
has on the rate of a simple chemical reaction:
⎡ Ea ⎤
⎢ KT ⎥
L ∝ Ae ⎣ ⎦
where:
L= the lifetime
A= a life constant
Ea = the activation energy in eV
T= the absolute temperature in degrees Kelvin
It can be seen that this is the exponential model, with the reciprocal of temperature used
as the stress. The Arrhenius model is the most widely used for evaluating the effect of
temperature on reliability. It is applicable to situations in which the failure mechanism is
a function of the steady state temperature, such as corrosion, diffusion, etc. Notable
observations about the Arrhenius acceleration model are that:
• In the formative years of the electronics industry, many failure mechanisms were
related to corrosion and contamination, which are inherently chemical reaction
rates for which the Arrhenius factor applies reasonably well
• It has since been applied to many other failure mechanisms, with an assumed
applicability
Eyring
The Eyring model is:
⎡ B⎤
1 − ⎢ A− ⎥
L∝ e ⎣ T⎦
T
Coffin-Manson
A form of fatigue life strain models is the Coffin-Manson “life vs. plastic strain”, which
is often used for solder joint reliability modeling:
β
⎛ ΔT ⎞
AF = ⎜⎜ S ⎟⎟
⎝ ΔTU ⎠
where:
AF = acceleration factor
ΔTU = product temperature in service use, °K
ΔTS = product temperature in stress conditions, °K
β= constant for a specific failure mechanism
β
⎛ 1 ⎞
N f = A⎜ ⎟
⎜ Δe ⎟
⎝ p ⎠
where:
Since ΔT ∝ Δep, a simplified acceleration factor for temperature cycling fatigue testing
is:
β
N ⎛ ΔT ⎞
AF = use = ⎜⎜ test ⎟⎟
N test ⎝ ΔTuse ⎠
The Coffin-Manson model is also sometimes used to model the acceleration due to
vibration stresses. Random vibration input and response curves are typically plotted on
log-log paper, with the power spectral density (PSD) expressed in squared acceleration
units per hertz (G2/Hz), plotted along the vertical axis, and the frequency (Hz) plotted
along the horizontal axis.
G2
P = lim
Δf → 0 Δ f
In the above equation, “G” is the root mean square (RMS) of the acceleration, expressed
in gravity units, and “Δf” is the bandwidth of the frequency range expressed in hertz.
Since “G” is the agent of failure that causes fatigue, the following inverse power model
applies:
1 1
L(G ) ∝ β
⇒ Life =
G KG β
The acceleration factor for vibration based on Grms for similar product responses is
represented by:
β
N ⎛G ⎞
AF = use = ⎜⎜ test ⎟⎟
N test ⎝ G use ⎠
L(U ,V ) =
C
B
−
n V
U e
where:
The T-NT relationship can be linearized and plotted on a Life vs. Stress plot by taking the
natural logarithm of both sides:
Here, the log of the life is equal to a linear relationship, where the intercept is ln(C), the
slope of ln(U) is “n” and the slope of 1/V is “B”.
B
C Vu
e n ⎛ 1 1 ⎞
LUse U un ⎛ U ⎞ B ⎜⎜ V −V ⎟⎟
AF = = B
= ⎜⎜ A ⎟⎟ e ⎝ u A ⎠
LAccelerated C VA ⎝ Uu ⎠
e
U An
where:
Temperature-Humidity Models
A variation of the Eyring relationship is the Temperature-Humidity (TH) relationship.
This combination model is expressed as:
⎛φ b ⎞
⎜ + ⎟
L(V , U ) = Ae ⎝V U ⎠
where, “φ” and “b” are parameters to be determined (the parameter “b” is also known as
the activation energy for humidity), “A” is a constant, “U” is the relative humidity
(decimal or percentage), and “V” is the temperature (in absolute units, °K). Note that the
relative humidity can be expressed in either a decimal format or as a percentage, as long
as it is consistent throughout the analysis. The relationship is linearized by taking the
natural logarithm of both sides of the equation:
φ
ln[L(V , U )] = ln( A) +
b
+
V U
⎛ φ b ⎞
⎜ + ⎟ ⎛ 1 1 ⎞ ⎛ 1
⎜V U ⎟ 1 ⎞
⎝ u ⎠ φ ⎜⎜ − ⎟⎟ + b ⎜⎜ − ⎟
LUse Ae u
⎟
AF = = ⎛ φ b ⎞
=e ⎝ Vu V A ⎠ ⎝ U u U A ⎠
LAccelerated ⎜⎜ + ⎟⎟
⎝ VA U A ⎠
Ae
where:
Peck Model
The Peck model (Reference 6) is:
⎡ Ea ⎤
−n ⎢ KT ⎥
L ∝ ( RH ) e ⎣ ⎦
where:
RH = Relative Humidity
T= temperature
n= constant
Ea = activation energy
K= Boltzman’s constant = 8.617 x 10-5 eV/°K
Note that this is a multiplicative model consisting of a power law for humidity and the
Arrhenius model for temperature.
⎡ Ea ⎤
⎢ KT ⎥ + n ( RH )
2
L∝e ⎣ ⎦
This model includes the effects of both temperature and relative humidity.
Harris Model
Wearout data published by the Harris Corporation shows a good fit to Peck’s model
(Reference 8) in representing aluminum corrosion. This model is:
a b
⎡ Ea ⎛ 1 1 ⎞ ⎤ ⎛ RH S ⎞ ⎛ VS ⎞
⎢ ⎜⎜ − ⎟⎟ ⎥ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟
⎣⎢ k ⎥ ⎝ RH U
AF = e ⎝ U S ⎠⎦ ⎠ ⎝ VU ⎠
T T
where:
AF = acceleration factor
Ea = activation energy
k= Boltzman’s constant = 8.617 x 10-5 eV/°K
TU = product temperature in service use, °K
Reliability Information Analysis Center
213
Chapter 5: Life Data Modeling
In estimating fatigue life for materials, the model is used as the analytical representation
of the so-called “S-N” curves, where “S” is stress amplitude and “N” is life (in cycles to
failure), such that N = kS-b, where “b” and “k” are material parameters either estimated
from test data or published in handbooks.
Miner’s Rule
Miners rule states that the amount of damage sustained by a metal is proportional to the
number of cycles it experiences, as follows:
k
ni
∑N
i =1
=C
i
There are “k” stress levels (one for each contribution “n” cycles), “N” is the total number
of cycles at a constant stress reversal, and “C” is usually assumed to be 1.0. It essentially
estimates the percentage of life used by each stress reversal at each specific magnitude.
β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R (t ) = e
where:
S= stressor
a= life constant
n= fatigue exponent in time space
The premise of the cumulative damage model is that the amount of life used per cycle is
proportional to the stressor raised to the “n” power:
n
⎛S ⎞
te = t1 ⎜⎜ 1 ⎟⎟
⎝ S0 ⎠
where:
This cumulative damage model is particularly useful when the stresses are time varying,
since an equivalent amount of damage can be estimated per unit time, regardless of the
behavior of stress as a function of time. This model is also consistent with fatigue, which
is essentially a cumulative damage scenario.
The form of a life model is the distribution equation, with the mean or characteristic life
(depending on the distribution) replaced with the acceleration model. For example, if a
Weibull distribution is used, the reliability function is:
β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R (t ) = e
where:
where:
S= stressor
a= life constant
n= fatigue exponent in time space
The modeling process estimates β, “a” and “n”. Once these parameters are estimated, the
life distribution for any stress level can be obtained.
B B
N
B t − M
T −
L = ∑ N i (− − ln(C ) − i e Vi ) − ∑ N i Ri e Vi
i =1 Vi C i =1 C
⎡ ⎜ T e Vi ⎟ ⎤
⎛ −B ⎞
− ⎜ Li C ⎟
⎡ − ⎛⎜ Tbie−VBi ⎞⎟ ⎜ T e Vi ⎟ ⎤
⎛ −B ⎞
− ⎜ ai C ⎟
K ⎢ ⎜ ⎟⎥ L ⎢ ⎜ ⎜ C ⎟
⎟ ⎜ ⎟⎥
+ ∑ N i Ln ⎢1 − e ⎝ ⎠
⎥ + ∑ N i Ln ⎢e ⎝ ⎠
−e ⎝ ⎠
⎥
i =1 ⎢ ⎥ i =1 ⎢ ⎥
⎣⎢ ⎦⎥ ⎣⎢ ⎦⎥
N M
L = ∑ N i (ln K + n ln(S i ) − KS t ) − ∑ N i KS inTRi n
i i
i =1 i =1
K L
+ ∑ N i Ln(1 − e − KS inTLi
) + ∑ N i Ln(e − KSi Tbi − e − KSi Tai )
n n
i =1 i =1
Weibull-Arrhenius:
⎛ ⎞
β
⎛ ⎞
⎜ ⎛ ⎞
β −1 ⎜ t
−⎜ i B
⎟
⎟ ⎟ ⎛ ⎞
β
N
⎜ β ⎜ ti ⎟ ⎜⎜ Vi ⎟⎟ ⎟ M ⎜ TRi ⎟
L = ∑ N i ln⎜ B ⎜ B ⎟ e ⎝ Ce ⎠
⎟ ∑ i⎜ B
− N ⎟
i =1 ⎜ Ce Vi ⎜ Vi ⎟ ⎟ i =1 ⎜⎝ Ce Vi ⎟
⎜ ⎝ Ce ⎠ ⎟ ⎠
⎝ ⎠
⎛ ⎞ ⎛ ⎛⎜ T ⎞⎟ ⎞
β β β
⎛ ⎞ ⎛ ⎞
⎜ ⎜ T ⎟
− ⎜ LiB ⎟ ⎟ ⎜ −⎜ aiB ⎟ ⎜ T ⎟
− ⎜ biB ⎟ ⎟
K
⎜ ⎜⎜ ⎟
Vi ⎟ ⎟ L ⎜ ⎜⎜⎝ Ce Vi ⎟⎟⎠ ⎜⎜ ⎟⎟
⎟
+ ∑ N i ln⎜1 − e ⎝ Ce ⎠ ⎟ ∑ i ⎜
+ N ln e − e ⎝ Ce Vi ⎠
⎟
i =1 ⎜ ⎟ i =1 ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
Weibull-IPL:
( ) e( )β ⎞⎟ − N (KS nT )β
N M
L = ∑ N i ln⎛⎜ βKS in KS in t i ∑ i i Ri
β −1 − KS nt
i i
i =1 ⎝ ⎠ i =1
i =1 ⎝ ⎠ i =1 ⎝ ⎠
Lognormal-Arrhenius:
⎛ ⎛ B ⎞⎞ ⎛ ⎛ B ⎞⎞
⎜ ⎜ ln(t i ) − ln(C ) − ⎟ ⎟ M ⎜ ⎜ ln(TRi ) − ln(C ) − ⎟ ⎟
⎜ 1 Vi ⎟ ⎟ ⎜ Vi ⎟ ⎟
L = ∑ N i ln⎜ φ ⎜ + ∑ N i ln⎜1 − Φ⎜
N
i =1 σ t ⎜ σ ⎟ ⎟ i =1
⎜ σ ⎟⎟
⎜⎜ i ⎜ ⎟ ⎟⎟ ⎜⎜ ⎜ ⎟ ⎟⎟
⎝ ⎝ ⎠⎠ ⎝ ⎝ ⎠⎠
⎛ ⎛ B ⎞⎞ ⎛ ⎛ B⎞ ⎛ B ⎞⎞
⎜ ⎜ ln(TLi ) − ln(C ) − ⎟ ⎟ L ⎜ ⎜ ln(Tbi ) − ln(C ) − ⎟ ⎜ ln(Tai ) − ln(C ) − ⎟⎟
⎜ Vi ⎟ ⎟ ⎜ ⎟⎟
+ ∑ N i ln⎜ Φ⎜ + ∑ N i ln⎜ Φ⎜
Vi ⎟
− Φ⎜
K
Vi
i =1
⎜ σ ⎟ ⎟ i =1
⎜ σ ⎟ ⎜ σ ⎟⎟
⎜⎜ ⎜ ⎟ ⎟⎟ ⎜⎜ ⎜ ⎟ ⎜ ⎟ ⎟⎟
⎝ ⎝ ⎠⎠ ⎝ ⎝ ⎠ ⎝ ⎠⎠
Lognormal-IPL:
N
⎛ 1 ⎛ ln(t i ) + ln( K ) + n ln(S i ) ⎞ ⎞
L = ∑ N i ln⎜⎜ φ⎜ ⎟ ⎟⎟
i =1 ⎝ σt i ⎝ σ ⎠⎠
M
⎛ ⎛ ln(TRi ) + ln( K ) + n ln(S i ) ⎞ ⎞ K ⎛ ⎛ ln(TLi ) + ln( K ) + n ln(S i ) ⎞ ⎞
+ ∑ N i ln⎜⎜1 − Φ⎜ ⎟ ⎟⎟ + ∑ N i ln⎜⎜ Φ⎜ ⎟ ⎟⎟
i =1 ⎝ ⎝ σ ⎠ ⎠ i =1 ⎝ ⎝ σ ⎠⎠
L
⎛ ⎛ ln(Tbi ) + ln( K ) + n ln(S i ) ⎞ ⎛ ln(Tai ) + ln( K ) + n ln(S i ) ⎞ ⎞
+ ∑ N i ln⎜⎜ Φ⎜ ⎟ − Φ⎜ ⎟ ⎟⎟
i =1 ⎝ ⎝ σ ⎠ ⎝ σ ⎠⎠
Exponential Arrhenius:
Exponential IPL:
Weibull Arrhenius:
Weibull IPL:
Lognormal Arrhenius:
Lognormal IPL:
The likelihood function will yield a value for all possible combinations of parameter
values. A useful tool in data analysis is a plot of the likelihood value. As an example,
Figure 5.4-1 illustrates a contour plot of the likelihood value for an exponential-IPL
model.
In this example, the plot lines represent values of equal likelihood as a function of the
two parameters of interest (i.e., the Weibull slope and the exponent in the power law
acceleration model). The center position represents the combination of beta and “n” at
which the maximum value of likelihood occurs. The height of the likelihood value
increases as the center of the contour lines is approached. The spread in the contour lines
of equal likelihood are proportional to the uncertainty in the parameter estimates, and in
fact are one way to estimate confidence bounds on the model parameters. Also, the
dispersion of the likelihood values on the “n” axis can be thought of as the spread of the
TTFs in the stress dimension, and the dispersion of the likelihood values on the “beta”
axis can be thought of as the spread of the TTFs in the time dimension.
5.5. References
1. Lyu, M.R. (Editor), “Handbook of Software Reliability Engineering”, McGraw-Hill,
April 1996, ISBN 0070394008
2. Musa, J.D.; Iannino, A.; and Okumoto, K.; “Software Reliability: Measurement,
Prediction, Application”, McGraw-Hill, May 1987, ISBN 007044093X
3. Musa, J.D., “Software Reliability Engineering: More Reliable Software, Faster
Development and Testing”, McGraw-Hill, July 1998, ISBN 0079132715
4. Nelson, W., “Applied Life Data Analysis”, John Wiley & Sons, 1982,
ISBN0471094587
5. Fisher, R. A., 1912, “On an Absolute Criterion for Fitting Frequency Curves”,
Messenger of Mathematics, Vol. 41, pp. 155-160. [Reprinted in Statistical
Science, Vol. 12, (1997) pp. 39-41.]
6. Peck, S., IRPS tutorial, 1990
7. Telcordia GR1221
8. Peck and Hallberg, “Quality and Reliability Engineering International”, 1991
9. Hald, A., 1999, “On the Maximum Likelihood in Relation to Inverse Probability
and Least Squares.” Statistical Science, Vol. 14, No. 2, pp. 214-222.
10. Accelerated Life Testing Analysis (ALTA), Reliasoft Corp.
Infant Mortality. In this first portion of the bathtub curve, the failure rate is relatively
high because a portion of the population may contain parts with defects. These parts
Reliability Information Analysis Center
223
Chapter 6: Interpretation of Reliability Estimates
generally fail earlier than those in the main population. The shape of the failure rate
curve is decreasing, with its rate of decrease dependent on the maturity of the design and
manufacturing processes, as well as the applied stresses.
Useful Life. The second portion of the bathtub curve is known as the “useful life” and is
characterized by a relatively constant failure rate caused by randomly occurring failures.
It should be noted that the failure rate is only related to the height of the curve, not to the
length of the curve, which is a representation of product or system life. If items are
exhibiting randomly occurring failures, then they fail according to the exponential
distribution, in accordance with a Poisson process. Since the exponential distribution
exhibits a constant hazard rate, we can simply add the failure rates for all items making
up an item to estimate the overall failure rate of that item during its useful life.
Wearout. The last part of the curve is the wearout portion. This is where items start to
deteriorate to such a degree that they are approaching, or have reached, the end of their
useful life. This is often relevant to mechanical parts, but can also apply to any failure
cause that exhibits wearout behavior.
It is important to understand the difference between the MTBF of an item and the useful
life of that same item. Items that experience wearout failure modes/mechanisms will
have some period of useful life before they fail as a result of wearout. This useful life is
not the same as the item MTBF. During useful life, an item may also experience
randomly occurring “freak” failures caused by weak components or faulty workmanship,
especially if the item is subjected to high stress conditions. The occurrence of these
random failures during an item’s useful life results in higher failure rates, or lower
MTBF, for that item.
Mechanical items are usually most prone to wearout and, therefore, we are usually most
concerned with the useful life, or MTTF, associated with these items. Electronic items
usually become obsolete long before any significant wearout takes place6. Therefore, the
infant mortality and constant failure rate portions of the bathtub curve are of the most
interest for these items.
The bathtub curve conceptually offers a good view of the three primary types of failure
categories. It is essentially a composite failure rate curve comprised of three generic
types of failure causes. In practice, however, the well defined curve of Figure 6.1-1 is
rare. The actual curve for a product or system will depend on many factors. A specific
6
It should be noted, however, that with the progressively decreasing feature sizes of current state-of-the-art microelectronic devices,
the issues associated with wearout and useful life are becoming of greater concern.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
224
Chapter 6: Interpretation of Reliability Estimates
failure cause will generally exhibit characteristics of only one segment of the bathtub
curve, but when the characteristics of all of the other failure causes for that product or
system are considered, and a composite model is generated, the curve will have a shape
that deviates from the classic bathtub curve, even though it will often contain elements of
each of the three portions. Usually, the composite curve will be dominated by the
characteristics of those failure causes that dominate the overall reliability of the item.
It is also important to note that defects do not always manifest themselves as infant
mortality failures. They can appear to be infant mortality, random or wearout, depending
on the specific characteristics of the failure mechanism and factors, such as defect
severity distributions.
The accuracy of a reliability model is a strong function of the manner in which defects
are accounted for. Therefore, there is a trade-off between the usability of the model and
the level of detailed data that it requires. This highlights the fact that the purpose of a
reliability prediction must be clearly understood before a methodology is chosen.
Practical considerations for choosing an approach will inevitably include the types and
level of detail of information available to the analyst. Given the practical time and cost
constraints that most reliability practitioners face, it is usually important that the chosen
reliability prediction methodology be based on data and information accessible to them.
Model developers have long known that many of the factors which had a major influence
on the reliability of the end product were not included in traditional methods like MIL-
HDBK-217, but under the “constraints” of handbook users, these factors could not be
Reliability Information Analysis Center
225
Chapter 6: Interpretation of Reliability Estimates
included in the models. For example, it was known that manufacturing processes had a
major impact on end item reliability, but those are the factors which corporations hold
most proprietary. As an example of this, a physics-of-failure-like model was developed
several years ago for small-scale CMOS technology. This model required many input
variables, such as metallization cross-sectional area, silicon area, oxide field strength,
oxide defect density, metallization defect density etc. While the model has the potential
to be much more accurate than the other MIL-HDBK-217 models, it is essentially
unusable by anyone other than the component manufacturers who have access to such
information. The model is useful, however, for these manufacturers to improve the
reliability of their component designs.
The two primary purposes for performing a quantitative reliability assessment of systems
are (1) to assess the capability of the parts and design to operate reliably in a given
application (robustness), and (2) to estimate the number of field failures or the probability
of mission success. The first does not require statistically-based data or models, but
rather sound part and materials selection/qualification and robust design techniques. It is
for this purpose that physics approaches have merit. The second, however, requires
empirical data and models derived from that data. This is due to the fact that field
component failures are predominantly caused by component and manufacturing defects
which can only be quantified through the statistical analysis of empirical data. This can
be seen by observing the TTF characteristics of components and systems, which are
almost always decreasing, indicating the predominance of defect-driven failure
mechanisms. The “handbook” models described in this book provide the data to quantify
average failure rates which are a function of those defects.
It has been shown that system reliability failure causes are not driven by deterministic
processes, but rather by stochastic processes that must be treated as such in a successful
model. There is a similarity between reliability prediction and chaotic processes. This
likeness stems from the fact that the reliability of a complex system is entirely dependent
upon initial conditions (e.g., manufacturing variation) and use variables (i.e., field
application). Both the initial conditions and the use application variables are often
unknowable to any degree of certainty. For example, the likelihood of a specific system
containing a defect is often unknown, depending on the defect type, because the
propensity for defects is a function of many variables and deterministically modeling
them all is virtually impossible. However, the reliability can be predicted within bounds
by using empirically based stochastic models.
A critical factor that must be considered when choosing a reliability assessment method
is whether the failure mechanism under analysis is a special cause or a common cause
mechanism. In other words, a special cause mechanism means that there is an assignable
cause to the failure and that only a subpopulation of the item is susceptible to this failure
mechanism. Common cause mechanisms are those affecting the entire population.
Table 6.1-1 summarizes the characteristics of various categories of failure causes, and
identifies whether they are typically common cause or special cause. The categories of
failure types encompass the ways a failure cause can manifest itself. These are also
categories that can be used in a FMEA.
If it is erroneously assumed that special cause mechanisms will affect the entire
population, gross errors in the reliability estimates of the population will result. This
error results from the assumption of a mono-modal TTF distribution when, in fact, the
actual distribution is multimodal.
If the distribution is truly mono-modal, only the parameters applicable to a single mode
distribution need to be estimated. However, if there are really several sub-populations
within the entire population, the parameters of each of the distributions needs to be
estimated, along with the percentage of the entire population represented by each
distribution.
This is especially critical when dealing with defects. In this case, it is critical to
understand the percentage of the population that is at risk of failure. To illustrate this,
consider the probability plot in Figure 6.2-1. As can be seen in this plot, there is an
apparent “knee” in the plot at about 400 hours, an indication of several subpopulations.
If a mono-modal distribution is assumed (i.e., the straight line), errors in the cumulative
percent fail at a given time will occur. Likewise, if a multimodal distribution is assumed,
a much more accurate representation of the situation results (the line through the data
points).
Data 1
90.000 Weibull-Mixed
MLE SRM MED FM
F=98/S=139
Data Points
Susp Points
Probability Line
50.000
10.000
Unreliability, F(t)
5.000
1.000
0.500
Bill Denson
Corning
1/15/2008
0.100 5:24:32 PM
10.000 100.000 1000.000 10000.000 100000.000 1000000.000
Time, (t)
β[1]=1.3341, η[1]=307.1460, Ρ[1]=0.0646; β[2]=0.7505, η[2]=2.1367Ε+4, Ρ[2]=0.4240; β[3]=4.2735, η[3]=1.1624Ε+5, Ρ[3]=0.5114
If accelerated tests are used to model life, the risk in assuming mono-modality must be
considered. For this reason, techniques like stress/strength and first principals are often
difficult to use to quantify multimodality.
9 0. 0 00
5 0. 0 00
Unreliability, F(t)
1 0. 0 00
5. 0 00
1. 0 00
0. 5 00
0. 1 00
0 . 1 00 1. 00 0 10 . 00 0 1 00 . 00 0 1 000 . 0 00 10 00 0. 00 0
T ime, (t)
F olio 1\1-. 5: β[ 1 ] =0 .6 0 1 0 , η[ 1 ] =6 1 .0 8 6 0 , Ρ[1 ]= 0 .4 2 1 9 ; β[ 2 ]= 0 .5 9 3 6 , η[2 ]= 9 1 8 .6 9 3 5 , Ρ[ 2 ] = 0 .5 7 8 1
90 . 0 0 0
50 . 0 0 0
Unreliability, F(t)
10 . 0 0 0
5.000
1.000
0.500
0.100
0. 1 00 1 . 00 0 1 0. 00 0 10 0 . 0 00 1 0 00 . 00 0 1 0 00 0. 0 00
T ime, (t)
F o lio 1 \1 -1 : β[1 ]= 0 .8 6 3 3 , η[1 ]= 3 4 1 .4 6 5 6 , Ρ[1 ]= 0 .6 3 0 3 ; β [2 ]= 1 .4 0 6 2 , η[2 ]= 8 6 3 .2 7 6 7 , Ρ[2 ]=0 .3 6 9 7
9 0. 0 00
5 0. 0 00
Unreliability, F(t)
1 0. 0 00
5. 0 00
1. 0 00
0. 5 00
0. 1 00
0.100 1. 00 0 1 0. 00 0 1 0 0. 00 0 1 00 0. 0 00 10 0 00. 0 00
T ime, (t)
F o lio 1\5-1 : β [1 ]= 1 .8 1 4 4 , η[1 ]= 9 8 .4 4 2 8 , Ρ[1 ]= 0 .1 8 8 1 ; β[ 2 ]= 1 .2 3 8 5 , η[2 ]=6 7 9 .4 4 6 9 , Ρ[ 2 ] =0 .8 1 1 9
90 . 0 0 0
50 . 0 0 0
Unreliability, F(t)
10 . 0 0 0
5.000
1.000
0.500
0.100
0. 1 00 1 . 00 0 1 0. 00 0 10 0 . 0 00 1 0 00 . 00 0 1 0 00 0. 0 00
T ime, (t)
F o lio 1 \. 5-5: β [1 ]=1 .1 8 0 8 , η[1 ]=2 0 6 .1 9 6 8 , Ρ[1 ]= 0 .1 9 4 0 ; β [2 ] =4 .6 9 4 3 , η[2 ]= 4 9 7 .6 3 5 9 , Ρ[2 ]=0 .8 0 6 0
90.000
50.000
Unreliability, F(t)
10.000
5 . 0 00
1 . 0 00
0 . 5 00
0 . 1 00
0. 10 0 1. 00 0 10 . 0 0 0 10 0 . 0 00 1 00 0 . 0 0 0 1 0 00 0. 0 00
T ime, (t)
F o lio 1\5 -5 : β [ 1 ]=5 .7 1 6 3 , η[1 ]=4 4 .7 8 9 1 , Ρ[1 ]= 0 .0 9 9 9 ; β[ 2 ]= 4 .2 9 3 2 , η[2 ]= 4 8 3 .7 0 4 2 , Ρ[2 ] =0 .9 0 0 1
A distribution was then obtained by pooling all of the individual distributions described
previously. This is shown in Figure 6.2-7. The effect of pooling the various distributions
from many failure causes has the effect of randomizing the apparent failure
characteristics of the resultant pooled population. This is one of the reasons that a
a ll da ta
W e ibull-Mixe d
NLRR SRM MED F M
90.000 F = 4 98 /S= 0
Da ta Points
Pro ba bility Line
50.000
Unreliability, F(t)
10.000
5 . 0 00
1 . 0 00
0 . 5 00
Bill De nson
Co rning
3 /14 /2 0 10
0 . 1 00 9 :43 :2 8 PM
0. 01 0 0 . 1 00 1.000 1 0 . 0 00 1 00 . 00 0 10 0 0. 00 0 1 0 00 0. 0 00 10 0 000 . 0 00
T ime, (t)
β [1 ]=0 .7 2 0 8 , η[1 ]=4 3 2 .4 6 1 4 , Ρ[1 ]= 0 .6 6 9 4 ; β[ 2 ]= 4 .4 2 3 2 , η[2 ]= 4 9 1 .0 1 2 0 , Ρ[2 ]=0 .3 3 0 6
The raw data is contained in Figure 6.2-8, which presents the number of deaths occurring
at each age. This is the discrete version of the pdf.
4000
3500
3000
2500
Number of Deaths
2000
1500
1000
500
0
0 20 40 60 80 100 120
Age
From this graphic, it can be seen that there are several distinct distributions present. First,
is the infant mortality period, which is represented by Mode 1. The second mode, Mode
2, represents deaths in the late teens and early twenties. Then, the third and fourth modes
represent deaths from old age.
Next, a multimode Weibull distribution was fit to the data, using Reliasoft’s Weibull++
software tool, which allows fitting failure data to multimode distributions. The results
are summarized in Table 6.1-7.
0.032
0.024
f(t)
0.016
0.008
0.000
0.100 22.100 44.100 66.100 88.100 110.100
Time, (t)
0.032
0.024
Failure Rate, f(t)/R(t)
0.016
0.008
0.000
0.100 22.100 44.100 66.100 88.100 110.100
Time, (t)
The Weibull probability plot is shown in Figure 6.2-11. Note that, in this graph, the plot
is shown using Weibull scales, i.e. the log of time on the x-axis and double log of
unreliability on the y-axis. If this plot was close to a straight line, it would indicate that
the distribution could be described adequately with a mono-modal Weibull distribution.
Clearly, this is not the case.
Probability - Weibull
99.990
90.000
50.000
Unreliability, F(t)
10.000
5.000
1.000
0.500
0.100
0.100 1.000 10.000 110.000
Time, (t)
Figure 6.2-12 illustrates a single mode Weibull fit (straight line) to the data. As can be
seen, if the single mode fit is used to estimate probability of death at a specific age,
significant errors would result. For example, it would imply that about 20% of the
population would live to 110 years. And, it would imply that there is less than .001%
probability of death in the first year.
This example illustrates the fact that, if there is a “sub-population” of samples with
different reliability behavior than the main population, then the TTF distributions may
manifest themselves as bimodal or multimodal. It is important that these multimodal
distributions be characterized. If one of the two “modes” in the distribution appears as
early failures resulting from defects, this information is required to develop an
appropriate reliability screen.
90.000
50.000
10.000
5.000
Unreliability, F(t)
1.000
0.500
0.100
0.050
0.010
0.005
0.001
1.000 10.000 110.000
Time, (t)
χ 2 (1 − CL ,2 r + 2 )
λ=
2t
where the numerator is a value taken from a chi-square table, and “t” is the number of
device hours. A question sometimes arises as to how the confidence bounds calculated in
this manner compare to those calculated with the use of the Poisson distribution.
From the binomial and Poisson distributions, Farachi (Reference 2) has shown that:
r
n!
1 − CL = ∑ (1 − q )n−k q k
k =0 k!(n − k )!
(1 − q )n − k q k ≈ (nq ) e − nq
k
n!
k !(n − k )! k!
1 − CL = ∑
r
(nq) −nq
k
−nq
⎡
e = e ⎢1 + nq + ⋅ ⋅ ⋅ ⋅ +
(nq)
r −1
+
(nq)r ⎤
k =0 k! ⎣ (r − 1)! (r )! ⎥⎦
Since:
nq = λ t
Then:
1 − CL = ∑
r
(λt )k −λt −λt
⎡
e = e ⎢1 + λt + ⋅ ⋅ ⋅ ⋅ +
(λt )r −1 (λt )r ⎤
+
k =0 k! ⎣ ( r − 1 )! (r )! ⎥⎦
The chi-square value is the exact solution to the above equation. The chi-square values
are for “λt”, not “λ” alone. Therefore, for a given confidence level and number of
failures, the chi-square tables provide the value for “λt”. Therefore, the chi-square values
are entirely consistent with the Binomial and Poisson distributions.
It is important to note that the confidence bounds based on the chi-square distribution
summarized above pertain to the uncertainty from statistical considerations alone. They
do not account for variations in failure rate due to other noise factors, such as:
For example, consider the following summary of the model development and use
approach, along with the potential sources of error, as shown in Figure 6.3-1. The
sources of error are highlighted in the gray boxes. From this, it can be seen that there are
many sources of noise. The model output results reflect the cumulative effects of the
uncertainties in all of the noise sources shown.
Although a theoretical basis for the calculation of the confidence bounds around
reliability predictions is extremely difficult to derive, it is possible to empirically observe
the degree of uncertainty. Reliability predictions performed using empirical models
developed from field data result in a failure rate estimate with relatively wide confidence
bounds. Table 6.3-1 presents the multipliers of the failure rate point estimate as a
function of confidence level. This data was obtained by analyzing data on systems for
which both predicted and observed data was available. For example, using traditional
approaches, one could be 90% certain that the true failure rate was less than 7.57 times
the predicted value.
Item Information:
Raw Input Data User Manufacturing Date
Quality
Defect Rate
Environmental Stresses:
Item Information: Temperature
Manufacturer Humidity
Manufacturing Date Delta T
Quality Radiation
Defect rate Contaminants
Operational Profile:
Duty Cycle
Data:
Cycling Rate
Operating Hours
Operating Stress
Time to Failure
Electrical Stress
Number of Failures
Mechanical
Failure Relevancy
Extreme Events
Degradation vs
Catastrophic
Model
Unmodeled Noise Development
Factors
Model
Modeled Factors: Censored Data;
Environmental: Biased Estimators;
Temperature Assumptions Made
Humidity in Modeling
Delta T
Radiation
Model
Contaminants Output
Operational Profile:
Duty Cycle
Cycling Rate
Operating Stress
Electrical Stress
Mechanical
Extreme Events
An interesting effect occurs when combining the distributions that describe the
uncertainties of the individual components comprising a system. The uncertainties are
wider at the piece-part level than at the system level. If one were to take the distributions
of failure rate from the regression analysis used to derive the component model (i.e.,
standard error estimate), and statistically combine them with a Monte Carlo summation,
the resultant distribution describing the system prediction uncertainty will have a
variance much smaller than that of the individual components comprising the system.
The reason for this is the effect of the Central Limit Theorem which quantifies the
variance of summed distributions. For example, the variance around the component
failure rate estimate is higher than the variance suggested by the above table. However,
the variance in the above table is observed to be much larger than that theoretically
derived by summing the component failure rate distributions. This implies that there are
system-level effects that contribute to the uncertainty that are not accounted for in the
component-based estimate.
Bayesian techniques, such as those used in the 217Plus system reliability assessment
methodology, allow the refinement of analytical predictions over time to reflect the
experienced reliability of an item as it progresses through in-house testing, initial field
deployment and subsequent use by the customer. In-house testing can be comprised of
accelerated tests at the component or equipment level, reliability growth tests, and
reliability screens or accelerated screening techniques.
We will not discuss Bayesian methods in detail here. The primary benefit of using
Bayesian techniques can be implied from Figure 6.3-2, however. As more and more test
and experience data is factored into the initial analytical reliability prediction, the
statistical confidence levels represented by the outside (red) lines on the graph continue
to converge on the “True MTBF” of the subject item. Using Bayesian techniques, as
time approaches infinity the predicted inherent MTBF and the true MTBF of the device,
product or system population become one and the same. This, of course, assumes that
MTBF is the appropriate metric, but the same situation conceptually applies to other
metrics such as failure rate and reliability (R).
Prediction
Assessment
Estimation
Paper
Analysis
In-House
Field
Testing
Data
MTBF
Upper Confidence
“True MTBF”
Lower
Confidence Level
TIME
On the other hand, consider a component that has a failure rate of 2 FITs7, typical for
many modern electronic components. In the five year design life, assuming continuous
operation, the reliability would be:
(− λ t )
R=e = 0 .999912
Therefore, if there were 10,000 of these components operating in a system, the expected
number of failures in the five year period would be less than one.
Predicting the reliability behavior of a failure cause based on the extreme left tail of the
TTF distribution of the main population is dangerous, since the accuracy of the
distribution breaks down in its extreme tails. As an example, consider a state-of-the-art
integrated circuit. One failure mechanism is electromigration of the metal lines.
Manufacturers will typically perform life tests of the metal line structures to assess their
lifetime. These tests are done in a manner similar to the practices detailed in this book.
They are accelerated tests performed under a variety of temperature and current density
conditions. Failure times are collected and models are developed to predict lifetimes
under deployment conditions. A goal of a good manufacturer is to design the metal lines
such that the probability of failure is acceptably low when the part is used under specified
conditions. While, as stated, the models developed can be used to estimate the reliability
under deployment conditions, rarely will the prediction be reasonably close to the
observed failure data. The reasons for this are:
A multimode distribution can be used to model this situation, the first mode being
applicable to the defects, and the second being applicable to the main population.
7
Two FITs is defined as 2.0 failures per billion hours. This corresponds to 0.002 failures per million hours.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
244
Chapter 6: Interpretation of Reliability Estimates
However, in many cases, it is only the first mode that will impact the field reliability
within the useful life of the component.
Some researchers have attempted to use extreme value statistics for such cases, but they
also have limited usefulness because the data on low failure rate items, like electronic
components, is generally not consistent with these distributions. As a result, low failure
rate items are usually modeled with a constant failure rate (exponential distribution), or a
Weibull distribution. The Weibull is usually used in this case to model the effects of
infant mortality.
Companies engaged in highly competitive industries face extreme time pressures, which
is in stark contrast to the tenets of good reliability engineering practices. The goal of the
reliability engineer should be to select an optimal approach that achieves the desired
purpose of the analysis, while conforming to the practical constrains to which he or she is
subjected.
6.6. Weibayes
There are many cases in reliability modeling in which there are few or no failures. For
these, a Weibayes technique can be used. This approach is practical when there are few
or no failures and a reasonable shape parameter can be estimated. This approach
essentially fixes a plotting position using:
The result of this analysis is a lower single-sided bound of the life distribution. As an
example, consider the following case:
3. The median rank at 1000 hours is 1.39%. A line is drawn through this point with
a beta slope of 3.
where αi and β represent the Weibull distribution parameters for individual items. This is
applicable when β is the same, but αi can be different for each item.
λ (t ) = λd (t )hd
G
Gth
h
where:
λ(t) = the failure rate of the device due to shock-related failure causes
λd(t) = the rate at which the drops occur
hd = the drop height distribution
G/h = the relationship between the G-level and the drop height
Gth = the failure threshold distribution
Since hd and Gth are random variables described by distributions, λ(t) can generally be
estimated with a Monte Carlo analysis, as described earlier in this book.
In this case, the conditional probability of failure if the device is dropped is:
G
hd Gth
h
The 217Plus methodology summarized previously in this book presents one possible
approach for using the initial estimated reliability based on the predictions made from
empirical models, and combining it with empirical data on the same product or system.
This combination is done using Bayesian principals. This is a general approach that can
be extended to include the combination of estimates from different methods that are made
at different levels. For example, consider the case summarized in Table 6.9-1. It may be
possible to characterize specific failure causes with one of the physics-based techniques
summarized herein, but it also may be unlikely that all failure causes can be modeled in
this manner.
In this example, the objective is to estimate the reliability of the assembly, which is
comprised of two components. Component A has physics-based models available for
two of the three primary failure causes.
λA −preliminary = λ1 + λ2 + λ3
where λ1, λ2 and λ3 are the failure rates obtained from the model or data available on each
failure cause. Of course, these values should represent the failure rate under the use
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
248
Chapter 6: Interpretation of Reliability Estimates
conditions for which the assessment is to be made. In this example, λ is used, which
indicates a constant failure rate. However, if the failure rates are time-dependent, the
corresponding time-dependent failure rates or hazard rates can be used. Also, the
methodology to be illustrated in this example is similar to the data combination
methodology described in the 217Plus section, the main difference being that this
example deals with the situation in which there are different types of data at different
hierarchical levels of the product or system, whereas the 217Plus methodology deals with
different types of data within the same configuration item.
Now, since Component A has life test data available from tests performed on the
component, λA-preliminary is the failure rate estimate before accounting for the life test data
on the entire component. This life data will account for any failure causes not included in
the three failure causes considered, and it will also provide additional data on the three
failure causes considered. A better estimate of reliability can be obtained by combining
λA-preliminary with the life test data, using Bayesian techniques. This technique accounts
for the quantity of data by weighting large amounts of data more heavily than small
amounts. λA-preliminary forms the “prior” distribution, comprised of a0 and ao/λA-preliminary .
The empirical data (i.e., test data in this case) is combined with λA-preliminary using the
following equation:
n
a0 + ∑a
i =1
i
λA = n
∑b '
a0
+
λA−preliminary i =1
i
λA is the best estimate of the Component A failure rate, while ao is the “equivalent”
number of failures of the prior distribution corresponding to λA-preliminary. For these
calculations, 0.5 should be used unless a tailored value can be derived. An example of
this tailoring is provided in the Section 2.6 of this book. The equivalent number of hours
associated with λA-preliminary is represented by ao/λA-preliminary. The number of failures
experienced in each source of empirical data is a1 through an. There may be “n” different
sources of data available (for example, each of the “n” sources corresponds to individual
tests or field data from the population of products). The equivalent number of cumulative
operating hours experienced for each individual data source is b1’ through bn’. These
values must be converted to equivalent hours by accounting for any accelerating effects
between the use conditions.
The same methodology is, in turn, applied at the parent level assembly, in which case, the
preliminary estimate is:
λAssembly= preliminary = λA + λB
∑b '
a0
+
λAssembly- preliminary i =1
i
If the outcome of the analysis is a failure rate, then the expected number of failures is:
N f = λt
where:
If the output of the analysis is a life model that describes the distribution of TTFs for a
specific set of conditions, the number of failures is:
N f = N [F (t 2 ) − F (t1 )]
where:
In this case, since “F” is a (unitless) probability value, the total population is scaled by
the probability of failure in the time interval of interest. This is identical to the expected
value of the binomial distribution of the number of failures.
R = e − λt
The equivalent failure rate can be obtained by solving the above equation for the failure
rate:
− ln(R )
λ=
t
The resulting failure rate value is equal to a failure rate that will result in the same
cumulative percent fail as predicted by the non-constant model at the specific time that
the reliability is calculated. If a different time is chosen, a different value will be
obtained.
This technique can be used when the reliability of some parts of a system is calculated
with non-constant failure rate models and others are calculated with a constant failure
rate. It can also be used when modeling “one-shot” devices, which will simply have a
probability of failure instead of a failure rate.
• Mean life
• Median life
• MTBF
• Failure rate
• Time to X% fail
• B10 life
• Distribution parameters:
o Weibull characteristic life and shape parameter
o Lognormal mean and standard deviation
If a constant failure rate distribution is used, there are various units of failure rate
possible. Some of these are:
“Failures per hour” is the fundamental unit. All of these failure rate units can be
translated to each other with a constant multiplication factor. For example, “Failures per
million hours” times 1000 equals “Failures per billion hours” and “Percent failure per
thousand hours” is equivalent to “Failures per ten thousand hours”.
In the above cases, the “life unit” shown is in hours (i.e., time), but it does not necessarily
need to be. Other possible life units are cycles, miles, missions, operations, etc.
Additionally, if the life unit in the above listed metrics is time (hours), it can refer to the
number of operating hours, calendar hours, flight hours, etc. Reliability prediction
methods like MIL-HDBK-217 use operating hours as the life unit, whereas 217Plus uses
calendar hours as the life unit. Calculation of the operating failure rate using MIL-
HDBK-217 makes the implicit assumption that the failure rate during non-operating
periods is zero, unless the non-operating failure rate is otherwise accounted for.
However, in all cases the life unit refers to the cumulative value of the population. For
example, if the failure rate unit of “Failures per million hours” is used, the million hours
refers to the cumulative time of the entire population, i.e. the sum of each component’s
number of hours.
While there are reliability assessment methods for specific causes listed above, (i.e.,
components, software, etc.) there are few methodologies that attempt to take a holistic
view of system reliability and integrate them into a single methodology. One example of
a methodology that attempts to do this is 217Plus, which is described in Chapter 7.
Initial
Conditions Reliability
Metrics
Stresses
Examples of the IPO variables, as applied to reliability modeling, are shown in Table
6.13-1.
Given the stochastic nature of reliability prediction for many failure causes, it is
impossible to develop a model that is adequately sensitive to all conceivable factors. At
least, this is true in all but the simplest of cases. That’s why model developers need to
select what are believed to be the most relevant factors, and then model accordingly.
This highlights the fact that reliability assessment falls into two distinct categories: the
modeling of intrinsic and extrinsic failure causes. Intrinsic failure causes are generally
those whose root cause is from a known failure mechanism that affects the entire
population of product. These can often be predicted within acceptable bounds by
understanding the stresses, the material properties, etc.
Extrinsic failure causes are those resulting from unpredictable causes, often a complex
sequence of events that ultimately results in the failure cause. Unfortunately, many real
world situations fall into this category. It’s unfortunate because these are the ones whose
likelihood is most difficult to predict. Generally, components that have very low failure
rates are governed by these mechanisms. It is often those unexpected, unpredictable
things that happen somewhere upstream in the process, or in the supplier’s process. This
is the premise behind the 217Plus system assessment methodology. While it is difficult
to predict the likelihood of these “extreme events”, or even identify the failure cause a
priori, it is possible to assess controllable factors that have a relationship to the likelihood
of experiencing the failure cause.
to collect. The faster the growth, the more difficult it is to derive an accurate (i.e.,
“current”) model.
As an example of this reliability growth effect, Table 6.13-2 contains, for each generic
component electronic type, the growth rate that has been observed from data collected by
the RIAC. These reliability growth factors are included in the 217Plus component
models. The growth rate model used for each component for this purpose is:
λ ∝ e − β (t −t
1 2 )
where:
Figure 6.14-1: Estimated Upper Bound failure Rates vs Operating Time at 60 and
90% Confidence
Using a single-sided failure rate bound for reliability estimates can be dangerous, because
they can be very pessimistic. Exactly how pessimistic is determined by the number of
operating hours relative to the true failure rate. Moreover, if the upper bound is used on
multiple components in an assembly, then the pessimism in the assembly failure rate
estimate is compounded.
The Bayesian techniques described previously are a way to address the issue of few or no
failures. This is, in fact, the premise of the 217Plus methodology. This approach, while
it requires a prior estimate, can alleviate the pessimistic nature of reliability estimates
made only from an observed number of hours with no failures.
Another related approach is to pool “like” data together for the purpose of estimating a
failure rate. For example, if a component has no failures, but there is also data available
on other components within the “family” of components, the data can be combined. An
example of this approach is described in the section on NPRD (Section 7.4). In that case,
the pooling occurs as a function of part type, quality and environment. The algorithm
used in that case was similar to a Bayesian approach, but was tailored to the specific
constraints of the data.
The application of a component beyond its rated value of stress can result in one or more
undesired effects. First, there can be reliability ramifications, which can manifest
themselves in a variety of ways: either as a sudden, catastrophic failure or as a latent
failure. The detectability of the first is much better, since it can be observed with product
or system testing. Latent failures are much more difficult to detect, and require more
testing and modeling using the techniques described in this book. The second type of
undesired effect is related to component performance. Performance characteristics can
either be permanently degraded or they may be subject to a “reversible” process in which
Reliability Information Analysis Center
261
Chapter 6: Interpretation of Reliability Estimates
the performance recovers after the overstress condition is taken away. In any event, these
possible undesired effects should be studied and understood before applying components
beyond their rated stress values.
6.16. References
1. http://www.mortality.org
2. Farachi, V., “Electronic Component Failure Rate Prediction Analysis,” RIAC
Journal, Nov., 2006.
7. Examples
This chapter presents several examples of reliability models that are intended to provide a
cross section of several different methodologies. The focus of the examples is to present
methodologies that the author has personally developed, and ,thus, can provide insight
into the logic and rationale for their development. Several examples were previously
presented in Chapter 2, but not in detail. This section presents more detail regarding
model factors, development methods, etc.
4. NPRD – This section, covering the RIAC “Nonelectronic Parts Reliability Data
(NPRD)” publication, is presented to illustrate the nuances of field reliability data,
the manner in which data is merged, and the manner in which it is used in
reliability modeling. Some of this information was previously presented in
Chapter 2 in the section on the use of field data, but more detail will be presented
here. This will hopefully provide the user with an appreciation for both the uses
and limitations of this type of data.
The examples presented in this section were selected to provide a cross-section of various
methodologies, including prediction, assessment and estimation. It is presented to
complement the information previously provided in Chapter 2.
The models that are currently contained in MIL-HDBK-217 have been developed by
various organizations, which use various techniques for their development. However,
Reference 1 will be used to illustrate a typical model development methodology. The
study documented in this report developed the models for discrete semiconductor
devices. Excerpts from this report are summarized within this section. The model
development methodology is shown in Figure 7.1-1. Each of the elements in this
methodology is further examined below.
8
As noted previously, as of the publication date of this book, a Draft of MIL-HDBK-217G is currently in the works, with an
anticipated release in 2010.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
264
Chapter 7: Examples
• Device Style
• Power Rating
• Package Type
• Semiconductor Material
• Structure (NPN, PNP)
• Electrical Stress
• Circuit Application
• Quality Level
• Duty Cycle
• Operating Frequency
• Junction Temperature
• Application Environment
• Complexity
• Power Cycling
The development of the theoretical device failure rate prediction models is an integral
part of the overall model development process. Information collected through literature
searches and discrete semiconductor user and vendor surveys is reviewed and evaluated
to aid in the development of theoretical models for each discrete semiconductor device
type group. The theoretical models serve the following functions:
1. Assure that the prediction models conform to physical and chemical principles
2. Select variables when not possible to determine sing purely statistical techniques
n
λ = λbπ T π E π Q ∏π i
i =1
where:
The second task was an extensive survey of discrete semiconductor manufacturers and
users.
The third task was in-person visits to organizations where data could not be accessed by
other means.
The final data collection task was the compilation of data referenced in the literature and
documented technical studies. Also, as part of this task, additional contact was made
between the authors and/or study sponsors to determine whether more data was available.
The results of the four specific data collection tasks are described in the following
sections.
Five minimum criteria were established to define an acceptable data source. Each
potential equipment selection was evaluated with these criteria before proceeding with
data summarization. These five criteria were:
Reliability Information Analysis Center
267
Chapter 7: Examples
Data summarization consisted of the extraction and compilation of the desired data
elements from the source reports and/or supporting documentation, and coding the data
for computer entry. Data summarization consisted of the following five tasks for sources
of field data:
The data collected for this effort is summarized on the next page, in Table 7.1-1.
Included are, for each part type, the number of observed failures and operating hours. In
addition to this data, other information was captured, such as quality level, environment,
etc.
An example of this is the correlation between quality and environment. This correlation
exists because higher quality parts are often used in the more severe environments. As
such, the analyst’s options are to:
1. Keep the factors as derived, with the caveat that they may be in error
2. Treat the factors as a combined, “pooled” factor representing the correlated
variables
3. Use alternate approaches to quantifying the effects of either or all correlated
variables
For example, consider the following model in which the factors to be included are the
base failure rate, a temperature factor and a stress factor:
λ = λbπ T π s
or:
− Ea
λ = λbe KT
Sn
ln λ = ln λb + ln e KT + ln S n
− Ea
ln λ = ln λb + + n ln S
KT
or:
− Ea
ln λb + + n ln S
λ=a KT
When the regression is performed, the intercept is “ln λ b”, and the temperature factor and
stress coefficients are “–Ea/K” and “n”, respectively. In MS Excel, the LINEST function
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
270
Chapter 7: Examples
is used to determine the model coefficients. These are the values used in the original
equation:
− Ea
λ = λbe KT
Sn
If categorical variables are to be modeled, they can be modeled with regression analysis
by assigning a “1” or a “0” to the variable, and performing the regression as described
above. As an example, consider the case in which the product or system to be modeled
has temperature, stress, environment and quality as the four variables affecting the
reliability. This is shown in Table 7.1-3.
The equation above, expanded with the inclusion of the categorical variables, becomes:
− Ea
ln λb + + n ln S + a1GB + A2 AI + A3GM + A4Comm.+ A5 Ind .+ A6 Mil
λ =e KT
where Ai are the coefficients of the categorical variables determined from the regression
analysis.
Residual plots are also useful in assessing how good the model is as a predictor of
reliability. The smaller the residuals, the better the model is.
Another useful plot, similar to a residual plot, is obtained when plotting the log10 of the
observed-to-predicted ratio. If this metric is relatively tightly clustered and centered
around zero, this is an indication of a good model.
Another thing that must be accounted for in the model validation effort is the scaling of
base failure rates to account for data in which there were no observed failures. The
methodology presented in this section is based on the premise that there exists a point
estimate of the dependent variable, in this case the failure rate. In cases where there are
no failures, a point estimate is not possible, i.e., only a lower single-sided confidence
bound is possible. The use of this confidence bound value cannot be used to represent
the data since the resultant model will be pessimistic (i.e., the failure rate will be
artificially increased). Only using the data points for which there are failures is also not
appropriate because it also will artificially bias the model pessimistically. Potential
solutions to this situation include:
• Scaling the base failure rates to reflect the zero failure data. One possible
alternative to accomplish this is to scale the base failure rates with the boundary
condition that the predicted number of failures in the entire dataset equals the
observed number.
• Use of maximum likelihood (MLE) parameter estimation techniques. These MLE
techniques are especially suited to censored data such as zero failures.
The RIAC is chartered with the collection, analysis and dissemination of reliability data
and information. To this end, it publishes quantitative reliability data such as failure rate
and failure mode/mechanism compendiums, as well as failure rate models. It is not
required to provide these services, but does so because there is a need for this data in the
reliability engineering community. It will continue to engage in such activities as long as
there appears to be this need by reliability practitioners. For this reason, the 217Plus
models and methodology were developed.
There are two primary elements to 217Plus, component reliability prediction models and
system-level models. A system failure rate estimate is first made by using the component
models to estimate the failure rate of each component. These failure rates are then
summed to estimate the system failure rate. This is the traditional methodology used in
many reliability predictions, and represents the reliability prediction, i.e., a reliability
estimate that is made before empirical data or detailed assessments are available. This
prediction is then modified in accordance with system level factors, which account for
non-component, or system level, effects. This is an example of a reliability
“assessment”, in which the process and design factors are assessed. Finally, the
prediction and assessment are combined with empirical data to form the reliability
“estimate” of the product, which is the best estimate of reliability based on all analysis
and data available to the analyst.
A flow diagram of the entire approach was presented in Chapter 2, which guides the user
in the application of the component models and the system level models. The basis for
the 217Plus methodology is the component reliability models, which estimate a system’s
reliability by summing the predicted failure rates of the constituent components in the
system. This estimate of the system reliability is further modified by the application of
“System-Level” factors, called Process Grade Factors (PGF). Development of the
component models is presented in Sections 7.2.3 through 7.2.5.
The primary intent of this section is to detail the development of the 217Plus
methodology. It is provided to familiarize the reader with the issues faced by model
developers in order to allow a better understanding of 217Plus and similar models. It
provides details related to certain aspects of model development.
requirements, induced failures, etc., that have not been explicitly addressed in prediction
methods.
The data in Figure 7.2-1, presented previously, contains the nominal percentage of
failures attributable to each of eight identified predominant failure causes based on data
collected by the RIAC. The data in this figure represents nominal percentages. The
actual percentages can vary significantly around these nominal values.
Softw are
9%
Parts
22%
No Defect
20%
Manufacturing
15%
Induced
12%
Design
9%
Wearout System
9% Management
4%
Another example that this author has experience with is shown in Figure 7.2-2, which
represents the distribution observed for Erbium Doped Fiber Amplifiers (EDFAs) used in
long haul telecommunications systems. The distribution is different than the above chart,
which is a pooled result from various system types and manufacturers. This example is
provided to illustrate the notion that the system type and manufacturing practices will
dictate the specific distribution obtained.
8% No Fault 1% - Component -
Found Mechanical
7% - Component -
21% - Electrical
Manufacturing
Defect
63% - Component
- Pumps and
Other Optical
Components
These process grades correspond to the degree to which actions have been taken to
mitigate the occurrence of product or system failure due to these failure categories. Once
the base estimate is modified with the process grades, the reliability estimate is further
modified by empirical data taken throughout item development and testing. This
modification is accomplished using Bayesian techniques that apply the appropriate
weights for the different data elements.
Advantages of the 217Plus methodology are that it uses all available information to form
the best estimate of field reliability, it is tailorable, it has quantifiable confidence bounds,
and it has sensitivity to the predominant product or system reliability drivers. The
methodology represents a holistic approach to predicting, assessing and estimating
product or system reliability by accounting for all primary factors that influence the
inability of an item to perform its intended function. It factors in all available reliability
data as it becomes available on the program. It, thus, integrates test and analysis data,
which provides a better prediction foundation and a means for estimating variances from
different reliability measures.
λ P = λ IA (Π P + Π D + Π M + ΠS + Π I + Π N + Π W ) + λ SW
The sum of the Pi-factors in the parenthesis represents the cumulative multiplier that
accounts for all of the processes used in system development and sustainment. The sum
of these values is normalized to unity for processes that are considered to be the mean of
industry practices. The individual model factors are:
Additional factors included in the model account for the effects of infant mortality,
environment, and reliability growth. Since each of these factors does not influence all of
the factors in the above equation, they are applied selectively to the applicable factors.
For example, environmental stresses will generally accelerate part defects and
manufacturing defects to failure. These additional factors are normalized to unity under
average conditions, so that the value inside the parenthesis is one under nominal
conditions and for nominal processes.
λ P = λ IA (Π P Π IM Π E + Π D Π G + Π M Π IM Π E Π G + ΠS Π G + Π I + Π N + Π W ) + λ SW
where,
The initial assessment of the failure rate, λIA, is the seed failure rate value, which is
obtained by using the 217Plus component reliability prediction models, along with other
available data. This failure rate is then modified by the Pi-factors that account for
specific processes used in the design and manufacture of the product or system, along
with the environment, reliability growth and infant mortality characteristics of the item.
The above failure rate expression represents the total failure rate of the system, which
includes "induced" and "no defect found" failure causes. If the inherent failure rate is
desired, then the "induced" and "no defect found" Pi-factors should be set to zero, since
they represent operational and non-inherent failure causes.
All variables in the model default to average values, not worst-case values. As a result,
the user has the option of applying any or all factors, depending on the level of
knowledge of the product or system and the amount of time or resources available for the
assessment. If a traditional reliability prediction is desired, the user can perform it using
the component models and the RIAC database failure rates contained in 217Plus9. As
additional data and information becomes available, the analysis can be expanded to
include these system-level factors.
The sum of the Π factors within the parentheses in the failure rate model is equal to
unity for the average grade. Each factor will increase if "less than average" processes are
in used and decrease if “better than average” processes are in used.
9
The RIAC 217Plus software contains databases that hold the RIAC’s NPRD and EPRD failure rate data, converted to failures per
million calendar hours. The RIAC “Handbook of 217Plus Reliability Prediction Models” does not contain this supplementary data.
The RIAC NPRD and ERPD databooks are available for separate purchase from the RIAC, and are in units of failures per million
operating hours.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
280
Chapter 7: Examples
Reference 2 presents the results of the study in which the process grades were
determined.
Data was collected by the RIAC on systems for which both predicted and observed
MTBF data was available. This was done for the purpose of quantifying the uncertainty
in traditional component-based predictions. Table 7.2-1 presents the multipliers of a
failure rate point estimate as a function of confidence level that was derived from analysis
of this data. For example, using traditional approaches, one could be 90% certain that the
true failure rate was less than 7.575 times the predicted value.
An analysis was then performed on the Table 7.2-2 data to quantify the distributions of
percentages for each failure cause. This was accomplished by performing a Weibull
analysis of each column. The resulting distributions are summarized in Table 7.2-3.
Table 7.2-4 summarizes the failure rate multiplier values for each of the eight failure
causes as a function of the grade for each of the eight. The generic formula for the
multiplier is given as:
Πi = −α × (ln Ri )1/ β
In this calculation, the characteristic percentages listed in Table 7.2-3 are scaled by a
factor of 1.11 to ensure that the sum of the multipliers is equal to one when each grade is
equal to 0.50. In this case, a grade of 0.50 represents an "average" process, and since the
model is normalized to an average process, the total multiplier of the initial assessment
failure rate is equal to one under these conditions.
Management
Cumulative Percentage
No Defect
Wearout
Induced
(Grade)
System
Design
Parts
Manufacturing
Management
Cumulative Percentage
No Defect
Wearout
Induced
(Grade)
System
Design
Parts
0.22 0.365 0.298 0.160 0.113 0.157 0.214 0.330
0.23 0.358 0.288 0.156 0.108 0.154 0.210 0.325
0.24 0.352 0.280 0.152 0.103 0.151 0.206 0.320
0.25 0.345 0.271 0.149 0.098 0.149 0.203 0.315
0.26 0.339 0.263 0.146 0.094 0.146 0.199 0.310
0.27 0.333 0.256 0.143 0.090 0.144 0.196 0.306
0.28 0.328 0.248 0.140 0.086 0.141 0.192 0.301
0.29 0.322 0.241 0.137 0.083 0.139 0.189 0.297
0.30 0.317 0.234 0.134 0.079 0.137 0.185 0.293
0.31 0.311 0.228 0.131 0.076 0.134 0.182 0.288
0.32 0.306 0.221 0.128 0.072 0.132 0.179 0.284
0.33 0.301 0.215 0.125 0.069 0.130 0.176 0.280
0.34 0.296 0.209 0.123 0.067 0.128 0.173 0.276
0.35 0.291 0.203 0.120 0.064 0.126 0.170 0.272
0.36 0.286 0.198 0.118 0.061 0.124 0.167 0.269
0.37 0.281 0.192 0.115 0.059 0.122 0.164 0.265
0.38 0.277 0.187 0.113 0.056 0.120 0.161 0.261
0.39 0.272 0.181 0.110 0.054 0.118 0.159 0.257
0.40 0.267 0.176 0.108 0.052 0.116 0.156 0.254
0.41 0.263 0.171 0.106 0.049 0.114 0.153 0.250
0.42 0.259 0.167 0.104 0.047 0.112 0.151 0.247
0.43 0.254 0.162 0.101 0.045 0.111 0.148 0.243
0.44 0.250 0.157 0.099 0.043 0.109 0.146 0.240
0.45 0.246 0.153 0.097 0.042 0.107 0.143 0.236
0.46 0.241 0.148 0.095 0.040 0.105 0.140 0.233
0.47 0.237 0.144 0.093 0.038 0.104 0.138 0.229
0.48 0.233 0.140 0.091 0.036 0.102 0.136 0.226
0.49 0.229 0.136 0.089 0.035 0.100 0.133 0.223
0.50 0.225 0.132 0.087 0.033 0.098 0.131 0.219
0.51 0.221 0.128 0.085 0.032 0.097 0.128 0.216
0.52 0.217 0.124 0.083 0.030 0.095 0.126 0.213
0.53 0.213 0.120 0.081 0.029 0.093 0.124 0.210
0.54 0.209 0.117 0.080 0.028 0.092 0.121 0.206
0.55 0.205 0.113 0.078 0.026 0.090 0.119 0.203
0.56 0.202 0.109 0.076 0.025 0.088 0.117 0.200
0.57 0.198 0.106 0.074 0.024 0.087 0.114 0.197
Manufacturing
Management
Cumulative Percentage
No Defect
Wearout
Induced
(Grade)
System
Design
Parts
0.58 0.194 0.103 0.072 0.023 0.085 0.112 0.194
0.59 0.190 0.099 0.071 0.022 0.084 0.110 0.190
0.60 0.186 0.096 0.069 0.021 0.082 0.108 0.187
0.61 0.183 0.093 0.067 0.020 0.080 0.106 0.184
0.62 0.179 0.090 0.065 0.019 0.079 0.103 0.181
0.63 0.175 0.086 0.064 0.018 0.077 0.101 0.178
0.64 0.172 0.083 0.062 0.017 0.076 0.099 0.174
0.65 0.168 0.080 0.060 0.016 0.074 0.097 0.171
0.66 0.164 0.077 0.059 0.015 0.073 0.095 0.168
0.67 0.160 0.074 0.057 0.014 0.071 0.092 0.165
0.68 0.157 0.072 0.055 0.013 0.069 0.090 0.162
0.69 0.153 0.069 0.054 0.013 0.068 0.088 0.158
0.70 0.149 0.066 0.052 0.012 0.066 0.086 0.155
0.71 0.146 0.063 0.050 0.011 0.065 0.084 0.152
0.72 0.142 0.061 0.049 0.010 0.063 0.081 0.149
0.73 0.138 0.058 0.047 0.010 0.062 0.079 0.145
0.74 0.135 0.055 0.046 0.009 0.060 0.077 0.142
0.75 0.131 0.053 0.044 0.008 0.058 0.075 0.139
0.76 0.127 0.050 0.042 0.008 0.057 0.073 0.135
0.77 0.123 0.048 0.041 0.007 0.055 0.071 0.132
0.78 0.119 0.045 0.039 0.007 0.053 0.068 0.129
0.79 0.116 0.043 0.038 0.006 0.052 0.066 0.125
0.80 0.112 0.040 0.036 0.006 0.050 0.064 0.122
0.81 0.108 0.038 0.035 0.005 0.048 0.062 0.118
0.82 0.104 0.036 0.033 0.005 0.047 0.059 0.114
0.83 0.100 0.034 0.031 0.004 0.045 0.057 0.111
0.84 0.096 0.031 0.030 0.004 0.043 0.055 0.107
0.85 0.092 0.029 0.028 0.003 0.042 0.052 0.103
0.86 0.088 0.027 0.027 0.003 0.040 0.050 0.099
0.87 0.084 0.025 0.025 0.003 0.038 0.047 0.095
0.88 0.079 0.023 0.023 0.002 0.036 0.045 0.091
0.89 0.075 0.021 0.022 0.002 0.034 0.042 0.087
0.90 0.070 0.019 0.020 0.002 0.032 0.040 0.082
0.91 0.066 0.017 0.019 0.001 0.030 0.037 0.078
0.92 0.061 0.015 0.017 0.001 0.028 0.034 0.073
0.93 0.056 0.013 0.015 0.001 0.026 0.031 0.068
Manufacturing
Management
Cumulative Percentage
No Defect
Wearout
Induced
(Grade)
System
Design
Parts
0.94 0.051 0.011 0.013 0.001 0.023 0.028 0.062
0.95 0.045 0.009 0.012 0.001 0.021 0.025 0.057
0.96 0.039 0.007 0.010 0.000 0.018 0.022 0.050
0.97 0.033 0.005 0.008 0.000 0.015 0.018 0.043
0.98 0.025 0.003 0.006 0.000 0.012 0.014 0.035
0.99 0.016 0.002 0.003 0.000 0.008 0.009 0.024
where:
Dremoved = D in − Dremaining
where:
Since SS is the percentage of defects removed from the population, it follows that:
Reliability Information Analysis Center
287
Chapter 7: Examples
The SSfield is the effective screening strength of the stresses that the product or system
will encounter in the field, and SSESS is the screening strength that the system is exposed
to during environmental stress screening (ESS). It also follows that Dfield is equal to the
cumulative (integral of) field failure rate:
D field = ∫ λ (t )
D field = ∫ λ postscreened (t )
∫
D field = SS * λ prescreened (t )
λ postsceened = SS * λ prescreened
This indicates that, in addition to estimating the effect that ESS has on system reliability,
the screening strength calculated from field stresses (SSfield) can be effectively used as a
failure rate multiplier that accounts for the environmental stresses:
1 − e − kt
SS field (t ) =
t
where,
The total screening strength, SStotal , after accounting for both the temperature cycling and
vibration-related portions, is:
where:
Algorithms for calculating screening strength are given in a subsequent section. If the
actual values of PTC and PRV are unknown, the default values that should be used are:
PTC = 0.80
PRV = 0.20
Since the component failure rates described above are relative to a ground benign
environment, the failure rate multiplier is the ratio of the SS value in the use environment
to the SS value in a ground benign environment:
where:
As previously indicated, the SS value is the screening strength and has been derived from
MIL-HDBK-344. It is an estimate of the probability of both precipitating a defect to
failure and detecting it once it is precipitated by the test.
SS TC = 1 − e (− kTC t )
SSRV =1 − e(−k RV t )
k TC = 0.0017 ( ΔT + .6) .6 [ln (RATE + 2.718) ]
3
where:
ΔT = Tmax − Tmin (in degrees C)
RATE = degrees C/minute
t = # of cycles
k RV = 0.0046 G 1.71
Reliability Information Analysis Center
289
Chapter 7: Examples
The parameter “G” is the magnitude of vibration stress, in units of Grms. Whenever
possible, the actual values of delta T (ΔT) and vibration (Grms) should be used for the use
application environment when calculating SS values. If the actual values are not known,
then the default values of ΔT (summarized in the component model descriptions later)
can be used. A discussion of the values of “k” follows.
For RV screens it is necessary to include an axis sensitivity factor. The RV applied in the
axis perpendicular to the plane of the board will have the greatest effect. When selecting
and modeling RV stress, the precipitation efficiency is, thus, given by:
It should also be noted that the expressions and tables for precipitation efficiency are only
approximate and, as in the estimation of initial defects, should be refined based upon
actual user data according to the techniques of Procedure D of MIL-HDBK-344.
Under the average temperature cycling and random vibrations conditions that represent
the data used in development of the models, the denominator is 0.205. This value is a
normalization constant such that the environment factor is equal to 1.0 when a product or
system is subjected to the average stress levels.
The values assumed for the rate and duration are 2 degrees C per minute and 10 hours,
respectively. Therefore, the environment factor is:
1.12(t + 2) −α
ΠG =
2 −α
The denominator in the above expression is necessary to ensure that the value of the
factor is 1.12 at the time of field deployment, regardless of the growth rate (α). Figure
7.2-3 illustrates the growth Pi-factor multiplier for various values of growth rates as a
function of time.
1.2
0.8 0
Pi (Growth)
0.2
0.5
0.6
0.7
1
0.4
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
0
Time (years)
The value of “α” is estimated by determining the degree to which the potential for growth
exists. This estimation is accomplished in a manner similar to the process grading
10
The system reliability growth factor is different from, and in addition to, the reliability growth factors used in the 217Plus
component models to reflect component technology improvements from their respective baseline years.
Reliability Information Analysis Center
291
Chapter 7: Examples
methodology by assessing and grading the processes that can contribute to reliability
growth.
t = time in years
SSESS = the screening strength of the screen(s) applied, if any.
The value of SS can be determined by using the stress screening strength equations as
presented in Section 7.2.2.9.
The above expression represents the instantaneous failure rate. If the average failure rate
for a given time period is desired, this expression must be integrated and divided by the
time period.
variable data (which is often the case with empirical failure rate data), a requirement of
the model form is that it be multiplicative (i.e., the predicted failure rate is the product of
a base failure rate and several factors that account for the stresses and component
variables that influence reliability). An example of a multiplicative model is as follows:
λ p = λbπ eπ qπ s
where:
However, a primary disadvantage of the multiplicative model form is that the predicted
failure rate value can become unrealistically large or small under extreme value
conditions (i.e., when all factors are at their lowest or highest values). This is an inherent
limitation of multiplicative models, primarily due to the fact that individual failure
mechanisms, or classes of failure mechanisms, are not explicitly accounted for. A better
approach is an additive model which predicts a separate failure rate for each generic class
of failure mechanisms. Each of these failure rate terms are then accelerated by the
appropriate stress or component characteristic. This model form is as follows;
λ p = λ oπ o + λ eπ e + λ cπ c + λ i + λ sj π sj
where:
By modeling the failure rate in this manner, factors that account for the application and
component specific variables that affect reliability (π factors) can be applied to the
appropriate additive failure rate term. Additional advantages to this approach are that
they:
failures per million calendar hours (F/106CH). This is necessary (and appropriate)
because it is the common basis for all failure rate contribution terms used in the model
(operating, non-operating, cycling, and induced). If an equivalent operating failure rate is
desired (in units of failures per million operating hours), the failure rate (in F/106CH) can
be divided by the duty cycle to yields a failure rate in F/106operating hours.
Solving for λb, and adding a factor to account for data points which have had no observed
failures, yields:
PFC * λobs
λb = * PF
πo
The PF parameter is the percentage of total observed calendar hours associated with
components that have had observed failures. This factor is necessary to pro-rate the base
failure rate which was calculated from those data records containing failures. Once this
value of λb was calculated for each data record, the geometric mean was used as the best
estimate of the base failure rate.
heavily than small quantities. The failure rate estimate obtained above forms the “prior”
distribution, comprised of a0 and b0.
If empirical data (i.e., test or field data) is available on the system under analysis, it can
be combined with the best pre-build failure rate estimate using the following equation:
ao + a1 + ....an
λ=
bo + b1 + ....bn
where:
a0 = 0.5
a0
b0 =
λp
If test data is available that was taken at accelerated conditions, it needs to be converted
to the conditions of interest. A traditional reliability prediction can be performed at both
the test and use conditions, and the equivalent number of hours (bi) can be accelerated by
the failure rate ratio between the test and use temperatures, as follows:
λT 1
H Eq = * HT
λT 2
where:
The benefits of including empirical data in the failure rate estimate are that it:
• Integrates all reliability data that is available at the point in time when the
estimate is performed (analogous to the statistical process called “meta-analysis”)
• Provides flexibility for the user to customize the reliability model with actual
historical experience data
λ predicted , new
λ predicted = λ predecessor *
λ predicted , predecessor
The (predicted, new)/(predicted, predecessor) failure rate ratio accounts for the
differences in application environment, complexity, stresses, date, etc. The predicted
failure rates for the predecessor and the new system are determined using the complete
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
298
Chapter 7: Examples
λP = π Gπ C (λOBπ DCOπ TOπ S + λEBπ DCN π TE + λTCBπ CRπ DT ) + λSJBπ SJDT + λEOS
π G = e (− β (Y −1993 ))
β= growth constant. A function of capacitor type (see Table 7.2-6)
πC = capacitance failure rate multiplier:
CE
⎛C⎞
π C = ⎜⎜ ⎟⎟
⎝ C1 ⎠
C= capacitance, in microfarads
C1 = constant. A function of capacitor type (see Table 7.2-6)
CE = constant. A function of capacitor type (see Table 7.2-6)
λOB = base failure rate, operating
πDCO = failure rate multiplier for duty cycle, operating:
DC
π DCO =
DC1op
⎛ − Eaop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617⎜ T + 273 − 298 ⎟ ⎟
π TO = e ⎝ ⎝ AO ⎠⎠
Eaop = activation energy, operating. A function of capacitor type (see Table 7.2-
6)
πS = failure rate multiplier for electrical stress:
n
⎛S ⎞
π S = ⎜⎜ A ⎟⎟
⎝ S1 ⎠
SA = stress ratio, the applied voltage stress divided by the rated voltage
S1 = constant. A function of capacitor type (see Table 7.2-6)
n= constant. A function of capacitor type (see Table 7.2-6)
λEB = base failure rate, environmental (see Table 7.2-6)
πDCN = failure rate multiplier, duty cycle – nonoperating:
1 − DC
π DCN =
DC 1nonop
⎛ − Ea nonop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟
π TE = e ⎝ ⎝ AE ⎠⎠
CR
π CR =
CR1
2.26
⎛ T − TAE ⎞
π SJDT = ⎜ AO ⎟
⎝ 44 ⎠
DC1nonop
λOB λEB λTCB λIND λSJB
TRdefault
β
Eanonop
Part Type CR1 DT1 n C1 S1 CE
DC1op
Eaop
Aluminum 0.000465 0.00022 0.000214 0.000768 .00095 0.229 0.17 0.5 0 0.83 0.4 1140.35 21 5 7.6 0.6 0.23
Ceramic 0.001292 0.000645 0.000096 0.00014 .00095 0.0082 0.17 0.3 0 0.83 0.3 1140.35 21 3 0.1 0.6 0.09
General 0.000634 0.000351 0.000083 0.000259 .00095 0.033 0.17 0.3 0 0.83 0.3 1140.35 21 7 0.1 0.6 0.09
Mica/Glass 0.000826 0.000997 0.000888 0.000764 .00095 0.0082 0.17 0.4 0 0.83 0.4 1140.35 21 10 0.1 0.6 0.09
Paper 0.000663 0.000075 0.000882 0.000042 .00095 0.0082 0.17 0.2 0 0.83 0.2 1140.35 21 5 0.1 0.6 0.09
Plastic 0.000994 0.001462 0.001657 0.002531 .00095 0.0082 0.17 0.2 0 0.83 0.2 1140.35 21 6 0.1 0.6 0.09
Tantalum 0.000175 0.000049 0.000032 0.000816 .00095 0.229 0.17 0.2 0 0.83 0.2 1140.35 21 17 7.6 0.6 0.23
Tantalum 0.000175 0.000049 0.000032 0.000816 .00095 0.229 0.17 0.2 0 0.83 0.2 1140.35 21 17 7.6 0.6 0.23
Variable, Air 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.3 1140.35 21 6 0.35 0.5 0.09
Variable, Ceramic 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.1 1140.35 21 3 0.35 0.5 0.09
Variable, FEP 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 6 0.35 0.5 0.09
Variable, General 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 6 0.35 0.5 0.09
Variable, Glass 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 3 0.35 0.5 0.09
Variable, Mica 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 10 0.35 0.5 0.09
Variable, Plastic 0.002683 0.005193 0.002066 0.000566 .00095 0.0082 0.17 0.3 0 0.83 0.2 1140.35 21 6 0.35 0.5 0.09
7.2.4.1. Introduction
DC
π DCO =
DC1op
πV = vibration factor:
nvib
⎛ V +1⎞
π V = ⎜⎜ a ⎟⎟
⎝ Vc ⎠
1 − DC
π DCN =
1 − DC1op
⎛ − Eanonop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟
π TE = e ⎝ ⎝ AE ⎠⎠
n RH
⎛ RH a + 1 ⎞
π RH = ⎜⎜ ⎟⎟
⎝ RH c ⎠
CR
π CR =
CR1
⎛ T + T − TAE ⎞
n PC
π DT = ⎜ AO R ⎟
⎝ 14 ⎠
Since failure mode data is typically not classified according to these categories, it is
necessary to transform the failure mode distribution data into the failure cause
distribution. This failure mode distribution data was obtained from several sources:
An example of this is summarized in Table 7.2-9, in which the failure causes for a
connector are hypothesized (2nd column), and then an occurrence rating is given for each
cause. This rating is in the 3rd column, and is scored as a 1, 3 or 9. This weighting
scheme is often used in FMEA analysis. The result is a fractional value for each failure
cause that is proportional to the weighting. The sum of all of these values for each
component type equals 1.0.
The methodology used in the photonics device models to derive the fraction of
occurrence differs from the methodology presented previously for the 217Plus
components, in that failure mode distributions were not available during the photonics
model development effort. For the 217Plus models, the components were more mature
and therefore, there was considerable history of both failure mode and failure rate data to
draw upon.
7.2.4.2.2. Map Observed Failure Modes into the Failure Cause Categories
To transform the failure mode distribution data into the failure cause distribution, the
following process was used:
The last item is accomplished by assessing whether each stress is a primary accelerant of
the failure mode, a secondary accelerant, or is not an accelerant. A 3:1 weighting
between primary and secondary accelerant was then used in estimating the percentage of
failures that could be attributed to those stresses.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
308
Chapter 7: Examples
The primary stresses that potentially accelerate operational failure modes are operating
temperature, vibration, current/voltage and optical power. The stresses that accelerate
environmental failure causes are nonoperating ambient temperature, corrosive stresses
(contaminants/heat/humidity), and aging stresses (time). As an example, Table 7.2-10
summarizes this process for our connector example.
Table 7.2-10: Failure Mode to Failure Cause Category for Connectors (SC and FC)
O-ring failure
Spring failure
Total
21.95%
7.32%
7.32%
2.44%
2.44%
2.44%
7.32%
2.44%
2.44%
7.32%
2.44%
2.44%
7.32%
7.32%
7.32%
7.32%
2.44%
100 %
Each of the failure modes is listed across the top of the table, and each of the accelerating
stresses/causes is listed down the left side. Each combination is identified with a “blank”
(no acceleration from the factor), a "p" (primary) or an "s" (secondary). The associated
relative percentage of failures attributable to the accelerating stress/cause is listed down
the right columns.
⎛ ⎞
n ⎜ ⎟
⎜
% = ∑ FM % n
wi ⎟
⎜ ⎟
⎜ ∑ wi ⎟
FM1
⎝ AC1 ⎠
where:
For example, the % value for ambient temperature (as part of the environmental failure
cause category) is:
The results of this data collection effort, for connectors, are summarized in Table 7.2-12.
Observed
Lambda
Failures
Delta T
Hours
RHa
Part
TAO
TAE
DC
CR
VA
TR
Data Type
Type
The first column is the part type; the second is the data type. Data types used in the
photonics device study included:
• Field data
• Test data
Thermal cycling
o Vibration
o Damp heat
o High temperature storage
o Low temperature storage
o Operating life test
The 3rd through tenth columns are the estimates of the actual stresses to which the part
was exposed in the field or during the test. These stresses are defined as follows:
For test data, these values were generally readily available. For data collected from
fielded systems, the actual stress values were not available. Therefore, they had to be
estimated. The default values of the environmental and operating profile factors were
summarized in Tables 7.2-7 and 7.2-8. Only field data from telecommunication
applications used in a ground, stationary, indoors environment was available to the
photonics device modeling study, so only the values pertaining to those conditions were
estimated in this manner.
7.2.4.2.6. Estimate Acceleration Model Constants for Each Part
Acceleration factors (or Pi-factors) were used in the component models to estimate the
effects of various stress and component variables on the failure rate. The two
predominant forms of acceleration factors are the Arrhenius and the power law models.
The Arrhenius model is generally used for modeling temperature effects and is:
Ea
AFT = e KT
where “AFT” is the temperature acceleration factor, “Ea” is the activation energy, “K” is
Boltzman’s constant, and “T” is the temperature (in degrees K).
The specific forms of these acceleration factors that were used in the models are
summarized below.
πV = vibration factor:
nvib
⎛ V +1⎞
π V = ⎜⎜ a ⎟⎟
⎝ Vc ⎠
⎛ − Eanonop ⎛ 1 1 ⎞ ⎞⎟
⎜ ⎜ ⎟
⎜ .00008617 ⎜ T + 273 − 298 ⎟ ⎟
π TE = e ⎝ ⎝ AE ⎠⎠
n RH
⎛ RH a + 1 ⎞
π RH = ⎜⎜ ⎟⎟
⎝ RH c ⎠
⎛ T + T − TAE ⎞
n PC
π DT = ⎜ AO R ⎟
⎝ 14 ⎠
Each model has a single factor that needs to be estimated, i.e., “Ea” for the Arrhenius and
“n” for the power law. These were estimated in one of the following ways:
For #2, the accelerations were categorized from “no acceleration” to “very high
acceleration” for each specific accelerating stress. Table 7.2-13 summarizes the values of
the applicable parameters as a function of the relationship.
Table 7.2-14 summarizes the specific parameter values used in the connector models.
The default values for the applicable photonics device model Pi-factors are summarized
in Table 7.2-15.
Default RH
Default DC
Default CR
Default DT
Default Tr
Vibration
Model Category
Default
Connector 0
Passive Micro-Optic Component 10
Passive Fiber-Based Component 0
Isolator 5
VOA 20
Fiber 0
0.25 1000 1 50 20
Splice 0
Cable 0
Laser Diode Module 15
Photodiode 5
Transmitter 15
Receiver 15
Transceiver 15
7.2.4.2.8. Estimate the Acceleration Factors (Pi-factors) for Each Part from Each Data Source
The acceleration factors used in the models are Pi-factors, which are the acceleration
factors normalized to a given stress level. These factors were calculated for each part
from each data source. To derive these factors, two pieces of information were required:
1. The estimate of the stress for each data point (in this case, a data point is a single
observation of reliability (failures and hours) at a known set of stress conditions).
The manner in which these were quantified was previously explained.
2. The default stress level of the data for each stress parameter in the model
The Pi-factor was then the acceleration model normalized to the default stress level. An
example of this calculation is shown in Table 7.2-16. Every data point available from
field or test data had its associated Pi-factor values calculated. Note that some of the Pi-
factors were zero. This occurs because test data was not applicable to all failure causes.
This concept will be further explained in the next section.
Pi DCO
Pi DCN
Pi RH
Pi CR
Pi TO
Pi DT
Pi TE
Pi V
Cable Field 3.200 1.000 32.000 0.267 1.000 0.137 0.368 1.063
Cable Thermal Cycling 4.000 1.000 1.000 0.000 1.000 0.000 2.037 1.180
Cable Thermal Cycling 4.000 1.000 1.000 0.000 1.000 0.000 4.037 1.149
Cable Thermal Cycling 4.000 1.000 1.000 0.000 1.000 0.000 4.037 1.162
Cable Vibration 4.000 1.000 4084101 0.000 1.000 0.000 0.000 0.000
Cable Vibration 4.000 1.000 496874 0.000 1.000 0.000 0.000 0.000
Cable Vibration 4.000 1.000 785027 0.000 1.000 0.000 0.000 0.000
Cable Vibration 4.000 1.000 4084101 0.000 1.000 0.000 0.000 0.000
Connector Field 3.200 1.135 1.000 0.267 0.974 0.137 0.368 0.360
Connector Damp heat 0.000 1.921 1.000 1.333 1.921 227 0.000 0.000
Connector Damp heat 0.000 1.506 1.000 1.333 1.506 681 0.000 0.000
Connector Damp heat 0.000 1.921 1.000 1.333 1.921 227 0.000 0.000
Connector Damp heat 0.000 1.921 1.000 1.333 1.921 227 0.000 0.000
Connector Damp heat 0.000 1.921 1.000 1.333 1.921 227 0.000 0.000
Connector High temperature storage 0.000 1.921 1.000 1.333 1.921 0.000 0.000 0.000
Connector High temperature storage 0.000 1.921 1.000 1.333 1.921 0.000 0.000 0.000
Connector Low temperature storage 0.000 0.337 1.000 1.333 0.337 0.000 0.000 0.000
7.2.4.2.9. Calculate the Base Failure Rates for Each Cause Such That the Observed Failure
Rates = the Predicted Failure Rates
In the case of the 217Plus models, which were based solely on field data, the base failure
rates for the photonic device models were obtained, as follows, for each failure cause
category:
m
∑ (Fobs × %i ) field
1
λ Bi = m k
∑ H obs field × ∏ π
1 1
where:
λBi = the base failure rate for the ith failure rate term
Fobs = the number of observed field failures
Hobs = the number of observed field hours
Ππ= the product of the applicable Pi-factors to the applicable field environment
i= the number of failure causes
m= the number of field data sources
k= the number of correction factors
%i = the percentage of failure rate attributable to the specific failure causes
Reliability Information Analysis Center
317
Chapter 7: Examples
The product of the Pi-factors converts the actual hours to an equivalent “effective”
number of hours normalized to the default stress values.
However, in the case of the photonic models developed for the study, it was necessary to
utilize a significant amount of test data since there was not enough field data available.
This is due to the fact that there are few field data sources for photonic components.
Therefore, the modeling methodology needed to be tailored to accommodate the specific
data available on the parts addressed in the photonics device study. This was
accomplished by using a Bayesian technique in which the field data becomes the prior
distribution, and the summation of the failure and hours from all data sources forms the
basis of the posterior distribution. The failure rate parameter of the exponential
distribution was, therefore:
m j
∑ (Fobs × %i ) field + ∑ Fobstest
λ Bi = 1 1
m k j k
∑ H obs field × ∏ π + ∑ H obstest × ∏ π
1 1 1 1
Each specific type of test data that was collected for the study was applicable to only one
of the four specific failure causes, as summarized in Table 7.2-17. Field data, however,
encompassed all four failure causes.
One of the advantages to the model structure was this ability to modify the base failure
rates of specific failure causes with test data applicable to only that failure cause.
The connector base failure rates resulting from this analysis are listed in Table 7.2-18.
Table 7.2-18: Base Failure Rates (Failures per Million Calendar Hours)
Base Failure Rate
Component (failures per million calendar hours)
Operating Environmental Cycling Induced
Connector 0.0002 0.3053 2.7952 0.0110
The quality factor ( Q) is calculated in a manner similar to the 217Plus methodology, but
tailored to the unique concerns of photonic components. This factor is calculated as
follows:
1
π q = α i (− ln (R i )) β i
Where αi and βi are Weibull parameters representing the distribution of the percentage of
failures attributable to components (parts). The quality factor is scaled within this
distribution based on how good the parts control program is. The parameter “Ri” is the
rating of the parts control program and is calculated from:
ni
∑
j =1
GijWij
Ri = ni
∑W
j =1
ij
where,
Ri = rating of the process for the ith failure cause, from 0.0 to 1.0
Gij = the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = the number of grading criteria associated with the ith failure cause
The 217Plus grading criteria, as applied to the photonics device models, are provided in
Table 7.2-19. These were tailored specifically for photonic components.
Table 7.2-19: Part Quality Process Grade Factor Questions for Photonic Device Models
Highest
Input User Actual
Parts Contribution to Reliability Rating Possible
Range Input Score
Score
Is there a documented part selection and part yes = 5
Y,N N 5 0.0
management process? no = 0
Highest
Input User Actual
Parts Contribution to Reliability Rating Possible
Range Input Score
Score
Are teaming relationships established with all yes = 7
Y,N N 7 0.0
critical component suppliers? no = 0
Will all suppliers provide timely failure reporting
and corrective action support (FRACAS) for both
yes = 7
critical and custom parts? (Timely reporting Y,N N 7 0.0
no = 0
implies a 2 week turnaround with faster response
on priority demand.)
Have supplier identified the likely failure modes
yes = 10
on critical and custom parts, and does the design Y,N N 10 0.0
no = 0
take these failure modes into account?
Are operational failure rate and failure mode
yes = 7
data provided by the suppliers of critical and Y,N N 7 0.0
no = 0
custom parts being used?
Is there a device specification for all critical and yes = 5
Y,N N 5 0.0
custom parts? no = 0
A. No OPA = 10
B. yes, MFD <2um = 0
Is an optical path adhesive (OPA) used in the
C. yes, MFD = 2 - 5 um = 4 A,B,C,D,E B 10 0.0
component
D. yes, MFD = 5 - 10 um = 6
E. yes, MFD > 10 um = 8
Highest
Input User Actual
Parts Contribution to Reliability Rating Possible
Range Input Score
Score
A. No Thin film = 0
B. yes, and surface is prepared by
Are there thin films (AR coatings, filter
sputtering = 2 A,B,C A 3 0.0
elements) in the light path?
C. yes, and surface is not
prepared by sputtering = 3
yes = 5
Does the component contain fused fiber? Y,N N 5 0.0
no = 0
yes = 5
Does the component contain fiber? Y,N N 5 0.0
no = 0
Was the package thermally designed to safely
yes = 3
dissipate heat by understanding and modeling Y,N N 3 0.0
no = 0
the thermal characteristics?
Has the manufacturer characterized the power yes = 5
Y,N N 5 0.0
handling capability of the component? no = 0
Have acceleration factors for power and
yes = 5
temperature been quantified and are they used Y,N N 5 0.0
no = 0
to determine the derating requirements?
Does the component contain absorbers at
yes = 4
wavelengths for which the component will be Y,N N 4 0.0
no = 0
exposed (i.e. garnet, shutter, etc.)
A. with a heat sink = 4
How is dissipated power intended to be dumped? B. dissipation not actively A,B B 4 0.0
managed = 0
Does the component rely on alignment of free yes = 3
Y,N N 3 0.0
space components attached with organics no = 0
A. stringent cleaning procedures
=3
Cleanliness precautions A,B,C C 3 0.0
B. some cleaning procedures = 2
C. no cleaning procedures = 0
For components that have a fiber/epoxy interface,
yes = 3
is the fiber tip inspected to ensure it is free of Y,N N 3 0.0
no = 0
defects and contamination?
Figures 7.2-5 and 7.2-6 illustrate the distribution of this metric for all data and for just
field data. For this analysis, only data for which failures occurred were included, since
data with no observed failures only have a single-sided bound on the failure rate and,
therefore, cannot be compared to the predicted value. The result of not including zero
failure data is that the metric is biased. As can be seen in these figures, the distribution of
all failures is significantly wider than the distribution of just the field failure rates. This
is due to the fact that the non-field data, i.e. test data, is typically at extreme conditions.
Therefore, the uncertainty in these extreme cases is typically larger than for nominal
conditions.
Histogram
14
12
10
Frequency
8
6
0
-3 -2 -1 0 1 2 3
LOG 10 (PREDICTED/OBSERVED)
Figure 7.2-5: Distribution of Log10 Predicted/Observed Failure Rate Ratio for All
Data
Histogram
7
6
5
Frequency
4
3
2
1
0
-0.25 0.25 0.75 1.25 1.75 More
LOG 10 (PREDICTED/OBSERVED)
Figure 7.2-6: Distribution of Log10 Predicted/Observed Ratio for Field Data Only
The distributions of the predicted/observed failure rate ratio are illustrated in Figure 7.2-
7. With this metric, the value should be centered about one, since the log of this ratio has
not been taken.
Re lia So f t W e ibu ll+ + 7 - w w w . Re lia So ft. com
distribution of predicted/observed failure rate ratio
99 . 0 00
Prob a bility-Lo gn orma l
cumulative probability
50 . 0 00
10 . 0 00
5. 00 0
1. 00 0
0. 00 1 0 . 0 10 0. 10 0 1. 000 10. 00 0 1 00 . 0 00
predicted/observed ratio
F olio1\Da ta 1: μ= −1 .5 5 8 5 , σ=2 .4 9 2 5 , ρ= 0 .9 8 8 0
F olio1\Da ta 2: μ= 0 .4 5 5 6 , σ=0 .9 5 4 7 , ρ=0 .8 8 1 3
Figure 7.2-7: Distributions of the Predicted/Observed Failure Rate Ratio for All Data
and For Field Data Only
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
324
Chapter 7: Examples
One of the problems that developers had when developing MIL-HDBK-217 models was
de-convolving the effects of quality and environment. For example, multiple linear
regression analysis of field failure rate data was usually used to quantify model variables
as a function of independent variables such as quality and environment. A basic
assumption of such techniques is that the independent variables are statistically
independent of each other. However, in reality they are not, since the “higher” quality
components are generally used in the severe environments and the commercial quality
components are used in the more benign environments. This correlation makes it
difficult to discern the effects of each of the variables individually. Additionally, there
are several attributes pooled into the quality factor, including qualification, process
certification, screening and quality systems.
The approach used in the 217Plus model to quantify the effects of part quality is to treat it
as one of the failure causes for which a process grade is determined. In this manner,
issues related to qualification, process certification, screening and quality systems were
individually addressed.
If an equivalent operating failure rate is desired in units of failures per million operating
hours, the 217Plus reliability prediction should be performed with the actual duty cycle to
which the unit will be subjected, then divide the resulting failure rate (in f/106 calendar
hours) by the duty cycle to yield a failure rate in terms of f/106 operating hours. The
resulting “operating” failure rate will be artificially increased to account for the
nonoperating and cycling failures that would not otherwise be accounted for. The
incorrect way to predict a 217Plus failure rate in units of failures per million operating
hour is to set the duty cycle equal to 1.0. The resulting failure rate in this case would be
valid only if the actual duty cycle is 100%. If the actual duty cycle is not 100%, then the
failures during non-operating periods will not be accounted for.
where:
where αi and βi are constants for each failure cause category, as given in Table 7.2-21.
The parameter Ri is calculated as:
ni
∑
j =1
GijWij
Ri = ni
∑W
j =1
ij
where:
Ri = rating of the process for the ith failure cause, from 0.0 to 1.0.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
326
Chapter 7: Examples
Gij = the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = the number of grading criteria associated with the ith failure cause
where:
t= time in years. This is the instantaneous time at which the failure rate is
to be evaluated. If the average failure rate for a given time period is
desired, this expression must be integrated and divided by the time
period.
SSESS = the screening strength of the screen(s) applied, if any
ΠE = environmental factor
πE =
(( ) (
.855 × .8 1 − e(−.065(ΔT +.6 ) ) + .2 1 − e(−.046G )
.6 1.71
))
.205
where:
1 .12 (t + 2 ) −α
ΠG =
2 −α
where:
∑
j =1
GijWij
Ri = ni
∑W
j =1
ij
The rating for each process grade type, Ri,is given as:
ni
∑
j =1
GijWij
Ri = ni
∑W
j =1
ij
where:
Ri = rating of the process for the ith failure cause, from 0.0 to 1.0.
Gij = the grade for the jth item of the ith failure cause. This grade is the rating
between 0.0 and 1.0 (worst to best).
Wij = the weight of the jth item of the ith failure cause
n i = number of grading criteria associated with the ith failure cause
These tables are organized as follows. Column 1 contains the criteria associated with the
specific Process Grade Type. Column 2 is the grading criteria (Gij). Most of the
questions are designated with a Y/N in this column. In these cases, a Yes (Y) answer
equals "1" and a "No" answer equals “0”. The question will receive the full weighted
score for a "Yes" answer and a zero for a "No" answer. In some cases, the grading
criteria is not binary, but rather can be one of three or four possible values. The grading
criteria for these are noted in this column. Column 3 identifies the scoring weight (Wij)
associated with the specific question.
In the event that a model user does not wish to answer all of the questions, he/she can
choose a subset of the most important questions by using only those with weight values
of seven or higher. Questions that are not scored should not be counted in the number of
grading criteria (ni) associated with the ith failure score.
Yes = 1
Can any of the line or quality personnel "stop the line" if that person believes a serious problem exists? 5
No = 0
Yes = 1
Has the majority of the manufacturing leadership had direct field or customer contact in the past year? 3
No = 0
Yes = 1
Do manufacturing people have measurable goals to improve production metrics, including quality and cycle time? 5
No = 0
Yes = 1
If answer to 3.2.13 is yes, do direct manufacturing people participate in developing the goals? 4
No = 0
Yes = 1
Do manufacturing personnel have goals for continuous quality improvement? 3
No = 0
Yes = 1
Are there quality circles that meet regularly? 3
No = 0
Yes = 1
Are teams rewarded or recognized for improving quality? 3
No = 0
Yes = 1
Are key metrics for quality and cost monitored and tracked? 5
No = 0
Yes = 1
Is the cost of defect prevention measures tracked (proactive quality)? 3
No = 0
Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions
Question Gij Wij
Yes = 1
Is the system required to isolate to a single Field Replaceable Unit (FRU) on 90% % of failures? 6
No = 0
Yes = 1
Is there a specified time limit to isolate a fault, effect a repair and restore the system? 5
No = 0
Yes = 1
Is there a requirement for 90% or greater test coverage within the FRU being analyzed? 6
No = 0
Does the system promote remote serviceability with failure status communicated via Ethernet, serial port, parallel port, serial bus, Yes = 1
4
etc., to a central maintenance station? No = 0
Is there any remote failure protection for this FRU residing on a separate FRU (e.g., an arc suppression circuit that is located on a Yes = 1
5
different FRU than the relay FRU)? No = 0
Yes = 1
Is this FRU designed to be hot-pluggable? 4
No = 0
Yes = 1
Does the FRU designer also design the fault isolation software that supports fault diagnosis? 6
No = 0
Yes = 1
Are multiple occurrences of "Can Not Duplicate" (CND) incidents analyzed for root cause of the problem? 6
No = 0
Table 7.2-27: Can Not Duplicate (CND) Process Grade Factor Questions
Question Gij Wij
Are test, warranty, early-life, and high fallout FRUs subjected to double fault verification (this procedure re-inserts the faulted FRU Yes = 1
10
to ensure the problems track the replaced FRU)? No = 0
Do your current products experience 40% or less Can Not Duplicate (CND) failures (note that CNDs are synonymous with No Yes = 1
8
Defects Found (NDF) and No Trouble Found (NTF))? No = 0
Is a failure mode and effect analysis (FMEA) performed down to the FRU level or the Circuit Card Assembly (CCA) level, Yes = 1
5
whichever is lower? No = 0
Yes = 1
Do design personnel participate directly in performing the FMEA? 5
No = 0
Yes = 1
Are maintenance analysis procedures (MAPs) developed to map failure symptoms to the failing FRU? 5
No = 0
Yes = 1
Are the MAPs verified by inserting faults in a maintainability test? 4
No = 0
Yes = 1
Are the MAPs updated with actual test and field data? 5
No = 0
Yes = 1
Has your company established the cost impact of a field failure? 3
No = 0
Yes = 1
Does the system contain error logging and reporting capability? 5
No = 0
Yes = 1
Does the system promote ongoing analysis of soft error conditions that might predict when a likely failure will occur? 4
No = 0
Yes = 1
Will the contractor developing this equipment also be responsible for maintaining it? 5
No = 0
Does the repair facility have the ability to recreate the conditions under which a true false alarm occurred (sequence of events, Yes = 1
5
operator error, sneak circuit, etc.) and are these techniques used to try to recreate the failure? No = 0
Does the repair facility have the ability to recreate the conditions under which a real failure occurred (high/low temperature, thermal Yes = 1
5
cycling/shock, vibration/ mechanical shock, etc.) and are these techniques used to try to recreate the failure? No = 0
Yes = 1
Will the maintainer be motivated to provide timely and complete documentation of the diagnosis and repair action? 5
No = 0
Do the system maintenance personnel receive feedback on their repair reports and the actions taken to mitigate the failure Yes = 1
5
reoccurrence? No = 0
Are the performance specification limits of the test equipment used to troubleshoot/repair the system, FRU, etc., equal to or more Yes = 1
5
stringent than the performance specification limits of the system, FRU, etc., in its actual application? No = 0
Are CND failures included in the Failure Reporting and Corrective Action System (FRACAS) system and closed out through Yes = 1
5
corrective action verification? No = 0
7.3.2. Approach
All samples were tested under a variety of temperature and relative humidity conditions.
In addition, samples included two factors which were varied in the life tests: Process
Force and Hardness. These stresses and product/process variables were expected to be
the ones that most heavily influenced the product reliability.
The tests were performed by first inspecting each sample, then exposing them to the
specific combination of variables as previously summarized, and, finally, re-inspecting
them at various intervals. The exposure times and inspection intervals were structured
such that short lifetimes could be observed in the event that acceleration factors were
higher than anticipated. Therefore, more frequent inspections were performed early in
the test, followed by less frequent inspections for the surviving samples. Failed samples
were removed from the test.
Data was then summarized in a format suitable for life modeling. The required data
elements included stress and product/process variables, plus life variables, as follows:
• Variables:
o Temperature
o Humidity
o Process force
o Hardness
• Life variables
o Last known good time
o First known bad time
7.3.4. Results
The 2-parameter Weibull distribution parameters for the TTF distributions for the
samples are shown in Table 7.3-4.
The TTF distributions for each of the three test conditions are illustrated in Figure 7.3-1.
F o lio1 \SL-130 , 1 00
W e ibull-2P
9 0. 0 00 MLE SRM MED F M
F = 3 3/S= 9
Da ta Po ints
Susp Po ints
Pro ba bilit y Line
F o lio1 \SL-130 , 8 5
W e ibull-2P
MLE SRM MED F M
5 0. 0 00 F = 4 3/S= 0
Da ta Po ints
Pro ba bilit y Line
F o lio1 \SL-85, 85
W e ibull-2P
MLE SRM MED F M
Unreliability, F(t)
F = 2 /S= 40
Da ta Po ints
Pro ba bilit y Line
1 0. 0 00
5. 00 0
Bill De nson
Co rning
1 1 /24 /2 00 8
1. 00 0 5 :0 5:2 2 PM
1. 0 0 0 10 . 00 0 1 00 . 0 00 10 0 0. 0 00 1 0 00 0. 00 0
T ime, (t)
F olio 1\SL-13 0, 10 0: β =3 .2 1 8 3 , η=6 2 .1 1 9 5
F olio 1\SL-13 0, 85 : β= 2 .7 2 2 1 , η= 2 6 8 .2 4 7 9
F olio 1\SL-85 , 85: β =5 .0 5 0 5 , η=2 1 0 9 .0 6 3 5
β
⎛t ⎞
−⎜ ⎟
⎝α ⎠
R=e
where:
The characteristic life is then developed as a function of the applicable variables. The
model form is:
α1
α = e e T RH α H α F α
α0 2 4 3
Where:
( ( ))
L = ∏ f (ti , β ,α 0 ,α1,α 2 ,α 3 ,α 4 ) * ∏ 1 − F t j , β ,α 0 ,α1,α 2 ,α 3 ,α 4 *
where:
The first of the three product terms represent failures at known times, the second
represents survivals, and the third represent failures that occur within intervals but the
precise failure times are not known
Once the model parameters are estimated in this fashion, the reliability at any time, and
for any combination of variables, can be estimated.
The estimated parameters are summarized in Table 7.3-5. In this table, the best estimate
is provided along with the 80% 2-sided confidence levels around the estimate. A small
variation between the lower and upper confidence bound are indicative of significant
variables.
8015.7
23.98 −8.83 0.2150 0.0388
α =e e T
RH H F
Once the model parameters are estimated, then a variety of output formats are possible.
For example, Figure 7.3-2 illustrates the probability of failure as a function of
temperature and relative humidity at a time of 50,000 hours.
Figure 7.3-2: Probability of Failure vs. Temperature and Relative Humidity at 50,000
Hours
there are few sources of failure rate data for other component types. All part types and
assemblies for which RIAC has data are included in NPRD with the exception, of
standard electronic component types. Although the data contained in NPRD were
collected from a wide variety of sources, RIAC has screened the data such that only high
quality data is added to the database and presented in this document. In addition, only
field failure rate data is included. The intent of this section is to provide the user with
information to adequately interpret and use data to supplement standard reliability
prediction methodologies.
The following generic sources of data were used for this publication:
An example of the process by which RIAC identifies candidate systems and extracts
reliability data on military systems is summarized in Table 7.4-1.
Perhaps the most important aspect of this data collection process is identifying viable
sources of high quality data. Large automated maintenance databases, such as the Air
Force REMIS system or the Navy's 3M and Avionics 3M systems, typically will not
provide accurate data on piece parts. They can, however, provide acceptable data on
assemblies or LRUs, if used judiciously. Additionally, there are specific instances in
which they can be used to obtain piece-part data. Piece-part data from these maintenance
systems is used in the RIAC's data collection efforts only when it can be verified that
they accurately report data at this level. Reliability Improvement Warranty (RIW) data
are another high quality data source which has been used.
Reliability Information Analysis Center
359
Chapter 7: Examples
Inherent limitations in data collection efforts can result in errors and inaccuracies in
summary data. Care must be taken to ensure that the following factors are considered
when using a data source. Some of the sources of error are:
1. There are many more factors affecting reliability than can be identified
2. There is a degree of uncertainty in any failure rate data collection effort. This
uncertainty is due to the following factors:
a. Uncertainty as to whether the failure was inherent (common cause) or
event-related (special cause)
b. Difficulty in separating primary and secondary failures
c. Much of the collected data is generic and not manufacturer specific,
indicating that variations in the manufacturing process are not accounted
for
d. It is very difficult to distinguish between the effects of highly correlated
variables. For example, the fact that higher quality components are
typically used in more severe environments makes it impossible to
distinguish the effect that each has, independently, on reliability.
e. Operating hours can be reported inaccurately
f. Maintenance logs can be incomplete
Actual component stresses are rarely known. Even if nominal stresses are known, actual
stresses which significantly impact reliability can vary significantly about this nominal
value. The impacts of complex environmental stresses on reliability during field
operation of a product or system is also extremely difficult, if not impossible, to discern.
When collecting field failure data, a very important variable is the criteria used to define,
detect and classify failures. Much of the failure data presented in NPRD-2010 were
identified by maintenance technicians performing a repair action, indicating that the
criteria for failure is that a part in a particular application has failed in a manner that
makes it apparent to the technician. In some data sources, the criteria for failure were
that the component replacement must have remedied the failure symptom.
Data in the summary section of NPRD represent an "estimate" of the expected failure
rate. The "true" value will lie within some confidence interval about that estimate. The
traditional method of identifying confidence limits for components with exponentially
distributed lifetimes has been the use of the Chi-Square distribution. This distribution
relies on the observance of failures from a homogeneous population and, therefore, has
limited applicability to merged data points from a variety of sources.
To give users of NPRD a better understanding of the confidence they can place in the
presented failure rates, an analysis of RIAC data in the past concluded that, for a given
generic part type, the natural logarithm of the observed failure rate is normally distributed
with a standard deviation of 1.5. This means that 68 percent of the actual experienced
failure rates will be between 0.22 and 4.5 times the mean value. Similarly, 90% of actual
failure rates will be between 0.08 and 11.9 times the presented mean value. As a general
rule-of-thumb, this type of precision is typical of probabilistic reliability prediction
models and point-estimate failure rates such as those contained within NPRD. It should
be noted that this precision is applicable to predicted failure rates at the component level,
and that confidence will increase as the statistical distributions of components are
combined when analyzing modules or systems.
In virtually all of the field failure data collected for NPRD, TTF was not available. Few
current DoD or commercial data tracking systems report elapsed time indicator (ETI)
meter readings that would allow TTF compilations. Those that do lose accuracy
following removal and replacement of failed items. To accurately monitor these times,
each replaceable item would require its own individual time recording device. Data
collection efforts typically track only the total number of item failures, part populations,
and the number of system operating hours. This means that the assumed underlying TTF
distribution for all failure rates presented in NPRD is the exponential distribution.
Unfortunately, many part types for which data are presented typically do not follow the
exponential failure law, but rather exhibit wearout characteristics, or an increasing failure
rate in time. While the actual TTF distribution may be Weibull or lognormal, it may
appear to be exponentially distributed if a long enough time has elapsed. This
assumption is accurate only under the condition that components are replaced upon
failure, which is true for the vast majority of data contained in NPRD. To illustrate this,
refer to Figure 7.4-1, which depicts the apparent failure rate for a population of
components that are replaced upon failure, each of which follow the Weibull TTF
distribution. This illustrates Drenick’s theorem that was discussed earlier in this book.
Additionally, since MTTF is often used instead of characteristic life, their relationship
should be understood. The ratio of alpha/MTTF is a function of beta and is given in
Table 7.4-3.
Based on the previous discussion, it is apparent that the time period over which data is
collected is very important. For example, if the data is collected from “time zero” to a
time which is a fraction of alpha, the failure rate will be increasing over that period and
the average failure rate will be much less than the asymptotic value. If however the data
is collected during a time period after which the failure rate has reached its asymptote, the
apparent failure rate will be constant and will have the value 1/alpha. The detailed data
section in NPRD presents part populations which provide the user the ability to further
analyze the time logged to an individual part or assembly, and to estimate the
characteristic life. For example, the detailed section presents the population and the total
number of operating hours for each data record. Dividing the part operating hours by the
population yields the average number of operating hours for the system/equipment in
which the part/assembly was operating. An entry for a commercial quality mercury
battery in a ground, fixed (GF) environment indicates that a population of 328 batteries
had experienced a total of 0.8528 million part hours of operation. This indicates that
each battery had experienced an average of 0.0026 million hours of operation in the time
period over which the data was collected. If a shape parameter, beta, of the Weibull
distribution is known for a particular part/assembly, the user can use this data to
extrapolate the average failure rate presented in NPRD to a Weibull characteristic life
(alpha). If the percent failure rate is relatively low, the methodology is of limited value.
If a significant percent of the population has failed, the methodology will yield results for
which the user should have a higher degree of confidence. The methodology presented is
useful only in cases where TTF characteristics are needed. In many instances, knowledge
of the part characteristic life is of limited value if the logistics demand is the concern.
This data can, however, be used to estimate characteristic life in support of preventive
maintenance efforts. The assumptions in the use of this methodology are:
1. Data were collected from "time zero" of the part/assembly field usage
2. The Weibull distribution is valid and β is known
Table 7.4-4 contains cumulative percent failure as a function of the Weibull beta shape
parameter and the time/characteristic life ratio (t/α). The percent failure from the NPRD
detailed data section can be converted to a (t/alpha) ratio using the data in Table 7.4-4.
Once this ratio is determined, a characteristic life can be determined by dividing the
average operating hours per part (part hours/population) by the (t/alpha) ratio. It should
be noted here that the percentage failures in the table can be greater than 100, since parts
are replaced upon failure and there can be an unlimited number of replacements for any
given part.
As an example, consider the NPRD detailed data for “Electrical Motors, Sensor”;
Military Quality Grade; Airborne, Uninhabited (AU) environment; and a Population Size
of 960 units. Assume for this data entry that there were 359 failures in 0.7890 million
part-operating hours. The data may be converted to a characteristic life in the following
manner:
2. Determine a typical Weibull shape parameter (β). For motors, a typical beta
value is 3.0 (Reference 5).
3. Convert the Percent Failure to a t/alpha ratio using Table 7.4-4 (for % fail = 37.4
and β = 3)
t
≅ 0.65 (extrapolating between 31 and 42)
α
5. Calculate α:
⎛ Part Hours ⎞
⎜⎜ ⎟⎟
Population Count ⎠ 0.00082
α=⎝ = = 0.00126 million hours
⎛t⎞ 0.65
⎜ ⎟
⎝α ⎠
Based on this data, an approximate Weibull characteristic life is 1260 hours. The user of
this methodology is cautioned that this is a very approximate method for determining the
characteristic life of an item when TTF data is not available. It should also be noted that
for small values of time (i.e.; t < 0.1 alpha), random failures can predominate, effectively
masking wearout characteristics and rendering the methodology inaccurate.
Additionally, for small operating times relative to α, the results are dependent on the
extreme tail of the distribution, thus significantly decreasing the confidence in the derived
alpha value.
For part types exhibiting wearout characteristics, the failure rate presented represents an
average failure rate over the time period in which the data was collected. It should also
be noted that for complex nonelectronic devices or assemblies, the exponential
distribution is a reasonable assumption. The user of this data should also be aware of
how data on cyclic devices such as circuit breakers is presented in NPRD. Ideally, these
devices should have failure rates presented in terms of failures per operating cycles.
Unfortunately, from the field data collected, the number of actuations is rarely known
and, therefore, the listed failure rates are presented in terms of failures per operating hour
for the equipment in which the part is used.
Section 1: Introduction
Section 2: Part Summaries
Section 3: Part Details
Section 4: Data Sources
Section 5: Part Number/Mil Number Index
Section 6: National Stock Number Index with Federal Stock Class Prefix
Section 7: National Stock Number Index without Federal Stock Class Prefix
Section 8: Part Description Index
Part Description of the part, including the major family of parts and specific part-type breakdown within the part
Description family.
The RIAC does not distinguish parts from assemblies within NPRD. Information is presented on
parts/assemblies at the indenture level at which it was available. The description of each item for which data
exists is made as clear as possible so that the user can choose a failure rate on the most similar part or assembly.
The parts/assemblies for which data is presented can be comprised of several part types, or they can be a
constituent part of a larger assembly. In general, however, data on the part type listed first in the data table is
representative of the part type listed and not of the higher level of assembly. For example, a listing for “Stator,
Motor” represents failure experience on the stator portion of the motor and not the entire motor assembly.
Added descriptors to the right, separated by commas, provide further details on the part type listed first.
Additional detailed part/assembly characteristics can be found, if available, in the Part Details section of
NPRD.
App. Env. The Application Environment describes the conditions of field operation. See Table 7.4-6 for a detailed list of
the application environments and their descriptions. These environments are consistent with MIL-HDBK-217.
In some cases, environments more generic than those used in MIL-HDBK-217 are used. For example: "A"
indicates the part was used in an Airborne environment, but the precise location and aircraft type was not
known. Additionally, some environments are more specific than the current version of MIL-HDBK-217, since
the current version has merged many of the environment categories and the NPRD data was originally
categorized into the more specific environment. Environments preceded by the term "NO" are indicative of
components used in a non-operating product or system in the specified environment.
Data Source Source of data comprising the NPRD data entry. The source number may be used as a reference to Section 4 of
NPRD to review the specific data source description.
Failure Rate The failure rate presented for each unique part type, environment, quality, and source combination. It is the
Fails / (E6) total number of failures divided by the total number of life units. No letter suffix indicates that the failure rate
is in failures per million operating hours. An "M" suffix indicates the unit is failures per million miles. For
roll-up data entries (i.e., those without sources listed), the failure rate is derived using the data merge algorithm
described in this section. A failure rate preceded by a "<" is representative of entries with no failures. The
failure rate listed was calculated by using a single failure divided by the given number of operating hours. The
resulting number is a “worst case” failure rate and the real failure rate is less than this value. All failure rates
are presented in NPRD in a fixed format of four decimal places after the decimal point. The user is cautioned
that the presented data has inherently high variability and that four decimal places does not imply any level of
precision or accuracy.
Total Failed The total number of failures observed in the merged data records.
Op. Hours/ The total number of operating life unit (in millions) observed in merged data records. Absence of a suffix
Miles (E6) indicates operating hours is the life unit and "M" indicates that miles is the life unit.
Detail Page The page number containing the detail data source description which comprises the summary record.
AIA Airborne Inhabited Attack - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as used for ground support.
AIB Airborne Inhabited Bomber -Typical conditions in bomber compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on long mission bomber
aircraft.
AIC Airborne Inhabited Cargo - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on long mission transport
aircraft .
AIF Airborne Inhabited Fighter - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as fighters and interceptors.
AIT Airborne Inhabited Transport - Typical conditions in cargo compartments occupied by aircrew without
environment extremes of pressure, temperature, shock and vibration and installed on high performance aircraft
such as trainer aircraft.
ARW Airborne Rotary Wing - Equipment installed on helicopters; includes laser designators and fire control systems.
AU Airborne Uninhabited - General conditions of such areas as cargo storage areas, wing and tail installations
where extreme pressure, temperature, and vibration cycling exist.
AUA Airborne Uninhabited Attack - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as used for ground support.
AUB Airborne Uninhabited Bomber - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on long mission bomber aircraft.
AUF Airborne Uninhabited Fighter - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as fighters and interceptors.
AUT Airborne Uninhabited Transport - Bomb bay, equipment bay, tail, or where extreme pressure, vibration, and
temperature cycling may be aggravated by contamination from oil, hydraulic fluid and engine exhaust.
Installed on high performance aircraft such as used for trainer aircraft.
DOR Dormant - Component or equipment is connected to a system in the normal operational configuration and
experiences non-operational and/or periodic operational stresses and environmental stresses. The system may
be in a dormant state for prolonged periods before being used in a mission.
GB Ground Benign - Non-mobile, laboratory environment readily accessible to maintenance; includes laboratory
& instruments and test equipment, medical electronic equipment, business and scientific computer complexes.
GBC GBC refers to a commercial application of a commercial part.
GF Ground Fixed - Conditions less than ideal such as installation in permanent racks with adequate cooling air and
possible installation in unheated buildings; includes permanent installation of air traffic control, radar and
communications facilities.
GM Ground Mobile - Equipment installed on wheeled or tracked vehicles; includes tactical missile ground support
equipment, mobile communication equipment, tactical fire direction systems.
ML Missile Launch - Severe conditions related to missile launch (air and ground), and space vehicle boost into
orbit, vehicle re-entry and landing by parachute. Conditions may also apply to rocket propulsion powered
flight.
MP Manpack - Portable electronic equipment being manually transported while in operation; includes portable field
communications equipment and laser designations and rangefinders.
N Naval - The most generalized normal fleet operation aboard a surface vessel.
NS Naval Sheltered - Sheltered or below deck conditions, protected from weather; include surface ships
communication, computer, and sonar equipment.
NSB Naval Submarine - Equipment installed in submarines; includes navigation and launch control systems.
NU Naval Unsheltered - Nonprotected surface shipborne equipment exposed to weather conditions; includes most
mounted equipment and missile/projectile fire control equipment.
N/R Not Reported - Data source did not report application environment.
SF Spaceflight - Earth orbital. Approaches benign ground conditions. Vehicle neither under powered flight nor in
atmosphere re-entry; includes satellites and shuttles.
Data records are also merged and presented at each level of part description (categorized
from most generic to most specific). The data entries with no source listed represent
these merged records. Merging data becomes a particular problem due to the wide
dispersion in failure rates, and because many data points consist of only survival data in
which no failures occurred, thus making it impossible to derive a failure rate. Several
Reliability Information Analysis Center
369
Chapter 7: Examples
approaches were considered in defining an optimum data merge routine. These options
are summarized as follows:
1. Summing all failures and dividing by the sum of all hours. The advantages of
this methodology are its simplicity and the fact that all observed operating
hours are accounted for. The primary disadvantage is that it does not weigh
outlier data points less than those clustering about a mean value. This can
cause a single failure rate to dominate the resulting value.
3. Deriving the arithmetic mean of all observed failure rates which are from data
records with failures, and modifying the resulting value in accordance with the
percentage of operating hours associated with the zero failure records.
Advantages of this method are that modifying the mean in accordance with
the percentage of operating hours from survival data will ensure that all
observed part hours are accounted for, regardless of whether they have
experienced failures. Disadvantages are that the arithmetic mean does not
apply less weight to those data points substantially beyond the mean and,
therefore, a single data point could dominate the calculated failure rate.
4. Using a mean failure rate by taking the lower 60% confidence level (Chi-
square) for zero failure data records and combining them with failure rates
from failure records. The disadvantages of this methodology are that the 60%
lower confidence limit can be a pessimistic approximation of the failure rate,
especially in the case where there are few observed part hours of operation;
and an arithmetic mean failure rate of these values (combined with the failure
rates from failure records) could yield a failure rate which is dominated by a
single failure rate, which itself may be based on a zero failure data point. The
use of a geometric mean would alleviate some of this effect. The problem
with the pessimistic nature of using the confidence level ,however, will
remain.
5. Deriving the geometric mean of all the failure rates associated with records
having failures and multiplying the derived failure rates by the proportion:
Option 5 was selected for NPRD, since it is the only one that (1) accounts for all
operating hours and (2) applies less weighting to the outliers. The resulting algorithm
used to merge data within NPRD is:
⎛ n′ ⎞
1 ⎜ ∑ h′ ⎟
⎛ n′ ⎞ n ′ ⎜ ⎟
λmerged = ⎜ ∏ λi ⎟ • ⎜ i =1 ⎟
⎜ ⎟ n
⎝ i =1 ⎠ ⎜ h⎟
⎜∑ ⎟
⎝ i =1 ⎠
where,
n′
∏ λ i = The product of failure rates from NPRD Section 2 records with failures*
i =1
n′
∑ h ′ = The sum of hours from NPRD Section 2 records with failures*
i =1
n
∑ h = The sum of hours from NPRD Section 2 records
i =1
In NPRD Section 2, part descriptions with "(Summary)" following the part name
comprise a merge of all data related to the generic part listed. An example of the NPRD
summary section is given in Figure 7.4-2.
To illustrate how the data was rolled up, consider the entries for linear mechanical
actuators. The failure rate of 41.7293 listed for "Actuator, Mechanical, Linear" is a roll-
up of three individual data entries for which there are sources listed (two for commercial
quality, AUC environment and one for unknown quality in an Airborne environment).
The listing of 5.5413 for "Actuator, Mechanical" is a roll-up of four individual data
entries (two for Mil/AIF, one for Unk/AUT , and one for Unk/GM ). Using the algorithm
described previously, the roll-up was calculated as follows:
1 0.1957 + 0.0595
⎡ ⎤
λsummary = [(5.110)(33.6241)] 2 ⎢ ⎥ = 5.5413
⎣ 0.1957 + 0.0595 + 0.0830 + 0.2655 ⎦
Now consider the entry for "Actuator, Mechanical (Summary)". This listing is a roll-up
of all "Actuator, Mechanical" data (in this case Actuator, Mechanical and Actuator,
Mechanical, Linear) using the algorithm described previously. In other words, the failure
rate of 25.8092 is a summary of failure data from seven individual data sources. For
these "(Summary)" data entries, sources are not listed since they represent a merge of one
or more data sources which are presented below the summary level. Roll-up values are
presented for each specific quality level and application environment for all components
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
372
Chapter 7: Examples
having multiple part type entries at the same indenture level. If there is no summary
record indicated for a particular part type, the listed part description represents the lowest
level of indenture available. For example, the listing for "Actuator, Mechanical,"
although being identical to the generic level for which the summary data is presented,
was the most detailed description available for the particular data entry. More detailed
part level information may be available in NPRD Section 3. Each failure rate record
listed in the NPRD summary section is a merge of all detailed data from Section 3 for a
specific part type, quality, environment and unique data source. Each of these failure rate
records refers to a Section 3 page which contains all detailed records, including part
details, when they were known. Roll-ups are performed at every combination of part
description (down to 4 levels), quality level, and application environment. The data
points being merged in the NPRD summary section include only those records for which
a data source is listed. These individual data points were already combined by summing
part hours and failures (associated with the detailed records) for each unique data source.
Roll-ups performed on only zero-failure data records are accomplished simply by
summing the total operating hours, calculating a failure rate by assuming one failure, and
denoting the resulting worst case failure rate with a "<" (“less than”) sign.
The roll-ups were performed in this manner to give the NPRD user maximum flexibility
in choosing data on the most specific part type possible. For example, if the user needs
data on a part type which is not specified in detail or for conditions for which data does
not exist in this document, the user can choose data on a more generic part type or
summary condition for which there is data.
The user is cautioned that individual data points from the detailed section may be of
limited value relative to the merged summary data in NPRD Section 2, which combines
records from several sources and typically results in many more part hours. Under no
circumstance should the NPRD detailed data or summary data be used to blindly
cherry-pick the most favorable or “optimistic” failure rate for a particular part or
assembly type.
NPRD Section 3 contains a listing of all field experience records contained in the RIAC
part databases. The detailed data section presents individual data records that are
representative of the specific part types used in a particular application from a single data
source. For example, if 20 relays of the same type were used in a specific military
system, for which there were 300 systems in service, each with 1300 hours of operation
over the time in which the data was collected, the part population is 20x300 = 6000, and
the total part operating hours are 6000x1300 = 7,800,000 hours. If the same part is used
in another system, or if the system is used in different operating environments, or if the
information came from a different source, then separate NPRD data records were
generated. If known, the population size is given for each data record as the last element
in the “Part Characteristics” field. An example of NPRD Section 3 is shown in Figure
7.4-3.
7.4.3.5. Section 6 “National Stock Number Index with Federal Stock Class”
This NPRD section provides an index of those Section 3 data entries that contain a
National Stock Number (NSN), including the four digit Federal Stock Class (FSC) prefix.
This index contains all parts for which the NSN is known.
7.4.3.6. Section 7 "National Stock Number Index without Federal Stock Class Prefix"
This NPRD section provides an index similar to the Section 6 index, with the exception
that the four-digit FSC is omitted.
7.5. References
1. RADC-TR-88-97, “RELIABILITY PREDICTION MODELS FOR DISCRETE
SEMICONDUCTOR DEVICES”, Final Technical Report, 1988
2. Denson, W.K. and S. Keene, “A New System Reliability Assessment
Methodology”, Final Report, 1998
3. “Photonic Component and Subsystem Reliability Process Final Report”,
Subcontract 0044-SC-20100-0203, Prepared for Penn State University Electro-
Optics Center, September 25, 2008
4. “Nonelectronic Parts Reliability Data (NPRD)”, Reliability Information Analysis
Center
5. RADC-TR-77-408, “Electric Motor Reliability Model”)
6. MIL-HDBK-344A, “Environmental Stress Screening of Electronic Equipment”,
August 1993
8.1. Introduction
In order to “build” reliability into a product or system, it is necessary to anticipate failure
causes, and ensure that they are eliminated or, at least, that their probability of occurring
is made acceptably low. This “anticipation” can be accomplished empirically through
test, or analytically through analysis and modeling. Failure Mode and Effects Analysis
(FMEA) is a structured way of identifying root cause failure modes, and is the backbone
of an effective reliability program, particularly as it relates to reliability growth during the
design and development phase.
A successful product or system depends on the requirements being fully understood, that
the design is robust, and that the manufacturing process is also robust. A Design FMEA
(DFMEA) assesses the first two of these, and a Process FMEA (PFMEA) assesses the
third. This is illustrated in Figure 8.1-1.
DFMEA PFMEA
Generally, the best manner in which to perform the FMEA is to separate the design and
process attributes and perform separate process and design FMEAs. However, in some
cases, these can essentially be combined into a single design FMEA by incorporating the
manufacturing process-related failure modes into the DFMEA “failure cause/mechanism”
column. The circumstances when this is appropriate are generally those when the item
under analysis is not complex from both a design and manufacturing perspective. This
book primarily addresses the DFMEA, since a reliability model is generally driven more
by the design than the process. However, process variables are often used as factors in
the reliability model.
A FMEA is the cornerstone of a reliability program, having many uses. The primary
purpose of the FMEA is to acquire an understanding of the reliability characteristics of a
product or system, such that corrective action can be taken to make the item more reliable
(reliability growth). The results of a FMEA are also used to support other reliability
engineering tasks, such as test plan development, the evaluation of engineering changes,
assessing detectability, the basis of troubleshooting manuals, and the development of
reliability models.
The logical, bottom-up analysis technique of the FMEA facilitates the understanding of
the reliability characteristics of a product or system. This understanding is a core
requirement for the attainment of the reliability objectives, and, as such, it will help
reduce the total program cost. While reliability engineering tasks are sometimes
considered to be costly to a program, the reality is that they will save significant amounts
of money, if properly implemented. Costs incurred when reliability problems are
identified in the field will be orders of magnitude higher than the upfront cost of the
reliability engineering tasks that solve them during design and development. Since the
success of a reliability program depends largely on the effectiveness of FMEA,
implementation of the FMEA is a critical element of the cost avoidance of field failures11.
• The assurance that all conceivable root failure causes and their effects have been
considered in the early stages of the product or system design and development
process, and that corrective actions are taken to mitigate the risk associated with
critical failure modes.
• If elements such as accelerating stresses are included in the FMEA analysis, it can
be used to develop reliability growth, demonstration and screening test plans, as
well as environmental qualification test plans. In this case, the importance of
each potential accelerating stress can be quantified and prioritized in accordance
with the severity, criticality or failure rate of the individual failure modes
accelerated by the specific stress. For example, if temperature is determined to
accelerate the majority of critical failure modes, then it should be used as a stress
in reliability and qualification testing.
• If and when reliability problems occur after a product or system is delivered to the
customer, the FMEA can be used as a basis for determining the root cause of
failure. Based on failure symptoms, the possible causes can be identified based
on the FMEA analysis that was performed.
• It can be used as a basis for the reliability model, in which the reliability of each
high risk failure cause is quantified.
11
It should be noted that an FMEA is only technically effective if it has an impact on the design of the product or system. An FMEA
that does an excellent job of identifying root failure causes, but is performed “after-the-fact” so as to have no impact on the actual
design, is a waste of reliability program resources. An FMEA is only cost effective if it impacts the design of the product or system
before the design is finalized and “bending metal” has started. A poorly timed FMEA that results in extensive and costly redesign
efforts to eliminate or mitigate root failure causes is also counterproductive.
Reliability Information Analysis Center
379
Chapter 8: The Use of FMEA in Reliability Modeling
Another benefit of the FMEA is that it can be used as a basis for evaluating the risk
associated with engineering changes. If a design change is proposed, the FMEA can be
consulted to determine if the change will result in new failure modes or an increase in the
probability of failure of identified modes. Based on this information, the change can be
accepted, or additional reliability characterization can be performed to further assess the
reliability impact of the proposed change.
Detectability can also be assessed by the FMEA. This is particularly useful in instances
where failures that are undetectable are of special importance to the project. An example
of this is when alarms are used as a means to detect failures. Some failure modes may
not result in an alarm, and, therefore, the criticality associated with the failure mode can
be high. In this case, the FMEA can be used to assess these failure modes.
Troubleshooting manuals are essentially an FMEA that is presented in reverse order. The
FMEA is generally presented in the order of functional elements, or components. If the
FMEA is sorted by the effect of failure (or symptom), it essentially becomes a
troubleshooting aid, since the analyst can review the specific failure modes that will
result in the observed symptom. Additionally, if the probability of failure is included in
the FMEA analysis, the possible failure modes or causes can be ranked in accordance
with their probability. This can aid in the troubleshooting process.
8.2. Definitions
FMEA refers to a generic analysis methodology and, while there are industry standards
that define the specifics of the analysis, there are many different ways in which the
analysis can be accomplished. The following list of terms and definitions summarizes
typical data elements that the FMEA will typically include as columns in the FMEA
worksheet template, presented in the order of which they usually appear.
Additional functions may also be required, depending on the specific organization and
nature of the product. These additional functions can include component engineering,
procurement, measurements, and marketing. There are also instances where an FMEA
might include the direct involvement of the customer, particularly for critical or highly
complex products or systems.
All of the above listed functions are not required for every part of the FMEA. For
instance, the initial parts of the FMEA can efficiently be performed by only the
Reliability engineering and the Applications engineering (or the Project Manager)
functions. After this, engagement by the entire team is critical, especially to gain “buy-
in” on corrective actions, which can be the responsibility of any of the disciplines.
The ideal team size is 5 to 8 people. Any larger, and the efficiency of the analysis is
compromised. It is also more efficient to break up the analysis into distinct functional
elements of the design, i.e. mechanical, electrical, optical, software/firmware, etc.,
although it is also imperative to account for failure causes that are due to interactions of
these functional elements. The FMEA facilitator needs to ensure that these interactions
are accounted for, since the individuals cognizant of their functional element will often
overlook these interactions.
• Document the results of analysis (it is also beneficial to have a separate “scribe”
that documents the results, and allows the facilitator to concentrate on the
additional items listed below)
• Keep the group focused on the task at hand
• Ensure that all components or processes are accounted for
• Prompt the group for participation, as required
• Spark the discussion by suggesting failure modes
• Ensure that the analysis is kept moving
• Ensure that the inputs of all participants are heard and captured. This includes
making sure that certain people are not allowed to dominate the analysis, and that
the ideas of quiet people are brought out.
• Manage conflicts – Professional people take a great deal of pride in their work.
Since the FMEA goal is to find fault with the product or system, FMEA sessions
can sometimes get contentious. The facilitator must manage this by keeping the
session constructive and not allow emotions to dictate the course of the analysis.
The facilitator is often from the reliability group, but does not have to be. It is more
important that the facilitator be skilled in the responsibilities listed above.
100 Sherman Rd., Suite C101, Utica, NY 13502-1348 PH: 877.363.RIAC
384
Chapter 8: The Use of FMEA in Reliability Modeling
8.3.4. Implementation
Some suggestions for implementing an effective FMEA are listed here:
Next, the analysis determines what happens if the failure mode was to occur. These are
the effects, and are determined at the various levels of the assembly architecture
(progressing from low to high), such as the surrounding components, the sub-assembly,
the assembly and the entire product or system. The specific levels used in the analysis
are very item-specific, in that the more complex that the product or system is, the more
levels that may be required for analysis. If the product or system is relatively simple,
only one level may be required.
The next step is the determination of the possible corrective actions. This will be
discussed later in this chapter.
There are many ways in which an FMEA can be performed. This section outlines one
approach that has been successful based on the experience of the author. The process
flow is illustrated in Figure 8.4-1.
Each of these steps is summarized below, along with guidance and tips on performing the
step.
The complexity of the system will dictate the number of hierarchical levels. Items
comprised of a single component will have only a single level. More complex products
or systems can have from four, up to eight, levels.
It may be necessary to treat the “system” as the customer’s system, since the effects of
failure will be manifested at that level. Whether this is necessary also depends on if the
FMEA is being driven by customer requests.
these potential modes. Ideally, the capacitor manufacturer will have performed a FMEA
on their product, in which case specific failure causes have already been identified and,
hopefully, mitigated.
I: Intermittent
P: Partial
O: Over
U: Unintended
N: Negative
D: Degraded
The loss function used in the Taguchi methodology can be used to determine some of the
applicable failure modes. Here, functions or attributes are categorized as “larger the
better”, “nominal the best” and “smaller the better”. For “larger the better”
function/attributes, a failure can occur when there is too little of the function/attribute, but
cannot fail when there is too much of it. This is illustrated in Table 8.7-1. It relates only
to the “over” function and the “partial” function IPOUND categories. The other
IPOUND categories are used when appropriate, and will generally be independent of the
Taguchi categories.
The IPOUND categories are intended to represent a complete and mutually exclusive set
of the ways in which a function or attribute can fail. When identifying failure modes in
this manner, it is helpful to set up a matrix of functions/attributes and the IPOUND
categories, and proceed to fill it in. When filling in this matrix, it is not necessary to
identify failure modes for all categories of IPOUND. Likewise, for any single category,
multiple failure modes are possible. The IPOUND methodology is simply a way to get
the team to think about all possible ways in which a function/attribute can fail.
The flow diagram in Figure 8.7-1 depicts a simple system consisting of a two-level
hierarchy. In practice, there can be any number of levels in the system hierarchy. The
failure cause-mode-effect relationship shifts in the FMEA as a function of the system
level, as illustrated in Figure 8.7-1. For example, at the most basic level, the part
manufacturing process, the cause of failure may be a process step that is out of control.
The ultimate effect of that cause becomes the failure mode at the part level. The failure
effect of the part becomes the failure mode at the next level of assembly, and so forth. It
is very important that the failure cause, mode and effect are not confounded in the a
The failure modes of the system functions/attributes are the effects of failure modes at the
subordinate hierarchical level. This tiering continues as the system is broken down to the
lowest level that the analysis will take place.
For simple, single-level products (for example, a component made from a monolithic
material) there is only a single level and, therefore, this is not an issue. Also, for
relatively simple products with two levels, a “local effects” column can be added to
capture the effects of the failure mode on the subassembly function. In this case, the
effects are relative to the functional requirements of the subassembly.
When identifying failure modes, the assumption is made that the failure could occur but
may not necessarily occur.
In the identification of severity, effects of failure modes that are potentially safety-related
are usually considered to be the most severe. For these, a severity value of 9 or 10 is
used, regardless of the above listed factors.
The “when occurs” dimension of severity pertains to the life cycle phase in which the
failure mode and its effect occurs. If the failure mode occurs in the R&D phase, it is
either because the design or process is not capable, or because intrinsic or extrinsic
failure causes occur. In either case, this is the best phase to identify these, since they can
be corrected in the most cost-effective manner possible.
If the failure mode occurs “in process”, the effect is essentially a yield reduction.
Failures occurring during inspections or quality checks by the customer are similar, but
they occur at the customer’s site and are, therefore, more severe then when the defects are
caught in-house.
Failures occurring in deployment represent the most severe type of failure effect (with the
possible exception of safety-related failures). These failures can be represented by the
three types of failures in the bathtub curve: infant mortality, random and wearout.
Usually, the severity of an effect is treated as one factor in the FMEA. However,
separating the severity into three factors and subsequent columns can be beneficial. For
example, if the “when occurs” dimension of severity is separated, the failure modes that
can be caught “in process” are identified, and this, in turn, can be used to establish in-
process checks and screening protocols.
If they are separated, any convenient numeric scale can be used, including 1-to-10, 1-to-
3, or others. If 1-to-10 is used, the RPN of a failure cause will range from 1 to 1,000.
If these dimensions are not separated (which will usually be the case), each of the three
should be represented in the criteria used to define the severity levels. One way in which
this can be accomplished is to use the guidelines in Table 8.8-2, in which each dimension
is assumed to have a value between 1 and 3, directly proportional to its severity. The
total severity is then the sum of each of the three values.
required. It is only required that someone knowledgeable in the system and part
functional and attribute requirements be involved.
There should be one severity rating for each failure effect, since the severity is a direct
1:1 relation to the effect. The maximum of each of these severities associated with each
failure mode is the severity used in the RPN calculation, since the RPN is applicable to
the cause. Here, the failure mode can result in several effects, but can also be initiated by
several causes. Therefore, a single cause can result in several effects, the worst of which
should be used in the PRN calculation. The relationship between failure cause, mode and
effect is illustrated in Figure 8.10-1.
• “Failures resulting from operational stress” refers to failures resulting from the
inability of a product or system to tolerate the applied stresses to which the
component, item or material within the item is exposed.
• “Tolerance stack up” refers to the initial tolerance at time zero, and the failure
of a product or system to tolerate the cumulative effect of those tolerances.
Failures can also be a result of short-term exposure to extreme stresses. While the
product or system is not designed to tolerate these stresses under steady-state conditions,
it should be able to tolerate short-term extreme stress exposure. There is a limit to the
stress level(s) that the product or system should be able to tolerate. However, design
actions can be taken to minimize the probability of failure due to these stresses.
The information presented here is generic in nature and applies equally to mechanical and
electronics failures. The specific failure mechanisms will vary, but the concepts are the
same.
Failure causes are often the result of a combination of conditions and events. Therefore,
when identifying causes, the analyst needs to consider these combinations. The factors
whose combinations can cause failure generically include:
When hypothesizing failure causes, it is useful to think about them in terms of their initial
conditions, stresses, and failure mechanisms, as illustrated in Figure 8.10-2.
A list of typical initial conditions, stresses, and failure mechanisms is provided below.
• Initial conditions:
o Defect free (the item is made “as designed”)
o Defects:
Intrinsic:
• Voids
• Material property variation
• Geometry variation
• Contamination
• Ionic contamination
• Crystal defects
• Stress concentrations
Extrinsic:
• Organic contamination
• Nonconductive particles
• Conductive particles
Reliability Information Analysis Center
395
Chapter 8: The Use of FMEA in Reliability Modeling
• Contamination
• Ionic contamination
o Stresses:
Operation - steady state
Operation - cycling
Chemical exposure
Salt fog
Mechanical shock
UV exposure
Drop
Vibration
Temperature-high
Temperature-low
Temperature cycling
Damp heat
Pressure - low
Pressure - high
Radiation - EMI
Radiation - cosmic
Sand and dust
o Failure mechanisms (physical process):
Electromigration
Dielectric breakdown
Corrosion
Dendritic growth
Tin whiskers
Metal fatigue
Stress corrosion cracking
melting
Creep
Warping
Brinelling
Fracture
Fretting fatigue
Galvanic corrosion
Pitting corrosion
Chemical attack
Fretting corrosion
Spalling
Crazing
Abrasive wear
Adhesive wear
Surface fatigue
Erosive wear
Cavitation pitting
Stress corrosion cracking
Elastic deformation
Material migration
Oxidation
Cracking
Plastic deformation
Elastic deformation
Brittle fracture
Expansion
Contraction
Emod change
Outgassing
Index of refraction changes
Photodarkening
Condensation
Crystallization
Each failure cause can be characterized with a specific combination of initial condition,
stress and degradation process. For example, a cause could be represented as:
Defect1-temperature–corrosion
After identifying all of the FMEA elements in accordance with the guidelines presented
herein, it is very useful to check the completeness of the analysis by hypothesizing what
would happen if the product or system:
In these cases, you are filling in the FMEA backwards by essentially hypothesizing the
cause “a-priori” and then determining what the resulting failure mode would be. For
example, the cause identified in this manner will result in a failure mode, which in turn
will have an effect at the next higher level in the system.
8.11.2. Occurrence
Ideally, a reliability model would be available from which to determine the occurrence,
but this is usually impractical due to the fact that the FMEA is generally performed
before the reliability modeling activities commence.
The occurrence should be based on engineering judgment and on empirical data. The
resulting occurrence value is based on both, as illustrated in Figure 8.11-2. For example,
if empirical information exists on a specific cause, it should be used as part of the
assessment of the Occurrence level. In this case, heavier weighting should be given to
field data over manufacturing and test data. If no empirical data exists, engineering
Reliability Information Analysis Center
399
Chapter 8: The Use of FMEA in Reliability Modeling
judgment should be used, and should be based on the collective experience of the FMEA
team.
The occurrence should be based on the likelihood that the cause will occur and that the
resulting mode will occur. Some FMEA methodologies, like the cancelled MIL-STD-
1629, include a separate factor that accounts for the probability that the effect will occur
if the mode is to occur (the same concept can be used for the cause-mode relationship).
High 10
Failure rate
estimate
based on
experience
1
Low
Not at all Frequently
How often has the failure cause/mechanism been
observed in the past (heavier weighting should
be given to field data over manufacturing and
test data)
The frequencies of occurrence should be rated relative to the required reliability for a
specific failure cause. For example, if a reliability allocation is performed to allocate the
product or system failure rate (or unreliability) to its constituent components, then the
occurrence value should be relative to this allocated value.
8.11.3. Preventions
Preventions are the actions taken to prevent the cause/mechanism of failure or the failure
mode from occurring, or to reduce their rate of occurrence. These will generally be
design-related actions. Examples include: “Ensured proper derating for all components”
or “Use of a conformal coating”.
8.11.4. Detections
Detections are actions taken to detect the cause/mechanism of failure. This can be via
either test or analysis.
8.11.5. Detectability
Detectability is a value between 1 and 10 that is inversely proportional to the degree to
which the failure cause can be detected, i.e., the less likely the detection of the failure
cause, the higher the detectability value. The traditional detectability definitions are
listed in Figure 8.11-3.
There are four aspects of detection that should be captured in the FMEA:
Screening
This aspect of detectability addresses the question: “What will be done in the
manufacturing process to detect and eliminate the items prone to the failure
cause/mechanism?” Reliability screening is a common technique for accomplishing this.
If screening is to be employed, the screening effectiveness must be determined. This
screening effectiveness is directly related to the “Probability of detection if the failure
cause/mechanism occurs”.
Degree of Warning
The fourth aspect of detectability relates to how detectable a failure cause/mechanism is
before it results in the worst case effect identified.
The life cycle phases to which each of these four dimensions is applicable are illustrated
in Figure 8.11-4.
x L x L 10
L H L H 8
x L x H 7
H H L L 5
L H H L 5
H H L H 2
L H H H 2
H H H H 1
RPN = O*S*D
where:
O= Probability of occurrence
S= Severity
D= Detectability
C= λβα
where:
C= Criticality
λ= Failure rate
The failure rate is the rate of occurrence of failure, expressed in failures per million
cumulative operating hours, or in FITs (failures per billion operating hours). The failure
effect probability is the conditional probability that, if the failure mode occurs, the
severity level identified in the FMEA will be the result. The failure mode ratio is the
fraction of the failure rate that can be attributed to the specific failure mode under
analysis. In other words the sum of these probabilities for all failure modes of an item
will be 1.0.
The same logic applies to the RPN methodology, in that the occurrence rating (O) is the
product of “the probability of the failure cause occurring” times “the probability that the
failure cause will result in the identified effect”.
Since severity is not included in this calculation, failure modes are usually sorted by
criticality for each severity level. This is done since a true measure of criticality must
include the severity of the failure mode.
The RPN methodology is generally the most common used in many industries. However,
in some cases, the criticality metric is more applicable. Such cases occur when the
system under analysis is complex, or when quantitative failure rate estimates are
available. These failure rate estimates are generally derived from reliability modeling, as
summarized in this book.
An issue that needs to be addressed is the identification of a critical RPN value above
which corrective action should take place. The RPN is a qualitative measure of risk and,
therefore, there is not a single value. Usually, the number of failure causes that can be
addressed with corrective actions will be determined by the availability of resources and
the criticality or severity of failure. Some organizations state to their suppliers that RPNs
of 40, or 50, or greater shall be addressed. However, this value is somewhat arbitrary.
Also, in many cases, it is required that all failure causes with high severity be addressed,
regardless of their occurrence or detectability.
Reliability Information Analysis Center
405
Chapter 8: The Use of FMEA in Reliability Modeling
The other factor that determines the failure causes that are to be addressed is the Pareto
ranking of the RPNs. In other words, in some cases, there are a well-defined number of
causes that comprise the total risk to the system. This situation becomes evident in the
Pareto analysis of RPNs.
• First, the design can be modified such that the effect of failure is minimized,
thus effectively lowering the severity level of the failure effect. Options for
this include the addition of redundant elements or fault tolerance, the selection
of better materials, and/or the use of more robust components.
• The second general option is to reduce the likelihood of the failure mode
occurring in the first place. This often can be achieved by the use of more
robust components. This robustness can be achieved with components of
higher quality levels or the ability to handle high stress levels. Another option
for reducing this likelihood is the control of environmental stresses and
reducing the stress to which the component is exposed.
• The third general corrective action is to improve detectability. Many products
will have failure modes for which the only viable corrective action is to make
the failure mode detectable. A common example of this is digital circuitry. If
redundancy is not an option, and higher quality components are not available,
the failure mode can be made detectable through the use of alarms or built-in-
test (BIT) features.
Corrective Action
Fault Better
Reduce
Tolerance Materials More Robust
Stress
Components
Control
Environment
• Severity Reduction:
o Add redundancy
o Add a fail-safe feature
o Use personal protection equipment (for safety critical items)
• Occurrence Reduction:
o “Design out” the cause
o Reduce the rate of occurrence
• Detection Improvement:
o Implement alarm features
o Implement screening tests
o Design more relevant tests to detect the failure cause
o Develop better characterization methods
1. Characteristics (or “Whats”) – These are the high level characteristics of the
product that need to be achieved for the product’s customer to be satisfied. The
lack of these characteristics is synonymous with failure modes of the product,
which in turn are failure effects of failure modes at the next lower level of the
product hierarchy.
2. Importance (or “Ranking of Needs”) – The QFD will generally include a rating
of the importance of the characteristic (#1). The severity of the failure effects will
then be proportional to this importance. The dimensions of this importance should
include the dimensions of severity as described previously.
3. Measures (or “Hows”) – The “measures” in the QFD pertain to the
characteristics of the items comprising the design. These measures can be
comprised on a hierarchical listing of the items and their critical characteristics or
functions. The manner in which these characteristics can fail can be identified
with an IPOUND analysis previously discussed, and become the failure modes in
the FMEA.
4. Relationships – the relationships matrix in the QFD identifies if the measure is
related to the characteristic and, if so, whether it is a strong, medium or weak
relationship. These relationships essentially identify the effects (i.e. the negative
of the characteristic) that will occur if the failure mode (i.e. failure modes of the
measures) occur.
If the FMEA elements are obtained from the QFD in this manner, the first five columns
in the FMEA can be populated directly, as shown in Figure 8.15-2. These columns are
identified in bold in the above descriptions.
8.16. References
1. SAE J1739 (R) Potential Failure Mode and Effects Analysis in Design (Design
FMEA), Potential Failure Mode and Effects Analysis in Manufacturing and
Assembly Processes (Process FMEA)
9. Concluding Remarks
Reliability modeling has been used successfully as a reliability engineering tool for many
years. It is only one element of a well-structured reliability program and, to be effective,
it must be integrated into a complete reliability program. This book has reviewed options
an analyst has for developing a reliability model of a product or system, and has provided
guidance on applying the appropriate methodology based on the specific needs and
constraints of the analyst.
The premise of the holistic approach described in this book is that the reliability model is
a living model that needs to be continuously updated throughout the program
development and deployment phases. This approach to modeling consists of predictions,
assessments and estimation. Each of these is performed at specific points in the
development cycle and has different purposes and approaches. Reliability predictions are
performed very early, before there is any empirical data on the item under analysis.
Reliability assessments are made to determine the effects of certain factors on reliability,
and to identify and study specific failure causes. Reliability estimates are made based on
empirical data, and encompass all three elements.
A critical theme of this book has been that the purpose of a reliability model must be
clearly defined, and then an appropriate methodology should be chosen. Each model
need must be fully defined in terms of the customers being served (their roles,
educational background, requirements, etc.), the constraints placed upon that customer
(including legal and contractual, as well as technical and engineering), and the purpose of
the model (what decisions are being supported and in what manner).
8. Use multiple modeling techniques, and work toward the goal of having them
reasonably agree with each other. In this manner, confidence in the results will be
greater.
9. Continuously update the model based on data that is obtained throughout all
program phases of the product or system life cycle
10. Identify and use available reliability software tools. These tools have become
very cost effective and are readily available, making the application of techniques
which were impractical several decades ago easily implemented.
It is hoped that this book has provided the reader with a knowledge of approaches, tools,
and interpretations that will allow a better understanding of the usefulness and limitations
of various reliability modeling techniques. Given its stochastic nature, reliability
modeling is part science and part art, and there are many ways to approach it. But, if the
analyst keeps the goals in mind and uses common sense, there is a high probability that
the model will be successful in achieving its objectives.