Failure Trends

FAILURE TRENDS IN A LARGE
DISK DRIVE POPULATION

Paper: Eduardo Pinheiro, Wolf-Dietrich Weber, Luiz Andre
Barroso
Slides: Tim Disney for CMPS 229 -Spring 2010
Thursday, April 15, 2010

In 2002 an estimated 90% of all new information
stored in magnetic media

Problem: Insufficient studies of disk failure
Manufacturers
•Accelerated life test extrapolation
•Returned units
Field studies
•Small populations
•Too little monitoring

Solution: Google itThe information collected includes environmental fac-
tors (such as temperatures), activity levels and many of
the Self-Monitoring Analysis and Reporting Technology
(SMART) parameters that are believed to be good indi-
cators of disk drive health. We mine through these data
Population:
and attempt to find evidence that corroborates or con-
tradicts many of the commonly held beliefs about how
•> 100,000 drives

various factors can affect disk drive lifetime.
Our paper is unique in that it is based on data from a
•From 2001
disk population size that is typically only available from
vendor warranty databases, but has the depth of deploy-
•5400 to 7200 rpm

ment visibility and detailed lifetime follow-up that only
an end-user study can provide. Our key findings are:
•80 to 400 GB
• Contrary to previously reported results, we found
Collection: very little correlation between failure rates and ei-

ther elevated temperature or activity levels.
•S.M.A.R.T
• Some SMART parameters (scan errors, realloca-
tion counts, offline reallocation counts, and proba-
•Environment (temp)
tional counts) have a large impact on failure proba-
bility.
•Activity Level
Figure 1: Collection, storage, and analysis architecture.
• Given the lack of occurrence of predictive SMART
signals on a large fraction of failed drives, it is un- resources, error indications, and configuration informa-
likely that an accurate predictive failure model can tion. It is imperative that this daemon’s resource usage
be built based on these signals alone. be very light, so not to interfere with the applications.
One way to assure this is to have the machine-level col-
lector poll individual machines relatively infrequently
2 Background (every few minutes). Other slower changing data (such
as configuration information) and data from other exist-
In this section we describe the infrastructure that was ing databases can be collected even less frequently than
Thursday, April 15, 2010 that. Most notably for this study, data regarding ma-
used to gather and process the data used in this study,
Data
Failure:
•Some drives fail in the field but are good in testing
•“Failed” if it was replaced as part of repairs procedure
Filtering:
•Clean up bad data
•Some drives reported being hotter than sun
•Filtering reduced sample set < 0.1%

Say it with Charts!
numbers, are likely to be good indicators of some-
really bad with the drive. Filtering for spurious
es reduced the sample set size by less than 0.1%.
•
Results
AFR - % failed in year
•3,6 month and 1 year overlap
• Drives
ow analyze models
the failure behavior are
of ourmixed
fleet of disk
s using detailed monitoring data collected over a
month observation window. During this time we
ded failure events as well as all the available en-
nmental and activity data and most of the SMART
meters from the drives themselves. Failure informa-
spanning a much longer interval (approximately five
s) was also mined from an older repairs database. Figure 2: Annualized failure rates broken down by age groups
he results presented here were tested for their statis-
significance using the appropriate tests. ulation. The higher baseline AFR for 3 and 4 year old
drives is more strongly influenced by the underlying re-
liability of the particular models in that vintage than by
Baseline Failure Rates disk drive aging effects. It is interesting to note that our
3-month, 6-months and 1-year data points do seem to
re 2 presents the average Annualized Failure Rates
indicate a noticeable influence of infant mortality phe-
R) for all drives in our study, aged zero to 5 years,
nomena, with 1-year AFR dropping significantly from
s Thursday,
derived Aprilfrom our older repairs database. The data
15, 2010
Utilization
difficult for us to arrive at a meaningful numer-
lization metric given that our measurements do
vide enough detail to derive what 100% utiliza-
ght be for any given disk model. We choose in-
o measure utilization in terms of weekly averages
•
writeWeekly averages
bandwidth per of read/write
drive. We categorize utiliza-
three levels: low, medium and high, correspond-
bandwidth
pectively to the lowest 25th percentile, 50-75th
• Expected
iles and high correlation
top 75th percentile. to
This categorization
ormed for each drive model, since the maximum
dthsfailure
have significant variability across drive fam-
We note that using number of I/O operations and
ansferred as utilization metrics provide very sim-
ults. Figure 3 shows the impact of utilization on
cross the different age groups. Figure 3: Utilization AFR
all, we expected to notice a very strong and con-
correlation between high utilization and higher 3.4 Temperature
rates. However our results appear to paint a more
x picture. First, only very young and very old Temperature is often quoted as the most important envi-
oups appear to show the expected behavior. Af- ronmental factor affecting disk drive reliability. Previous
first year, the AFR of high utilization drives is studies have indicated that temperature deltas as low as
t moderately higher than that of low utilization 15C can nearly double disk drive failure rates [4]. Here
The three-year group in fact appears to have the we take temperature readings from the SMART records
e of the expected behavior, with low utilization every few minutes during the entire 9-month window
Temperature
•High temp often though to be

important factor in drive failure
•Only for very low, very high
Figure 4: Distribution of average temperatures and failures

rates.

Figure 4: Distribution of average temperatures and failures
rates.
Temperature
More pronounced difference for

very old drives
Figure 5: AFR for average drive temperature.
ported temperature effects only for the high end of our

temperature range and especially for older drives. In the
lower and middle temperature ranges, higher tempera-
tures are not associated with higher failure rates. This is
a fairly surprising result, which could indicate that data-
center or server designers have more freedom than pre-
viously thought when setting operating temperatures for
equipment that contains disk drives. We can conclude
SMART - Scan Errors
Found by drives scanning the

disk surface in the background
Figure 6: AFR for scan errors.

SMART - Scan Errors
Figure 6: AFR for scan errors. Figure 7: AFR for reallocation counts.
Figure 8: Impact of scan errors on survival probability. Left figure shows aggregate survival probability for all drives after first
scan error. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their
number of scan errors.
The critical threshold analysis confirms what the groups (Figure 7), even if slightly less pronounced.
charts visually imply: the critical threshold for scan er- Drives with one or more reallocations do fail more of-
rors is one. After the first scan error, drives are 39 times ten than those with none. The average impact on AFR
more likely to fail within 60 days than drives without appears to be between a factor of 3-6x.
scan errors. Figure 11 shows the survival probability after the first
reallocation. We truncate the graph to 8.5 months, due
3.5.2 Reallocation Counts to a drastic decrease in the confidence levels after that
SMART - Reallocation Counts
Number of times a bad sector

has been remapped
Figure 6: AFR for scan errors. Figure 7: AFR for reallocation counts.

SMART - Reallocation Counts
Figure 9: AFR for offline reallocation count. Figure 10: AFR for probational count.
Figure 11: Impact of reallocation count values on survival probability. Left figure shows aggregate survival probability for all
drives after first reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number of reallocations.
3.5.3 Offline Reallocations points were not within high confidence intervals). Drives
in the older age groups appear to be more highly affected
Offline reallocations are defined as a subset of the real- by it, although we are unable to attribute this effect to
location counts studied previously, in which only real- age given the different model mixes in the various age
located sectors found during background scrubbing are groups.
counted. In other words, it should exclude sectors that
After the first offline reallocation, drives have over
are reallocated as a result of errors found during actual
21 times higher chances of failure within 60 days than
I/O operations. Although this definition mostly holds,
Thursday, April 15, 2010 drives without offline reallocations; an effect that is
SMART - Offline Reallocation Counts
Reallocated sectors found only

during background scrubbing
Figure 9: AFR for offline reallocation count.

SMART - Offline Reallocation Counts
Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drives
after first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number offline reallocation.

SMART - Probational Count
Suspicious sectors that haven’t

yet been reallocated
e 9: AFR for offline reallocation count. Figure 10: AFR for probational count.

SMART - Probational Count
Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drives
after first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number offline reallocation.
Figure 13: Impact of probational count values on survival probability. Left figure shows aggregate survival probability for all
drives after first probational count. Middle figure breaks down survival probability per drive ages in months. Right figure breaks
down drives by their number of probational counts.
therefore, can be seen as a softer error indication. It 3.5.5 Miscellaneous Signals

could provide earlier warning of possible problems but
might also be a weaker signal, in that sectors on pro- In addition to the SMART parameters described in the
bation may indeed never be reallocated. About 2% of previous sections, which we have found to most closely
our drives had non-zero probational count values. We impact failure rates, we have also studied several other
note that this number is lower than both online and of- parameters from the SMART set as well as other envi-
fline reallocation counts, likely indicating that sectors ronmental factors. Here we briefly mention our relevant
SMART - Others
Seek Errors Calibration Retries
•Fails to track to a sector •Authors couldn’t find consistent definition
•Only seen in a single drive manufacture •Weakly tied to failure rates
CRC Errors Spin Retries
•Error between physical media and interface •Number of retries when disk spins up
•Some correlation but not much •No count in population
Power Cycles Power-on hours
•Number of times drive is powered on/off •In population drive age is basically the same
•Correlated only in older drives

Predictive power
Failed Drives
Hoped to form
predictive failure model 44%
56%
with SMART without SMART

Predictive power
ployments than a
Although they do
between SMART
and failures (possi
work is useful in
of factors what af
ple, they commen
much as ten times
turer might expect
mental correlation
failure rates (an ef
els in [4]); and the
anisms are at play
[19]. Generally, o
sults.
Figure 14: Percentage of failed drives with SMART errors. User experience
into the device in
4 Related Work ufacturer reports,
ing device behavio
Thursday, April 15, 2010 Previous studies in this area generally fall into two cat- nately, there are v
So
•Huge study of failure

•Temperature and activity less important than believed
•SMART parameters well-correlated
•But SMART data alone won’t help

Failure Trends

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Failure Trends

Uploaded by

Copyright:

Available Formats

FAILURE TRENDS IN A LARGE

DISK DRIVE POPULATION

Thursday, April 15, 2010

Thursday, April 15, 2010

Thursday, April 15, 2010

•> 100,000 drives

•5400 to 7200 rpm

Collection: very little correlation between failure rates and ei-

Thursday, April 15, 2010

•High temp often though to be

Figure 4: Distribution of average temperatures and failures

Thursday, April 15, 2010

More pronounced difference for

Figure 5: AFR for average drive temperature.

ported temperature effects only for the high end of our

Found by drives scanning the

Figure 6: AFR for scan errors.

Thursday, April 15, 2010

Number of times a bad sector

Thursday, April 15, 2010

Reallocated sectors found only

Figure 9: AFR for offline reallocation count.

Thursday, April 15, 2010

Thursday, April 15, 2010

Suspicious sectors that haven’t

Thursday, April 15, 2010

therefore, can be seen as a softer error indication. It 3.5.5 Miscellaneous Signals

Thursday, April 15, 2010

with SMART without SMART

Thursday, April 15, 2010

•Huge study of failure

Thursday, April 15, 2010

You might also like