Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

FAILURE TRENDS IN A LARGE

DISK DRIVE POPULATION


Paper: Eduardo Pinheiro, Wolf-Dietrich Weber, Luiz Andre
Barroso
Slides: Tim Disney for CMPS 229 -Spring 2010

Thursday, April 15, 2010


In 2002 an estimated 90% of all new information
stored in magnetic media

Thursday, April 15, 2010


Problem: Insufficient studies of disk failure

Manufacturers
•Accelerated life test extrapolation
•Returned units

Field studies
•Small populations
•Too little monitoring

Thursday, April 15, 2010


Solution: Google itThe information collected includes environmental fac-
tors (such as temperatures), activity levels and many of
the Self-Monitoring Analysis and Reporting Technology
(SMART) parameters that are believed to be good indi-
cators of disk drive health. We mine through these data
Population:
and attempt to find evidence that corroborates or con-
tradicts many of the commonly held beliefs about how

•> 100,000 drives


various factors can affect disk drive lifetime.
Our paper is unique in that it is based on data from a

•From 2001
disk population size that is typically only available from
vendor warranty databases, but has the depth of deploy-

•5400 to 7200 rpm


ment visibility and detailed lifetime follow-up that only
an end-user study can provide. Our key findings are:

•80 to 400 GB
• Contrary to previously reported results, we found

Collection: very little correlation between failure rates and ei-


ther elevated temperature or activity levels.

•S.M.A.R.T
• Some SMART parameters (scan errors, realloca-
tion counts, offline reallocation counts, and proba-

•Environment (temp)
tional counts) have a large impact on failure proba-
bility.
•Activity Level
Figure 1: Collection, storage, and analysis architecture.
• Given the lack of occurrence of predictive SMART
signals on a large fraction of failed drives, it is un- resources, error indications, and configuration informa-
likely that an accurate predictive failure model can tion. It is imperative that this daemon’s resource usage
be built based on these signals alone. be very light, so not to interfere with the applications.
One way to assure this is to have the machine-level col-
lector poll individual machines relatively infrequently
2 Background (every few minutes). Other slower changing data (such
as configuration information) and data from other exist-
In this section we describe the infrastructure that was ing databases can be collected even less frequently than
Thursday, April 15, 2010 that. Most notably for this study, data regarding ma-
used to gather and process the data used in this study,
Data

Failure:
•Some drives fail in the field but are good in testing
•“Failed” if it was replaced as part of repairs procedure
Filtering:
•Clean up bad data
•Some drives reported being hotter than sun
•Filtering reduced sample set < 0.1%

Thursday, April 15, 2010


Say it with Charts!
numbers, are likely to be good indicators of some-
really bad with the drive. Filtering for spurious
es reduced the sample set size by less than 0.1%.


Results
AFR - % failed in year
•3,6 month and 1 year overlap
• Drives
ow analyze models
the failure behavior are
of ourmixed
fleet of disk
s using detailed monitoring data collected over a
month observation window. During this time we
ded failure events as well as all the available en-
nmental and activity data and most of the SMART
meters from the drives themselves. Failure informa-
spanning a much longer interval (approximately five
s) was also mined from an older repairs database. Figure 2: Annualized failure rates broken down by age groups
he results presented here were tested for their statis-
significance using the appropriate tests. ulation. The higher baseline AFR for 3 and 4 year old
drives is more strongly influenced by the underlying re-
liability of the particular models in that vintage than by
Baseline Failure Rates disk drive aging effects. It is interesting to note that our
3-month, 6-months and 1-year data points do seem to
re 2 presents the average Annualized Failure Rates
indicate a noticeable influence of infant mortality phe-
R) for all drives in our study, aged zero to 5 years,
nomena, with 1-year AFR dropping significantly from
s Thursday,
derived Aprilfrom our older repairs database. The data
15, 2010
Utilization
difficult for us to arrive at a meaningful numer-
lization metric given that our measurements do
vide enough detail to derive what 100% utiliza-
ght be for any given disk model. We choose in-
o measure utilization in terms of weekly averages

writeWeekly averages
bandwidth per of read/write
drive. We categorize utiliza-
three levels: low, medium and high, correspond-
bandwidth
pectively to the lowest 25th percentile, 50-75th
• Expected
iles and high correlation
top 75th percentile. to
This categorization
ormed for each drive model, since the maximum
dthsfailure
have significant variability across drive fam-
We note that using number of I/O operations and
ansferred as utilization metrics provide very sim-
ults. Figure 3 shows the impact of utilization on
cross the different age groups. Figure 3: Utilization AFR
all, we expected to notice a very strong and con-
correlation between high utilization and higher 3.4 Temperature
rates. However our results appear to paint a more
x picture. First, only very young and very old Temperature is often quoted as the most important envi-
oups appear to show the expected behavior. Af- ronmental factor affecting disk drive reliability. Previous
first year, the AFR of high utilization drives is studies have indicated that temperature deltas as low as
t moderately higher than that of low utilization 15C can nearly double disk drive failure rates [4]. Here
The three-year group in fact appears to have the we take temperature readings from the SMART records
e of the expected behavior, with low utilization every few minutes during the entire 9-month window
Thursday, April 15, 2010
Temperature

•High temp often though to be


important factor in drive failure
•Only for very low, very high

Figure 4: Distribution of average temperatures and failures


rates.

Thursday, April 15, 2010


Figure 4: Distribution of average temperatures and failures
rates.

Temperature

More pronounced difference for


very old drives

Figure 5: AFR for average drive temperature.

ported temperature effects only for the high end of our


temperature range and especially for older drives. In the
lower and middle temperature ranges, higher tempera-
tures are not associated with higher failure rates. This is
a fairly surprising result, which could indicate that data-
center or server designers have more freedom than pre-
viously thought when setting operating temperatures for
equipment that contains disk drives. We can conclude
Thursday, April 15, 2010
SMART - Scan Errors

Found by drives scanning the


disk surface in the background

Figure 6: AFR for scan errors.

Thursday, April 15, 2010


SMART - Scan Errors
Figure 6: AFR for scan errors. Figure 7: AFR for reallocation counts.

Figure 8: Impact of scan errors on survival probability. Left figure shows aggregate survival probability for all drives after first
scan error. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down drives by their
number of scan errors.

The critical threshold analysis confirms what the groups (Figure 7), even if slightly less pronounced.
charts visually imply: the critical threshold for scan er- Drives with one or more reallocations do fail more of-
rors is one. After the first scan error, drives are 39 times ten than those with none. The average impact on AFR
more likely to fail within 60 days than drives without appears to be between a factor of 3-6x.
scan errors. Figure 11 shows the survival probability after the first
reallocation. We truncate the graph to 8.5 months, due
3.5.2 Reallocation Counts to a drastic decrease in the confidence levels after that
Thursday, April 15, 2010
SMART - Reallocation Counts

Number of times a bad sector


has been remapped

Figure 6: AFR for scan errors. Figure 7: AFR for reallocation counts.

Thursday, April 15, 2010


SMART - Reallocation Counts
Figure 9: AFR for offline reallocation count. Figure 10: AFR for probational count.

Figure 11: Impact of reallocation count values on survival probability. Left figure shows aggregate survival probability for all
drives after first reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number of reallocations.

3.5.3 Offline Reallocations points were not within high confidence intervals). Drives
in the older age groups appear to be more highly affected
Offline reallocations are defined as a subset of the real- by it, although we are unable to attribute this effect to
location counts studied previously, in which only real- age given the different model mixes in the various age
located sectors found during background scrubbing are groups.
counted. In other words, it should exclude sectors that
After the first offline reallocation, drives have over
are reallocated as a result of errors found during actual
21 times higher chances of failure within 60 days than
I/O operations. Although this definition mostly holds,
Thursday, April 15, 2010 drives without offline reallocations; an effect that is
SMART - Offline Reallocation Counts

Reallocated sectors found only


during background scrubbing

Figure 9: AFR for offline reallocation count.

Thursday, April 15, 2010


SMART - Offline Reallocation Counts

Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drives
after first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number offline reallocation.

Thursday, April 15, 2010


SMART - Probational Count

Suspicious sectors that haven’t


yet been reallocated

e 9: AFR for offline reallocation count. Figure 10: AFR for probational count.

Thursday, April 15, 2010


SMART - Probational Count
Figure 12: Impact of offline reallocation on survival probability. Left figure shows aggregate survival probability for all drives
after first offline reallocation. Middle figure breaks down survival probability per drive ages in months. Right figure breaks down
drives by their number offline reallocation.

Figure 13: Impact of probational count values on survival probability. Left figure shows aggregate survival probability for all
drives after first probational count. Middle figure breaks down survival probability per drive ages in months. Right figure breaks
down drives by their number of probational counts.

therefore, can be seen as a softer error indication. It 3.5.5 Miscellaneous Signals


could provide earlier warning of possible problems but
might also be a weaker signal, in that sectors on pro- In addition to the SMART parameters described in the
bation may indeed never be reallocated. About 2% of previous sections, which we have found to most closely
our drives had non-zero probational count values. We impact failure rates, we have also studied several other
note that this number is lower than both online and of- parameters from the SMART set as well as other envi-
fline reallocation counts, likely indicating that sectors ronmental factors. Here we briefly mention our relevant
Thursday, April 15, 2010
SMART - Others
Seek Errors Calibration Retries
•Fails to track to a sector •Authors couldn’t find consistent definition
•Only seen in a single drive manufacture •Weakly tied to failure rates
CRC Errors Spin Retries
•Error between physical media and interface •Number of retries when disk spins up
•Some correlation but not much •No count in population
Power Cycles Power-on hours
•Number of times drive is powered on/off •In population drive age is basically the same
•Correlated only in older drives

Thursday, April 15, 2010


Predictive power
Failed Drives

Hoped to form
predictive failure model 44%
56%

with SMART without SMART

Thursday, April 15, 2010


Predictive power
ployments than a
Although they do
between SMART
and failures (possi
work is useful in
of factors what af
ple, they commen
much as ten times
turer might expect
mental correlation
failure rates (an ef
els in [4]); and the
anisms are at play
[19]. Generally, o
sults.
Figure 14: Percentage of failed drives with SMART errors. User experience
into the device in
4 Related Work ufacturer reports,
ing device behavio
Thursday, April 15, 2010 Previous studies in this area generally fall into two cat- nately, there are v
So

•Huge study of failure


•Temperature and activity less important than believed
•SMART parameters well-correlated
•But SMART data alone won’t help

Thursday, April 15, 2010

You might also like