Professional Documents
Culture Documents
Memory Errors 2
Memory Errors 2
Memory Errors 2
4 8
0.96
0
A B C
DRAM Vendor
Figure 2. DRAM device-hours per vendor on Hopper. Even
Figure 1. A simplified logical view of a single channel of for the least-populous vendor (A), we have almost 1B
the DRAM memory subsystem on each Cielo and Hopper device-hours of data.
node.
Both console and event logs contain a variety of other
information, including the physical address and ECC syn-
DRAM devices are used to store data bits and two are used drome associated with each error. These events are decoded
to store check bits. A lane is a group of DRAM devices further using configuration information to determine the
on different ranks that shares data (DQ) signals. DRAMs physical DRAM location associated with each error. For
in the same lane also share a strobe (DQS) signal, which this analysis, we decoded the location to show the DIMM,
is used as a source-synchronous clock signal for the data as well as the DRAM bank, column, row, and chip.
signals. A memory channel has 18 lanes, each with two We make the assumption that each DRAM device experi-
ranks (i.e., one DIMM per channel). Each DRAM device ences a single fault during our observation interval. The oc-
contains eight internal banks that can be accessed in parallel. currence time of each DRAM fault corresponds to the time
Logically, each bank is organized into rows and columns. of the first observed error message per DRAM device. We
Each row/column address pair identifies a 4-bit word in then assign a specific type and mode to each fault based on
the DRAM device. Figure 1 shows a diagram of a single the associated errors in the console log. Our observed fault
memory channel. rates indicate that fewer than two DRAM devices will suffer
Physically, all DIMMs on Cielo and Hopper (from all multiple faults within our observation window, and thus the
vendors) are identical. Each DIMM is double-sided. DRAM error in this method is low. Prior studies have also used and
devices are laid out in two rows of nine devices per side. validated a similar methodology (e.g., [34]).
There are no heatsinks on DIMMs in Cielo or Hopper. Hardware inventory logs are separate logs and provide
The primary difference between the two memory subsys- snapshots of the hardware present in each machine at differ-
tems is that Cielo uses chipkill-correct ECC on its memory ent points in its lifetime. These files provide unique insight
subsystem, while Hopper uses chipkill-detect ECC. into the hardware configuration that is often lacking on other
systems. In total we analyzed over 400 individual hardware
inventory logs that covered a span of approximately three
5. Experimental Setup years on Hopper and two years on Cielo. Each of these files
For our analysis we use three different data sets - corrected consist of between 800 thousand and 1.5 million lines of ex-
error messages from console logs, uncorrected error mes- plicit description of each host’s hardware, including config-
sages from event logs, and hardware inventory logs. These uration information and information about each DIMM such
three logs provided the ability to map each error message to as the manufacturer and part number.
specific hardware present in the system at that point in time. We limit our analysis to the subset of dates for which we
Corrected error logs contain events from nodes at specific have both hardware inventory logs as well as error logs. For
time stamps. Each node in the system has a hardware mem- Hopper, this includes 18 months of data from April 2011
ory controller that logs corrected error events in registers to January 2013. For Cielo, this includes 12 months of data
provided by the x86 machine-check architecture (MCA) [5]. from July 2011 to November 2012. For both systems, we
Each node’s operating system is configured to poll the MCA exclude the first several months of data to avoid attributing
registers once every few seconds and record any events it hard faults that occurred prior to the start of our dataset
finds to the node’s console log. to the first months of our observation. Also, in the case of
Uncorrected error event logs are similar and contain data both systems, time periods where the system was not in a
on uncorrected errors logged in the system. These are typi- consistent state or was in a transition state were excluded.
cally not logged via polling, but instead logged after the node For confidentiality purposes, we anonymize all DIMM
reboots as a result of the uncorrected error. vendor information. Because this is not possible for the CPU
vendor, we present all CPU data in arbitrary units (i.e., all
● Total ● Permanent ● Transient
data is normalized to one of the data points in the graph).
While this obscures absolute rates, it still allows the ability to 30 ● ●
3.0
Transient
Permanent
Relative FIT/DRAM
22.2
FITs/DRAM
21.5
2.0
1.71
1.0
0.88
5
0
A B C
DRAM Vendor
0.0
Figure 4. Fault rate per DRAM vendor. Hopper sees sub- A B C
stantial variation in fault rates by DRAM vendor. DRAM Vendor
Figure 5. Cielo DRAM transient fault rates relative to Hop-
DRAM vendors, share the same DDR-3 memory technol- per (Hopper == 1.0). Vendor A shows a significantly higher
ogy, and use the same memory controllers. A major differ- transient fault rate in Cielo, likely attributable to altitude.
ence between the two systems is their altitude: the neutron
Fault Mode Vendor Cielo Hopper
flux, relative to New York City, experienced by Cielo is
A 31.85 18.75
5.53 and the relative neutron flux experienced by Hopper
Single-bit B 15.61 9.70
is 0.91[1]. Therefore, by comparing fault rates in Cielo and
C 6.10 7.93
Hopper, we can determine what effect, if any, altitude has on
DRAM fault rates. A 6.05 0.0
Figure 5 plots the rate of transient faults on Cielo rel- Single-column B 0.55 0.0
ative to the rate of transient faults on Hopper, using Cielo C 0.18 0.0
data from [34]. The figure shows that vendor A experiences A 14.65 0.0
a substantially higher transient fault rate on Cielo, while ven- Single-bank B 0.07 0.09
dor B experiences a modestly higher transient fault rate and C 0.18 0.10
vendor C experiences virtually the same transient fault rate. Table 3. Rate of single-bit, single-column and single-bank
Table 3 breaks down this data and shows the single-bit, transient DRAM faults in Cielo and Hopper (in FIT). The
single-column, and single-row transient fault rates for each higher rate of transient faults in Cielo may indicate that these
system by DRAM vendor. The table shows that the single- faults are caused by high-energy particles.
bit transient fault rates for vendors A and B are substantially
higher in Cielo than in Hopper, as are the single-column and The pattern of errors from this fault appears to be similar
single-bank transient fault rates for vendor A. The rates of to a fault mode called DRAM disturbance faults or “row-
all other transient fault modes were within 1 FIT of each hammer” faults [19] [26]. This fault mode results in corrup-
other on both systems. Due to the similarities between the tion of a “victim row” when an “aggressor row” is opened
two systems, the most likely cause of the higher single-bit, repeatedly in a relatively short time window. We did not do
single-column, and single-bank transient fault rates in Cielo root-cause analysis on this part; therefore, we cannot say for
is particle strikes from high-energy neutrons. certain that the fault in question is a disturbance fault. How-
Our main observation from this data is that particle- ever, it is clear that fault modes with similar characteristics
induced transient faults remain a noticeable effect in DRAM do occur in practice and must be accounted for.
devices, although the susceptibility varies by vendor and Our primary observation from data in this section is that
there are clearly other causes of faults (both transient and new and unexpected fault modes may occur after a system
permanent) in modern DRAMs. Also, our results show that is architected, designed, and even deployed: row-hammer
while altitude does have an impact on system reliability, ju- faults were only identified by industry after Hopper was
dicious choice of the memory vendor can reduce this impact. designed, and potentially after it began service [26]. Prior
work has looked at tailoring detection to specific fault modes
6.1.2 Multi-bit DRAM Faults Figure 6 shows an exam- (e.g., [35]), but the data show that we need to be able to han-
ple of a single-bank fault from Hopper of multiple adjacent dle a larger variety of fault modes and that robust detection
DRAM rows with bit flips. The figure shows that errors from is crucial.
this fault were limited to several columns in three logically
adjacent rows. Multiple faults within our dataset matched 6.2 SRAM Faults
this general pattern, although the occurrence rate was rela- In this section, we examine fault rates of SRAM in the AMD
tively low. All errors from this fault were corrected by ECC. OpteronTM processors in Hopper.
1.4
1.214
1.2
2b14 ● 1.109
Relative to Accelerated
1.057
DRAM Row
1.025
1.0
2b11 ● ● ● ● ●●
●●● ●●●●● ●●●●●● ●
0.8
Testing
0.6
2b10 ● ● ●
0.4
400
402
406
40a
414
418
480
540
580
5d8
602
60a
640
682
6b2
6b8
700
724
730
6c0
75c
5f0
4fe
0.2
DRAM Column
0.0
Figure 6. A single-bank fault from one Hopper node. L2 Data L3 Data L2 Tag L3 Tag
15000 14214
accelerated testing.
per Socket
10000
Accelerated testing on the SRAM used in Hopper was
5446 performed using a variety of particle sources, including the
5000
1311 high-energy neutron beam at the LANSCE ICE House[22] at
226 175 407 3 413 1 58 LANL and an alpha particle source. The testing was “static”
0
testing, rather than operational testing. Static testing initial-
L1 ta
L1 ta
L1 g
L2 a
L3 a
U
L2 g
L3 B
g
g
Ta
a
at
at
LR
TL
Ta
Ta
D
IT
ID
D
D
D
L3
L2
L1
SRAM Structure
beam, and then compares the results to the initial state to
look for errors.
Figure 7. Rate of SRAM faults per socket in Hopper, rela- The L2 and L3 caches employ hardware scrubbers, so
tive to the fault rate in the L2TLB. The fault rate is affected we expect the field error logs to capture the majority of
by structure size and organization, as well as workload ef- bit flips that occur in these structures. Figure 8 compares
fects. the per-bit rate of SRAM faults in Hopper’s L2 and L3
data and tag arrays to results obtained from the accelerated
6.2.1 SRAM Fault Rate Figure 7 shows the rate of SRAM testing campaign on the SRAM cells. The figure shows that
faults per socket in Hopper, broken down by hardware struc- accelerated testing predicts a lower fault rate than seen in
ture. The figure makes clear that SRAM fault rates are dom- the field in all structures except the L2 tag array. The fault
inated by faults in the L2 and L3 data caches, the two largest rate in the L2 tag is approximately equal to the rate from
structures in the processor. However, it is clear that faults accelerated testing.
occur in many structures, including smaller structures such Overall, the figure shows good correlation between rates
as tag and TLB arrays. measured in the field and rates measured from static SRAM
A key finding from this figure is that, at scale, faults occur testing. Our conclusion from this correlation is that the ma-
even in small on-chip structures, and detection and correc- jority of SRAM faults in the field are caused by known parti-
tion in these structures is important to ensure correctness for cles. While an expected result, confirming expectations with
high-performance computing workloads running on large- field data is important to ensure that parts are functioning
scale systems. This result is significant for system designers as specified, to identify potential new or unexpected fault
considering different devices on which to base their systems. modes that may not have been tested in pre-production sili-
Devices should either offer robust coverage on all SRAM con, and to ensure accelerated testing reflects reality.
and significant portions of sequential logic, or provide alter-
nate means of protection such as redundant execution [36]. 7. Fallacies in Measuring System Health
6.2.2 Comparison to Accelerated Testing Accelerated All data and analyses presented in the previous section refer
particle testing is used routinely to determine the sensitivity to fault rates, not error rates. Some previous studies have re-
of SRAM devices to single-event upsets from energetic par- ported system error rates instead of fault rates [12][14][16][30].
ticles. To be comprehensive, the accelerated testing must in- Error rates are heavily dependent on a system’s software
clude all particles to which the SRAM cells will be exposed configuration and its workloads’ access patterns in addition
to in the field (e.g., neutrons, alpha particles) with energy to the health of the system, which makes error rates an im-
spectra that approximate real-world conditions. Therefore, perfect measure of hardware reliability. In this section, we
it is important to correlate accelerated testing data with field show that measuring error rates can lead to erroneous con-
data to ensure that the conditions are in fact similar. clusions about hardware reliability.
the system. More importantly, however, this type of analy-
9 10
Relative DRAM Errors
●
●
6
● ●
ECC performs better than Cielo’s ECC. However, Cielo’s
3
Median ●
●
2
● ●
●
●
●
ECC actually had a 3x lower uncorrected error rate than that
1
Relative Uncorrected
2−bit Error Preceded Larger Error
20
FITs/DRAM
0.72
Error Rate
15
10
5
1.8
0.2
0
A B C
Address Parity ECC
DRAM Vendor
Figure 10. Rate of faults that generate errors which are Figure 11. Rate of DRAM address parity errors relative to
potentially undetectable by SEC-DED ECC. uncorrected data ECC errors.
Both Cielo and Hopper log the bits in error for each (vendor C) has a high rate of undetectable-by-SECDED
corrected error. By analyzing each error, we can determine faults when considering multiple nodes operating in parallel.
whether a given error would be undetectable by SEC-DED Hopper and Cielo use chipkill-detect ECC and chipkill-
ECC. For the purposes of this study, we assume that errors correct ECC, respectively, and therefore exhibit much lower
larger than 2 bits in a single ECC word are undetectable by undetected error rates than if they used SEC-DED ECC.
SEC-DED (an actual SEC-DED ECC typically can detect
some fraction of these errors). We call a fault that generates 8.2 DDR Command and Address Parity
an error larger than 2 bits in an ECC word an undetectable- A key feature of DDR3 (and now DDR4) memory is the
by-SECDED fault. A fault is undetectable-by-SECDED if it ability to add parity-check logic to the command and ad-
affects more than two bits in any ECC word, and the data dress bus. The wires on this bus are shared from the mem-
written to that location does not match the value produced ory controller to each DIMM’s register. Therefore, errors on
by the fault. For instance, writing a 1 to a location with a these wires will be seen by all DRAM devices on a DIMM.
stuck-at-1 fault will not cause an error. Though command and address parity is optional on DDR2
Not all multi-bit faults are undetectable-by-SECDED memory systems, we are aware of no prior study that exam-
faults. For instance, many single-column faults only affect ines the potential value of this parity mechanism.
a single bit per DRAM row [33], and manifest as a series The on-DIMM register calculates parity on the received
of ECC words with a single-bit error that is correctable by address and command pins and compares it to the received
SEC-DED ECC. parity signal. On a mismatch, the register signals the mem-
Figure 10 shows the rate of undetectable-by-SECDED ory controller of a parity error. The standard does not pro-
faults on Cielo, broken down by vendor. The rate of these vide for retry on a detected parity error, but requires the on-
faults exhibits a strong dependence on vendor. Vendor A DIMM register to disallow any faulty transaction from writ-
has an undetectable-by-SECDED fault rate of 21.7 FIT per ing data to the memory, thereby preventing data corruption.
DRAM device. Vendors B and C, on the other hand, have The DDR3 sub-system in Cielo includes command and
much lower rates of 1.8 and 0.2 FIT/device, respectively. address parity checking. Figure 11 shows the rate of de-
A Cielo node has 288 DRAM devices, so this translates to tected command/address parity errors relative to the rate of
6048, 518, and 57.6 FIT per node for vendors A, B, and C, detected, uncorrected data ECC errors. The figure shows that
respectively. This translates to one undetected error every the rate of command/address parity errors was 72% that of
0.8 days, every 9.5 days, and every 85 days on a machine the the rate of uncorrected ECC errors.
size of Cielo. The conclusion from this data is that command/address
Figure 10 also shows that 30% of the undetectable-by- parity is a valuable addition to the DDR standard. Further-
SECDED faults on Cielo generated 2-bit errors (which are more, increasing DDR memory channel speeds will likely
detectable by SEC-DED) before generating a larger (e.g., 3- cause an increase in signaling-related errors. Therefore, we
bit) error, while the remaining 70% did not. We emphasize, expect the ratio of address parity to ECC errors to increase
however, that this detection is not a guarantee - different with increased DDR frequencies.
workloads would write different data to memory, and thus
potentially exhibit a different pattern of multi-bit errors. 8.3 SRAM Error Protection
Our main conclusion from this data is that SEC-DED In this section, we examine uncorrected errors from SRAM
ECC is poorly suited to modern DRAM subsystems. The in Hopper and Cielo.
rate of undetected errors is too high to justify its use in very Figure 12 shows the rate of SRAM uncorrected errors
large scale systems comprised of thousands of nodes where on Cielo in arbitrary units. The figure separately plots un-
fidelity of results is critical. Even the most reliable vendor corrected errors from parity-protected structures and uncor-
1
Hopper Uncorrected
0.0 0.2 0.4 0.6 0.8 1.0
8 7.38
Error Rate
Error Rate 6
3.62 3.99
4
2 1.00
0.50
0.2
0
a
ag
g
Ta
at
at
Ta
IT
D
D
L2
L1
L2
L3
L1
Parity−protected ECC−protected
SRAM Structure
Relative Uncorrected
2.99
L2, and L3 caches, which comprise the majority of the die 3
(Cielo / Hopper)
area in the processor. 2
The figure shows that the majority of uncorrected errors 1.28 1.30
in Cielo come from parity-protected structures, even though 1 0.87
a
ag
g
Ta
at
while ECC will correct single-bit faults. Therefore, the ma-
Ta
IT
D
D
L2
L1
L3
L1
jority of SRAM uncorrected errors in Cielo are the result of SRAM Structure
single-bit, rather than multi-bit, faults.
Our primary conclusion in this section is that the best Figure 14. Rate of SRAM uncorrected errors in Cielo rel-
way to reduce SRAM uncorrected error rates simply is to ative to Hopper. ECC and aggressive interleaving seem to
extend single-bit correction (e.g., SEC-DED ECC) through mitigate the expected increase in uncorrected error rate due
additional structures in the processor. While this is a non- to altitude.
trivial effort due to performance, power, and area concerns,
corrected error rate from the L2 tag, which is approximately
this solution does not require extensive new research or
half the rate of uncorrected errors from the L1 tag1 .
novel technologies.
Our observation from this data is that seemingly small
Once single-bit faults are addressed, the remaining multi-
microarchitectural decisions (e.g., the decision to exclude
bit faults may be more of a challenge to address, especially
a single bit from ECC protection) can have a large impact
in highly scaled process technologies in which the rate and
on overall system failure rates. Therefore, detailed model-
spread of multi-bit faults may increase substantially [17]. In
ing and analysis of the microarchitecture (e.g., AVF analy-
addition, novel technologies such as ultra-low-voltage oper-
sis [25]) is critical to ensuring a resilient system design.
ation will likely change silicon failure characteristics [37].
Prior field studies have shown that SRAM fault rates have
Current and near-future process technologies, however, will
a strong altitude dependence due to the effect of atmospheric
see significant benefit from increased use of on-chip ECC.
neutrons [34]. True to this trend, Cielo experiences substan-
tially more corrected SRAM faults than Hopper. While first
8.4 Analysis of SRAM Errors
principles would predict a similar increase in SRAM un-
In this section, we delve further into the details of observed corrected errors at altitude, prior studies have not quantified
SRAM uncorrected errors, to determine the root cause and whether this is true in practice.
potential avenues for mitigation. Figure 14 shows the per-bit SRAM uncorrected error rate
Figure 13 shows the rate of uncorrected errors (in arbi- in Cielo relative to Hopper in the L1 instruction tag, L2 tag,
trary units) from several representative SRAM structures in and L3 data arrays. The figure shows that Cielo’s error rate
Hopper. In the processors used in Hopper, the L1 data cache in the L2 and L3 structures is not substantially higher than
tag (L1DTag) is protected by parity, and all errors are un- Hopper’s error rate, despite being at a significantly higher
correctable. The L2 cache tag (L2Tag), on the other hand, altitude. We attribute this to the error protection in these
is largely protected by ECC. However, a single bit in each structures. These structures have both ECC and aggressive
L2 tag entry is protected by parity. Because the L2 cache bit interleaving; therefore, a strike of much larger than two
is substantially larger than the L1, there are approximately bits is required to cause an uncorrected error. Therefore, the
half as many parity-protected bits in the L2 tag as there are
bits in the entire L1 tag. This is reflected in the observed un- 1 This L2 Tag behavior has been addressed on more recent AMD processors.
data points to the conclusion that the multi-bit error rate from ● 8Gbit / High FIT ● 16Gbit / High FIT ● 32Gbit / High FIT
high-energy neutrons in these structures is smaller than the ● 8Gbit / Low FIT ● 16Gbit / Low FIT ● 32Gbit / Low FIT
error rate from other sources, such as Alpha particles, and a
(Relative to Cielo)
70 ●
19.32 uncorrected error rate per socket will be lower than our pro-
20
6.57 jections. For instance, according to Ibe et al. the per-bit SER
3.28
decreased by 62% between 45nm and 22nm technology. If
0
)
)
)
l
C
ed
C
rg
le
C
EC
Sm
La
al
ca
(E
Sc
l(
(S
e
al
l(
rg
e
Sm
al
La
rg
La