3 EDAC - Error Detection And Correction
5 Written by Doug Thompson <norsk5@xmission.com>
11 modified by Dave Peterson, Doug Thompson, et al,
12 from the bluesmoke.sourceforge.net project.
15 ============================================================================
18 The 'edac' kernel module goal is to detect and report errors that occur
19 within the computer system. In the initial release, memory Correctable Errors
20 (CE) and Uncorrectable Errors (UE) are the primary errors being harvested.
22 Detecting CE events, then harvesting those events and reporting them,
23 CAN be a predictor of future UE events. With CE events, the system can
24 continue to operate, but with less safety. Preventive maintenance and
25 proactive part replacement of memory DIMMs exhibiting CEs can reduce
26 the likelihood of the dreaded UE events and system 'panics'.
29 In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
30 in order to determine if errors are occurring on data transfers.
31 The presence of PCI Parity errors must be examined with a grain of salt.
32 There are several add-in adapters that do NOT follow the PCI specification
33 with regards to Parity generation and reporting. The specification says
34 the vendor should tie the parity status bits to 0 if they do not intend
35 to generate parity. Some vendors do not do this, and thus the parity bit
36 can "float" giving false positives.
38 The PCI Parity EDAC device has the ability to "skip" known flaky
39 cards during the parity scan. These are set by the parity "blacklist"
40 interface in the sysfs for PCI Parity. (See the PCI section in the sysfs
41 section below.) There is also a parity "whitelist" which is used as
42 an explicit list of devices to scan, while the blacklist is a list
45 EDAC will have future error detectors that will be added or integrated
46 into EDAC in the following list:
48 MCE Machine Check Exception
49 MCA Machine Check Architecture
50 NMI NMI notification of ECC errors
51 MSRs Machine Specific Register error cases
54 These errors are usually bus errors, ECC errors, thermal throttling
58 ============================================================================
61 EDAC is composed of a "core" module (edac_mc.ko) and several Memory
62 Controller (MC) driver modules. On a given system, the CORE
63 is loaded and one MC driver will be loaded. Both the CORE and
64 the MC driver have individual versions that reflect current release
65 level of their respective modules. Thus, to "report" on what version
66 a system is running, one must report both the CORE's and the
72 If 'edac' was statically linked with the kernel then no loading is
73 necessary. If 'edac' was built as modules then simply modprobe the
74 'edac' pieces that you need. You should be able to modprobe
75 hardware-specific modules and have the dependencies load the necessary core
80 $> modprobe amd76x_edac
82 loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
86 ============================================================================
89 EDAC presents a 'sysfs' interface for control, reporting and attribute
92 EDAC lives in the /sys/devices/system/edac directory. Within this directory
93 there currently reside 2 'edac' components:
95 mc memory controller(s) system
99 ============================================================================
100 Memory Controller (mc) Model
102 First a background on the memory controller's model abstracted in EDAC.
103 Each mc device controls a set of DIMM memory modules. These modules are
104 laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
105 be multiple csrows and two channels.
107 Memory controllers allow for several csrows, with 8 csrows being a typical value.
108 Yet, the actual number of csrows depends on the electrical "loading"
109 of a given motherboard, memory controller and DIMM characteristics.
111 Dual channels allows for 128 bit data transfers to the CPU from memory.
115 ===================================
116 csrow0 | DIMM_A0 | DIMM_B0 |
117 csrow1 | DIMM_A0 | DIMM_B0 |
118 ===================================
120 ===================================
121 csrow2 | DIMM_A1 | DIMM_B1 |
122 csrow3 | DIMM_A1 | DIMM_B1 |
123 ===================================
125 In the above example table there are 4 physical slots on the motherboard
133 Labels for these slots are usually silk screened on the motherboard. Slots
134 labeled 'A' are channel 0 in this example. Slots labeled 'B'
135 are channel 1. Notice that there are two csrows possible on a
136 physical DIMM. These csrows are allocated their csrow assignment
137 based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
138 is placed in each Channel, the csrows cross both DIMMs.
140 Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
141 Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
142 will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
143 when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
144 csrow1 will be populated. The pattern repeats itself for csrow2 and
147 The representation of the above is reflected in the directory tree
148 in EDAC's sysfs interface. Starting in directory
149 /sys/devices/system/edac/mc each memory controller will be represented
150 by its own 'mcX' directory, where 'X" is the index of the MC.
160 Under each 'mcX' directory each 'csrowX' is again represented by a
161 'csrowX', where 'X" is the csrow index:
171 Notice that there is no csrow1, which indicates that csrow0 is
172 composed of a single ranked DIMMs. This should also apply in both
173 Channels, in order to have dual-channel mode be operational. Since
174 both csrow2 and csrow3 are populated, this indicates a dual ranked
175 set of DIMMs for channels 0 and 1.
178 Within each of the 'mc','mcX' and 'csrowX' directories are several
179 EDAC control and attribute files.
182 ============================================================================
185 In directory 'mc' are EDAC system overall control and attribute files:
188 Panic on UE control file:
192 An uncorrectable error will cause a machine panic. This is usually
193 desirable. It is a bad idea to continue when an uncorrectable error
194 occurs - it is indeterminate what was uncorrected and the operating
195 system context might be so mangled that continuing will lead to further
196 corruption. If the kernel has MCE configured, then EDAC will never
199 LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
201 RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue
208 Generate kernel messages describing uncorrectable errors. These errors
209 are reported through the system message log system. UE statistics
210 will be accumulated even when UE logging is disabled.
212 LOAD TIME: module/kernel parameter: log_ue=[0|1]
214 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue
221 Generate kernel messages describing correctable errors. These
222 errors are reported through the system message log system.
223 CE statistics will be accumulated even when CE logging is disabled.
225 LOAD TIME: module/kernel parameter: log_ce=[0|1]
227 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce
230 Polling period control file:
234 The time period, in milliseconds, for polling for error information.
235 Too small a value wastes resources. Too large a value might delay
236 necessary handling of errors and might loose valuable information for
237 locating the error. 1000 milliseconds (once each second) is about
240 LOAD TIME: module/kernel parameter: poll_msec=[0|1]
242 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
245 Module Version read-only attribute file:
249 The EDAC CORE module's version and compile date are shown here to
250 indicate what EDAC is running.
254 ============================================================================
258 In 'mcX' directories are EDAC control and attribute files for
259 this 'X" instance of the memory controllers:
262 Counter reset control file:
266 This write-only control file will zero all the statistical counters
267 for UE and CE errors. Zeroing the counters will also reset the timer
268 indicating how long since the last counter zero. This is useful
269 for computing errors/time. Since the counters are always reset at
270 driver initialization time, no module/kernel parameter is available.
272 RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
274 This resets the counters on memory controller 0
277 Seconds since last counter reset control file:
279 'seconds_since_reset'
281 This attribute file displays how many seconds have elapsed since the
282 last counter reset. This can be used with the error counters to
287 DIMM capability attribute file:
291 The EDAC (Error Detection and Correction) capabilities/modes of
292 the memory controller hardware.
295 DIMM Current Capability attribute file:
297 'edac_current_capability'
299 The EDAC capabilities available with the hardware
300 configuration. This may not be the same as "EDAC capability"
301 if the correct memory is not used. If a memory controller is
302 capable of EDAC, but DIMMs without check bits are in use, then
303 Parity, SECDED, S4ECD4ED capabilities will not be available
304 even though the memory controller might be capable of those
305 modes with the proper memory loaded.
308 Memory Type supported on this controller attribute file:
312 This attribute file displays the memory type, usually
313 buffered and unbuffered DIMMs.
316 Memory Controller name attribute file:
320 This attribute file displays the type of memory controller
321 that is being utilized.
324 Memory Controller Module name attribute file:
328 This attribute file displays the memory controller module name,
329 version and date built. The name of the memory controller
330 hardware - some drivers work with multiple controllers and
331 this field shows which hardware is present.
334 Total memory managed by this memory controller attribute file:
338 This attribute file displays, in count of megabytes, of memory
339 that this instance of memory controller manages.
342 Total Uncorrectable Errors count attribute file:
346 This attribute file displays the total count of uncorrectable
347 errors that have occurred on this memory controller. If panic_on_ue
348 is set this counter will not have a chance to increment,
349 since EDAC will panic the system.
352 Total UE count that had no information attribute fileY:
356 This attribute file displays the number of UEs that
357 have occurred have occurred with no informations as to which DIMM
358 slot is having errors.
361 Total Correctable Errors count attribute file:
365 This attribute file displays the total count of correctable
366 errors that have occurred on this memory controller. This
367 count is very important to examine. CEs provide early
368 indications that a DIMM is beginning to fail. This count
369 field should be monitored for non-zero values and report
370 such information to the system administrator.
373 Total Correctable Errors count attribute file:
377 This attribute file displays the number of CEs that
378 have occurred wherewith no informations as to which DIMM slot
379 is having errors. Memory is handicapped, but operational,
380 yet no information is available to indicate which slot
381 the failing memory is in. This count field should be also
382 be monitored for non-zero values.
388 Symlink to the memory controller device
392 ============================================================================
395 In the 'csrowX' directories are EDAC control and attribute files for
396 this 'X" instance of csrow:
399 Total Uncorrectable Errors count attribute file:
403 This attribute file displays the total count of uncorrectable
404 errors that have occurred on this csrow. If panic_on_ue is set
405 this counter will not have a chance to increment, since EDAC
406 will panic the system.
409 Total Correctable Errors count attribute file:
413 This attribute file displays the total count of correctable
414 errors that have occurred on this csrow. This
415 count is very important to examine. CEs provide early
416 indications that a DIMM is beginning to fail. This count
417 field should be monitored for non-zero values and report
418 such information to the system administrator.
421 Total memory managed by this csrow attribute file:
425 This attribute file displays, in count of megabytes, of memory
426 that this csrow contains.
429 Memory Type attribute file:
433 This attribute file will display what type of memory is currently
434 on this csrow. Normally, either buffered or unbuffered memory.
437 EDAC Mode of operation attribute file:
441 This attribute file will display what type of Error detection
442 and correction is being utilized.
445 Device type attribute file:
449 This attribute file will display what type of DIMM device is
450 being utilized. Example: x4
453 Channel 0 CE Count attribute file:
457 This attribute file will display the count of CEs on this
458 DIMM located in channel 0.
461 Channel 0 UE Count attribute file:
465 This attribute file will display the count of UEs on this
466 DIMM located in channel 0.
469 Channel 0 DIMM Label control file:
473 This control file allows this DIMM to have a label assigned
474 to it. With this label in the module, when errors occur
475 the output can provide the DIMM label in the system log.
476 This becomes vital for panic events to isolate the
477 cause of the UE event.
479 DIMM Labels must be assigned after booting, with information
480 that correctly identifies the physical slot with its
481 silk screen label. This information is currently very
482 motherboard specific and determination of this information
483 must occur in userland at this time.
486 Channel 1 CE Count attribute file:
490 This attribute file will display the count of CEs on this
491 DIMM located in channel 1.
494 Channel 1 UE Count attribute file:
498 This attribute file will display the count of UEs on this
499 DIMM located in channel 0.
502 Channel 1 DIMM Label control file:
506 This control file allows this DIMM to have a label assigned
507 to it. With this label in the module, when errors occur
508 the output can provide the DIMM label in the system log.
509 This becomes vital for panic events to isolate the
510 cause of the UE event.
512 DIMM Labels must be assigned after booting, with information
513 that correctly identifies the physical slot with its
514 silk screen label. This information is currently very
515 motherboard specific and determination of this information
516 must occur in userland at this time.
519 ============================================================================
522 If logging for UEs and CEs are enabled then system logs will have
523 error notices indicating errors that have been detected:
525 MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
526 channel 1 "DIMM_B1": amd76x_edac
528 MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
529 channel 1 "DIMM_B1": amd76x_edac
532 The structure of the message is:
533 the memory controller (MC0)
536 offset in the page (0xce0)
537 the byte granularity (grain 8)
538 or resolution of the error
539 the error syndrome (0xb741)
541 memory channel (channel 1)
542 DIMM label, if set prior (DIMM B1
543 and then an optional, driver-specific message that may
544 have additional information.
546 Both UEs and CEs with no info will lack all but memory controller,
547 error type, a notice of "no info" and then an optional,
548 driver-specific error message.
552 ============================================================================
553 PCI Bus Parity Detection
556 On Header Type 00 devices the primary status is looked at
557 for any parity error regardless of whether Parity is enabled on the
558 device. (The spec indicates parity is generated in some cases).
559 On Header Type 01 bridges, the secondary status register is also
560 looked at to see if parity occurred on the bus on the other side of
566 Under /sys/devices/system/edac/pci are control and attribute files as follows:
569 Enable/Disable PCI Parity checking control file:
574 This control file enables or disables the PCI Bus Parity scanning
575 operation. Writing a 1 to this file enables the scanning. Writing
576 a 0 to this file disables the scanning.
579 echo "1" >/sys/devices/system/edac/pci/check_pci_parity
582 echo "0" >/sys/devices/system/edac/pci/check_pci_parity
586 Panic on PCI PARITY Error:
588 'panic_on_pci_parity'
591 This control files enables or disables panicking when a parity
592 error has been detected.
595 module/kernel parameter: panic_on_pci_parity=[0|1]
598 echo "1" >/sys/devices/system/edac/pci/panic_on_pci_parity
601 echo "0" >/sys/devices/system/edac/pci/panic_on_pci_parity
608 This attribute file will display the number of parity errors that
613 PCI Device Whitelist:
615 'pci_parity_whitelist'
617 This control file allows for an explicit list of PCI devices to be
618 scanned for parity errors. Only devices found on this list will
619 be examined. The list is a line of hexadecimal VENDOR and DEVICE
624 One or more can be inserted, separated by a comma.
626 To write the above list doing the following as one command line:
628 echo "1022:7450,1434:16a6"
629 > /sys/devices/system/edac/pci/pci_parity_whitelist
633 To display what the whitelist is, simply 'cat' the same file.
636 PCI Device Blacklist:
638 'pci_parity_blacklist'
640 This control file allows for a list of PCI devices to be
641 skipped for scanning.
642 The list is a line of hexadecimal VENDOR and DEVICE ID tuples:
646 One or more can be inserted, separated by a comma.
648 To write the above list doing the following as one command line:
650 echo "1022:7450,1434:16a6"
651 > /sys/devices/system/edac/pci/pci_parity_blacklist
654 To display what the whitelist currently contains,
655 simply 'cat' the same file.
657 =======================================================================
659 PCI Vendor and Devices IDs can be obtained with the lspci command. Using
660 the -n option lspci will display the vendor and device IDs. The system
661 administrator will have to determine which devices should be scanned or
666 The two lists (white and black) are prioritized. blacklist is the lower
667 priority and will NOT be utilized when a whitelist has been set.
668 Turn OFF a whitelist by an empty echo command:
670 echo > /sys/devices/system/edac/pci/pci_parity_whitelist
672 and any previous blacklist will be utilized.