Change Details

While rebooting cp2006 today: ``` Enumerating Boot options... Enumerating Boot options... Done UEFI0107: One or more memory errors have occurred on memory slot: B2. Remove input power to the system, reseat the DIMM module and restart the system. If the issues persist, replace the faulty memory module identified in the message. UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning. Check the System Event Log (SEL) to identify the non-functioning DIMM, and then replace it. ``` SEL features a few related events, one of which today: ``` $ sudo ipmi-sel -v | grep memory 8 | Feb-05-2016 | 19:05:57 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h 13 | Jun-01-2016 | 17:52:21 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h 21 | Mar-23-2018 | 16:55:08 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h ``` The same thing happened on cp2010, SEL follows: ``` 15 | Mar-23-2018 | 17:41:46 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h ``` == Memory Testing for cp2001-cp2026 == Due to the large number of memory failures in this batch (they were all ordered together, 26, on RT#9336), we'll need to step through the systems and test memory across the order. We'll also have an email with Dell, in an attempt to resolve this situation. However, they are unlikely to want to foot the bill on proactively swapping all 416 16GB dimms in that 26 system quantity order. Memory testing proceedure: All tests should be recorded on [[ https://docs.google.com/spreadsheets/d/1zw8lXpqh9KxgjUpGKufnQ76_kramD8_is4abBPo3TU0/edit?usp=sharing | this google sheet ]]. * start at cp2001, and work upwards. Systems that are already depooled can be tested at any time (suggest testing all depooled first, then starting at cp2001 and working upwards.) * only take down 1 odd number server, and 1 even numbered server at a time. * systems will be automatically depooled by pybal/confctl, no need to manually depool. * copy down any errors in the SEL before memory testing, the SEL has to be wiped prior to tests (or it will report on SEL errors.) (command: racadm getsel) * clear out the SEL on drac (command: racadm delsel) * run memory tests on system, monitor result and update [[ https://docs.google.com/spreadsheets/d/1zw8lXpqh9KxgjUpGKufnQ76_kramD8_is4abBPo3TU0/edit?usp=sharing | this google sheet]]. * if system passes memory testing, reboot it back into its OS and note it on the google sheet. coordinate in #wikimedia-traffic to bring servers on and offline. Once we have memtesting of the entire batch, we'll have a better idea on how widespread this issue is.

While rebooting cp2006 today: ``` Enumerating Boot options... Enumerating Boot options... Done UEFI0107: One or more memory errors have occurred on memory slot: B2. Remove input power to the system, reseat the DIMM module and restart the system. If the issues persist, replace the faulty memory module identified in the message. UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning. Check the System Event Log (SEL) to identify the non-functioning DIMM, and then replace it. ``` SEL features a few related events, one of which today: ``` $ sudo ipmi-sel -v | grep memory 8 | Feb-05-2016 | 19:05:57 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h 13 | Jun-01-2016 | 17:52:21 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h 21 | Mar-23-2018 | 16:55:08 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h ``` The same thing happened on cp2010, SEL follows: ``` 15 | Mar-23-2018 | 17:41:46 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h ``` == Memory Testing for cp2001-cp2026 == Due to the large number of memory failures in this batch (they were all ordered together, 26, on RT#9336), we'll need to step through the systems and test memory across the order. We'll also have an email with Dell, in an attempt to resolve this situation. However, they are unlikely to want to foot the bill on proactively swapping all 416 16GB dimms in that 26 system quantity order. Memory testing proceedure: All tests should be recorded on [[ https://docs.google.com/spreadsheets/d/1zw8lXpqh9KxgjUpGKufnQ76_kramD8_is4abBPo3TU0/edit?usp=sharing | this google sheet ]]. * start at cp2001, and work upwards. ** Systems that are already depooled can be tested at any time (suggest testing all depooled first, then starting at cp2001 and working upwards.) ** Systems that are perma-depooled as spare can be tested at any time. * of the pooled systems, do not take down more than 1 server per pool (text/upload). * systems will be automatically depooled by pybal/confctl, no need to manually depool. * copy down any errors in the SEL before memory testing, the SEL has to be wiped prior to tests (or it will report on SEL errors.) (command: racadm getsel) * clear out the SEL on drac (command: racadm delsel) * run memory tests on system, monitor result and update [[ https://docs.google.com/spreadsheets/d/1zw8lXpqh9KxgjUpGKufnQ76_kramD8_is4abBPo3TU0/edit?usp=sharing | this google sheet]]. * if system passes memory testing, reboot it back into its OS and note it on the google sheet. coordinate in #wikimedia-traffic to bring servers on and offline. Once we have memtesting of the entire batch, we'll have a better idea on how widespread this issue is.

While rebooting cp2006 today: ``` Enumerating Boot options... Enumerating Boot options... Done UEFI0107: One or more memory errors have occurred on memory slot: B2. Remove input power to the system, reseat the DIMM module and restart the system. If the issues persist, replace the faulty memory module identified in the message. UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning. Check the System Event Log (SEL) to identify the non-functioning DIMM, and then replace it. ``` SEL features a few related events, one of which today: ``` $ sudo ipmi-sel -v | grep memory 8 | Feb-05-2016 | 19:05:57 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h 13 | Jun-01-2016 | 17:52:21 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h 21 | Mar-23-2018 | 16:55:08 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h ``` The same thing happened on cp2010, SEL follows: ``` 15 | Mar-23-2018 | 17:41:46 | ECC Uncorr Err | Memory | Assertion Event | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h ``` == Memory Testing for cp2001-cp2026 == Due to the large number of memory failures in this batch (they were all ordered together, 26, on RT#9336), we'll need to step through the systems and test memory across the order. We'll also have an email with Dell, in an attempt to resolve this situation. However, they are unlikely to want to foot the bill on proactively swapping all 416 16GB dimms in that 26 system quantity order. Memory testing proceedure: All tests should be recorded on [[ https://docs.google.com/spreadsheets/d/1zw8lXpqh9KxgjUpGKufnQ76_kramD8_is4abBPo3TU0/edit?usp=sharing | this google sheet ]]. * start at cp2001, and work upwards. Systems that are already depooled can be tested at any time (suggest testing all depooled first, then starting at cp2001 and working upwards.) * only take down 1 odd number server* Systems that are already depooled can be tested at any time (suggest testing all depooled first, then starting at cp2001 and working upwards.) ** Systems that are perma-depooled as spare can be tested at any time. * of the pooled systems, and 1 even numbereddo not take down more than 1 server at a timeper pool (text/upload). * systems will be automatically depooled by pybal/confctl, no need to manually depool. * copy down any errors in the SEL before memory testing, the SEL has to be wiped prior to tests (or it will report on SEL errors.) (command: racadm getsel) * clear out the SEL on drac (command: racadm delsel) * run memory tests on system, monitor result and update [[ https://docs.google.com/spreadsheets/d/1zw8lXpqh9KxgjUpGKufnQ76_kramD8_is4abBPo3TU0/edit?usp=sharing | this google sheet]]. * if system passes memory testing, reboot it back into its OS and note it on the google sheet. coordinate in #wikimedia-traffic to bring servers on and offline. Once we have memtesting of the entire batch, we'll have a better idea on how widespread this issue is.