Page MenuHomePhabricator

db2139 s4 (commonswiki) instance crashed (backup source)
Closed, ResolvedPublic

Description

At 2023-04-25 17:43:24 UTC, mysql crashed with:

systemd log
Apr 18 13:31:09 db2139 mysqld[1333]: 2023-04-18 13:31:09 38341035 [Warning] Aborted connection 38341035 to db: 'unconnected' user: 'orchestrator' host: '208.80.155.103' (Got timeout reading communication packets)
Apr 18 13:31:09 db2139 mysqld[1333]: 2023-04-18 13:31:09 38341031 [Warning] Aborted connection 38341031 to db: 'unconnected' user: 'orchestrator' host: '208.80.155.103' (Got timeout reading communication packets)
Apr 18 13:31:09 db2139 mysqld[1333]: 2023-04-18 13:31:09 38341032 [Warning] Aborted connection 38341032 to db: 'unconnected' user: 'orchestrator' host: '208.80.155.103' (Got timeout reading communication packets)
Apr 25 17:43:24 db2139 mysqld[1333]: 230425 17:43:24 [ERROR] mysqld got signal 7 ;
Apr 25 17:43:24 db2139 mysqld[1333]: This could be because you hit a bug. It is also possible that this binary
Apr 25 17:43:24 db2139 mysqld[1333]: or one of the libraries it was linked against is corrupt, improperly built,
Apr 25 17:43:24 db2139 mysqld[1333]: or misconfigured. This error can also be caused by malfunctioning hardware.
Apr 25 17:43:24 db2139 mysqld[1333]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Apr 25 17:43:24 db2139 mysqld[1333]: We will try our best to scrape up some info that will hopefully help
Apr 25 17:43:24 db2139 mysqld[1333]: diagnose the problem, but since we have already crashed,
Apr 25 17:43:24 db2139 mysqld[1333]: something is definitely wrong and this may fail.
Apr 25 17:43:24 db2139 mysqld[1333]: Server version: 10.4.25-MariaDB
Apr 25 17:43:24 db2139 mysqld[1333]: key_buffer_size=1048576
Apr 25 17:43:24 db2139 mysqld[1333]: read_buffer_size=131072
Apr 25 17:43:24 db2139 mysqld[1333]: max_used_connections=14
Apr 25 17:43:24 db2139 mysqld[1333]: max_threads=252
Apr 25 17:43:24 db2139 mysqld[1333]: thread_count=18
Apr 25 17:43:24 db2139 mysqld[1333]: It is possible that mysqld could use up to
Apr 25 17:43:24 db2139 mysqld[1333]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 555601 K  bytes of memory
Apr 25 17:43:24 db2139 mysqld[1333]: Hope that's ok; if not, decrease some variables in the equation.
Apr 25 17:43:24 db2139 mysqld[1333]: Thread pointer: 0x7f41059f3f18
Apr 25 17:43:24 db2139 mysqld[1333]: Attempting backtrace. You can use the following information to find out
Apr 25 17:43:24 db2139 mysqld[1333]: where mysqld died. If you see no messages after this, something went
Apr 25 17:43:24 db2139 mysqld[1333]: terribly wrong...
Apr 25 17:43:24 db2139 mysqld[1333]: stack_bottom = 0x7f718c41d760 thread_stack 0x49000
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(my_print_stacktrace+0x2e)[0x55db122f8dde]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(handle_fatal_signal+0x54d)[0x55db11d7fd8d]
Apr 25 17:43:24 db2139 mysqld[1333]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7f71a4d5b140]
Apr 25 17:43:24 db2139 mysqld[1333]: /lib/x86_64-linux-gnu/libc.so.6(+0x162c96)[0x7f71a49cbc96]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xc42b57)[0x55db120e9b57]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xc45ffb)[0x55db120ecffb]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xd27215)[0x55db121ce215]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xd2a3e9)[0x55db121d13e9]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xd07e57)[0x55db121aee57]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xc6a350)[0x55db12111350]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xc700c7)[0x55db121170c7]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xc7080d)[0x55db1211780d]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xc81b2d)[0x55db12128b2d]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xbc587f)[0x55db1206c87f]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(_ZN7handler12ha_write_rowEPKh+0x31d)[0x55db11d8c4ad]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(_Z12write_recordP3THDP5TABLEP12st_copy_info+0x19d)[0x55db11b4091d]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(_Z12mysql_insertP3THDP10TABLE_LISTR4ListI4ItemERS3_IS5_ES6_S6_15enum_duplicatesb+0xb23)[0x55db11b4b3b3]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(_Z21mysql_execute_commandP3THD+0x1880)[0x55db11b789d0]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x223)[0x55db11b7f4f3]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(_ZN15Query_log_event14do_apply_eventEP14rpl_group_infoPKcj+0x6e9)[0x55db11e948f9]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0x61f604)[0x55db11ac6604]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(handle_slave_sql+0x1662)[0x55db11ad0512]
Apr 25 17:43:24 db2139 mysqld[1333]: /opt/wmf-mariadb104/bin/mysqld(+0xb640c2)[0x55db1200b0c2]
Apr 25 17:43:25 db2139 mysqld[1333]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7)[0x7f71a4d4fea7]
Apr 25 17:43:25 db2139 mysqld[1333]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f71a4966def]
Apr 25 17:43:25 db2139 mysqld[1333]: Trying to get some variables.
Apr 25 17:43:25 db2139 mysqld[1333]: Some pointers may be invalid and cause the dump to abort.
Apr 25 17:43:25 db2139 mysqld[1333]: Query (0x7f4104049bca): INSERT /* MediaWiki\Deferred\LinksUpdate\LinksTable::doWrites  */ IGNORE INTO `externallinks` (el_to,el_index,el_index_60,el_to_domain_index,el_to_path,el_from) VALUES ('https://creativecommons.org/publicdomain/zero/1.0/deed.en','https://org.creativecommons./publicdomain/zero/1.0/deed.en','https://org.creativecommons./publicdomain/zero/1.0/deed.en','https://org.creativecommons.','/publicdomain/zero/1.0/deed.en',131147954),('https://apicollections.parismusees.paris.fr/iiif/280002070/manifest','https://fr.paris.parismusees.apicollections./iiif/280002070/manifest','https://fr.paris.parismusees.apicollections./iiif/280002070/','https://fr.paris.parismusees.apicollections.','/iiif/280002070/manifest',131147954),('https://www.parismuseescollections.paris.fr/fr/musee-de-la-vie-romantique/oeuvres/madame-viardot-profil-gauche-0#infos-principales','https://fr.paris.parismuseescollections.www./fr/musee-de-la-vie-romantique/oeuvres/madame-viardot-profil-gauche-0#infos-principales','https://fr.paris.parismuseescollections.www./fr/musee-de-la-','https://fr.paris.parismuseescollections.www.','/fr/musee-de-la-vie-romantique/oeuvres/madame-viardot-profil-gauche-0#infos-principales',131147954)
Apr 25 17:43:25 db2139 mysqld[1333]: Connection ID (thread ID): 40975044
Apr 25 17:43:25 db2139 mysqld[1333]: Status: NOT_KILLED
Apr 25 17:43:25 db2139 mysqld[1333]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=on,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
Apr 25 17:43:25 db2139 mysqld[1333]: The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
Apr 25 17:43:25 db2139 mysqld[1333]: information that should help you find out what is causing the crash.
Apr 25 17:43:25 db2139 mysqld[1333]: Writing a core file...
Apr 25 17:43:25 db2139 mysqld[1333]: Working directory at /srv/sqldata.s4
Apr 25 17:43:25 db2139 mysqld[1333]: Resource Limits:
Apr 25 17:43:25 db2139 mysqld[1333]: Limit                     Soft Limit           Hard Limit           Units
Apr 25 17:43:25 db2139 mysqld[1333]: Max cpu time              unlimited            unlimited            seconds
Apr 25 17:43:25 db2139 mysqld[1333]: Max file size             unlimited            unlimited            bytes
Apr 25 17:43:25 db2139 mysqld[1333]: Max data size             unlimited            unlimited            bytes
Apr 25 17:43:25 db2139 mysqld[1333]: Max stack size            8388608              unlimited            bytes
Apr 25 17:43:25 db2139 mysqld[1333]: Max core file size        0                    0                    bytes
Apr 25 17:43:25 db2139 mysqld[1333]: Max resident set          unlimited            unlimited            bytes
Apr 25 17:43:25 db2139 mysqld[1333]: Max processes             2058269              2058269              processes
Apr 25 17:43:25 db2139 mysqld[1333]: Max open files            200001               200001               files
Apr 25 17:43:25 db2139 mysqld[1333]: Max locked memory         65536                65536                bytes
Apr 25 17:43:25 db2139 mysqld[1333]: Max address space         unlimited            unlimited            bytes
Apr 25 17:43:25 db2139 mysqld[1333]: Max file locks            unlimited            unlimited            locks
Apr 25 17:43:25 db2139 mysqld[1333]: Max pending signals       2058269              2058269              signals
Apr 25 17:43:25 db2139 mysqld[1333]: Max msgqueue size         819200               819200               bytes
Apr 25 17:43:25 db2139 mysqld[1333]: Max nice priority         0                    0
Apr 25 17:43:25 db2139 mysqld[1333]: Max realtime priority     0                    0
Apr 25 17:43:25 db2139 mysqld[1333]: Max realtime timeout      unlimited            unlimited            us
Apr 25 17:43:25 db2139 mysqld[1333]: Core pattern: /var/tmp/core/core.%h.%e.%p.%t
Apr 25 17:43:25 db2139 systemd[1]: mariadb@s4.service: Main process exited, code=killed, status=7/BUS
Apr 25 17:43:25 db2139 systemd[1]: mariadb@s4.service: Failed with result 'signal'.
Apr 25 17:43:25 db2139 systemd[1]: mariadb@s4.service: Consumed 1month 2w 1d 23h 11min 42.554s CPU time.
Apr 25 17:43:31 db2139 systemd[1]: mariadb@s4.service: Scheduled restart job, restart counter is at 1.
Apr 25 17:43:31 db2139 systemd[1]: Stopped mariadb database server.
Apr 25 17:43:31 db2139 systemd[1]: mariadb@s4.service: Consumed 1month 2w 1d 23h 11min 42.554s CPU time.
Apr 25 17:43:31 db2139 systemd[1]: Starting mariadb database server...
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] /opt/wmf-mariadb104/bin/mysqld (mysqld 10.4.25-MariaDB) starting as process 4023805 ...
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Warning] Could not increase number of max_open_files to more than 200001 (request: 800313)
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Warning] You need to use --log-bin to make --binlog-format work.
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] mysqld: Aria engine: starting recovery
Apr 25 17:43:31 db2139 mysqld[4023805]: recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 4 3 2 1 0
Apr 25 17:43:31 db2139 mysqld[4023805]:  (0.0 seconds);
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] mysqld: Aria engine: recovery done
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] InnoDB: Using Linux native AIO
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] InnoDB: Uses event mutexes
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] InnoDB: Number of pools: 1
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] InnoDB: Using SSE2 crc32 instructions
Apr 25 17:43:31 db2139 mysqld[4023805]: 2023-04-25 17:43:31 0 [Note] InnoDB: Initializing buffer pool, total size = 192G, instances = 8, chunk size = 128M
Apr 25 17:43:36 db2139 mysqld[4023805]: 2023-04-25 17:43:36 0 [Note] InnoDB: Completed initialization of buffer pool
Apr 25 17:43:36 db2139 mysqld[4023805]: 2023-04-25 17:43:36 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
Apr 25 17:43:36 db2139 mysqld[4023805]: 2023-04-25 17:43:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=88140742101854
Apr 25 17:43:43 db2139 mysqld[4023805]: 2023-04-25 17:43:43 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 2 row operations to undo
Apr 25 17:43:43 db2139 mysqld[4023805]: 2023-04-25 17:43:43 0 [Note] InnoDB: Trx id counter is 61957172214
Apr 25 17:43:43 db2139 mysqld[4023805]: 2023-04-25 17:43:43 0 [Note] InnoDB: Starting final batch to recover 332195 pages from redo log.
Apr 25 17:43:51 db2139 mysqld[4023805]: 2023-04-25 17:43:51 0 [Note] InnoDB: To recover: 92528 pages from log
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Last binlog file './db2073-bin.000017', position 150251405
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: 128 out of 128 rollback segments are active.
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Creating shared tablespace for temporary tables
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Waiting for purge to start
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Rolled back recovered transaction 61957172213
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Rollback of non-prepared transactions completed
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: 10.4.25 started; log sequence number 88140742104209; transaction id 61957172216
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] InnoDB: Loading buffer pool(s) from /srv/sqldata.s4/ib_buffer_pool
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] Plugin 'FEEDBACK' is disabled.
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [ERROR] mysqld: Can't open shared library '/opt/wmf-mariadb104/lib/plugin/semisync_slave.so' (errno: 0, cannot open shared object file: No such file or directory)
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [ERROR] mysqld: Plugin 'unix_socket' already installed
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] Server socket created on IP: '::'.
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] Server socket created on IP: '::'.
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 3 [Note] Event Scheduler: scheduler thread started with id 3
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] Reading of all Master_info entries succeeded
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] Added new Master_info '' to hash table
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MariaDB server acts as a replica and has its hostname changed. Please use '--log-basename=#' or '--relay-log=db2139-relay-bin' to avoid this problem.
Apr 25 17:43:54 db2139 mysqld[4023805]: 2023-04-25 17:43:54 0 [Note] /opt/wmf-mariadb104/bin/mysqld: ready for connections.
Apr 25 17:43:54 db2139 mysqld[4023805]: Version: '10.4.25-MariaDB'  socket: '/run/mysqld/mysqld.s4.sock'  port: 3314  MariaDB Server
Apr 25 17:43:54 db2139 systemd[1]: Started mariadb database server.
Apr 25 17:50:13 db2139 mysqld[4023805]: 2023-04-25 17:50:13 0 [Note] InnoDB: Buffer pool(s) load completed at 230425 17:50:13
racadm getsel
racadm>>racadm getsel
Record:      1
Date/Time:   04/25/2020 22:25:10
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   06/21/2021 14:50:57
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   06/21/2021 14:50:57
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   06/21/2021 14:51:02
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   06/21/2021 14:51:02
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   06/21/2021 14:51:24
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   06/21/2021 14:51:29
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   06/21/2021 14:53:19
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   06/21/2021 14:53:24
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   10/18/2021 14:27:44
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   10/18/2021 14:27:45
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   10/18/2021 15:07:44
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   10/18/2021 15:07:45
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   06/28/2022 14:33:57
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   06/28/2022 14:33:58
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   06/28/2022 15:01:23
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   06/28/2022 15:01:24
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   06/28/2022 15:07:09
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   06/28/2022 15:07:10
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   06/28/2022 15:33:15
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   06/28/2022 15:33:19
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   04/25/2023 17:44:00
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   04/25/2023 17:44:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   04/25/2023 17:44:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   04/25/2023 17:44:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   04/25/2023 17:44:00
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A7.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   04/25/2023 18:50:51
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   04/25/2023 18:50:51
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   04/25/2023 18:50:51
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   04/25/2023 18:50:51
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   04/25/2023 21:42:43
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
Record:      32
Date/Time:   04/25/2023 21:42:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   04/25/2023 21:42:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   04/25/2023 21:42:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   04/25/2023 22:43:23
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   04/25/2023 22:43:23
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   04/25/2023 22:43:23
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      38
Date/Time:   04/25/2023 22:43:23
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      39
Date/Time:   04/26/2023 02:34:03
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
Record:      40
Date/Time:   04/26/2023 02:34:03
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      41
Date/Time:   04/26/2023 02:34:03
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      42
Date/Time:   04/26/2023 02:34:03
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      43
Date/Time:   04/26/2023 05:30:00
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
Record:      44
Date/Time:   04/26/2023 05:30:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      45
Date/Time:   04/26/2023 05:30:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      46
Date/Time:   04/26/2023 05:30:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      47
Date/Time:   04/26/2023 05:43:27
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
Record:      48
Date/Time:   04/26/2023 05:43:28
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      49
Date/Time:   04/26/2023 05:43:28
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      50
Date/Time:   04/26/2023 05:43:28
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      51
Date/Time:   04/26/2023 07:14:13
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
Record:      52
Date/Time:   04/26/2023 07:14:13
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      53
Date/Time:   04/26/2023 07:14:13
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      54
Date/Time:   04/26/2023 07:14:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
dmesg
[Tue Apr 25 17:37:31 2023] Disabling lock debugging due to kernel taint
[Tue Apr 25 17:37:31 2023] mce: Uncorrected hardware memory error in user-access at 2b4f4bca40
[Tue Apr 25 17:37:31 2023] mce: [Hardware Error]: Machine check events logged
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue Apr 25 17:37:31 2023] Memory failure: 0x2b4f4bc: Sending SIGBUS to mysqld:3690622 due to hardware memory corruption
[Tue Apr 25 17:37:31 2023] Memory failure: 0x2b4f4bc: recovery action for dirty LRU page: Recovered
[Tue Apr 25 17:37:31 2023] mce: [Hardware Error]: CPU 14: Machine Check Exception: 7 Bank 1: bd80000000100134
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]: event severity: corrected
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]:  Error 0, type: corrected
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]:  fru_text: A7
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]:   section_type: memory error
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
[Tue Apr 25 17:37:31 2023] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Tue Apr 25 17:37:31 2023] mce: [Hardware Error]: RIP 33:<00007f71a49cbc96> 
[Tue Apr 25 17:37:31 2023] mce: [Hardware Error]: TSC 1600a0776d4499a ADDR 2b4f4bca40 MISC 86 PPIN aa7d4c7a4d3b7408 
[Tue Apr 25 17:37:31 2023] mce: [Hardware Error]: PROCESSOR 0:50657 TIME 1682444603 SOCKET 0 APIC 8 microcode 5003302
[Tue Apr 25 17:37:31 2023] mce: [Hardware Error]: Machine check events logged
[Tue Apr 25 17:37:31 2023] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[Tue Apr 25 17:37:31 2023] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[Tue Apr 25 17:37:31 2023] EDAC skx MC0: TSC 0x0 
[Tue Apr 25 17:37:31 2023] EDAC skx MC0: ADDR 0x2b4f4bca40 
[Tue Apr 25 17:37:31 2023] EDAC skx MC0: MISC 0x0 
[Tue Apr 25 17:37:31 2023] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682444604 SOCKET 0 APIC 0x0
[Tue Apr 25 17:37:31 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[Tue Apr 25 17:37:32 2023] MCE: Killing mysqld:3690622 due to hardware memory corruption fault at 7f675f4bca38
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]: event severity: corrected
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]:  Error 0, type: corrected
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]:  fru_text: A7
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]:   section_type: memory error
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]:   error_status: 0x0000000000000400
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]:   error_type: 3, multi-bit ECC
[Tue Apr 25 18:44:22 2023] {2}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Tue Apr 25 18:44:22 2023] mce: [Hardware Error]: Machine check events logged
[Tue Apr 25 18:44:22 2023] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[Tue Apr 25 18:44:22 2023] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[Tue Apr 25 18:44:22 2023] EDAC skx MC0: TSC 0x0 
[Tue Apr 25 18:44:22 2023] EDAC skx MC0: ADDR 0x2b4f4bca40 
[Tue Apr 25 18:44:22 2023] EDAC skx MC0: MISC 0x0 
[Tue Apr 25 18:44:22 2023] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682448614 SOCKET 0 APIC 0x0
[Tue Apr 25 18:44:22 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]: event severity: corrected
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]:  Error 0, type: corrected
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]:  fru_text: A7
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]:   section_type: memory error
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]:   error_status: 0x0000000000000400
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]:   error_type: 3, multi-bit ECC
[Tue Apr 25 21:36:14 2023] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Tue Apr 25 21:36:14 2023] mce: [Hardware Error]: Machine check events logged
[Tue Apr 25 21:36:14 2023] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[Tue Apr 25 21:36:14 2023] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[Tue Apr 25 21:36:14 2023] EDAC skx MC0: TSC 0x0 
[Tue Apr 25 21:36:14 2023] EDAC skx MC0: ADDR 0x2b4f4bca40 
[Tue Apr 25 21:36:14 2023] EDAC skx MC0: MISC 0x0 
[Tue Apr 25 21:36:14 2023] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682458927 SOCKET 0 APIC 0x0
[Tue Apr 25 21:36:14 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]: event severity: corrected
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]:  Error 0, type: corrected
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]:  fru_text: A7
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]:   section_type: memory error
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]:   error_status: 0x0000000000000400
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]:   error_type: 3, multi-bit ECC
[Tue Apr 25 22:36:53 2023] {4}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Tue Apr 25 22:36:53 2023] mce: [Hardware Error]: Machine check events logged
[Tue Apr 25 22:36:53 2023] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[Tue Apr 25 22:36:53 2023] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[Tue Apr 25 22:36:53 2023] EDAC skx MC0: TSC 0x0 
[Tue Apr 25 22:36:53 2023] EDAC skx MC0: ADDR 0x2b4f4bca40 
[Tue Apr 25 22:36:53 2023] EDAC skx MC0: MISC 0x0 
[Tue Apr 25 22:36:53 2023] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682462567 SOCKET 0 APIC 0x0
[Tue Apr 25 22:36:53 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]: event severity: corrected
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]:  Error 0, type: corrected
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]:  fru_text: A7
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]:   section_type: memory error
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]:   error_status: 0x0000000000000400
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]:   error_type: 3, multi-bit ECC
[Wed Apr 26 02:27:33 2023] {5}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Wed Apr 26 02:27:33 2023] mce: [Hardware Error]: Machine check events logged
[Wed Apr 26 02:27:33 2023] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[Wed Apr 26 02:27:33 2023] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[Wed Apr 26 02:27:33 2023] EDAC skx MC0: TSC 0x0 
[Wed Apr 26 02:27:33 2023] EDAC skx MC0: ADDR 0x2b4f4bca40 
[Wed Apr 26 02:27:33 2023] EDAC skx MC0: MISC 0x0 
[Wed Apr 26 02:27:33 2023] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682476407 SOCKET 0 APIC 0x0
[Wed Apr 26 02:27:33 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]: event severity: corrected
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]:  Error 0, type: corrected
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]:  fru_text: A7
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]:   section_type: memory error
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]:   error_status: 0x0000000000000400
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]:   error_type: 3, multi-bit ECC
[Wed Apr 26 05:23:30 2023] {6}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Wed Apr 26 05:23:30 2023] mce: [Hardware Error]: Machine check events logged
[Wed Apr 26 05:23:30 2023] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[Wed Apr 26 05:23:30 2023] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[Wed Apr 26 05:23:30 2023] EDAC skx MC0: TSC 0x0 
[Wed Apr 26 05:23:30 2023] EDAC skx MC0: ADDR 0x2b4f4bca40 
[Wed Apr 26 05:23:30 2023] EDAC skx MC0: MISC 0x0 
[Wed Apr 26 05:23:30 2023] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682486964 SOCKET 0 APIC 0x0
[Wed Apr 26 05:23:30 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]: It has been corrected by h/w and requires no further action
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]: event severity: corrected
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]:  Error 0, type: corrected
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]:  fru_text: A7
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]:   section_type: memory error
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]:   error_status: 0x0000000000000400
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]:   error_type: 3, multi-bit ECC
[Wed Apr 26 05:36:57 2023] {7}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Wed Apr 26 05:36:57 2023] mce: [Hardware Error]: Machine check events logged
[Wed Apr 26 05:36:57 2023] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[Wed Apr 26 05:36:57 2023] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[Wed Apr 26 05:36:57 2023] EDAC skx MC0: TSC 0x0 
[Wed Apr 26 05:36:57 2023] EDAC skx MC0: ADDR 0x2b4f4bca40 
[Wed Apr 26 05:36:57 2023] EDAC skx MC0: MISC 0x0 
[Wed Apr 26 05:36:57 2023] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682487771 SOCKET 0 APIC 0x0
[Wed Apr 26 05:36:57 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[Wed Apr 26 06:19:10 2023] Process accounting resumed
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]: It has been corrected by h/w and requires no further action
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]: event severity: corrected
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]:  Error 0, type: corrected
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]:  fru_text: A7
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]:   section_type: memory error
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]:   error_status: 0x0000000000000400
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600 
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]:   error_type: 3, multi-bit ECC
[Wed Apr 26 07:07:43 2023] {8}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000 
[Wed Apr 26 07:07:43 2023] mce: [Hardware Error]: Machine check events logged
[Wed Apr 26 07:07:43 2023] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[Wed Apr 26 07:07:43 2023] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[Wed Apr 26 07:07:43 2023] EDAC skx MC0: TSC 0x0 
[Wed Apr 26 07:07:43 2023] EDAC skx MC0: ADDR 0x2b4f4bca40 
[Wed Apr 26 07:07:43 2023] EDAC skx MC0: MISC 0x0 
[Wed Apr 26 07:07:43 2023] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682493217 SOCKET 0 APIC 0x0
[Wed Apr 26 07:07:43 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
racadm getsel
-------------------------------------------------------------------------------
Record:      59
Date/Time:   05/10/2023 14:58:16
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      60
Date/Time:   05/10/2023 14:58:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      61
Date/Time:   05/10/2023 14:58:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      62
Date/Time:   05/10/2023 14:58:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      63
Date/Time:   05/10/2023 15:00:42
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      64
Date/Time:   05/10/2023 15:00:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      65
Date/Time:   05/10/2023 15:00:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      66
Date/Time:   05/10/2023 15:00:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      67
Date/Time:   05/10/2023 15:08:23
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      68
Date/Time:   05/10/2023 15:08:24
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      69
Date/Time:   05/10/2023 15:08:24
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      70
Date/Time:   05/10/2023 15:08:24
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      71
Date/Time:   05/10/2023 15:10:57
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      72
Date/Time:   05/10/2023 15:10:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      73
Date/Time:   05/10/2023 15:10:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      74
Date/Time:   05/10/2023 15:10:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      75
Date/Time:   05/10/2023 15:22:20
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      76
Date/Time:   05/10/2023 15:22:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      77
Date/Time:   05/10/2023 15:22:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      78
Date/Time:   05/10/2023 15:22:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      79
Date/Time:   05/10/2023 15:45:59
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      80
Date/Time:   05/10/2023 15:45:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      81
Date/Time:   05/10/2023 15:45:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      82
Date/Time:   05/10/2023 15:45:59
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      83
Date/Time:   05/10/2023 16:26:21
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      84
Date/Time:   05/10/2023 16:26:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      85
Date/Time:   05/10/2023 16:26:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      86
Date/Time:   05/10/2023 16:26:21
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      87
Date/Time:   05/10/2023 16:50:03
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      88
Date/Time:   05/10/2023 16:50:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      89
Date/Time:   05/10/2023 16:50:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      90
Date/Time:   05/10/2023 16:50:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      91
Date/Time:   05/10/2023 16:54:36
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      92
Date/Time:   05/10/2023 16:54:37
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      93
Date/Time:   05/10/2023 16:54:37
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      94
Date/Time:   05/10/2023 16:54:37
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      95
Date/Time:   05/10/2023 16:55:23
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      96
Date/Time:   05/10/2023 16:55:24
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      97
Date/Time:   05/10/2023 16:55:24
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      98
Date/Time:   05/10/2023 16:55:24
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      99
Date/Time:   05/10/2023 16:59:20
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      100
Date/Time:   05/10/2023 16:59:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      101
Date/Time:   05/10/2023 16:59:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      102
Date/Time:   05/10/2023 16:59:20
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      103
Date/Time:   05/10/2023 17:27:55
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      104
Date/Time:   05/10/2023 17:27:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      105
Date/Time:   05/10/2023 17:27:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      106
Date/Time:   05/10/2023 17:27:55
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      107
Date/Time:   05/10/2023 17:37:18
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      108
Date/Time:   05/10/2023 17:37:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      109
Date/Time:   05/10/2023 17:37:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      110
Date/Time:   05/10/2023 17:37:19
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      111
Date/Time:   05/10/2023 17:57:43
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      112
Date/Time:   05/10/2023 17:57:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      113
Date/Time:   05/10/2023 17:57:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      114
Date/Time:   05/10/2023 17:57:43
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      115
Date/Time:   05/10/2023 17:57:43
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B6.
-------------------------------------------------------------------------------
Record:      116
Date/Time:   05/10/2023 18:00:31
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      117
Date/Time:   05/10/2023 18:00:31
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B6.
-------------------------------------------------------------------------------

Related Objects

Event Timeline

jcrespo triaged this task as High priority.Apr 26 2023, 8:44 AM
[16212571.545882] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])

So there's a broken DIMM (A7):

[16167970.072688] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[16178282.316673] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[16178282.316675] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[16178282.316676] {3}[Hardware Error]: event severity: corrected
[16178282.316678] {3}[Hardware Error]:  Error 0, type: corrected
[16178282.316678] {3}[Hardware Error]:  fru_text: A7
[16178282.316679] {3}[Hardware Error]:   section_type: memory error
[16178282.316680] {3}[Hardware Error]:   error_status: 0x0000000000000400
[16178282.316681] {3}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[16178282.316682] {3}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600
[16178282.316683] {3}[Hardware Error]:   error_type: 3, multi-bit ECC
[16178282.316684] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[16178282.316711] mce: [Hardware Error]: Machine check events logged
[16178282.316718] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[16178282.316719] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[16178282.316720] EDAC skx MC0: TSC 0x0
[16178282.316720] EDAC skx MC0: ADDR 0x2b4f4bca40
[16178282.316721] EDAC skx MC0: MISC 0x0
[16178282.316722] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682458927 SOCKET 0 APIC 0x0
[16178282.316733] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[16181921.927923] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[16181921.927924] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[16181921.927925] {4}[Hardware Error]: event severity: corrected
[16181921.927926] {4}[Hardware Error]:  Error 0, type: corrected
[16181921.927927] {4}[Hardware Error]:  fru_text: A7
[16181921.927928] {4}[Hardware Error]:   section_type: memory error
[16181921.927929] {4}[Hardware Error]:   error_status: 0x0000000000000400
[16181921.927929] {4}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[16181921.927931] {4}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600
[16181921.927932] {4}[Hardware Error]:   error_type: 3, multi-bit ECC
[16181921.927933] {4}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[16181921.927962] mce: [Hardware Error]: Machine check events logged
[16181921.927975] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[16181921.927977] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[16181921.927978] EDAC skx MC0: TSC 0x0
[16181921.927979] EDAC skx MC0: ADDR 0x2b4f4bca40
[16181921.927979] EDAC skx MC0: MISC 0x0
[16181921.927981] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682462567 SOCKET 0 APIC 0x0
[16181921.927992] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[16195761.615965] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[16195761.615966] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[16195761.615967] {5}[Hardware Error]: event severity: corrected
[16195761.615968] {5}[Hardware Error]:  Error 0, type: corrected
[16195761.615968] {5}[Hardware Error]:  fru_text: A7
[16195761.615969] {5}[Hardware Error]:   section_type: memory error
[16195761.615970] {5}[Hardware Error]:   error_status: 0x0000000000000400
[16195761.615971] {5}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[16195761.615972] {5}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600
[16195761.615973] {5}[Hardware Error]:   error_type: 3, multi-bit ECC
[16195761.615975] {5}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[16195761.616000] mce: [Hardware Error]: Machine check events logged
[16195761.616010] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[16195761.616014] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[16195761.616015] EDAC skx MC0: TSC 0x0
[16195761.616016] EDAC skx MC0: ADDR 0x2b4f4bca40
[16195761.616017] EDAC skx MC0: MISC 0x0
[16195761.616018] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682476407 SOCKET 0 APIC 0x0
[16195761.616029] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[16206318.886113] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[16206318.886115] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[16206318.886115] {6}[Hardware Error]: event severity: corrected
[16206318.886116] {6}[Hardware Error]:  Error 0, type: corrected
[16206318.886117] {6}[Hardware Error]:  fru_text: A7
[16206318.886118] {6}[Hardware Error]:   section_type: memory error
[16206318.886119] {6}[Hardware Error]:   error_status: 0x0000000000000400
[16206318.886120] {6}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[16206318.886121] {6}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600
[16206318.886122] {6}[Hardware Error]:   error_type: 3, multi-bit ECC
[16206318.886123] {6}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[16206318.886144] mce: [Hardware Error]: Machine check events logged
[16206318.886152] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[16206318.886153] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[16206318.886154] EDAC skx MC0: TSC 0x0
[16206318.886154] EDAC skx MC0: ADDR 0x2b4f4bca40
[16206318.886155] EDAC skx MC0: MISC 0x0
[16206318.886156] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682486964 SOCKET 0 APIC 0x0
[16206318.886168] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[16207125.883471] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[16207125.883472] {7}[Hardware Error]: It has been corrected by h/w and requires no further action
[16207125.883473] {7}[Hardware Error]: event severity: corrected
[16207125.883475] {7}[Hardware Error]:  Error 0, type: corrected
[16207125.883475] {7}[Hardware Error]:  fru_text: A7
[16207125.883477] {7}[Hardware Error]:   section_type: memory error
[16207125.883478] {7}[Hardware Error]:   error_status: 0x0000000000000400
[16207125.883478] {7}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[16207125.883480] {7}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600
[16207125.883481] {7}[Hardware Error]:   error_type: 3, multi-bit ECC
[16207125.883482] {7}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[16207125.883507] mce: [Hardware Error]: Machine check events logged
[16207125.883515] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[16207125.883521] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[16207125.883524] EDAC skx MC0: TSC 0x0
[16207125.883525] EDAC skx MC0: ADDR 0x2b4f4bca40
[16207125.883526] EDAC skx MC0: MISC 0x0
[16207125.883527] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682487771 SOCKET 0 APIC 0x0
[16207125.883538] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
[16209658.953694] Process accounting resumed
[16212571.545788] {8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[16212571.545790] {8}[Hardware Error]: It has been corrected by h/w and requires no further action
[16212571.545791] {8}[Hardware Error]: event severity: corrected
[16212571.545793] {8}[Hardware Error]:  Error 0, type: corrected
[16212571.545794] {8}[Hardware Error]:  fru_text: A7
[16212571.545795] {8}[Hardware Error]:   section_type: memory error
[16212571.545796] {8}[Hardware Error]:   error_status: 0x0000000000000400
[16212571.545797] {8}[Hardware Error]:   physical_address: 0x0000002b4f4bca40
[16212571.545800] {8}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 3 device: 0 row: 58443 column: 600
[16212571.545801] {8}[Hardware Error]:   error_type: 3, multi-bit ECC
[16212571.545803] {8}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[16212571.545832] mce: [Hardware Error]: Machine check events logged
[16212571.545839] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[16212571.545841] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 0x940000000000009f
[16212571.545842] EDAC skx MC0: TSC 0x0
[16212571.545849] EDAC skx MC0: ADDR 0x2b4f4bca40
[16212571.545858] EDAC skx MC0: MISC 0x0
[16212571.545865] EDAC skx MC0: PROCESSOR 0:0x50657 TIME 1682493217 SOCKET 0 APIC 0x0
[16212571.545882] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b4f4bc offset:0xa40 grain:32 syndrome:0x0 -  err_code:0x0000:0x009f socket:0 imc:0 rank:0 bg:0 ba:3 row:0xe44b col:0x658 retry_rd_err_log[0000a80f 00000000 10002000 0496c100 0000e44b] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])

HW logs:

-------------------------------------------------------------------------------
Record:      51
Date/Time:   04/26/2023 07:14:13
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------

Mentioned in SAL (#wikimedia-operations) [2023-04-26T09:05:22Z] <jynus@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2139.codfw.wmnet with reason: T335396

Mentioned in SAL (#wikimedia-operations) [2023-04-26T09:05:35Z] <jynus@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2139.codfw.wmnet with reason: T335396

on boot
           iDRAC, Update FW, Install OS)
F11      = Boot Manager
F12      = PXE Boot
iDRAC IPV4:  10.193.2.182
                                
Initializing Firmware Interfaces...
 





Enumerating Boot options...
Enumerating Boot options... Done

UEFI0106: One or more memory correctable training errors have occurred on
memory slot: A7.
Remove input power to the system, reseat the DIMM module and restart the
system. If the correctable errors persist, replace the faulty memory module
identified in the message.

UEFI0079: One or more uncorrectable Memory errors occurred in the previous
boot.
Check the System Event Log (SEL) to identify the non-functional DIMM, and then
replace the DIMM.
 

Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for Lifecycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager
jcrespo added a project: ops-codfw.
jcrespo added subscribers: KOfori, wiki_willy, Papaul.

@wiki_willy @Papaul Can we get a replacement DIMM? The urgency is that our guess is that warranty lasts until today or soon. CC @KOfori

Create Dispatch: Success
You have successfully submitted request SR166997440.

Thank you Papaul for the quick reaction- I will leave the host up and running for now.

I pasted in the ticket with Dell the same error we are seeing and here is what Dell is telling me:

Denial Notes

Troubleshooting/System Failure information provided is insufficient for Memory.

@Papaul what else do they need? We have pasted their idrac's log

@Marostegui sorry just getting backup with you on this in the main time we can power the server down and swap DIMM A7 with DIMM B7 and see if we see that error on DIMM B7.
@Jhancock.wm ^

Let me do it @Papaul. I should be the first point of contact for this ticket.

@jcrespo I swapped DIMM A7 with DIMM B6. (Their server's DIMM is asymmetrical for some reason. There's no b7 so I used that instead). It's been under power for a few minutes and I haven't seen any new errors in the logs.

Can I put it back to work and test it's memory under regular usage?

Yes. Let me know if any errors pop up. Thanks!

Expected post messages:

Message PR1: Replaced part detected for device: DDR4 DIMM(DIMM A7).
Message PR1: Replaced part detected for device: DDR4 DIMM(DIMM B6).

@jcrespo that is normal is it just telling you that there were some changes made on those DIMM's

the error reoccured today. but a Dell TSR report was submitted and a new part is most likely on the way.

jcrespo updated the task description. (Show Details)
-------------------------------------------------------------------------------
Record:      59
Date/Time:   05/10/2023 14:58:16
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------

Mentioned in SAL (#wikimedia-operations) [2023-05-11T05:47:59Z] <jynus@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2139.codfw.wmnet with reason: T335396

Mentioned in SAL (#wikimedia-operations) [2023-05-11T05:48:12Z] <jynus@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2139.codfw.wmnet with reason: T335396

@jcrespo this part has been received.
Is it currently safe to replace this DIMM? if not I can take care of it tomorrow after 13:00 UTC (or in the next two hours if you are available)

@Jhancock.wm Jaime is out today, but the server is OFF (or unreachable), so please go ahead and replace the DIMM today if you can. Please leave the server UP once you're done. Thank you!

@Marostegui DIMM_B6 has been replaced. the server is UP and I can reach it via idrac and ping the IP. Do you want to leave this ticket open for observation?
returning the bad DIMM tracking: 398150374954

Thanks @Jhancock.wm. I can reach the host. I prefer to leave the ticket open, as Jaime owns this server, I want him to decide when he feels comfortable closing it. Thanks a lot for your help cc @jcrespo

jcrespo closed this task as Resolved.EditedMon, May 15, 8:03 AM

I've put db2139 back into service, no errors or issues observed so far. I reloaded all data from the most recent backup. Thank you!