Page MenuHomePhabricator

Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March
Closed, ResolvedPublic

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

corruption?

160323 13:11:49 [ERROR] ombudsmenwiki.blobs_cluster25: 1 client is using or hasn't closed the table properly

After start:

160425 18:12:15 mysqld_safe Starting mysqld daemon with databases from /srv/sqldata
160425 18:12:16 [Note] /opt/wmf-mariadb10/bin/mysqld (mysqld 10.0.23-MariaDB-log) starting as process 20239 ...
160425 18:12:16 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=es2019' or '--log-bin=es2019-bin' to avoid this problem.
160425 18:12:16 [Note] InnoDB: Using mutexes to ref count buffer pool pages
160425 18:12:16 [Note] InnoDB: The InnoDB memory heap is disabled
160425 18:12:16 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
160425 18:12:16 [Note] InnoDB: Memory barrier is not used
160425 18:12:16 [Note] InnoDB: Compressed tables use zlib 1.2.3
160425 18:12:16 [Note] InnoDB: Using Linux native AIO
160425 18:12:16 [Note] InnoDB: Using CPU crc32 instructions
160425 18:12:16 [Note] InnoDB: Initializing buffer pool, size = 94.0G
160425 18:12:20 [Note] InnoDB: Completed initialization of buffer pool
160425 18:12:21 [Note] InnoDB: Highest supported file format is Barracuda.
160425 18:12:21 [Note] InnoDB: Log scan progressed past the checkpoint lsn 2824766898466
160425 18:12:21 [Note] InnoDB: Database was not shutdown normally!
160425 18:12:21 [Note] InnoDB: Starting crash recovery.
160425 18:12:21 [Note] InnoDB: Reading tablespace information from the .ibd files...
160425 18:12:32 [Note] InnoDB: Restoring possible half-written data pages 
160425 18:12:32 [Note] InnoDB: from the doublewrite buffer...
InnoDB: Doing recovery: scanned up to log sequence number 2824772141056
InnoDB: Doing recovery: scanned up to log sequence number 2824777383936
InnoDB: Doing recovery: scanned up to log sequence number 2824782626816
InnoDB: Doing recovery: scanned up to log sequence number 2824787869696
InnoDB: Doing recovery: scanned up to log sequence number 2824793112576
InnoDB: Doing recovery: scanned up to log sequence number 2824798355456
InnoDB: Doing recovery: scanned up to log sequence number 2824803598336
InnoDB: Doing recovery: scanned up to log sequence number 2824808841216
InnoDB: Doing recovery: scanned up to log sequence number 2824814084096
InnoDB: Doing recovery: scanned up to log sequence number 2824819326976
InnoDB: Doing recovery: scanned up to log sequence number 2824824569856
InnoDB: Doing recovery: scanned up to log sequence number 2824829812736
InnoDB: Doing recovery: scanned up to log sequence number 2824835055616
InnoDB: Doing recovery: scanned up to log sequence number 2824840298496
InnoDB: Doing recovery: scanned up to log sequence number 2824845541376
InnoDB: Doing recovery: scanned up to log sequence number 2824850784256
InnoDB: Doing recovery: scanned up to log sequence number 2824856027136
InnoDB: Doing recovery: scanned up to log sequence number 2824861270016
InnoDB: Doing recovery: scanned up to log sequence number 2824866512896
InnoDB: Doing recovery: scanned up to log sequence number 2824871755776
InnoDB: Doing recovery: scanned up to log sequence number 2824876998656
InnoDB: Doing recovery: scanned up to log sequence number 2824882241536
InnoDB: Doing recovery: scanned up to log sequence number 2824887484416
InnoDB: Doing recovery: scanned up to log sequence number 2824892727296
InnoDB: Doing recovery: scanned up to log sequence number 2824897970176
InnoDB: Doing recovery: scanned up to log sequence number 2824903213056
InnoDB: Doing recovery: scanned up to log sequence number 2824908455936
InnoDB: Doing recovery: scanned up to log sequence number 2824913698816
InnoDB: Doing recovery: scanned up to log sequence number 2824918941696
InnoDB: Doing recovery: scanned up to log sequence number 2824924184576
InnoDB: Doing recovery: scanned up to log sequence number 2824929427456
InnoDB: Doing recovery: scanned up to log sequence number 2824934670336
InnoDB: Doing recovery: scanned up to log sequence number 2824939913216
InnoDB: Doing recovery: scanned up to log sequence number 2824945156096
InnoDB: Doing recovery: scanned up to log sequence number 2824950398976
InnoDB: Doing recovery: scanned up to log sequence number 2824955641856
InnoDB: Doing recovery: scanned up to log sequence number 2824960884736
InnoDB: Doing recovery: scanned up to log sequence number 2824966127616
InnoDB: Doing recovery: scanned up to log sequence number 2824971370496
InnoDB: Doing recovery: scanned up to log sequence number 2824976613376
InnoDB: Doing recovery: scanned up to log sequence number 2824981856256
InnoDB: Doing recovery: scanned up to log sequence number 2824987099136
InnoDB: Doing recovery: scanned up to log sequence number 2824992342016
InnoDB: Doing recovery: scanned up to log sequence number 2824997584896
InnoDB: Doing recovery: scanned up to log sequence number 2825002827776
InnoDB: Doing recovery: scanned up to log sequence number 2825008070656
InnoDB: Doing recovery: scanned up to log sequence number 2825013313536
InnoDB: Doing recovery: scanned up to log sequence number 2825018556416
InnoDB: Doing recovery: scanned up to log sequence number 2825023799296
InnoDB: Doing recovery: scanned up to log sequence number 2825029042176
InnoDB: Doing recovery: scanned up to log sequence number 2825034285056
InnoDB: Doing recovery: scanned up to log sequence number 2825039527936
InnoDB: Doing recovery: scanned up to log sequence number 2825044770816
InnoDB: Doing recovery: scanned up to log sequence number 2825050013696
InnoDB: Doing recovery: scanned up to log sequence number 2825055256576
InnoDB: Doing recovery: scanned up to log sequence number 2825060499456
InnoDB: Doing recovery: scanned up to log sequence number 2825065742336
InnoDB: Doing recovery: scanned up to log sequence number 2825070985216
InnoDB: Doing recovery: scanned up to log sequence number 2825076228096
InnoDB: Doing recovery: scanned up to log sequence number 2825081470976
InnoDB: Doing recovery: scanned up to log sequence number 2825086713856
InnoDB: Doing recovery: scanned up to log sequence number 2825091956736
InnoDB: Doing recovery: scanned up to log sequence number 2825097199616
InnoDB: Doing recovery: scanned up to log sequence number 2825102442496
InnoDB: Doing recovery: scanned up to log sequence number 2825107685376
InnoDB: Doing recovery: scanned up to log sequence number 2825112928256
InnoDB: Doing recovery: scanned up to log sequence number 2825118171136
InnoDB: Doing recovery: scanned up to log sequence number 2825123414016
InnoDB: Doing recovery: scanned up to log sequence number 2825128656896
InnoDB: Doing recovery: scanned up to log sequence number 2825133899776
InnoDB: Doing recovery: scanned up to log sequence number 2825139142656
InnoDB: Doing recovery: scanned up to log sequence number 2825144385536
InnoDB: Doing recovery: scanned up to log sequence number 2825149628416
InnoDB: Doing recovery: scanned up to log sequence number 2825154871296
InnoDB: Doing recovery: scanned up to log sequence number 2825160114176
InnoDB: Doing recovery: scanned up to log sequence number 2825165357056
InnoDB: Doing recovery: scanned up to log sequence number 2825170599936
InnoDB: Doing recovery: scanned up to log sequence number 2825175842816
InnoDB: Doing recovery: scanned up to log sequence number 2825178407575
InnoDB: Transaction 5076870391 was in the XA prepared state.
InnoDB: Transaction 5076870391 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 5076870656
160425 18:12:35 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 
InnoDB: Apply batch completed
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 223473047, file name es1009-bin.001486
InnoDB: Last MySQL binlog file position 0 748329631, file name ./es2019-bin.000106
160425 18:12:47 [Note] InnoDB: 128 rollback segment(s) are active.
InnoDB: Starting in background the rollback of uncommitted transactions
2016-04-25 18:12:47 7fc50afff700  InnoDB: Rollback of non-prepared transactions completed
160425 18:12:47 [Note] InnoDB: Waiting for purge to start
160425 18:12:47 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.26-76.0 started; log sequence number 2825178407575
2016-04-25 18:12:47 7fc4edff9700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
160425 18:12:47 [Note] Recovering after a crash using es2019-bin
160425 18:12:48 [Note] Starting crash recovery...
2016-04-25 18:12:48 7fde9fb06780  InnoDB: Starting recovery for XA transactions...
2016-04-25 18:12:48 7fde9fb06780  InnoDB: Transaction 5076870391 in prepared state after recovery
2016-04-25 18:12:48 7fde9fb06780  InnoDB: Transaction contains changes to 3 rows
2016-04-25 18:12:48 7fde9fb06780  InnoDB: 1 transactions in prepared state after recovery
160425 18:12:48 [Note] Found 1 prepared transaction(s) in InnoDB
160425 18:12:48 [Note] Crash recovery finished.
160425 18:12:48 [Note] Server socket created on IP: '::'.
160425 18:12:48 [Note] Server socket created on IP: '::'.
[filtered]
160425 18:12:49 [Note] Event Scheduler: scheduler thread started with id 2
160425 18:12:49 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--log-basename=#' or '--relay-log=es2019-relay-bin' to avoid this problem.
160425 18:12:49 [Note] /opt/wmf-mariadb10/bin/mysqld: ready for connections.
Version: '10.0.23-MariaDB-log'  socket: '/tmp/mysql.sock'  port: 3306  MariaDB Server

Expected corruption on start on data and heartbeat:

160425 18:14:31 [Note] Slave I/O thread: connected to master 'repl@es2018.codfw.wmnet:3306',replication started in log 'es2018-bin.000108' at position 1036403185
160425 18:14:33 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table wikidatawiki.blobs_cluster25; Duplicate entry '164237230' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log es2018-bin.000108, end_log_pos 1036404510, Gtid 0-171978868-164558713, Internal MariaDB error code: 1062
160425 18:14:33 [Warning] Slave: Duplicate entry '164237230' for key 'PRIMARY' Error_code: 1062
160425 18:14:33 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'es2018-bin.000108' position 1036403185
160425 18:14:37 [ERROR] mysqld: Table './heartbeat/heartbeat' is marked as crashed and should be repaired
160425 18:14:37 [Warning] Checking table:   './heartbeat/heartbeat'

As expected, 0 logs of shutdown, it was a hard reset/power loss.

I will keep meanwhile the ticket open for 2 months unless something happens.

jcrespo changed the task status from Open to Stalled.Apr 25 2016, 6:26 PM
jcrespo removed jcrespo as the assignee of this task.
jcrespo changed the task status from Stalled to Open.May 10 2016, 10:57 AM
jcrespo claimed this task.

Several days and no new power loss. I will reimage it and set its master with transactional replication.

jcrespo changed the task status from Open to Stalled.EditedMay 10 2016, 6:01 PM
jcrespo moved this task from In progress to Backlog on the DBA board.

es2019 has been reimaged from es2017. Let's wait now that month to see if it fails again while I test GTID here.

Change 283946 merged by Volans:
DBtools: add script to check external storage

https://gerrit.wikimedia.org/r/283946

jcrespo raised the priority of this task from Medium to High.May 26 2016, 1:35 PM
jcrespo moved this task from Backlog to In progress on the DBA board.

es2017 just (crashed?) at 12:25 today, I do not think that is unrelated.

jcrespo changed the task status from Stalled to Open.May 26 2016, 1:35 PM

Nothing on syslog:

May 26 12:05:01 es2017 CRON[135137]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 26 12:15:01 es2017 CRON[135912]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 26 12:17:01 es2017 CRON[136071]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 26 12:41:37 es2017 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1446" x-info="http://www.rsyslog.com"] start

Or kernel.log:

May 25 06:25:02 es2017 kernel: [1255457.765907] Process accounting resumed
May 26 06:25:03 es2017 kernel: [1341854.374319] Process accounting resumed
May 26 12:41:37 es2017 kernel: [    0.000000] Initializing cgroup subsys cpuset
May 26 12:41:37 es2017 kernel: [    0.000000] Initializing cgroup subsys cpu
May 26 12:41:37 es2017 kernel: [    0.000000] Initializing cgroup subsys cpuacct
May 26 12:41:37 es2017 kernel: [    0.000000] Linux version 4.4.0-1-amd64 (debian-kernel@lists.debian.org) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Debian 4.4.2-3+wmf1 (2016-02-29)

No interactive user sessions, and the last sudo activity is a successful raid check:

May 26 12:21:51 es2017 sshd[136433]: Set /proc/self/oom_score_adj to 0
May 26 12:21:51 es2017 sshd[136433]: Connection from 208.80.154.14 port 1888 on 10.192.0.142 port 22
May 26 12:21:51 es2017 sshd[136433]: Connection closed by 208.80.154.14 [preauth]
May 26 12:22:09 es2017 sudo:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/check-raid.py
May 26 12:22:09 es2017 sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
May 26 12:22:09 es2017 sudo: pam_unix(sudo:session): session closed for user root

No swapping, increased load after reboot is for reboot and replication catching up:
https://grafana-admin.wikimedia.org/dashboard/db/server-board?from=1464261097242&to=1464270254562&var-server=es2017&var-network=eth0

jcrespo renamed this task from es2019 crashed with no logs to es2017 and es2019 crashed with no logs.May 26 2016, 2:04 PM

I think I got it:

"Normal","Mon Feb 08 2016 16:06:18","Log cleared."
"Critical","Thu May 26 2016 12:22:06","CPU 2 has an internal error (IERR)."
"Normal","Thu May 26 2016 12:24:04","A problem was detected related to the previous server boot."
"Critical","Thu May 26 2016 12:24:04","Multi-bit memory errors detected on a memory device at location(s) DIMM_A2."

For es2017, CPU seems to have failed:

CPU 2	Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz	E5	2600 MHz	IERR	10

Memory show currently as ok, but A2 slot had an error at the same time, we should not trust it:

root@es2017:~# ipmitool sel list 1 | 02/08/2016 | 16:06:18 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted 2 | 05/26/2016 | 12:22:06 | Processor #0x61 | IERR | Asserted 3 | 05/26/2016 | 12:24:04 | Unknown #0x2e |  | Asserted 4 | 05/26/2016 | 12:24:04 | Memory #0x02 | Uncorrectable ECC (UnCorrectable ECC |  DIMMA2) | Asserted

We can shutdown this server relatively easily, and they should be 100% under warranty, replacing these 2 components seems wise.

es2019 seems to had suffered the same cpu and memory errors:

MEM0001: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
 2016-04-22T14:48:59-0500
Log Sequence Number: 142
Detailed Description:
The memory has encountered a uncorrectable error. System performance may be degraded. The operating system and/or applications may fail as a result.
Recommended Action:
Re-install the memory component. If the problem persists, contact technical support. Refer to the product documentation to choose a convenient contact method.
CPU0000: CPU 1 has an internal error (IERR).
 2016-03-23T02:15:10-0500
Log Sequence Number: 114
Detailed Description:
System event log and OS logs may indicate that the exception is external to the processor.
Recommended Action:
Review System Event Log and Operating System Logs. If the issue persists, contact technical support. Refer to the product documentation to choose a convenient contact method.
Comment: root

(the dates are consistent with the crash dates)

Will be receiving memory replacement tomorrow
Service Request 930250087 <<#7521282-32655863#>>
Service Request 930256880 <<#7521282-32654588#>>

Mentioned in SAL [2016-05-27T17:10:52Z] <volans> Stop slave, stop mysql and shutdown es2017 and es2019 for hardware maintenance T130702

es2017 and es2019 were restarted after @Papaul replaced the memory.

hardware logs are cleared, the time will tell us if it's fixed.

Resolving for now. we can re-open it if it will happen again.

es2017:

Correctable memory error rate exceeded for DIMM_A2.
just after booting for the first time after replacing the memory

Disk 0 in Backplane 1 of Integrated RAID Controller 1 is inserted. (for every disk) at 2016-05-30T13:52:21-0500, in what seems to be a RAID controller failure.

As a confirmation that I/O was stuck, rom dmesg after a bunch of call traces we got:

[246461.498936] megaraid_sas 0000:03:00.0: pending commands remain after waiting, will reset adapter scsi0.
[246461.509527] megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0.
[246468.254786] megaraid_sas 0000:03:00.0: Waiting for FW to come to ready state
[246476.510617] megaraid_sas 0000:03:00.0: FW now in Ready state
[246477.398591] megaraid_sas 0000:03:00.0: Init cmd success
[246477.426595] megaraid_sas 0000:03:00.0: firmware type	: Extended VD(240 VD)firmware
[246477.426601] megaraid_sas 0000:03:00.0: controller type	: MR(1024MB)
[246477.426604] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR)	: Enabled
[246477.426607] megaraid_sas 0000:03:00.0: Secure JBOD support	: No
[246477.451018] megaraid_sas 0000:03:00.0: Jbod map is not supported megasas_setup_jbod_map 4612
[246477.451039] megaraid_sas 0000:03:00.0: Reset successful for scsi0.
[246477.456982] megaraid_sas 0000:03:00.0: 2270 (2s/0x0020/CRIT) - Controller encountered a fatal error and was reset

And after that the server become responsive again.

Looking at the manufacturer website with the serial number of es2017 looks like there are newer versions of the RAID controller's firmware that are recommended.

We could give it a try. @Papaul, @jcrespo thoughts?

@Volans it is not a problem to update the RAID controller firmware .

Do you need downtime for that? If yes, let's program a time.

Yes i will need downtime for that but let me download the firmware first and i will let you or Volans know

thanks.

Probably just me (Volans is away). We may want to do it on all ES servers in the end- but there is no hurry for the other hosts.

Ping me at almost any time when you see me on IRC 5 minutes before you have the time and I will shutdown them cleanly.

These 2 hosts gave more issues than the rest together!

Tracking number for the memory that were returned to Dell on 5/27/2016

20160531_111959.jpg (2×5 px, 3 MB)

20160531_112004.jpg (2×5 px, 3 MB)

@jcrespo or @ Volan I have the firmware file please let me know when is the best time next week to schedule a downtime for those systems. Thanks.

Tuesday, whenever you start working and are available (my afternoon)?

Update note on both systems

BIOS 1.5.4 to 2.0.2
IDRAC 2.21 to 2.30
Dell uEFI diagnostics
Dell Os Driver Pack 15.10 to 16.03
PERC H730 Controller 25.3 to 25.4

A bunch of errors is making netfilter and ntp fail on es2017. On the admin console:

MEM0701: Correctable memory error rate exceeded for DIMM_A2.
 2016-06-07T15:28:36-0500
Log Sequence Number: 227
Detailed Description:
The memory may not be operational. This an early indicator of a possible future uncorrectable error.
Recommended Action:
Re-install the memory component. If the problem continues, contact support.
Comment: root

I am going to restart es2017 to try to have at least a clean start.

The software errors do not happen on es2019, but the log says the same:

MEM0702: Correctable memory error rate exceeded for DIMM_A1.
 2016-06-07T17:27:31-0500
Log Sequence Number: 251
Detailed Description:
The memory may not be operational. This an early indicator of a possible future uncorrectable error.
Recommended Action:
Re-install the memory component. If the problem continues, contact support.
Comment: root

Consider pushing for a full machines replacement.

jcrespo reopened this task as Open.
jcrespo reassigned this task from jcrespo to Papaul.

restarting es2017 fixed the software issues, but this is clearly not in a closed state. This is not the highest priority, but clearly there is a hardware defect here (board?).

@jcrespo I am sending the log file to the Dell support engineer, I will update you on the status.

@jcrespo here is the reply back from the Dell support team so I will need another down time on those systems tomorrow or Friday.

Hello Papaul,

We have just found a fix for this issue this morning.

Is the Bios version 2.1.6?

http://www.dell.com/support/home/us/en/04/Drivers/DriversDetails?driverId=48H0V&fileId=3543814277&osCode=NAA&productCode=poweredge-r730xd&languageCode=en&categoryId=BI

This version may be too new to have been in the update package.

This has new firmware to handle dimm errors correctly.

Please let me know what version you have.

Thanks,

John

This is ok, just ping me today early or next week to get it done.

Papaul added a subscriber: ori.

BIOS compete on both systems. Both systems are now running version 2.1.6.
@ori System is up back, it is all yours. Thanks

Mentioned in SAL [2016-06-15T18:54:46Z] <ori> Started MySQL on es2019 (T130702)

I can confirm the error did not happen on both systems after BIOS update and restart.

CORRECTION: it is not happening on es2019, but it happened again on es2017- I would double check by rebooting once more that host (es2017)- the error was seconds before detecting the new bios version.

Logs:

SUP0516: Updating firmware for PowerEdge BIOS to version 2.1.6.
 2016-06-15T16:52:33-0500
Log Sequence Number: 273
Detailed Description:
Do not turn off the system while firmware update is in progress.
Recommended Action:
No response action required.
Comment: root

SUP0518: Successfully updated the PowerEdge BIOS firmware to version 2.1.6.
 2016-06-15T16:55:12-0500
Log Sequence Number: 275
Detailed Description:
The specified firmware for the component was successfully updated.
Recommended Action:
No response action is required.
Comment: root

SYS1001: System is turning off.
 2016-06-15T16:55:16-0500
Log Sequence Number: 276
Detailed Description:
System is turning off.
Recommended Action:
No response action is required.
Comment: root

SYS1000: System is turning on.
 2016-06-15T16:55:24-0500
Log Sequence Number: 278
Detailed Description:
System is turning on.
Recommended Action:
No response action is required.
Comment: root

MEM0701: Correctable memory error rate exceeded for DIMM_A2.
 2016-06-15T16:57:32-0500
Log Sequence Number: 281
Detailed Description:
The memory may not be operational. This an early indicator of a possible future uncorrectable error.
Recommended Action:
Re-install the memory component. If the problem continues, contact support.
Comment: root

MEM0702: Correctable memory error rate exceeded for DIMM_A2.
 2016-06-15T16:58:06-0500
Log Sequence Number: 282
Detailed Description:
The memory may not be operational. This an early indicator of a possible future uncorrectable error.
Recommended Action:
Re-install the memory component. If the problem continues, contact support.
Comment: root

PR36: Version change detected for BIOS firmware. Previous version:2.0.2, Current version:2.1.6
 2016-06-15T16:59:59-0500
Log Sequence Number: 283
Detailed Description:
The system has detected a different firmware version than previously recorded for the indicated device. This may be due to a firmware update, rollback, or part replacement.
Recommended Action:
No response action is required.
Comment: root
jcrespo lowered the priority of this task from High to Medium.Jun 27 2016, 11:21 AM

No. No crash. But I was expecting to hear their response from support.

@jcrespo the log file shows that since 6/15/2016 we didn't have any memory error reported for es2017 and since 6/7/2016 no error reported for es2019. To double check that everything is clear on es2017 and es2019 can you please reboot thoses systems and confirm please before i contact Dell. Thanks

I consider this as resolved, let's track BIOS and followup on T139714.

jcrespo renamed this task from es2017 and es2019 crashed with no logs to Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March.Oct 31 2016, 12:03 PM
jcrespo reopened this task as Open.
  • Number of crashes es2019: 23rd March & 22nd April & 30th Oct
  • Number of crashes es2017: 26th May 30th May,
  • Number of crashes es2015: 10th Oct
  • Number of crashes es2014: 17th Oct

I'll review all the past and linked ticket histories. We'll need to generate a list of each system, and the overall errors and messages from each, along with the steps we've taken on each system.

So systems are:

es2014
es2015
es2017
es2019

I'll try to gather this info later today along with the related tasks and steps taken.

  • Number of crashes es2019: 23rd March & 22nd April & 30th Oct
  • Number of crashes es2017: 26th May 30th May,
  • Number of crashes es2015: 10th Oct
  • Number of crashes es2014: 17th Oct

No crashes in the last 4 months, it seems?

No crashes in the last 4 months, it seems?

Indeed, the uptimes are looking very promising a bit over 2 months now!

root@neodymium:~# sudo salt 'es201*' cmd.run 'uptime'
es2011.codfw.wmnet:
     15:39:45 up 76 days, 48 min,  0 users,  load average: 0.00, 0.00, 0.00
es2018.codfw.wmnet:
     15:39:45 up 75 days, 23:19,  0 users,  load average: 0.00, 0.00, 0.00
es2013.codfw.wmnet:
     15:39:45 up 76 days, 3 min,  0 users,  load average: 0.00, 0.00, 0.00
es2019.codfw.wmnet:
     15:39:45 up 83 days, 20:09,  0 users,  load average: 0.08, 0.04, 0.01
es2016.codfw.wmnet:
     15:39:45 up 75 days, 23:42,  0 users,  load average: 0.10, 0.07, 0.02
es2012.codfw.wmnet:
     15:39:45 up 76 days, 22 min,  0 users,  load average: 0.00, 0.02, 0.00
es2014.codfw.wmnet:
     15:39:45 up 76 days,  1:09,  0 users,  load average: 0.04, 0.03, 0.00
es2015.codfw.wmnet:
     15:39:45 up 76 days,  3:01,  0 users,  load average: 0.00, 0.00, 0.00
es2017.codfw.wmnet:
     15:39:45 up 76 days,  1:55,  0 users,  load average: 0.00, 0.01, 0.00

es2015 crashed on 2017-03-11, faulty cpu and board replaced.

Number of crashes es2019: 23rd March & 22nd April & 30th Oct & 22 Apr
Number of crashes es2017: 26th May and 30th May
Number of crashes es2015: 10th Oct, 11 Mar (CPU and board replaced already)
Number of crashes es2014: 17th Oct

^@Papaul @RobH

jcrespo raised the priority of this task from Medium to High.Apr 22 2017, 7:51 AM

Going to close this for now as we had no more crashes lately.