Page MenuHomePhabricator

(OoW) wtp2019 shows error messages in the racadm getsel's output
Closed, ResolvedPublic

Description

I just rebooted wtp2019 since it was completely frozen (no ssh, mgmt console showed only a "Starting.." and nothing more). The racadm getsel output shows:

/admin1-> racadm getsel
Record:      1
Date/Time:   10/26/2016 15:58:55
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/22/2019 00:48:31
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/22/2019 00:48:34
Source:      system
Severity:    Critical
Description: An over current fault detected on power supply 1.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   04/22/2019 00:48:34
Source:      system
Severity:    Critical
Description: An over current fault detected on power supply 2.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   04/22/2019 00:48:36
Source:      system
Severity:    Ok
Description: Power supply 2 is operating normally.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   04/22/2019 00:48:36
Source:      system
Severity:    Ok
Description: Power supply 1 is operating normally.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   04/23/2019 05:44:28
Source:      system
Severity:    Critical
Description: The system board PS1 PG Fail voltage is outside of range.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   04/23/2019 05:44:33
Source:      system
Severity:    Ok
Description: The system board PS1 PG Fail voltage is within range.
-------------------------------------------------------------------------------

Event Timeline

elukey edited projects, added ops-codfw; removed ops-eqiad.

History of this host:

  • wtp2019 - hardware (RAM) check (T146113)
  • wtp2019 has faulty memory (T146009)
  • wtp2019 issues an uncorrectable memory error (T148710)
  • wtp2019.codfw.wmnet is down (T149110)

and it shows up in more general tickets as well

It's a lemon.

Dzahn triaged this task as Medium priority.Apr 30 2019, 9:31 PM
wiki_willy renamed this task from wtp2019 shows error messages in the racadm getsel's output to (OoW) wtp2019 shows error messages in the racadm getsel's output.Jul 15 2019, 8:57 PM
wiki_willy assigned this task to Papaul.

Instructions: The System Event Log contains information about the managed system. To sort the log by column, click a column header.

Clear Log
Save As

	 	Tue Apr 23 2019 05:44:33	The system board PS1 PG Fail voltage is within range.	
	 	Tue Apr 23 2019 05:44:28	The system board PS1 PG Fail voltage is outside of range.	
	 	Mon Apr 22 2019 00:48:36	Power supply 1 is operating normally.	
	 	Mon Apr 22 2019 00:48:36	Power supply 2 is operating normally.	
	 	Mon Apr 22 2019 00:48:34	An over current fault detected on power supply 2.	
	 	Mon Apr 22 2019 00:48:34	An over current fault detected on power supply 1.	
	 	Mon Apr 22 2019 00:48:31	CPU 2 has an internal error (IERR).	
	 	Wed Oct 26 2016 15:58:55	Log cleared.

Mentioned in SAL (#wikimedia-operations) [2019-08-06T19:42:53Z] <subbu> depooled wtp2019 ( to assist papaul with T221572 )

Depooled the server just now (logged in SAL).

ssastry@deploy1001:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service wtp2019 'depool service=parsoid'
Depooling parsoid on wtp2019.codfw.wmnet
codfw/parsoid/parsoid/wtp2019.codfw.wmnet: pooled changed yes => no

Tailing /srv/log/parsoid/main.log shows traffic has stopped.

Grafana graphs is still showing traffic, but maybe the sampling lags a bit. Waiting for a bit to see if it dies down.

Oh, I was looking at the cluster graphs. the wtp2019 graph does indeed show zero traffic to the host now.

@ssastry upgrade complete, I have no more errors showing in the IDRAC log, I am leaving the task open until next week then will resolve it if no errors.

The server can be repool.

Thanks.

Mentioned in SAL (#wikimedia-operations) [2019-08-06T20:17:58Z] <subbu> repooled wtp2019 ( after papaul finished upgrade as part of T221572 )

After upgrade, verified code version

ssastry@wtp2019:~$ curl http://localhost:8000/_version
{"name":"parsoid","version":"0.10.0+git","sha":"7232dfff04a305db11f6c6de33cecfae0b1a7801"}

Then, repooled

ssastry@deploy1001:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service wtp2019 'pool service=parsoid'
Pooling parsoid on wtp2019.codfw.wmnet
codfw/parsoid/parsoid/wtp2019.codfw.wmnet: pooled changed no => yes

tail of logs and grafana show traffic is back on the server.

I checked the server this morning no errors showing in log. closing the task