Labservices1001 crashing, probable overheating
Closed, ResolvedPublic

Description

Just now, at 8:44 PM CDT, Labservices1001 froze up and stopped responding.

I'm trained to ignore temperature warnings in the syslog, but, here's the syslog before the freeze:

Jun  3 01:36:36 labservices1001 pdns[1750]: gmysql Connection successful. Connected to database 'pdns' on 'localhost'.
Jun  3 01:36:36 labservices1001 pdns[1750]: AXFR done for 'eqiad.wmflabs', zone committed with serial number 1527989790
Jun  3 01:36:36 labservices1001 pdns[1750]: AXFR started for 'eqiad.wmflabs'
Jun  3 01:36:36 labservices1001 pdns[1750]: Transaction started for 'eqiad.wmflabs'
Jun  3 01:36:37 labservices1001 pdns[1750]: AXFR done for 'eqiad.wmflabs', zone committed with serial number 1527989790
Jun  3 01:37:01 labservices1001 CRON[48470]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958866] CPU1: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958872] CPU0: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958877] CPU4: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958882] CPU2: Package temperature above threshold, cpu clock throttled (total events = 366)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958890] CPU5: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958896] CPU3: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958901] CPU6: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958906] CPU7: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.959793] CPU0: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959795] CPU4: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959811] CPU2: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959826] CPU3: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959830] CPU5: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959835] CPU6: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959839] CPU7: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358343.038185] CPU1: Package temperature/speed normal
Jun  3 01:38:01 labservices1001 CRON[48620]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:39:01 labservices1001 CRON[48730]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:40:01 labservices1001 CRON[48878]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:41:01 labservices1001 CRON[49009]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:42:01 labservices1001 CRON[49137]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)

(then silence until I rebooted it)

Andrew created this task.Jun 3 2018, 2:26 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 3 2018, 2:26 AM
Krenair added a subscriber: Krenair.Jun 3 2018, 2:53 AM
Paladox added a subscriber: Paladox.Jun 3 2018, 7:36 AM

this server is out of warranty. In the past I've reapplied thermal paste. Let me know if you would like to schedule a time to do that.

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jun 4 2018, 3:34 PM
Andrew added a comment.Jun 4 2018, 3:37 PM

In the past I've reapplied thermal paste. Let me know if you would like to schedule a time to do that.

You've done that on this exact server? If so, is it likely that a second attempt would make a difference?

Oh, then the issue is probably not related to the thermal paste. Was something else going on that would stress the CPU at that time?

Andrew added a comment.EditedJun 4 2018, 5:54 PM

Sorry, @Cmjohnson, to be clear I was asking if you've already replaced the paste on this server, not saying that I think you have.

chasemp added a subscriber: chasemp.Jun 4 2018, 6:01 PM

A note on not-paging for our weekly meeting: https://phabricator.wikimedia.org/T152368#2849231

@Andrew sorry, I did misunderstand. I need to purchase more thermal paste. Can this wait a few days?

Andrew added a comment.Jun 4 2018, 7:53 PM

Can this wait a few days?

Sure.

Cmjohnson moved this task from Up next to Being worked on on the ops-eqiad board.Jun 5 2018, 6:17 PM
Joe triaged this task as Normal priority.Jun 18 2018, 12:42 PM
Cmjohnson mentioned this in Unknown Object (Task).Jun 27 2018, 12:55 PM

@Andrew We need thermal paste. I have created a procurement task https://phabricator.wikimedia.org/T198326. Once it arrives I will ping you regarding a good day/time to power off.

@Cmjohnson Sounds good, thanks for the update. That server seems to be holding steady for now.

Vvjjkkii renamed this task from Labservices1001 crashed to vrbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Krenair renamed this task from vrbaaaaaaa to Labservices1001 crashed.Jul 1 2018, 2:48 AM
Krenair lowered the priority of this task from High to Normal.
Krenair updated the task description. (Show Details)
Krenair added a subscriber: Aklapper.

This just happened again

Andrew added a comment.Jul 1 2018, 2:55 AM
Jul  1 02:32:51 labservices1001 pdns[2129]: Domain 'deployment-prep.wmflabs.org' is fresh (not presigned, no RRSIG check)
Jul  1 02:32:51 labservices1001 pdns[2129]: Domain 'design.wmflabs.org' is fresh (not presigned, no RRSIG check)
Jul  1 02:32:51 labservices1001 pdns[2129]: Domain 'discovery-stats.wmflabs.org' is fresh (not presigned, no RRSIG check)
Jul  1 02:33:01 labservices1001 CRON[45175]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul  1 02:34:01 labservices1001 CRON[45298]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jul  1 02:47:07 labservices1001 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="935" x-info="http://www.rsyslog.com"] start
Jul  1 02:47:07 labservices1001 rsyslogd: rsyslogd's groupid changed to 104
Jul  1 02:47:07 labservices1001 rsyslogd: rsyslogd's userid changed to 101
Jul  1 02:47:07 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpuset
Jul  1 02:47:07 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpu
Jul  1 02:47:07 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpuacct

Change 443448 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Openstack: page wmcs team for any host failure

https://gerrit.wikimedia.org/r/443448

Change 443448 merged by Andrew Bogott:
[operations/puppet@production] Openstack: page wmcs team for host failures

https://gerrit.wikimedia.org/r/443448

Andrew added a comment.Aug 5 2018, 2:47 AM

This just happened again -- thanks to better paging I caught it sooner :) There's nothing of interest in the syslog, just a sudden stop:

Aug  5 02:32:06 labservices1001 pdns[22726]: While checking domain freshness: Query to '208.80.155.117:5354' for SOA of 'wikivoyage.wmflabs.org' produced no results (error code: Refused)
Aug  5 02:32:06 labservices1001 pdns[22726]: While checking domain freshness: Query to '208.80.155.117:5354' for SOA of 'hashtags.wmflabs.org' produced no results (error code: Refused)
Aug  5 02:32:09 labservices1001 pdns[22726]: Received serial number updates for 0 zones, had 24 timeouts
Aug  5 02:33:01 labservices1001 CRON[30537]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug  5 02:43:28 labservices1001 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="928" x-info="http://www.rsyslog.com"] start
Aug  5 02:43:28 labservices1001 rsyslogd: rsyslogd's groupid changed to 104
Aug  5 02:43:28 labservices1001 rsyslogd: rsyslogd's userid changed to 101
Aug  5 02:43:28 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpuset
Aug  5 02:43:28 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpu
Aug  5 02:43:28 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpuacct
Aug  5 02:43:28 labservices1001 kernel: [    0.000000] Linux version 3.13.0-143-generic (buildd@lcy01-amd64-010) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.4) ) #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 (Ubuntu 3.13.0-143.192-generic 3.13.11-ckt39)
Andrew added a comment.Aug 5 2018, 2:48 AM

Ah, there were some temp warnings a few minutes earlier:

Aug  5 02:29:02 labservices1001 kernel: [3025868.972351] CPU3: Core temperature above threshold, cpu clock throttled (total events = 1981)
Aug  5 02:29:02 labservices1001 kernel: [3025868.982073] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2491)

etc.

Andrew renamed this task from Labservices1001 crashed to Labservices1001 crashing, probable overheating.Aug 5 2018, 2:49 AM
Andrew raised the priority of this task from Normal to High.Aug 5 2018, 3:02 AM
RobH assigned this task to Andrew.Aug 6 2018, 4:42 PM
RobH added a subscriber: RobH.

So, this thread is mildly confusing. From what I can see, labservices1001 (warranty expired 2017-04), had its thermal paste replaced at a previous time

In the past I've reapplied thermal paste. Let me know if you would like to schedule a time to do that.

You've done that on this exact server? If so, is it likely that a second attempt would make a difference?

However, it seems worth doing again, as requested by @Andrew and the cloud team via the SRE weekly meeting.

IRC Sync: @Andrew will schedule downtime with @Cmjohnson. I'm assigning this to @Andrew until the time is scheduled.

Mentioned in SAL (#wikimedia-operations) [2018-08-06T16:53:08Z] <andrewbogott> power down labservices1001 for thermal paste fix, T196252

added thermal paste

Andrew closed this task as Resolved.Aug 6 2018, 6:05 PM

Hopefully resolved; we'll see if it overheats again. Thanks @Cmjohnson