Page MenuHomePhabricator

high replication lag on wdqs1002
Closed, ResolvedPublic

Description

Today (May 29th) since ~ noon UTC, wdqs1002 is lagging behind on replication. Updates are still happening, just not fast enough to catch up. Restarting the updater and blazegraph does not solve the issue. Logs don't seem to have anything suspicious, load on that machine is reasonable.

Event Timeline

Gehel created this task.May 29 2017, 7:09 PM
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptMay 29 2017, 7:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Alphos added a subscriber: Alphos.May 29 2017, 7:39 PM

Looking at dmesg, there are a lot of warnings about CPU temperature and throttling:

[9098037.343804] CPU23: Package temperature above threshold, cpu clock throttled (total events = 52647618)

This has been going on at least since May 22, but seems to happen more often lately. This might or might not be related to the issue.

Mentioned in SAL (#wikimedia-operations) [2017-05-29T19:56:21Z] <gehel> removing wdqs1002 from LVS pending investigation of T166524

@Cmjohnson is this high temperature an indication that you should do some magic with thermal paste?

If it happens on a single server, not a load issue. Combined with warnings looks like hardware problem. I'll make a pass through the logs tomorrow to see if maybe it's still some software bug but so far it looks like it may need some care - maybe a fan dropped dead or something?

Mentioned in SAL (#wikimedia-operations) [2017-05-30T07:38:33Z] <gehel> wdqs1002 back in LVS - T166524

Gehel added a subscriber: RobH.May 30 2017, 7:42 AM

@RobH: from racktables, it looks like wdqs1002 is 4.5 years old (purchase date = 2012-12-05, same as wdqs1001 - other servers are newer). I'm not sure about the warranty status, or when we should think about renewing those servers. Any idea?

thiemowmde moved this task from incoming to monitoring on the Wikidata board.
thiemowmde added subscribers: Lydia_Pintscher, Jonas, hoo.

Taking wdqs1002 out of LVS seems to have given it sufficient breathing space to catch up on replication. I added it back and it seems stable so far. I'm still not trusting it entirely...

Gehel claimed this task.May 30 2017, 1:13 PM
debt triaged this task as Medium priority.May 30 2017, 5:36 PM

I think there was talk about replacing these older servers with new ones, maybe we should start with wdqs1002... Don't want for it to die on us at the worst time possible.

Gehel reassigned this task from Gehel to Cmjohnson.Jun 1 2017, 12:38 PM

wdqs1002 has not had any issue since then. Hardware request is done on a separate ticket. It still probably make sense to have a look at thermal paste, but I'll let @Cmjohnson decide on the way to go there (ticket is now assigned to him).

Gehel mentioned this in Unknown Object (Task).Jun 1 2017, 12:41 PM

Mentioned in SAL (#wikimedia-operations) [2017-06-01T17:41:32Z] <gehel> shutting down wdqs1002 for maintenance - T166524

Mentioned in SAL (#wikimedia-operations) [2017-06-01T18:20:47Z] <gehel> wdqs1002 back in LVS - thermal paste added - T166524

Gehel claimed this task.Jun 1 2017, 6:21 PM

thermal paste has been added by @Cmjohnson, this can be closed.

debt closed this task as Resolved.Jun 1 2017, 10:39 PM