Page MenuHomePhabricator

mc1024 broke - replace it or remove it from configs
Closed, ResolvedPublic

Description

Today mc1024 went down unexpectedly. Icinga alerted about HOST down.

  • Connecting to mgmt console was not possible
  • Power cyclingfailed
  • DCops was unable to physically reboot

Notes:

This is about the config part, it is shard06 in redis::shards and it appears in mcrouter_wancache.yaml.


18:58 <+icinga-wm> PROBLEM - Host mc1024 is DOWN: PING CRITICAL - Packet loss = 100%
19:03 < mutante> !log mc1024 - attempting to power on via mgmt, went down and power down
19:06 < elukey> cmjohnson1: sorry to ping you, mc1024 in B6 went down a couple of mins ago, if you have a min can you check if the host is dead/fried?
grep -r 10.64.16.107
hieradata/common/profile/mediawiki/mcrouter_wancache.yaml:        host: 10.64.16.107
hieradata/common/redis.yaml:        host: 10.64.16.107

Event Timeline

mc2024 is the replication partner in codfw:

<+icinga-wm> ACKNOWLEDGEMENT - Check health of redis instance on 6379 on mc2024 is CRITICAL: CRITICAL: replication_delay is 3066 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases  (db0) with 339767 keys, up 24 days 1 hours - replication_delay is 3066
Dzahn mentioned this in Unknown Object (Task).Jan 14 2021, 8:43 PM
Dzahn added a subtask: Unknown Object (Task).

As it has been noted on the description, traffic is serviced by the gutter pool so we can wait to hear from DCops when this server can be replaced. In the meantime we can leave things as they are and review again next week. Thank you @Dzahn for taking care of it!

@Krinkle @aaron the gutter pool sets a max TTL of 600s to any key with a TTL over 600s, do you think it is fine to keep the gutter-pool substitute the missing server?

@Krinkle @aaron the gutter pool sets a max TTL of 600s to any key with a TTL over 600s, do you think it is fine to keep the gutter-pool substitute the missing server?

Seems OK for now.

Change 661740 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: Remove mc1024 from config

https://gerrit.wikimedia.org/r/661740

Change 661740 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: Remove mc1024 from config

https://gerrit.wikimedia.org/r/661740

jijiki claimed this task.

The server is resting in piece, new servers have been bought, we can close this task

Did the decom script run for mc1024? I can still see it in debmonitor

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Mar 17 2021, 11:59 PM