Change Details

Today mc1024 went down unexpectedly. Icinga alerted about HOST down. When connecting to mgmt console there was nothing on it and the server was powered off. Attempting to power cycle it failed.. it just stayed powered off. Then we contacted dcops and Chris tried to physically reboot it but also to no avail. The server is out of warranty and a HP and this happened before. It is not going to come back. So we need to migrate off of it and replace it with something else unless it can be removed completely. There was no immediate user-facing issue because [[ https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All | traffic failed over to the gutter pool ]] (very good!:). That made this ticket High prio but not UBN. There will be separate tasks for decom'ing the broken hardware (T272074) and to ask for a replacement (T272085). This is about the config part, it is `shard06` in redis::shards and it appears in mcrouter_wancache.yaml. --- ```lang=irc 18:58 <+icinga-wm> PROBLEM - Host mc1024 is DOWN: PING CRITICAL - Packet loss = 100% 19:03 < mutante> !log mc1024 - attempting to power on via mgmt, went down and power down 19:06 < elukey> cmjohnson1: sorry to ping you, mc1024 in B6 went down a couple of mins ago, if you have a min can you check if the host is dead/fried? ``` ``` grep -r 10.64.16.107 hieradata/common/profile/mediawiki/mcrouter_wancache.yaml: host: 10.64.16.107 hieradata/common/redis.yaml: host: 10.64.16.107 ```