When going to https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/, the last change is from before the outage (T329535), but there are known changes that happened after.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | BUG REPORT | taavi | T329589 gerrit copy of cloud/instance-puppet stopped replicating | ||
| Resolved | BUG REPORT | dcaro | T329535 Cloud Ceph outage 2023-02-13 | ||
| Resolved | dcaro | T329709 [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner | |||
| Resolved | dcaro | T329711 [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity | |||
| Open | None | T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network | |||
| Resolved | Request | Papaul | T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet | ||
| Resolved | cmooney | T329799 Add network-layer protections to avoid inadvertently lowering IRB MTU |
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2023-02-13T23:59:48Z] <bd808> enc-1.cloudinfra.eqiad1.wikimedia.cloud: service uwsgi-puppet-enc restart (T329589)
Mentioned in SAL (#wikimedia-cloud) [2023-02-14T00:13:13Z] <bd808> enc-2.cloudinfra.eqiad1.wikimedia.cloud: service puppet-enc-git-worker restart (T329589)
On enc-2.cloudinfra.eqiad1.wikimedia.cloud:
$ journalctl -u puppet-enc-git-worker --no-pager ... Feb 14 00:12:48 enc-2 puppet-enc-git-worker[8737]: 2023-02-14 00:12:48.437 8737 ERROR puppet-enc-git-worker pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'cloudinfra-db03.cloudinfra.eqiad1.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)") Feb 14 00:12:48 enc-2 puppet-enc-git-worker[8737]: 2023-02-14 00:12:48.437 8737 ERROR puppet-enc-git-worker Feb 14 00:12:48 enc-2 systemd[1]: puppet-enc-git-worker.service: Main process exited, code=exited, status=1/FAILURE Feb 14 00:12:48 enc-2 systemd[1]: puppet-enc-git-worker.service: Failed with result 'exit-code'. $ host cloudinfra-db03.cloudinfra.eqiad1.wikimedia.cloud ;; connection timed out; no servers could be reached
Mentioned in SAL (#wikimedia-cloud) [2023-02-14T00:18:02Z] <bd808> enc-2.cloudinfra.eqiad1.wikimedia.cloud: shutdown -r now (T329589)
DNS is working on enc-2.cloudinfra.eqiad1.wikimedia.cloud after the reboot, but keyholder needs to be manually armed there now. I poked in the #wikimedia-cloud-admin IRC channel to see if someone can do that for me as I do not know the passphrase or where to find it.
I cannot find any docs on all of this on wikitech either which seems like something that @taavi could help fix as he did the work in T318504: ENC API should update cloud/instance-puppet.git instead of requiring the caller to do so to move the git commits from Horizon to the backend enc service.
Mentioned in SAL (#wikimedia-cloud) [2023-02-14T08:09:28Z] <taavi> arm keyholder on enc-2 T329589