Page MenuHomePhabricator

gerrit copy of cloud/instance-puppet stopped replicating
Closed, ResolvedPublicBUG REPORT

Description

When going to https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/, the last change is from before the outage (T329535), but there are known changes that happened after.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2023-02-13T23:59:48Z] <bd808> enc-1.cloudinfra.eqiad1.wikimedia.cloud: service uwsgi-puppet-enc restart (T329589)

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T00:13:13Z] <bd808> enc-2.cloudinfra.eqiad1.wikimedia.cloud: service puppet-enc-git-worker restart (T329589)

On enc-2.cloudinfra.eqiad1.wikimedia.cloud:

$ journalctl -u puppet-enc-git-worker --no-pager
...
Feb 14 00:12:48 enc-2 puppet-enc-git-worker[8737]: 2023-02-14 00:12:48.437 8737 ERROR puppet-enc-git-worker pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'cloudinfra-db03.cloudinfra.eqiad1.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)")
Feb 14 00:12:48 enc-2 puppet-enc-git-worker[8737]: 2023-02-14 00:12:48.437 8737 ERROR puppet-enc-git-worker
Feb 14 00:12:48 enc-2 systemd[1]: puppet-enc-git-worker.service: Main process exited, code=exited, status=1/FAILURE
Feb 14 00:12:48 enc-2 systemd[1]: puppet-enc-git-worker.service: Failed with result 'exit-code'.
$ host cloudinfra-db03.cloudinfra.eqiad1.wikimedia.cloud
;; connection timed out; no servers could be reached

Mentioned in SAL (#wikimedia-cloud) [2023-02-14T00:18:02Z] <bd808> enc-2.cloudinfra.eqiad1.wikimedia.cloud: shutdown -r now (T329589)

DNS is working on enc-2.cloudinfra.eqiad1.wikimedia.cloud after the reboot, but keyholder needs to be manually armed there now. I poked in the #wikimedia-cloud-admin IRC channel to see if someone can do that for me as I do not know the passphrase or where to find it.

I cannot find any docs on all of this on wikitech either which seems like something that @taavi could help fix as he did the work in T318504: ENC API should update cloud/instance-puppet.git instead of requiring the caller to do so to move the git commits from Horizon to the backend enc service.

bd808 changed the subtype of this task from "Task" to "Bug Report".Feb 14 2023, 12:43 AM

I also can't find the passphrase. Hopefully @taavi will clarify.

Armed the Keyholder and updated the docs.

taavi claimed this task.