Hi folks!
Not sure if this is the right set of tags, but I'd like to report an issue that happened this morning to toolhub. I noticed some pybal alerts related to pods being not responding well in eqiad, and I found the following scenario:
NAME READY STATUS RESTARTS AGE toolhub-main-5557c9fc9c-7ftvl 3/4 CrashLoopBackOff 572 (5m8s ago) 38d toolhub-main-5557c9fc9c-d59s8 3/4 CrashLoopBackOff 858 (67s ago) 38d toolhub-main-crawler-28139400-7vj6v 1/2 Error 0 43m
Then I realized that toolhub.wikimedia.org was down. The codfw pods were up, and no log was emitted by the toolhub-main containers. I got the following error from the crawler:
MySQLdb._exceptions.OperationalError: (2002, "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' (115)")
And Taavi suggested that it could have been related to a change in dbproxies in eqiad, so I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/934866 and everything went up again.
As follow up, I am wondering if we could add more logs to the toolhub-main containers to have a better response in these use cases. Moreover, is there a procedure to failover to codfw in case it is needed? I'd have used DNS in this case but I wasn't sure if any follow up was needed (and I didn't find docs etc.).