Page MenuHomePhabricator

Improve logging and alerting if toolhub is down
Open, Needs TriagePublic

Description

Hi folks!

Not sure if this is the right set of tags, but I'd like to report an issue that happened this morning to toolhub. I noticed some pybal alerts related to pods being not responding well in eqiad, and I found the following scenario:

NAME                                  READY   STATUS             RESTARTS         AGE
toolhub-main-5557c9fc9c-7ftvl         3/4     CrashLoopBackOff   572 (5m8s ago)   38d
toolhub-main-5557c9fc9c-d59s8         3/4     CrashLoopBackOff   858 (67s ago)    38d
toolhub-main-crawler-28139400-7vj6v   1/2     Error              0                43m

Then I realized that toolhub.wikimedia.org was down. The codfw pods were up, and no log was emitted by the toolhub-main containers. I got the following error from the crawler:

MySQLdb._exceptions.OperationalError: (2002, "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' (115)")

And Taavi suggested that it could have been related to a change in dbproxies in eqiad, so I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/934866 and everything went up again.

As follow up, I am wondering if we could add more logs to the toolhub-main containers to have a better response in these use cases. Moreover, is there a procedure to failover to codfw in case it is needed? I'd have used DNS in this case but I wasn't sure if any follow up was needed (and I didn't find docs etc.).

Event Timeline

As follow up, I am wondering if we could add more logs to the toolhub-main containers to have a better response in these use cases.

What sort of logging would you be looking for? Based on the big gap I see at https://logstash.wikimedia.org/goto/a31db31b9f22482d9a2f4b3608f06561 I think something actually blocked all of toolhub's log event from getting to the ELK stack during the outage.

Moreover, is there a procedure to failover to codfw in case it is needed? I'd have used DNS in this case but I wasn't sure if any follow up was needed (and I didn't find docs etc.).

I have not described any failover procedure as part of https://wikitech.wikimedia.org/wiki/Toolhub.wikimedia.org. T288685: Establish active/active multi-dc support for Toolhub generally describes why I never bothered. There is no writable database for Toolhub in codfw nor is there currently an Elasticsearch index. In T329319: What should happen to Toolhub during the 2023 DC switch? everyone seemed to decide that it was more trouble than it was worth to figure out how to fail over along with the wikis and most of their support services.

As follow up, I am wondering if we could add more logs to the toolhub-main containers to have a better response in these use cases.

What sort of logging would you be looking for? Based on the big gap I see at https://logstash.wikimedia.org/goto/a31db31b9f22482d9a2f4b3608f06561 I think something actually blocked all of toolhub's log event from getting to the ELK stack during the outage.

I didn't notice any log when inspecting the pods with kubectl logs, so I guess that toolhub hung when connecting to the database (since we didn't have firewall rules for the new proxy). I only noticed something in the toolhub-main-crawler pod, that was also failing periodically (is it supposed to behave like that?).

Moreover, is there a procedure to failover to codfw in case it is needed? I'd have used DNS in this case but I wasn't sure if any follow up was needed (and I didn't find docs etc.).

I have not described any failover procedure as part of https://wikitech.wikimedia.org/wiki/Toolhub.wikimedia.org. T288685: Establish active/active multi-dc support for Toolhub generally describes why I never bothered. There is no writable database for Toolhub in codfw nor is there currently an Elasticsearch index. In T329319: What should happen to Toolhub during the 2023 DC switch? everyone seemed to decide that it was more trouble than it was worth to figure out how to fail over along with the wikis and most of their support services.

I think that adding this information to the wikitech page would be valuable, so we know what to do next time :)

What sort of logging would you be looking for? Based on the big gap I see at https://logstash.wikimedia.org/goto/a31db31b9f22482d9a2f4b3608f06561 I think something actually blocked all of toolhub's log event from getting to the ELK stack during the outage.

I didn't notice any log when inspecting the pods with kubectl logs, so I guess that toolhub hung when connecting to the database (since we didn't have firewall rules for the new proxy). I only noticed something in the toolhub-main-crawler pod, that was also failing periodically (is it supposed to behave like that?).

If your question is if the crawler pod is supposed to end up stuck between runs because of sidecar containers lingering, that is covered by T292861: Find a better solution than `concurrencyPolicy: Replace` for sidecars in CronJob.

I have not described any failover procedure as part of https://wikitech.wikimedia.org/wiki/Toolhub.wikimedia.org. T288685: Establish active/active multi-dc support for Toolhub generally describes why I never bothered. There is no writable database for Toolhub in codfw nor is there currently an Elasticsearch index. In T329319: What should happen to Toolhub during the 2023 DC switch? everyone seemed to decide that it was more trouble than it was worth to figure out how to fail over along with the wikis and most of their support services.

I think that adding this information to the wikitech page would be valuable, so we know what to do next time :)

https://wikitech.wikimedia.org/wiki/Toolhub.wikimedia.org#Datacenter_failover. Improvements welcome.