Improve logging and alerting if toolhub is down
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	elukey
	Jul 3 2023, 9:53 AM

Description

Hi folks!

Not sure if this is the right set of tags, but I'd like to report an issue that happened this morning to toolhub. I noticed some pybal alerts related to pods being not responding well in eqiad, and I found the following scenario:

NAME                                  READY   STATUS             RESTARTS         AGE
toolhub-main-5557c9fc9c-7ftvl         3/4     CrashLoopBackOff   572 (5m8s ago)   38d
toolhub-main-5557c9fc9c-d59s8         3/4     CrashLoopBackOff   858 (67s ago)    38d
toolhub-main-crawler-28139400-7vj6v   1/2     Error              0                43m

Then I realized that toolhub.wikimedia.org was down. The codfw pods were up, and no log was emitted by the toolhub-main containers. I got the following error from the crawler:

MySQLdb._exceptions.OperationalError: (2002, "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' (115)")

And Taavi suggested that it could have been related to a change in dbproxies in eqiad, so I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/934866 and everything went up again.

As follow up, I am wondering if we could add more logs to the toolhub-main containers to have a better response in these use cases. Moreover, is there a procedure to failover to codfw in case it is needed? I'd have used DNS in this case but I wasn't sure if any follow up was needed (and I didn't find docs etc.).

Related Objects

Mentioned In: T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over
Mentioned Here: T292861: Find a better solution than `concurrencyPolicy: Replace` for sidecars in CronJob
T288685: Establish active/active multi-dc support for Toolhub
T329319: What should happen to Toolhub during the 2023 DC switch?

Event Timeline

elukey created this task.Jul 3 2023, 9:53 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 3 2023, 9:53 AM

• taavi edited projects, added Toolhub; removed Cloud-Services.Jul 3 2023, 9:54 AM

RhinosF1 subscribed.Jul 3 2023, 10:00 AM

elukey mentioned this in T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over.Jul 3 2023, 10:08 AM

As follow up, I am wondering if we could add more logs to the toolhub-main containers to have a better response in these use cases.

What sort of logging would you be looking for? Based on the big gap I see at https://logstash.wikimedia.org/goto/a31db31b9f22482d9a2f4b3608f06561 I think something actually blocked all of toolhub's log event from getting to the ELK stack during the outage.

Moreover, is there a procedure to failover to codfw in case it is needed? I'd have used DNS in this case but I wasn't sure if any follow up was needed (and I didn't find docs etc.).

I have not described any failover procedure as part of https://wikitech.wikimedia.org/wiki/Toolhub.wikimedia.org. T288685: Establish active/active multi-dc support for Toolhub generally describes why I never bothered. There is no writable database for Toolhub in codfw nor is there currently an Elasticsearch index. In T329319: What should happen to Toolhub during the 2023 DC switch? everyone seemed to decide that it was more trouble than it was worth to figure out how to fail over along with the wikis and most of their support services.

In T340950#9033049, @bd808 wrote:

As follow up, I am wondering if we could add more logs to the toolhub-main containers to have a better response in these use cases.

What sort of logging would you be looking for? Based on the big gap I see at https://logstash.wikimedia.org/goto/a31db31b9f22482d9a2f4b3608f06561 I think something actually blocked all of toolhub's log event from getting to the ELK stack during the outage.

I didn't notice any log when inspecting the pods with kubectl logs, so I guess that toolhub hung when connecting to the database (since we didn't have firewall rules for the new proxy). I only noticed something in the toolhub-main-crawler pod, that was also failing periodically (is it supposed to behave like that?).

Moreover, is there a procedure to failover to codfw in case it is needed? I'd have used DNS in this case but I wasn't sure if any follow up was needed (and I didn't find docs etc.).

I have not described any failover procedure as part of https://wikitech.wikimedia.org/wiki/Toolhub.wikimedia.org. T288685: Establish active/active multi-dc support for Toolhub generally describes why I never bothered. There is no writable database for Toolhub in codfw nor is there currently an Elasticsearch index. In T329319: What should happen to Toolhub during the 2023 DC switch? everyone seemed to decide that it was more trouble than it was worth to figure out how to fail over along with the wikis and most of their support services.

I think that adding this information to the wikitech page would be valuable, so we know what to do next time :)

In T340950#9033315, @elukey wrote:

In T340950#9033049, @bd808 wrote:

What sort of logging would you be looking for? Based on the big gap I see at https://logstash.wikimedia.org/goto/a31db31b9f22482d9a2f4b3608f06561 I think something actually blocked all of toolhub's log event from getting to the ELK stack during the outage.

I didn't notice any log when inspecting the pods with kubectl logs, so I guess that toolhub hung when connecting to the database (since we didn't have firewall rules for the new proxy). I only noticed something in the toolhub-main-crawler pod, that was also failing periodically (is it supposed to behave like that?).

If your question is if the crawler pod is supposed to end up stuck between runs because of sidecar containers lingering, that is covered by T292861: Find a better solution than `concurrencyPolicy: Replace` for sidecars in CronJob.

I have not described any failover procedure as part of https://wikitech.wikimedia.org/wiki/Toolhub.wikimedia.org. T288685: Establish active/active multi-dc support for Toolhub generally describes why I never bothered. There is no writable database for Toolhub in codfw nor is there currently an Elasticsearch index. In T329319: What should happen to Toolhub during the 2023 DC switch? everyone seemed to decide that it was more trouble than it was worth to figure out how to fail over along with the wikis and most of their support services.

I think that adding this information to the wikitech page would be valuable, so we know what to do next time :)

https://wikitech.wikimedia.org/wiki/Toolhub.wikimedia.org#Datacenter_failover. Improvements welcome.

Improve logging and alerting if toolhub is downOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Improve logging and alerting if toolhub is down
Open, Needs TriagePublic
Actions