Page MenuHomePhabricator

Primary s4 db Incident report review
Closed, DeclinedPublic

Description

Please review and updated the following incident document

https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-04-27_Commons_wiki_primary_db

Specificity:

  • There was a mention of "InnoDB Monitor starts to warn about threadpool blocked" however i couldn't see this alert firing in operations. can someone copy the contents of the alert to the Detection section
  • Add details to the impact section
  • Conclusions section
  • Links to relevant documentation

Event Timeline

Marostegui renamed this task from MySQL Replication Inciident repot review to MySQL Replication Incident repot review.Apr 27 2021, 1:32 PM
jbond renamed this task from MySQL Replication Incident repot review to MySQL Replication Incident report review.Apr 27 2021, 1:34 PM

There was a mention of "InnoDB Monitor starts to warn about threadpool blocked" however i couldn't see this alert firing in operations. can someone copy the contents of the alert to the [[ https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-04-27_MySQL_Replication#Detection | Detection section ]

That has no alerting, that was extracted from the logs. So there's no need to add that to the detection part. That was read after the incident, when doing the search for the culprit

Add details to the impact section

Done

Conclusions section

Expanded a bit

"InnoDB Monitor" is the name of the Mariadb/MySQL component that starts dumping warnings into the error log. Normally warnings do not log there, but it starts doing periodic dump information to mysql error long (and from them, to syslog) when something at engine level is in very bad state: https://dev.mysql.com/doc/refman/5.6/en/innodb-enabling-monitors.html Maybe the confusion was with our monitoring infrastructure(?), but that is the name of that mysql component.

jcrespo renamed this task from MySQL Replication Incident report review to Primary s4 db Incident report review.Apr 27 2021, 1:53 PM

Jbond- please check I renamed the name of the incident on Google Docs. Could you rename it on wikitech, too. "Replication incident" is missleading.

Maybe the confusion was with our monitoring infrastructure(?), but that is the name of that mysql component.

The confusion was mine, i thought this fired as an icinga log, this can be ignored now

please check I renamed the name of the incident on Google Docs. Could you rename it on wikitech, too. "Replication incident" is missleading.

done

jbond updated the task description. (Show Details)

BTW, "Replication incident" was a good name when you took control- we didn't know what was going on at first, so when replication breaks is either the replicas or the primary server :-). We do now, so that is why I renamed it to not confuse it with a "regular" replication error problem.

BTW, "Replication incident" was a good name when you took control- we didn't know what was going on at first, so when replication breaks is either the replicas or the primary server :-). We do now, so that is why I renamed it to not confuse it with a "regular" replication error problem.

Ack SGTM thanks

lmata subscribed.

closing as this will not get addressed further