We got an alert on #wikimedia-databases IRC saying:
PROBLEM - MariaDB sustained replica lag on db1111 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1111&var-port=9104
However, the spike of lag was seen on all servers,it just happened db1111 was more sensitive to be reported:
Logstash indicated thousands of client errors due to lag:
Log around that time seems to indicate maintenance to Wikidata property suggester:
09:44:09 <hoo> !log Updated the Wikidata property suggester with data from the 2021-01-11 JSON dump and applied the T132839 workarounds
hoo on IRC seemed to agree that it was likely the cause:
09:48:10 <hoo> The maintenance script rebuilds the entire table (yuck...)
We can have temporary lag on one server and we are able to cope with that, but on all servers is quite imapacting for editors/recentchanges/etc.
Let's try to avoid production impact by one of: refactoring the script, adding pauses (e.g. waitForReplica()) or avoiding its run, or any other method, up to devel team.