Page MenuHomePhabricator

Updating's Wikidata property suggester caused replica lag on all wikidata databases
Closed, ResolvedPublic

Description

We got an alert on #wikimedia-databases IRC saying:

PROBLEM - MariaDB sustained replica lag on db1111 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1111&var-port=9104

However, the spike of lag was seen on all servers,it just happened db1111 was more sensitive to be reported:

Screenshot from 2021-01-21 10-57-14.png (1×2 px, 254 KB)

Logstash indicated thousands of client errors due to lag:

Screenshot from 2021-01-21 10-59-06.png (1×1 px, 177 KB)

Log around that time seems to indicate maintenance to Wikidata property suggester:

09:44:09 <hoo> !log Updated the Wikidata property suggester with data from the 2021-01-11 JSON dump and applied the T132839 workarounds

hoo on IRC seemed to agree that it was likely the cause:

09:48:10 <hoo> The maintenance script rebuilds the entire table (yuck...)

We can have temporary lag on one server and we are able to cope with that, but on all servers is quite imapacting for editors/recentchanges/etc.

Let's try to avoid production impact by one of: refactoring the script, adding pauses (e.g. waitForReplica()) or avoiding its run, or any other method, up to devel team.

Event Timeline

Currently the procedure for updating the table involves a hand-written script, which does data clean up (deletes LOTS of rows) in the database table after the dataset is inserted (which is probably what was causing these issues). See PropertySuggester_update.

I spent a bit of time today on this and re-wrote this as a small python script which can be applied on the dataset before it is inserted into the table (thus no more deletes), effectively solving the issues at hand. For the next update, I will use that and we can closely monitor the logs in order to conclusively close this.

We should be tackling this topic as part of T72037: [Story] Automate Entity Suggester Data Updates which will likely lead to much refactoring and usage of a different update mechanism.

hoo claimed this task.

I just did another suggester update, with the script described above (T272571#6771954) and we encountered no issues this time, thus this should be fine.