Updating's Wikidata property suggester caused replica lag on all wikidata databases
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Jan 21 2021, 10:07 AM

Description

We got an alert on #wikimedia-databases IRC saying:

PROBLEM - MariaDB sustained replica lag on db1111 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1111&var-port=9104

However, the spike of lag was seen on all servers,it just happened db1111 was more sensitive to be reported:

Screenshot from 2021-01-21 10-57-14.png (1×2 px, 254 KB)

Logstash indicated thousands of client errors due to lag:

Screenshot from 2021-01-21 10-59-06.png (1×1 px, 177 KB)

Log around that time seems to indicate maintenance to Wikidata property suggester:

09:44:09 <hoo> !log Updated the Wikidata property suggester with data from the 2021-01-11 JSON dump and applied the T132839 workarounds

hoo on IRC seemed to agree that it was likely the cause:

09:48:10 <hoo> The maintenance script rebuilds the entire table (yuck...)

We can have temporary lag on one server and we are able to cope with that, but on all servers is quite imapacting for editors/recentchanges/etc.

Let's try to avoid production impact by one of: refactoring the script, adding pauses (e.g. waitForReplica()) or avoiding its run, or any other method, up to devel team.

Related Objects

Mentioned Here: T72037: [Story] Automate Entity Suggester Data Updates

Event Timeline

jcrespo created this task.Jan 21 2021, 10:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2021, 10:07 AM

Currently the procedure for updating the table involves a hand-written script, which does data clean up (deletes LOTS of rows) in the database table after the dataset is inserted (which is probably what was causing these issues). See PropertySuggester_update.

I spent a bit of time today on this and re-wrote this as a small python script which can be applied on the dataset before it is inserted into the table (thus no more deletes), effectively solving the issues at hand. For the next update, I will use that and we can closely monitor the logs in order to conclusively close this.

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptJan 24 2021, 6:35 PM

Addshore subscribed.Jan 25 2021, 3:47 PM

We should be tackling this topic as part of T72037: [Story] Automate Entity Suggester Data Updates which will likely lead to much refactoring and usage of a different update mechanism.

Ladsgroup subscribed.Jan 26 2021, 9:19 AM

Lucas_Werkmeister_WMDE subscribed.Jan 26 2021, 11:09 AM

Krinkle moved this task from Untriaged to Jan 2021 on the Wikimedia-production-error board.Jan 27 2021, 7:38 PM

jcrespo awarded a token.Feb 15 2021, 4:16 PM

I just did another suggester update, with the script described above (T272571#6771954) and we encountered no issues this time, thus this should be fine.

	F34010823: Screenshot from 2021-01-21 10-57-14.png
	Jan 21 2021, 10:07 AM

	F34010825: Screenshot from 2021-01-21 10-59-06.png
	Jan 21 2021, 10:07 AM

Updating's Wikidata property suggester caused replica lag on all wikidata databasesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Updating's Wikidata property suggester caused replica lag on all wikidata databases
Closed, ResolvedPublic
Actions