Page MenuHomePhabricator

Wikidata editing via API (and UI) slow (timeouts)
Closed, ResolvedPublic

Assigned To
Authored By
Addshore
Oct 30 2019, 4:53 PM
Referenced Files
F30928725: image.png
Oct 30 2019, 5:34 PM
F30928711: image.png
Oct 30 2019, 5:30 PM
F30928662: image.png
Oct 30 2019, 5:22 PM
F30928660: image.png
Oct 30 2019, 5:22 PM
F30928576: image.png
Oct 30 2019, 5:04 PM
F30928539: image.png
Oct 30 2019, 4:55 PM
F30928525: image.png
Oct 30 2019, 4:54 PM

Event Timeline

Addshore triaged this task as Unbreak Now! priority.Oct 30 2019, 4:54 PM
Addshore updated the task description. (Show Details)

Edit rate also through the floor.
I'm guessing this is all of the API modules

image.png (289×920 px, 62 KB)

Relates probably to things that were backported in T234948.
The maintenance script for the terms migration restarted at around the time this issue seemed to appear.

Addshore renamed this task from Wikidata editing via API slow (timeouts) to Wikidata editing via API (and UI) slow (timeouts).EditedOct 30 2019, 4:58 PM

5:59 PM <jynus> !log killed rebuildItemTerms on mwmaint1002
5:59 PM <+stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

Looks like API request rate is recovering

image.png (372×1 px, 51 KB)

Addshore lowered the priority of this task from Unbreak Now! to High.
Addshore added a subscriber: Ladsgroup.

@Ladsgroup will write a patch removing the cron for now so that the script doesn't start up again.

Change 547264 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] mediawiki: Temporary disable rebuildTermIndex

https://gerrit.wikimedia.org/r/547264

Change 547264 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: Temporary disable rebuildTermIndex

https://gerrit.wikimedia.org/r/547264

Edit rate recovered, then got worse again

image.png (295×574 px, 30 KB)

We are seeing 5-15s lag on db slaves.

image.png (282×903 px, 28 KB)

Going to revert the 2 backports

Mentioned in SAL (#wikimedia-operations) [2019-10-30T17:26:36Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Revert 16:02 UTC T236928 (duration: 01m 04s)

Mentioned in SAL (#wikimedia-operations) [2019-10-30T17:29:15Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Revert 16:05 UTC T236928 (duration: 01m 05s)

The replag all seemed to come from one host which was having issues

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104

That seems to have recovered now.

Edits seem to be recovering

image.png (304×593 px, 30 KB)

From @jcrespo

Rows read on s8 shot up from around 16:12

image.png (282×1 px, 73 KB)

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1572451682946&to=1572452256601&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-shard=s8&var-role=All

The original code deploys were:

16:05 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547244|Stop locking and use DISTINCT when finding used terms to delete]] (T234948) (duration: 01m 04s)
16:02 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547243|Stop locking and use DISTINCT when finding used terms to delete]] (T236466) (duration: 01m 05s)

The maintenance script restarted every hour at XX:30 minutes according to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/547264/1/modules/mediawiki/manifests/maintenance/wikidata.pp

So the issue existed since the deployment, and the maintenance script just increased the issue to a visible level.

We are seeing 5-15s lag on db slaves.

Offtopic- I don't believe the lag was real- it just that there was so much overload that lag checking reads/connections by mw would take so much time to finish and return results. This is irrelevant for this ticket (the state was bad), but something that maybe could be optimized or could be missleading in other scenarios in the lag checking code?

The maintenance script restarted every 30 mins

Small correction, every hour at XX:30 minutes. Maybe you already knew that, but just in case it could lead to missleading analysis.

The maintenance script restarted every 30 mins

Small correction, every hour at XX:30 minutes. Maybe you already knew that, but just in case it could lead to missleading analysis.

Fixed

Looking at things, everything still looks nice and recovered.

This may deserve an incident documentation.

I can do that

Change 547397 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Revert "Stop locking and use DISTINCT when finding used terms to delete"

https://gerrit.wikimedia.org/r/547397

Change 547397 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Revert "Stop locking and use DISTINCT when finding used terms to delete"

https://gerrit.wikimedia.org/r/547397

Incident doc draft up, and the patch involved has been reverted on master so as not to surprise us next week.
So closing this now