Page MenuHomePhabricator

Wikidata editing via API (and UI) slow (timeouts)
Closed, ResolvedPublic

Details

Related Gerrit Patches:

Event Timeline

Addshore created this task.Wed, Oct 30, 4:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Oct 30, 4:53 PM
Addshore triaged this task as Unbreak Now! priority.Wed, Oct 30, 4:54 PM
Addshore updated the task description. (Show Details)
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptWed, Oct 30, 4:54 PM
Addshore updated the task description. (Show Details)Wed, Oct 30, 4:55 PM
Addshore added a comment.EditedWed, Oct 30, 4:55 PM

Edit rate also through the floor.
I'm guessing this is all of the API modules

Relates probably to things that were backported in T234948.
The maintenance script for the terms migration restarted at around the time this issue seemed to appear.

Addshore renamed this task from Wikidata editing via API slow (timeouts) to Wikidata editing via API (and UI) slow (timeouts).EditedWed, Oct 30, 4:58 PM

5:59 PM <jynus> !log killed rebuildItemTerms on mwmaint1002
5:59 PM <+stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

Looks like API request rate is recovering

Addshore lowered the priority of this task from Unbreak Now! to High.
Addshore added a subscriber: Ladsgroup.

@Ladsgroup will write a patch removing the cron for now so that the script doesn't start up again.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptWed, Oct 30, 5:06 PM

Change 547264 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] mediawiki: Temporary disable rebuildTermIndex

https://gerrit.wikimedia.org/r/547264

Change 547264 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: Temporary disable rebuildTermIndex

https://gerrit.wikimedia.org/r/547264

Edit rate recovered, then got worse again

We are seeing 5-15s lag on db slaves.

Going to revert the 2 backports

Mentioned in SAL (#wikimedia-operations) [2019-10-30T17:26:36Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Revert 16:02 UTC T236928 (duration: 01m 04s)

Mentioned in SAL (#wikimedia-operations) [2019-10-30T17:29:15Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Revert 16:05 UTC T236928 (duration: 01m 05s)

The replag all seemed to come from one host which was having issues

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1126&var-port=9104

That seems to have recovered now.

Edits seem to be recovering

Addshore added a subscriber: jcrespo.EditedWed, Oct 30, 5:34 PM

From @jcrespo

Rows read on s8 shot up from around 16:12

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1572451682946&to=1572452256601&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-shard=s8&var-role=All

The original code deploys were:

16:05 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547244|Stop locking and use DISTINCT when finding used terms to delete]] (T234948) (duration: 01m 04s)
16:02 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547243|Stop locking and use DISTINCT when finding used terms to delete]] (T236466) (duration: 01m 05s)

The maintenance script restarted every hour at XX:30 minutes according to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/547264/1/modules/mediawiki/manifests/maintenance/wikidata.pp

So the issue existed since the deployment, and the maintenance script just increased the issue to a visible level.

We are seeing 5-15s lag on db slaves.

Offtopic- I don't believe the lag was real- it just that there was so much overload that lag checking reads/connections by mw would take so much time to finish and return results. This is irrelevant for this ticket (the state was bad), but something that maybe could be optimized or could be missleading in other scenarios in the lag checking code?

The maintenance script restarted every 30 mins

Small correction, every hour at XX:30 minutes. Maybe you already knew that, but just in case it could lead to missleading analysis.

The maintenance script restarted every 30 mins

Small correction, every hour at XX:30 minutes. Maybe you already knew that, but just in case it could lead to missleading analysis.

Fixed

Looking at things, everything still looks nice and recovered.

This may deserve an incident documentation.

I can do that

Change 547397 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Revert "Stop locking and use DISTINCT when finding used terms to delete"

https://gerrit.wikimedia.org/r/547397

Change 547397 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Revert "Stop locking and use DISTINCT when finding used terms to delete"

https://gerrit.wikimedia.org/r/547397

Addshore closed this task as Resolved.Thu, Oct 31, 8:10 AM

Incident doc draft up, and the patch involved has been reverted on master so as not to surprise us next week.
So closing this now

Maintenance_bot moved this task from Incoming to Done on the User-Ladsgroup board.Thu, Oct 31, 8:15 AM