Page MenuHomePhabricator

Wikidata.org currently very slow
Closed, ResolvedPublic

Description

The wikidata.org website is currently very slow.
Loading time for https://www.wikidata.org/wiki/Special:Watchlist is 246071 ms which is absurd high (this is plain loading time for the page self according to my browser console, network tab)

Event Timeline

Legoktm triaged this task as Unbreak Now! priority.Aug 13 2017, 6:13 PM
Legoktm added a project: SRE.
Legoktm added a subscriber: Legoktm.

this is being investigated. Currently looking at db1082 being slow.

I don't see anything wrong with db1082 (or the s5 master).
db1082 had some spikes but traffic is back to normal: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1082&var-port=9104&refresh=1m&from=1502611167573&to=1502654367573

db1063 (the master) had a big drop in traffic, but it also recovered: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&refresh=1m&from=1502611227789&to=1502654427790&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&var-port=9104

I have checked both servers and they had no HW or MySQL issues, so I would say this could be a consequence of whatever is/was going on rather than the cause.

I am seeing one disk on db1063 (the master) with errors and predictive failure, but it has been on predictive failure since 15th Jul, and the raid is OPTIMAL status.
As the traffic has recovered on the master after the drop, I would say that disk predictive failure and errors (only 379 errors only) has nothing to do with this issue.

Marostegui lowered the priority of this task from Unbreak Now! to Medium.Aug 14 2017, 2:48 AM

This graphs shows that the traffic got back to normal rate hours ago.
What caused this...I don't know. But I think we can lower (if not close) this issue.

https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=6&fullscreen&orgId=1&from=1502505928446&to=1502678728447

It looks like something just happened on db1082 again?
at roughly 16:54 edits on wikidata dropped off again and there was a spike in the slave lag.

<icinga-wm> IRC echo bot PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.132 second response time
5:05 PM I<icinga-wm> IRC echo bot PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 413.03 seconds
5:05 PM PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 432.75 seconds

I'm on a plane but the lag is probably due to the massive spike in UPDATES the master (db1063) had

jcrespo raised the priority of this task from Medium to Unbreak Now!.Aug 17 2017, 3:48 PM
jcrespo added a subscriber: jcrespo.

The execution of:

www-data 20284  0.0  0.1 328216 60744 ?        S    15:30   0:00 php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/pruneChanges.php --wiki wikidatawiki --number-of-days=3

Matches the start of the issues.

I think this is just UPDATE /* Title::invalidateCache */ based on the binlogs, not the above script.

which means most probably a direct cause of T164173, getting worse?

This has not happened since T164173 got fixed I believe, so maybe it was indeed a direct cause.

jcrespo claimed this task.

Resolving for now.