The wikidata.org website is currently very slow.
Loading time for https://www.wikidata.org/wiki/Special:Watchlist is 246071 ms which is absurd high (this is plain loading time for the page self according to my browser console, network tab)
Description
Related Objects
Event Timeline
Editing is very slow, edits aren't saved https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1&from=1502637379598&to=1502647684878
I don't see anything wrong with db1082 (or the s5 master).
db1082 had some spikes but traffic is back to normal: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1082&var-port=9104&refresh=1m&from=1502611167573&to=1502654367573
db1063 (the master) had a big drop in traffic, but it also recovered: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&refresh=1m&from=1502611227789&to=1502654427790&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&var-port=9104
I have checked both servers and they had no HW or MySQL issues, so I would say this could be a consequence of whatever is/was going on rather than the cause.
I am seeing one disk on db1063 (the master) with errors and predictive failure, but it has been on predictive failure since 15th Jul, and the raid is OPTIMAL status.
As the traffic has recovered on the master after the drop, I would say that disk predictive failure and errors (only 379 errors only) has nothing to do with this issue.
This graphs shows that the traffic got back to normal rate hours ago.
What caused this...I don't know. But I think we can lower (if not close) this issue.
It looks like something just happened on db1082 again?
at roughly 16:54 edits on wikidata dropped off again and there was a spike in the slave lag.
<icinga-wm> IRC echo bot PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.132 second response time
5:05 PM I<icinga-wm> IRC echo bot PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 413.03 seconds
5:05 PM PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 432.75 seconds
I'm on a plane but the lag is probably due to the massive spike in UPDATES the master (db1063) had
And this happened again just now:
https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&panelId=1&fullscreen&orgId=1&from=1502980683784&to=1502984283784
And spike on the master again: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1063&var-port=9104&from=1502980743939&to=1502984343939
The execution of:
www-data 20284 0.0 0.1 328216 60744 ? S 15:30 0:00 php5 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata/extensions/Wikibase/repo/maintenance/pruneChanges.php --wiki wikidatawiki --number-of-days=3
Matches the start of the issues.
I think this is just UPDATE /* Title::invalidateCache */ based on the binlogs, not the above script.
This has not happened since T164173 got fixed I believe, so maybe it was indeed a direct cause.