2017-07-03 Save Timing spike (300% increase)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Gilles
	Jul 3 2017, 7:49 AM

Description

The p75 shot all the way up to 2.4s. The edit rate around that time is twice of last week's at the same time. Further investigation is required.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1497956786997&to=1499128586997

Screen Shot 2017-07-03 at 17.37.00.png (1×2 px, 158 KB)

Related Objects

Mentioned Here: T167333: Increase email log retention period for the main email relays
T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts
T166141: db1046 BBU looks faulty
T166344: db1016 m1 master: Possibly faulty BBU
T168661: Apply schema change to add 3D filetype for STL files
T169506: Decommission subra/suhail
T169680: NFS on dataset1001 overloaded, high load on the hosts that mount it
T169822: Run updateArticleCount.php on Wikimedia Commons
T160869: Prepare and check storage layer for kbp.wikipedia.org
T166204: Convert unique keys into primary keys for some wiki tables on s1
T166208: Convert unique keys into primary keys for some wiki tables on s7
T168354: dbstore2001: s5 thread isn't able to catch up with the master
T169396: Global rename of Markos90 → Mαρκος: supervision needed

Event Timeline

• Gilles created this task.Jul 3 2017, 7:49 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 3 2017, 7:49 AM

• Gilles triaged this task as High priority.Jul 3 2017, 7:50 AM

• Gilles updated the task description. (Show Details)Jul 3 2017, 7:54 AM

elukey subscribed.Jul 3 2017, 7:59 AM

The biggest part of the spike is over, however we're not back to usual levels yet. It would be nice to understand those spikes regardless.

The spike from today isn't reflected at all in frontend save timing, which might suggest that it's about bot/API edits?

• Gilles lowered the priority of this task from High to Medium.Jul 3 2017, 12:08 PM

Krinkle updated the task description. (Show Details)Jul 4 2017, 12:36 AM

2017-07-03 02:42 UTC	Increase starts (red vertical line below)
2017-07-03 05:18 UTC	Second increase starts
2017-07-03 06:44 UTC	Peak
2017-07-03 07:12 UTC	Decrease starts
2017-07-03 10:30 UTC	Decrease ends

Edit count, while higher than last week around that time, actually shows a sharp drop right at 02:42 when the increase starts. Probably unrelated, but interesting nonetheless.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1498961777774&to=1499126968566	https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1&from=1498961777774&to=1499126968566

Very few events in the Server admin log around that time:

Server Admin Log at https://tools.wmflabs.org/sal/production?p=0&q=&d=2017-07-03

02:40 <l10nupdate@tin> scap sync-l10n completed (1.30.0-wmf.7) (duration: 14m 45s)
05:31 <marostegui> Run redact sanitarium on db1095 - T160869
05:44 <marostegui> Run redact sanitarium on db1069 - T160869
07:24 <volans> bounced uwsgi-graphite-web on graphite1003, log stopped since Jul 2 10:23:45
07:40 <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1034 - T166208 (duration: 03m 00s)
07:51 <marostegui> Deploy alter table db1039 - s7 - T166208
08:02 <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Add comments about db1039 status - T166208 (duration: 02m 49s)
08:25 <gehel> banning elastic1020 from elasticsearch eqiad waiting for its recovery
08:29 <marostegui> Compress dewiki on dbstore2001 - T168354
08:49 <gehel> unbanning elastic1020 from elasticsearch eqiad
09:24 <marostegui> Deploy alter table on s1 directly on codfw master (db2016) and let it replicate - T166204
09:37 <marostegui> Global rename of Markos90 → Mαρκος - T169396

Krinkle renamed this task from 2017-07-03 Save Timing spike to 2017-07-03 Save Timing spike (300% increase).Jul 4 2017, 12:50 AM

Krinkle updated the task description. (Show Details)

Another day has past and two more medium-size spikes on record:

Spike 3 on 2017-07-04 from 20:00 to 23:59.
Recovery drop on 2017-07-05 from 1:30 to 4:20.
Spike 4 on 2017-07-05 from 18:00 to 23:00.
Final recovery drop on 2017-07-06 from 01:00 to 04:00

It's now finally back to last week's levels.

Screen Shot 2017-07-05 at 22.17.24.png (1×2 px, 188 KB)

A few more potentially relevant log entries:

Server Admin Log

2017-07-04 10:40 marostegui: Stop replication on db1102 (sanitarium3) on s2 shard for maintenance - T153743
2017-07-04 14:15 paravoid: reset db2038's iLO
2017-07-04 16:10 moritzm: rebooting radium for kernel update
2017-07-04 2017-07-04 16:54 jynus: dropping ukwikimedia from several labsdbhosts
2017-07-04 21:40 volans: ACK'ed puppet not running on stat100[2-3],snapshot100[1,5-7] due to NFS overloaded on dataset1001 - T169680
2017-07-05:
02:27 l10nupdate@tin: scap sync-l10n completed (1.30.0-wmf.7) (duration: 10m 23s)
05:08 marostegui: Force a relearn on db1046's BBU - T166141
05:13 marostegui: Deploy alter table on s7 master - db1062 - T168661
05:18 marostegui: Deploy alter table on s3 master - db1075 - T168661
09:17 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Depool db1085 - T153743 (duration: 02m 50s)
09:27 marostegui: Stop MySQL on db1085 for maintenance - T153743
09:57 marostegui: Deploy alter table on s1 eqiad hosts - T168661
10:41 marostegui: Run redact_sanitarium on s6 databases db1102 - T153743
12:33 marostegui: Stop all replication threads on db1095 for maintenance - T153743
12:49 marostegui: Force BBU relearn on db1016 - T166344
15:54 jynus: restart mysql on db2072
16:33 jynus: restart mysql on db2062
18:14 niharika29@tin: Synchronized wmf-config/InitialiseSettings.php: Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki and meta [mediawiki-config] - https://gerrit.wikimedia.org/r/360370 (duration: 00m 45s)
19:48 RainbowSprinkles: commonswiki: running updateArticleCount.php (against the vslow slave)
20:14 RainbowSprinkles: commonswiki: nevermind that article count thing
22:16 mutante: subra/suhail: disabling puppet, stopping poolcounterd, stopping other services, first step of decom, replaced by poolcounter200[12] (T169506)
22:29 mutante: subra/suhail: re-enabled puppet, now with role::spare, no more poolcounter, scheduled icinga downtimes for decom (T169506)
22:47 Reedy: running mwscript updateArticleCount.php --wiki=commonswiki --update on screen on terbium T169822
2017-07-06 02:31 l10nupdate@tin: scap sync-l10n completed (1.30.0-wmf.7) (duration: 09m 25s)

It seems none of these SAL entries really stand out as a particularly likely cause. Looking at the Logstash "mediawiki-errors" dashboard, I find a significant spike that shows hope to find the cause here:

https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors (last 7 days)

Screen Shot 2017-07-05 at 22.44.23.png (386×2 px, 88 KB)

Screen Shot 2017-07-05 at 22.46.25.png (376×2 px, 89 KB)

This error count graph roughly fits the Backend Save Timing graph:

The peak from the normal 75k/hour to 1M+/hour errors mostly happened in the DBReplication channel (31,000,000 entries in 2 days).

Trending messages:

Server {host} has {lag} seconds of lag (>= {maxlag}) – 29 million – https://logstash.wikimedia.org/goto/432ace551fa51bc71887d4b852143545 – mostly s1 (enwiki), s4 (commonswiki), and s5 (dewiki, wikidatawiki)
All replica DBs lagged. Switch to read-only mode – 2 million – https://logstash.wikimedia.org/goto/6aba1f941f48addb71ade3da2ac5baac – mostly s1 (enwiki), s4 (commonswiki), and s5 (dewiki, wikidatawiki)
Pool error on {key}: {error} – 200,000
Error connecting to {db_server}: {error} – 150,000
Wikimedia\Rdbms\LoadMonitor::getServerStates: host {db_server} is unreachable – 150,000
Lowered set() TTL for enwiki:page-restrictions:*:* due to replication lag. – 25,000
Lowered set() TTL for global:resourceloader:titleinfo:enwiki:* due to replication lag. – 25,000

While these peaks do align with the save timing regression, the vast majority are from codfw due to the masters there undergoing database maintenance.

Same query without codfw:

https://logstash.wikimedia.org/goto/290e86c376f96c4ece5aa125986edac3

No longer 4 spikes, just one. Might still be worth investigating:

Screen Shot 2017-07-05 at 23.11.14.png (384×2 px, 87 KB)

Krinkle added a comment.Jul 6 2017, 5:56 AM

This comment was removed by Krinkle.

• Marostegui subscribed.Jul 6 2017, 5:58 AM

Taking a portion of the delayed logs: https://logstash.wikimedia.org/goto/1f62bc3a37a984e10a3ed065b71efe07
That shows codfw being delayed, which was real because we were running a big alter table there (T166204#3400068) but that shouldn't affect anything, no?

Does the backend save timing metric, as it is measured, also include post-response work? If some of the DB work is done after responding to the client and is subject to performance swings, that would explain why we're seeing spikes on the backend save time not reflected on the frontend metric.

In T169486#3410765, @Gilles wrote:

Does the backend save timing metric, as it is measured, also include post-response work? If some of the DB work is done after responding to the client and is subject to performance swings, that would explain why we're seeing spikes on the backend save time not reflected on the frontend metric.

If this helps, normally, when we depool the slaves we make sure nothing is running before starting the alters

In T169486#3410765, @Gilles wrote:

Does the backend save timing metric, as it is measured, also include post-response work?

We only measures up-to request shutdown.

requestShutdown is measured first thing in doPostOutputShutdown – before calling register_postsend_function even. https://github.com/wikimedia/mediawiki/blob/230b3ee879/includes/MediaWiki.php#L721-L726
WikimediaEvents registers a deferred update as part of the process for saving a revision, and once called it measures editResponseTime as requestStart to requestShutdown. https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/69b82ef120/WikimediaEventsHooks.php#L76-L89

fgiunchedi subscribed.Jul 6 2017, 5:37 PM

Just happened again this UTC morning and still on-going:

https://grafana.wikimedia.org/dashboard/db/save-timing?from=1499319574565&to=1499362774565

Last 12 hours	Last 7 days

Server Admin Log on 2017-07-06

04:56 marostegui: Deploy alter table on s1 eqiad hosts - T168661
05:12 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1055 - T168661 (duration: 00m 43s)
05:51 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1073 - T168661 (duration: 00m 42s)
06:20 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1080 - T168661 (duration: 00m 42s)
06:38 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1083, depool db1089 - T168661 (duration: 00m 43s)
06:45 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1089 - T168661 (duration: 00m 44s)
07:42 moritzm: reboot wasat for kernel update
07:48 jynus@tin: Synchronized wmf-config/db-eqiad.php: Revert parsercaches to pc100[456] (duration: 00m 43s)
12:14 marostegui: Deploy alter table db1083 - https://phabricator.wikimedia.org/T166204
12:48 moritzm: rebooting mc* servers in codfw for kernel update
14:50 moritzm: rebooting prometheus1003/1004 for kernel update
15:12 herron: extend mx[1,2]001 exim log retention to 60 days - T167333
16:45 godog: manually create mwdeploy's new home

All errors
https://logstash.wikimedia.org/goto/673bedf9e692e1db088d75a9095fda75

Read-only errors	Read-only errors (biggest spike)

Logstash on 2017-07-06 from 09:00 to 17:30

08:20 – 08:40: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (300x per minute). Only on s2.
09:20 – 10:00: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (300x per minute). Only on s5 (dewiki, wikidata) and s7 (eswiki, huwiki, ..).
10:45 – 10:50: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (800x per minute). Only on s2 (enwiktionary, svwiki, cswiki, fiwiki, trwiki)
14:20 – 14:40: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (300x per minute). Only on s1 (enwiki), s5 (dewiki, wikidata), and others.
16:30 – 16:50: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (200x per minute). Only on s2.

Server db1035 is not replicating?
https://logstash.wikimedia.org/goto/ae539706facedcdd56f534b3bda1b0fe

ERROR DBReplication Server db1035 is not replicating? (peak: 200x per minute)
ERROR DBReplication Server db1035 is not replicating?
...

• Gilles moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Jul 6 2017, 7:56 PM

The aggregated graph for lag is: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1499327621364&to=1499414021364&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All

Checking some of the db servers which are supposed to be lagging behind I cannot see anything on the graphs (ie: db1026 and db1045)

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1026&from=1499241454200&to=1499414254200
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1045&from=1499241478591&to=1499414278592

Regarding that error about db1035, that is weird. I don't see any issues on its log (ie: replication broken, or any disconnection from its master).
On that server there is a spike on writes around 2:30AM: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1035&from=1499384970549&to=1499409131965
But it doesn't correlate with a spike or lag or load or anything similar: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1035&from=1499384970549&to=1499409131965

The writes spike happens on the master (expected to go to the slave then): https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1075&from=1499384970549&to=1499409131965

We are having more traffic than usual: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1&from=1499241833077&to=1499414633077

Closing for now as the regression has recovered. Unfortunately we didn't find out what caused it. We did find various suspicious errors in the logs, that I hope will be further looked into. But perhaps better tracked in their own task (e.g. check if those read-only warnings happen more often, and if so what is causing those - having regular read-only periods for several periods of 30 minutes every day is obviously not good service). Could be that MediaWiki is activating read-only too soon, could also be that there are actual lag problems, or something in-between.

Krinkle closed this task as Resolved.Jul 12 2017, 4:39 AM

Krinkle claimed this task.

Krinkle added a subscriber: aaron.

Read-only activates for bots sooner, which might lower the edit rate and change the distribution of edits (slower/faster, the ratio with navtiming going up, and such). I'm not sure that would make spikes itself. Actually replication lag adds on to save time, so that can certainly make spikes in navtiming due to ChronologyProtector.

	F8633961: Screen Shot 2017-07-06 at 11.01.20.png
	Jul 6 2017, 7:18 PM

	F8634063: Screen Shot 2017-07-06 at 11.19.56.png
	Jul 6 2017, 7:18 PM

	F8633836: Screen Shot 2017-07-06 at 10.41.11.png
	Jul 6 2017, 7:18 PM

	F8633818: Screen Shot 2017-07-06 at 10.39.21.png
	Jul 6 2017, 7:18 PM

	F8634367: Screen Shot 2017-07-06 at 12.12.46.png
	Jul 6 2017, 7:18 PM

	F8634366: Screen Shot 2017-07-06 at 12.12.34.png
	Jul 6 2017, 7:18 PM

	F8633959: Screen Shot 2017-07-06 at 10.59.21 1.png
	Jul 6 2017, 7:18 PM

	F8627919: Screen Shot 2017-07-05 at 23.11.14.png
	Jul 6 2017, 6:11 AM

2017-07-03 Save Timing spike (300% increase)Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

2017-07-03 Save Timing spike (300% increase)
Closed, ResolvedPublic
Actions