Page MenuHomePhabricator

2017-07-03 Save Timing spike (300% increase)
Closed, ResolvedPublic

Description

The p75 shot all the way up to 2.4s. The edit rate around that time is twice of last week's at the same time. Further investigation is required.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1497956786997&to=1499128586997

Screen Shot 2017-07-03 at 17.37.00.png (1×2 px, 158 KB)

Event Timeline

The biggest part of the spike is over, however we're not back to usual levels yet. It would be nice to understand those spikes regardless.

The spike from today isn't reflected at all in frontend save timing, which might suggest that it's about bot/API edits?

Gilles lowered the priority of this task from High to Medium.Jul 3 2017, 12:08 PM
2017-07-03 02:42 UTCIncrease starts (red vertical line below)
2017-07-03 05:18 UTCSecond increase starts
2017-07-03 06:44 UTCPeak
2017-07-03 07:12 UTCDecrease starts
2017-07-03 10:30 UTCDecrease ends

Edit count, while higher than last week around that time, actually shows a sharp drop right at 02:42 when the increase starts. Probably unrelated, but interesting nonetheless.

Very few events in the Server admin log around that time:

02:40 <l10nupdate@tin> scap sync-l10n completed (1.30.0-wmf.7) (duration: 14m 45s)
05:31 <marostegui> Run redact sanitarium on db1095 - T160869
05:44 <marostegui> Run redact sanitarium on db1069 - T160869
07:24 <volans> bounced uwsgi-graphite-web on graphite1003, log stopped since Jul 2 10:23:45
07:40 <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1034 - T166208 (duration: 03m 00s)
07:51 <marostegui> Deploy alter table db1039 - s7 - T166208
08:02 <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Add comments about db1039 status - T166208 (duration: 02m 49s)
08:25 <gehel> banning elastic1020 from elasticsearch eqiad waiting for its recovery
08:29 <marostegui> Compress dewiki on dbstore2001 - T168354
08:49 <gehel> unbanning elastic1020 from elasticsearch eqiad
09:24 <marostegui> Deploy alter table on s1 directly on codfw master (db2016) and let it replicate - T166204
09:37 <marostegui> Global rename of Markos90 → Mαρκος - T169396

Krinkle renamed this task from 2017-07-03 Save Timing spike to 2017-07-03 Save Timing spike (300% increase).Jul 4 2017, 12:50 AM
Krinkle updated the task description. (Show Details)

Another day has past and two more medium-size spikes on record:

  • Spike 3 on 2017-07-04 from 20:00 to 23:59.
  • Recovery drop on 2017-07-05 from 1:30 to 4:20.
  • Spike 4 on 2017-07-05 from 18:00 to 23:00.
  • Final recovery drop on 2017-07-06 from 01:00 to 04:00

It's now finally back to last week's levels.

Screen Shot 2017-07-05 at 22.17.24.png (1×2 px, 188 KB)

A few more potentially relevant log entries:

Server Admin Log

2017-07-04 10:40 marostegui: Stop replication on db1102 (sanitarium3) on s2 shard for maintenance - T153743
2017-07-04 14:15 paravoid: reset db2038's iLO
2017-07-04 16:10 moritzm: rebooting radium for kernel update
2017-07-04 2017-07-04 16:54 jynus: dropping ukwikimedia from several labsdbhosts
2017-07-04 21:40 volans: ACK'ed puppet not running on stat100[2-3],snapshot100[1,5-7] due to NFS overloaded on dataset1001 - T169680
2017-07-05:
02:27 l10nupdate@tin: scap sync-l10n completed (1.30.0-wmf.7) (duration: 10m 23s)
05:08 marostegui: Force a relearn on db1046's BBU - T166141
05:13 marostegui: Deploy alter table on s7 master - db1062 - T168661
05:18 marostegui: Deploy alter table on s3 master - db1075 - T168661
09:17 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Depool db1085 - T153743 (duration: 02m 50s)
09:27 marostegui: Stop MySQL on db1085 for maintenance - T153743
09:57 marostegui: Deploy alter table on s1 eqiad hosts - T168661
10:41 marostegui: Run redact_sanitarium on s6 databases db1102 - T153743
12:33 marostegui: Stop all replication threads on db1095 for maintenance - T153743
12:49 marostegui: Force BBU relearn on db1016 - T166344
15:54 jynus: restart mysql on db2072
16:33 jynus: restart mysql on db2062
18:14 niharika29@tin: Synchronized wmf-config/InitialiseSettings.php: Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki and meta [mediawiki-config] - https://gerrit.wikimedia.org/r/360370 (duration: 00m 45s)
19:48 RainbowSprinkles: commonswiki: running updateArticleCount.php (against the vslow slave)
20:14 RainbowSprinkles: commonswiki: nevermind that article count thing
22:16 mutante: subra/suhail: disabling puppet, stopping poolcounterd, stopping other services, first step of decom, replaced by poolcounter200[12] (T169506)
22:29 mutante: subra/suhail: re-enabled puppet, now with role::spare, no more poolcounter, scheduled icinga downtimes for decom (T169506)
22:47 Reedy: running mwscript updateArticleCount.php --wiki=commonswiki --update on screen on terbium T169822
2017-07-06 02:31 l10nupdate@tin: scap sync-l10n completed (1.30.0-wmf.7) (duration: 09m 25s)

It seems none of these SAL entries really stand out as a particularly likely cause. Looking at the Logstash "mediawiki-errors" dashboard, I find a significant spike that shows hope to find the cause here:

https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors (last 7 days)

Screen Shot 2017-07-05 at 22.44.23.png (386×2 px, 88 KB)

Screen Shot 2017-07-05 at 22.46.25.png (376×2 px, 89 KB)

This error count graph roughly fits the Backend Save Timing graph:

Screen Shot 2017-07-05 at 22.17.24.png (1×2 px, 188 KB)

The peak from the normal 75k/hour to 1M+/hour errors mostly happened in the DBReplication channel (31,000,000 entries in 2 days).

Trending messages:

  • Server {host} has {lag} seconds of lag (>= {maxlag}) – 29 million – https://logstash.wikimedia.org/goto/432ace551fa51bc71887d4b852143545 – mostly s1 (enwiki), s4 (commonswiki), and s5 (dewiki, wikidatawiki)
  • All replica DBs lagged. Switch to read-only mode – 2 million – https://logstash.wikimedia.org/goto/6aba1f941f48addb71ade3da2ac5baac – mostly s1 (enwiki), s4 (commonswiki), and s5 (dewiki, wikidatawiki)
  • Pool error on {key}: {error} – 200,000
  • Error connecting to {db_server}: {error} – 150,000
  • Wikimedia\Rdbms\LoadMonitor::getServerStates: host {db_server} is unreachable – 150,000
  • Lowered set() TTL for enwiki:page-restrictions:*:* due to replication lag. – 25,000
  • Lowered set() TTL for global:resourceloader:titleinfo:enwiki:* due to replication lag. – 25,000

While these peaks do align with the save timing regression, the vast majority are from codfw due to the masters there undergoing database maintenance.

Same query without codfw:

https://logstash.wikimedia.org/goto/290e86c376f96c4ece5aa125986edac3

No longer 4 spikes, just one. Might still be worth investigating:

Screen Shot 2017-07-05 at 23.11.14.png (384×2 px, 87 KB)

This comment was removed by Krinkle.

Taking a portion of the delayed logs: https://logstash.wikimedia.org/goto/1f62bc3a37a984e10a3ed065b71efe07
That shows codfw being delayed, which was real because we were running a big alter table there (T166204#3400068) but that shouldn't affect anything, no?

Does the backend save timing metric, as it is measured, also include post-response work? If some of the DB work is done after responding to the client and is subject to performance swings, that would explain why we're seeing spikes on the backend save time not reflected on the frontend metric.

Does the backend save timing metric, as it is measured, also include post-response work? If some of the DB work is done after responding to the client and is subject to performance swings, that would explain why we're seeing spikes on the backend save time not reflected on the frontend metric.

If this helps, normally, when we depool the slaves we make sure nothing is running before starting the alters

Does the backend save timing metric, as it is measured, also include post-response work?

We only measures up-to request shutdown.

Just happened again this UTC morning and still on-going:

https://grafana.wikimedia.org/dashboard/db/save-timing?from=1499319574565&to=1499362774565

Last 12 hoursLast 7 days
Screen Shot 2017-07-06 at 10.39.21.png (1×2 px, 212 KB)
Screen Shot 2017-07-06 at 10.41.11.png (1×2 px, 193 KB)
Server Admin Log on 2017-07-06

04:56 marostegui: Deploy alter table on s1 eqiad hosts - T168661
05:12 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1055 - T168661 (duration: 00m 43s)
05:51 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1073 - T168661 (duration: 00m 42s)
06:20 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1080 - T168661 (duration: 00m 42s)
06:38 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1083, depool db1089 - T168661 (duration: 00m 43s)
06:45 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1089 - T168661 (duration: 00m 44s)
07:42 moritzm: reboot wasat for kernel update
07:48 jynus@tin: Synchronized wmf-config/db-eqiad.php: Revert parsercaches to pc100[456] (duration: 00m 43s)
12:14 marostegui: Deploy alter table db1083 - https://phabricator.wikimedia.org/T166204
12:48 moritzm: rebooting mc* servers in codfw for kernel update
14:50 moritzm: rebooting prometheus1003/1004 for kernel update
15:12 herron: extend mx[1,2]001 exim log retention to 60 days - T167333
16:45 godog: manually create mwdeploy's new home

Read-only errorsRead-only errors (biggest spike)
Screen Shot 2017-07-06 at 11.19.56.png (1×2 px, 249 KB)
Screen Shot 2017-07-06 at 11.01.20.png (658×1 px, 76 KB)
Logstash on 2017-07-06 from 09:00 to 17:30

08:20 – 08:40: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (300x per minute). Only on s2.
09:20 – 10:00: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (300x per minute). Only on s5 (dewiki, wikidata) and s7 (eswiki, huwiki, ..).
10:45 – 10:50: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (800x per minute). Only on s2 (enwiktionary, svwiki, cswiki, fiwiki, trwiki)
14:20 – 14:40: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (300x per minute). Only on s1 (enwiki), s5 (dewiki, wikidata), and others.
16:30 – 16:50: Spike in DBReplication with "All replica DBs lagged. Switch to read-only mode" (200x per minute). Only on s2.

ERROR DBReplication Server db1035 is not replicating? (peak: 200x per minute)
ERROR DBReplication Server db1035 is not replicating?
...

The aggregated graph for lag is: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=1499327621364&to=1499414021364&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All

Checking some of the db servers which are supposed to be lagging behind I cannot see anything on the graphs (ie: db1026 and db1045)

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1026&from=1499241454200&to=1499414254200
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1045&from=1499241478591&to=1499414278592

Regarding that error about db1035, that is weird. I don't see any issues on its log (ie: replication broken, or any disconnection from its master).
On that server there is a spike on writes around 2:30AM: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1035&from=1499384970549&to=1499409131965
But it doesn't correlate with a spike or lag or load or anything similar: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1035&from=1499384970549&to=1499409131965

The writes spike happens on the master (expected to go to the slave then): https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1075&from=1499384970549&to=1499409131965

We are having more traffic than usual: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1&from=1499241833077&to=1499414633077

Closing for now as the regression has recovered. Unfortunately we didn't find out what caused it. We did find various suspicious errors in the logs, that I hope will be further looked into. But perhaps better tracked in their own task (e.g. check if those read-only warnings happen more often, and if so what is causing those - having regular read-only periods for several periods of 30 minutes every day is obviously not good service). Could be that MediaWiki is activating read-only too soon, could also be that there are actual lag problems, or something in-between.

Krinkle claimed this task.
Krinkle added a subscriber: aaron.

Read-only activates for bots sooner, which might lower the edit rate and change the distribution of edits (slower/faster, the ratio with navtiming going up, and such). I'm not sure that would make spikes itself. Actually replication lag adds on to save time, so that can certainly make spikes in navtiming due to ChronologyProtector.