As can be seen here:
the amount of state transitions in pybal backends has grown a lot over the last few weeks. We need to understand why, at times, fetching the enwiki blank page takes so long that pybal fails to get its response in time.
Joe | |
May 21 2019, 6:45 AM |
F29180185: s1.png | |
May 21 2019, 9:49 AM |
F29181587: db1099.png | |
May 21 2019, 9:49 AM |
F29181625: db1106.png | |
May 21 2019, 9:49 AM |
F29181668: db1118.png | |
May 21 2019, 9:49 AM |
F29181168: db1080.png | |
May 21 2019, 9:49 AM |
F29181355: db1089.png | |
May 21 2019, 9:49 AM |
F29181064: s1master.png | |
May 21 2019, 9:49 AM |
F29181304: db1083.png | |
May 21 2019, 9:49 AM |
As can be seen here:
the amount of state transitions in pybal backends has grown a lot over the last few weeks. We need to understand why, at times, fetching the enwiki blank page takes so long that pybal fails to get its response in time.
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Switch off php7 for investigation of production instabilities | operations/mediawiki-config | master | +1 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Joe | T223952 Increased instability in MediaWiki backends (according to load balancers) | |||
Resolved | Anomie | T224017 Slow query ApiQueryRevisions on enwiki |
While there is no evidence that the increase in traffic sent to php7 is the cause of this increase in errors, there are a couple tasks that we'd need to investigate better and that could point at some form of resource starvation due to the coexistence of php7 and HHVM. See T223310 and T223647. So my first order of business will be to turn the php7 sampling back to zero for a couple days, to see if that changes the status quo for pybal's checks.
Change 511632 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Switch off php7 for investigation of production instabilities
Change 511632 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch off php7 for investigation of production instabilities
Mentioned in SAL (#wikimedia-operations) [2019-05-21T06:59:03Z] <oblivian@deploy1001> Synchronized wmf-config/CommonSettings.php: Turning off php7 sampling for investigation in T223952 (duration: 00m 53s)
While we have to wait and see if the absence of php7 traffic improves the situation (and in that case, why is that the case), I've noticed one fact about the latest cases of such instabilities:
This all makes me think that if the problem is concentrated on enwiki alone, across both API and appserver (where the problem was less marked anyways), it could have something to do with some database suffering. This will need to be dug into further to confirm.
Having confirmed pybal is not an issue here, while databases might have a part in the issue, I'll change the tags accordingly.
Also, I think we're clear to restore the php7 A/B test at this point.
From the DB land this is what I have seen regarding s1:
17:59 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SDC: Enable Wikidata federation on Commons again T214075 (duration: 01m 00s) 17:20 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enabling EventGate api-request logging on group1 wikis (duration: 01m 00s) 17:18 mutante: LDAP - added 'brennen' to group 'gerritadmin' (T218858) 17:18 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/extensions/OATHAuth/: UBN T221257 train un-blocker (duration: 01m 02s) 17:09 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/extensions/Echo/includes/formatters/: Notifications: Revert 7121b9c4 per I8f9a6a19ba (duration: 01m 01s) 16:49 tzatziki: deleting three files for legal compliance 16:47 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/extensions/WikibaseMediaInfo/: SDC: Various fixes T218922 T221071 T221110 T221123 (duration: 01m 02s) 16:41 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/autoload.php: Update to point to new maintenance scripts (duration: 01m 00s) 16:39 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/maintenance/language/generateUpperCharTable.php: Maintenance script for _joe_ (duration: 00m 59s) 16:38 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/maintenance/language/generateUcfirstOverrides.php: Maintenance script for _joe_ (duration: 01m 00s) 16:21 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/languages/Language.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1 try two (duration: 01m 00s) 16:18 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/includes/DefaultSettings.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1b (duration: 01m 03s) 16:13 jforrester@deploy1001: scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) 16:11 XioNoX: set fasw-c-eqiad:ge-[0-1]/0/17 in admin vlan - T221232 16:09 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT T220434 Deploy Partial blocks to Chinese Wikipedia (duration: 01m 02s) 14:37 ariel@deploy1001: Finished deploy [dumps/dumps@dcf04a0]: fix up paths for 1.34_wmf.1 for AbstractFilter (duration: 00m 04s) 14:36 ariel@deploy1001: Started deploy [dumps/dumps@dcf04a0]: fix up paths for 1.34_wmf.1 for AbstractFilter 14:35 otto@deploy1001: scap-helm eventgate-analytics finished 14:35 otto@deploy1001: scap-helm eventgate-analytics cluster eqiad completed 14:35 otto@deploy1001: scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] 14:34 otto@deploy1001: scap-helm eventgate-analytics finished 14:34 otto@deploy1001: scap-helm eventgate-analytics cluster codfw completed 14:34 otto@deploy1001: scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] 14:13 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) 14:12 elukey@cumin1001: START - Cookbook sre.hosts.downtime 13:56 otto@deploy1001: scap-helm eventgate-analytics finished 13:56 otto@deploy1001: scap-helm eventgate-analytics cluster staging completed 13:56 otto@deploy1001: scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] 13:52 elukey: upgrading hadoop cdh distrubition to 5.16.1 on all the Hadoop-related nodes - T218343 13:48 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) 13:48 elukey@cumin1001: START - Cookbook sre.hosts.downtime 13:48 godog: reimage prometheus2004 - T187987 12:57 filippo@puppetmaster1001: conftool action : set/pooled=yes; selector: name=prometheus1004.eqiad.wmnet 12:44 godog: bounce prometheus instances on prometheus[12]003 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/499742 12:33 moritzm: running some ferm tests on graphite2002 12:10 godog: briefly stop all prometheus on prometheus1003 to finish metrics rsync - T187987 11:39 Lucas_WMDE: EU SWAT done 11:38 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable suggestion constraint status on testwikidata (T221108, T204439) (duration: 01m 01s)
The increase on query latency can be seen on individual hosts (specially on API hosts), including the master:
db1067 - s1 master:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1067&var-port=9104&from=1554564819008&to=1556943710021
db1080 - api + main traffic host:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1554564819008&to=1556943710021&var-dc=eqiad%20prometheus%2Fops&var-server=db1080&var-port=9104
db1083 - main traffic host:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1554564819008&to=1556943710021&var-dc=eqiad%20prometheus%2Fops&var-server=db1083&var-port=9104
db1089 - main traffic host:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1554564819008&to=1556943710021&var-dc=eqiad%20prometheus%2Fops&var-server=db1089&var-port=9104
This behaviour isn't seeing on the special slaves (those only serving recentchanges, watchlist, logpager, recentchangeslinked, contributions)
db1099 - special slave:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1554564819008&to=1556943710021&var-dc=eqiad%20prometheus%2Fops&var-server=db1099&var-port=13311
db1105, the other special slaves shows the same traffic as db1099, so not pasting it here.
db1106 - dump slave with very little amount of main traffic weight shows a smaller increase:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1554564819008&to=1556943710021&var-dc=eqiad%20prometheus%2Fops&var-server=db1106&var-port=9104
db1118 - main traffic + api traffic shows almost no increase:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1554564819008&to=1556943710021&var-dc=eqiad%20prometheus%2Fops&var-server=db1118&var-port=9104
db1119 - main traffic + api traffic shows a smaller increase as well
https://grafana.wikimedia.org/d/000000273/mysql?panelId=40&fullscreen&orgId=1&from=1554564819008&to=1556943710021&var-dc=eqiad%20prometheus%2Fops&var-server=db1119&var-port=9104
LB weights that day for s1:
's1' => [ 'db1067' => 0, # C6 2.8TB 160GB, # master 'db1080' => 200, # A2 3.6TB 512GB, api 'db1083' => 500, # B1 3.6TB 512GB # candidate master 'db1089' => 500, # C3 3.6TB 512GB 'db1099:3311' => 50, # B2 3.6TB 512GB # rc, log: s1 and s8 'db1105:3311' => 50, # C3 3.6TB 512GB # rc, log: s1 and s2 'db1106' => 50, # D3 3.6TB 512GB, vslow, dump # master for sanitarium db1124 'db1118' => 200, # B1 3.6TB 512GB, api 'db1119' => 200, # B8 3.6TB 512GB, api 'groupLoadsBySection' => [ 's1' => [ 'watchlist' => [ 'db1099:3311' => 1, 'db1105:3311' => 1, ], 'recentchanges' => [ 'db1099:3311' => 1, 'db1105:3311' => 1, ], 'recentchangeslinked' => [ 'db1099:3311' => 1, 'db1105:3311' => 1, ], 'contributions' => [ 'db1099:3311' => 1, 'db1105:3311' => 1, ], 'logpager' => [ 'db1099:3311' => 1, 'db1105:3311' => 1, ], 'dump' => [ 'db1106' => 1, ], 'vslow' => [ 'db1106' => 1, ], 'api' => [ 'db1080' => 1, 'db1118' => 1, 'db1119' => 1,
db-eqiad.php DB config was changed twice that day but for s3, and it was just a repool from a host that was down for maintenance:
https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/504392/
https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/504513/
Regarding slow queries, I cannot go and search for the ones happening that exact day (as it is more than a month ago), but I have taken a look at the last 5 days and there are not outstanding queries there either.
However, by looking at those graphs, it is clear that something happened and increased query response time on s1.
There is also something that started to generate more temporary tables starting the 2nd of May on all s1 hosts:
db1089
https://grafana.wikimedia.org/d/000000273/mysql?panelId=7&fullscreen&orgId=1&from=1556640896961&to=1557094763628&var-dc=eqiad%20prometheus%2Fops&var-server=db1083&var-port=9104
19:27 thcipriani@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.3
That increase on temporary tables has happened on pretty much any big wiki, not only on s1, but also on s8 for instance.
Let's add Performance-Team to see if the can provide more ideas.
Overall, while the response time increased on s1 as has been seen above, I am not sure if it can explain all the symptoms that have been seen so far.
We saw one of these events at 14:48 today and pybal reported fetch failures for -- and wanted to depool -- basically the entire appserver fleet https://phabricator.wikimedia.org/P8551
Which matches a spike of the queries reported at T224017: Slow query ApiQueryRevisions on enwiki that arrived to the API dbs for enwiki at 14:48 too.
Update from the subtask related to the slow query detected (which might or might not be the cause for this): I have removed db1118 from API and placed db1134 instead which looks like it has a good query plan, choosing the right index T224017#5202152
I am removing the DBA tag for now here (but remain subscribed), as we are following up on T224017: Slow query ApiQueryRevisions on enwiki (which is not yet clear if it is definitely causing this or not although the timing matches quite often and needs fixing anyways)
The slow query that is being investigated at T224017 is only reported as slow when 'rev_page'= '35159700'.
Which is https://en.wikipedia.org/wiki/User:Cyberpower678/RfX_Report and that page has around 80k revisions
root@db2103.codfw.wmnet[enwiki]> select count(*) from revision where rev_page='35159700'; +----------+ | count(*) | +----------+ | 80040 | +----------+ 1 row in set (0.07 sec)
Also:
https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/User:Cyberpower678/RfX_Report
I've looked back in the last week or so and we don't see those kind of instabilities specific to api servers anymore. Resolving.