Page MenuHomePhabricator

Increased instability in MediaWiki backends (according to load balancers)
Closed, ResolvedPublic

Assigned To
Authored By
May 21 2019, 6:45 AM
Referenced Files
F29180185: s1.png
May 21 2019, 9:49 AM
F29181587: db1099.png
May 21 2019, 9:49 AM
F29181625: db1106.png
May 21 2019, 9:49 AM
F29181668: db1118.png
May 21 2019, 9:49 AM
F29181168: db1080.png
May 21 2019, 9:49 AM
F29181355: db1089.png
May 21 2019, 9:49 AM
F29181064: s1master.png
May 21 2019, 9:49 AM
F29181304: db1083.png
May 21 2019, 9:49 AM


As can be seen here:

the amount of state transitions in pybal backends has grown a lot over the last few weeks. We need to understand why, at times, fetching the enwiki blank page takes so long that pybal fails to get its response in time.

Event Timeline

Joe triaged this task as High priority.

While there is no evidence that the increase in traffic sent to php7 is the cause of this increase in errors, there are a couple tasks that we'd need to investigate better and that could point at some form of resource starvation due to the coexistence of php7 and HHVM. See T223310 and T223647. So my first order of business will be to turn the php7 sampling back to zero for a couple days, to see if that changes the status quo for pybal's checks.

Change 511632 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Switch off php7 for investigation of production instabilities

Change 511632 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch off php7 for investigation of production instabilities

Mentioned in SAL (#wikimedia-operations) [2019-05-21T06:59:03Z] <oblivian@deploy1001> Synchronized wmf-config/CommonSettings.php: Turning off php7 sampling for investigation in T223952 (duration: 00m 53s)

While we have to wait and see if the absence of php7 traffic improves the situation (and in that case, why is that the case), I've noticed one fact about the latest cases of such instabilities:

  • They have a correspondence between the fetch time reported by pybal and what the appservers report
  • Looking at logs, the effect seems to be much more prominent in the last two days (starting on May 20th): on couple random application servers, the number of timeouts (e.g. requests taking more than 5 seconds to complete) more than doubled compared to preceding periods
  • Again looking at apache logs, the slow requests around the time of pybal's show the problem is concentrated on enwiki. All data from a single appserver:
    • enwiki constitutes 35% of requests of the total in the 10 seconds around pybal's detection; of those, 22% took 5 seconds or longer to complete
    • In the same time period, only 0.3% of all requests not going to enwiki took longer than 5 seconds
  • Verifying across the fleet with some cumin sorcery, it seems almost all requests taking more than 5 seconds at the time were on enwiki

This all makes me think that if the problem is concentrated on enwiki alone, across both API and appserver (where the problem was less marked anyways), it could have something to do with some database suffering. This will need to be dug into further to confirm.

Having confirmed pybal is not an issue here, while databases might have a part in the issue, I'll change the tags accordingly.

Also, I think we're clear to restore the php7 A/B test at this point.

Marostegui added a subscriber: Marostegui.

From the DB land this is what I have seen regarding s1:

  • Small increase on query response time starting the 17th of April which hasn't fully gotten back to previous values

s1.png (500×1 px, 102 KB)

SAL entries around that time range:

17:59 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SDC: Enable Wikidata federation on Commons again T214075 (duration: 01m 00s)
17:20 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Enabling EventGate api-request logging on group1 wikis (duration: 01m 00s)
17:18 mutante: LDAP - added 'brennen' to group 'gerritadmin' (T218858)
17:18 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/extensions/OATHAuth/: UBN T221257 train un-blocker (duration: 01m 02s)
17:09 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/extensions/Echo/includes/formatters/: Notifications: Revert 7121b9c4 per I8f9a6a19ba (duration: 01m 01s)
16:49 tzatziki: deleting three files for legal compliance
16:47 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/extensions/WikibaseMediaInfo/: SDC: Various fixes T218922 T221071 T221110 T221123 (duration: 01m 02s)
16:41 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/autoload.php: Update to point to new maintenance scripts (duration: 01m 00s)
16:39 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/maintenance/language/generateUpperCharTable.php: Maintenance script for _joe_ (duration: 00m 59s)
16:38 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/maintenance/language/generateUcfirstOverrides.php: Maintenance script for _joe_ (duration: 01m 00s)
16:21 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/languages/Language.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1 try two (duration: 01m 00s)
16:18 jforrester@deploy1001: Synchronized php-1.34.0-wmf.1/includes/DefaultSettings.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1b (duration: 01m 03s)
16:13 jforrester@deploy1001: scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see for details)
16:11 XioNoX: set fasw-c-eqiad:ge-[0-1]/0/17 in admin vlan - T221232
16:09 jforrester@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT T220434 Deploy Partial blocks to Chinese Wikipedia (duration: 01m 02s)
14:37 ariel@deploy1001: Finished deploy [dumps/dumps@dcf04a0]: fix up paths for 1.34_wmf.1 for AbstractFilter (duration: 00m 04s)
14:36 ariel@deploy1001: Started deploy [dumps/dumps@dcf04a0]: fix up paths for 1.34_wmf.1 for AbstractFilter
14:35 otto@deploy1001: scap-helm eventgate-analytics finished
14:35 otto@deploy1001: scap-helm eventgate-analytics cluster eqiad completed
14:35 otto@deploy1001: scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad]
14:34 otto@deploy1001: scap-helm eventgate-analytics finished
14:34 otto@deploy1001: scap-helm eventgate-analytics cluster codfw completed
14:34 otto@deploy1001: scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw]
14:13 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
14:12 elukey@cumin1001: START - Cookbook sre.hosts.downtime
13:56 otto@deploy1001: scap-helm eventgate-analytics finished
13:56 otto@deploy1001: scap-helm eventgate-analytics cluster staging completed
13:56 otto@deploy1001: scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging]
13:52 elukey: upgrading hadoop cdh distrubition to 5.16.1 on all the Hadoop-related nodes - T218343
13:48 elukey@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
13:48 elukey@cumin1001: START - Cookbook sre.hosts.downtime
13:48 godog: reimage prometheus2004 - T187987
12:57 filippo@puppetmaster1001: conftool action : set/pooled=yes; selector: name=prometheus1004.eqiad.wmnet
12:44 godog: bounce prometheus instances on prometheus[12]003 after
12:33 moritzm: running some ferm tests on graphite2002
12:10 godog: briefly stop all prometheus on prometheus1003 to finish metrics rsync - T187987
11:39 Lucas_WMDE: EU SWAT done
11:38 lucaswerkmeister-wmde@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable suggestion constraint status on testwikidata (T221108, T204439) (duration: 01m 01s)

The increase on query latency can be seen on individual hosts (specially on API hosts), including the master:
db1067 - s1 master:

s1master.png (500×1 px, 91 KB)

db1080 - api + main traffic host:

db1080.png (500×1 px, 118 KB)

db1083 - main traffic host:

db1083.png (500×1 px, 104 KB)

db1089 - main traffic host:

db1089.png (500×1 px, 108 KB)

This behaviour isn't seeing on the special slaves (those only serving recentchanges, watchlist, logpager, recentchangeslinked, contributions)
db1099 - special slave:

db1099.png (500×1 px, 76 KB)

db1105, the other special slaves shows the same traffic as db1099, so not pasting it here.

db1106 - dump slave with very little amount of main traffic weight shows a smaller increase:

db1106.png (500×1 px, 44 KB)

db1118 - main traffic + api traffic shows almost no increase:

db1118-1.png (500×1 px, 53 KB)

db1119 - main traffic + api traffic shows a smaller increase as well

db1118.png (500×1 px, 42 KB)

LB weights that day for s1:

	's1' => [
		'db1067' => 0,      # C6 2.8TB 160GB, # master
		'db1080' => 200,    # A2 3.6TB 512GB, api
		'db1083' => 500,    # B1 3.6TB 512GB # candidate master
		'db1089' => 500,    # C3 3.6TB 512GB
		'db1099:3311' => 50, # B2 3.6TB 512GB # rc, log: s1 and s8
		'db1105:3311' => 50, # C3 3.6TB 512GB # rc, log: s1 and s2
		'db1106' => 50,     # D3 3.6TB 512GB, vslow, dump # master for sanitarium db1124
		'db1118' => 200,    # B1 3.6TB 512GB, api
		'db1119' => 200,    # B8 3.6TB 512GB, api

'groupLoadsBySection' => [
	's1' => [
		'watchlist' => [
			'db1099:3311' => 1,
			'db1105:3311' => 1,
		'recentchanges' => [
			'db1099:3311' => 1,
			'db1105:3311' => 1,
		'recentchangeslinked' => [
			'db1099:3311' => 1,
			'db1105:3311' => 1,
		'contributions' => [
			'db1099:3311' => 1,
			'db1105:3311' => 1,
		'logpager' => [
			'db1099:3311' => 1,
			'db1105:3311' => 1,
		'dump' => [
			'db1106' => 1,
		'vslow' => [
			'db1106' => 1,
		'api' => [
			'db1080' => 1,
			'db1118' => 1,
			'db1119' => 1,

db-eqiad.php DB config was changed twice that day but for s3, and it was just a repool from a host that was down for maintenance:

Regarding slow queries, I cannot go and search for the ones happening that exact day (as it is more than a month ago), but I have taken a look at the last 5 days and there are not outstanding queries there either.
However, by looking at those graphs, it is clear that something happened and increased query response time on s1.

There is also something that started to generate more temporary tables starting the 2nd of May on all s1 hosts:

db1089-temp tables.png (500×1 px, 56 KB)

That increase doesn't correlate with an increase of query response time, but need to also be looked at.
As per SAL, that matches:

19:27 thcipriani@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.3

That increase on temporary tables has happened on pretty much any big wiki, not only on s1, but also on s8 for instance.

Let's add Performance-Team to see if the can provide more ideas.

Overall, while the response time increased on s1 as has been seen above, I am not sure if it can explain all the symptoms that have been seen so far.

We saw one of these events at 14:48 today and pybal reported fetch failures for -- and wanted to depool -- basically the entire appserver fleet

We saw one of these events at 14:48 today and pybal reported fetch failures for -- and wanted to depool -- basically the entire appserver fleet

Which matches a spike of the queries reported at T224017: Slow query ApiQueryRevisions on enwiki that arrived to the API dbs for enwiki at 14:48 too.

Update from the subtask related to the slow query detected (which might or might not be the cause for this): I have removed db1118 from API and placed db1134 instead which looks like it has a good query plan, choosing the right index T224017#5202152

I am removing the DBA tag for now here (but remain subscribed), as we are following up on T224017: Slow query ApiQueryRevisions on enwiki (which is not yet clear if it is definitely causing this or not although the timing matches quite often and needs fixing anyways)

The slow query that is being investigated at T224017 is only reported as slow when 'rev_page'= '35159700'.
Which is and that page has around 80k revisions

root@db2103.codfw.wmnet[enwiki]> select count(*) from revision where rev_page='35159700';
| count(*) |
|    80040 |
1 row in set (0.07 sec)


I've looked back in the last week or so and we don't see those kind of instabilities specific to api servers anymore. Resolving.