Investigate increase in tx bandwidth usage for mc1033
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	May 14 2019, 5:33 PM

Description

@Addshore alerted us that from his dashboard it seems that tx bandwidth usage for mc1033 spiked on 2019-05-02 19:26 UTC.

Screenshot 2019-05-20 at 23.05.30.png (1×2 px, 671 KB)

In T97368 a similar pattern was observed and investigated, but it might not be related.

From SAL, we see 1.34.0-wmf.3 was deployed at that time, so new keys might be involved.

Details

Subject	Repo	Branch	Lines +/-
rdbms: fix ChronologyProtector client IDs to not be empty	mediawiki/core	master	+127 -34
rdbms: fix ChronologyProtector client IDs to not be empty	mediawiki/core	wmf/1.34.0-wmf.6	+127 -34
watchlist: make getLatestNotificationTimestamp() method use the correct user	mediawiki/core	wmf/1.34.0-wmf.5	+1 -1
watchlist: make getLatestNotificationTimestamp() method use the correct user	mediawiki/core	wmf/1.34.0-wmf.6	+1 -1
watchlist: make getLatestNotificationTimestamp() method use the correct user	mediawiki/core	master	+1 -1
mediawiki::php: bump the APC shm size on php7	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T224454: Create an alert for high memcached bw usage
T223978: 1.34.0-wmf.3 generating lots of temporary tables on MySQL slaves
T223952: Increased instability in MediaWiki backends (according to load balancers)
T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster
Mentioned Here: T223978: 1.34.0-wmf.3 generating lots of temporary tables on MySQL slaves
rMW66556bf37e88: rdbms: fix ChronologyProtector client IDs to not be empty
T223180: Monitoring PHP 7 APC usage
T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters
T221266: Homepage: Deploy to target wikis in production
T212550: Implement support for ChronologyProtection in events sent when editing Mediawiki/Wikidata
T203786: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions
T97368: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern

Event Timeline

elukey triaged this task as Medium priority.May 14 2019, 5:33 PM

elukey created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2019, 5:33 PM

From the memcached's bytes_written and bytes_read metrics I don't see anything changing dramatically:

https://grafana.wikimedia.org/d/000000316/memcache?panelId=44&fullscreen&orgId=1&from=now-14d&to=now
https://grafana.wikimedia.org/d/000000316/memcache?panelId=45&fullscreen&orgId=1&from=now-14d&to=now

Same thing in slab usage:

https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1033&var-slab=All&from=now-14d&to=now

Interestingly, I do see an increase in GETs issued by the appservers via mcrouter:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=now-14d&to=now&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All

But the increase starts at around 9 UTC, way before the deployment. At around that time I can see @jijiki logging that PHP 7.2 was getting 5% of the traffic.

I verified with ifstat on mc1033 that we the tx bandwidth is definitely used:

  Time           eth0
HH:MM:SS   Kbps in  Kbps out
17:21:45  97526.38  654420.3
17:21:46  102252.5  657875.6
17:21:47  99799.39  626790.2
17:21:48  106664.9  649912.2
17:21:49  102245.5  656182.2
17:21:50  98895.46  629060.7
17:21:51  105330.4  651062.4
17:21:52  111281.0  633934.4
17:21:53  110871.7  624383.0
17:21:54  104879.6  641662.9
17:21:55  104054.8  627020.4
17:21:56  101483.6  631687.5
17:21:57  101457.5  639215.5

We have also redis on those nodes, wondering if this could be related to it. Need to investigate more, the metrics that I found don't really make sense (yet).

@aaron hi! I am trying to figure out if Redis traffic for mc1033 is causing this increase in bandwidth usage. From tcpdump I can see a lot of traffic for GET global:Wikimedia\Rdbms\ChronologyProtector::v2. Is it something added by 1.34.0-wmf.3 ? If not I'll look more deeper :)

Found https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/505340/1/includes/libs/rdbms/ChronologyProtector.php, but not sure if it may cause this..

For things added my wmf.3, it looks like there were indeed some chronology protector changes https://github.com/wikimedia/mediawiki/compare/wmf/1.34.0-wmf.1...wmf/1.34.0-wmf.3

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/505340/
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/504815/
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/505864/

I have modified the ops/s of the Redis dashboard (https://grafana.wikimedia.org/d/000000174/redis) to show per host metrics:

Screen Shot 2019-05-16 at 7.50.33 AM.png (686×2 px, 389 KB)

Definitely something has changed during the deployment, but the Redis MainStash is the problem :)

elukey raised the priority of this task from Medium to High.May 16 2019, 5:53 AM

elukey added subscribers: Krinkle, Joe.

@Krinkle @aaron I raised the priority to high since the tx network bandwidth usage grew a lot and we might end up saturating it with spikes in memcached traffic (causing timeouts etc.. like T203786). Can we check what caused the increase of traffic asap?

jijiki added projects: serviceops, User-jijiki.May 16 2019, 6:54 AM

In T223310#5185185, @Addshore wrote:

For things added my wmf.3, it looks like there were indeed some chronology protector changes https://github.com/wikimedia/mediawiki/compare/wmf/1.34.0-wmf.1...wmf/1.34.0-wmf.3

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/505340/
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/504815/
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/505864/

None of those look interesting in terms of this problem.

I would expect https://github.com/wikimedia/mediawiki/commit/1ef64b2f2683051066a33ba34c6d04723252c3c0 to have some increase though. Unless the process cache doesn't work, it would seem odd for that to cause problems.

In T223310#5186065, @aaron wrote:

I would expect https://github.com/wikimedia/mediawiki/commit/1ef64b2f2683051066a33ba34c6d04723252c3c0 to have some increase though. Unless the process cache doesn't work, it would seem odd for that to cause problems.

Any specific key/traffic-pattern that I can check with tcpdump?

I grabbed a pcap (~200MB) from 11:03:25 to 11:04:41 of traffic directed to the mc1033's redis port (only one direction to avoid having to deal with huge pcaps), grepped for GET and tried to aggregate some results:

[..]
 317 "GET" "ptwiki:MWSession
 318 "GET" "kowiki:MWSession
 332 "GET" "dawiki:MWSession
 371 "GET" "plwiki:MWSession
 375 "GET" "hewiki:MWSession
 399 "GET" "enwiki:page-recent-delete
 414 "GET" "nlwiki:MWSession
 437 "GET" "arwiki:MWSession
 437 "GET" "svwiki:MWSession
 438 "GET" "bnwikibooks:abusefilter
 466 "GET" "metawiki:MWSession
 466 "GET" "ukwiki:abusefilter
 503 "GET" "urwiki:MWSession
 570 "GET" "nlwiktionary:MWSession
 590 "GET" "dewiki:abusefilter
 634 "GET" "loginwiki:abusefilter
 654 "GET" "frwiki:abusefilter
 682 "GET" "enwiktionary:MWSession
 694 "GET" "commonswiki:sitestatsupdate
 874 "GET" "itwiki:MWSession
 957 "GET" "global:watchlist-recent-updates
1157 "GET" "trwiki:MWSession
1278 "GET" "centralauth:session
1351 "GET" "frwiki:MWSession
1368 "GET" "enwiktionary:abusefilter
1518 "GET" "jawiki:MWSession
1525 "GET" "zhwiki:MWSession
1543 "GET" "eswiki:MWSession
2183 "GET" "ruwiki:MWSession
2609 "GET" "global:echo
3526 "GET" "wikidatawiki:MWSession
3634 "GET" "dewiki:MWSession
4082 "GET" "commonswiki:MWSession
5041 "GET" "enwiki:abusefilter
11031 "GET" "wikidatawiki:abusefilter
11502 "GET" "enwiki:MWSession
17447 "GET" "commonswiki:abusefilter
713937 "GET" "global:Wikimedia\Rdbms\ChronologyProtector

I have also checked that the GETs are spread evenly across mw hosts (so there are not few top talkers). Is it normal that we do that amount of GETs to global:Wikimedia\Rdbms\ChronologyProtector ? If so, is there anything out-of-the-ordinary that you can see in the above list?

EDIT: 713937 GETs for global:Wikimedia\Rdbms\ChronologyProtector in 76 seconds are ~9400 rps.

@aaron what about T212550 ? I thought it was https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504819/ but it is still not merged, possibly something is calling a lot https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/504815/ ?

elukey claimed this task.May 17 2019, 5:27 PM

elukey mentioned this in T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.May 17 2019, 5:39 PM

elukey added projects: Performance-Team, MediaWiki-libs-BagOStuff.May 20 2019, 4:03 PM

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.May 20 2019, 8:10 PM

• Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

Given we don't have past data to compare to, the raw key hits doesn't necessarily point to the cause. ChronologyProtector is currently expected to be near the top, but we don't have reason to believe its use has increased recently.

@elukey Would it be possible to have a breakdown that accounts for value size? That might help surface a key that is not the most frequent on its own, but responsible for the most bandwidth.

Given the significant increase, I would expect it to show in such a breakdown.

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.May 20 2019, 8:12 PM

@Krinkle the graph of ops/second clearly shows a bump from ~3k ops to ~12k ops, and the amount of gets to the chronology protector per second is close to 9k ops from tcpdump. I'll follow up as you suggest but it seems to me a volume of calls problem, rather than big keys hitting Redis.

Thanks, I didn't see the op/s stat. That correlation seems fair indeed. I wonder how that would have regression, but certainly reason to look at it again.

Given the lack of changes on the MW backend around this (as far as we know), we may need to expand the scope of things causing ChronologyProtector look ups.

For example:

An MW consumer that may've disabled ChronologyProtector in the past through an opt-out, may've stopped doing to, or had its logic broken. This could be both internal or external to WMF.
The traffic layer (e.g. Varnish frontend-backend, ATS, Apache) may've had an unintended change causing some ChronologyProtector to be enabled where it normally wouldn't. Not sure what this could be, but seems plausible somehow.

(Moving back from Radar per @elukey's comment showing pretty clear correlation that would make it impossible for that number of op/s to have existed previously, even without a breakdown of the previous week, because we do know the total numbers.)

From the Task description:

spiked on 2019-05-02 19:26 UTC.

From SAL:

08:55 <jiji@deploy1001> Synchronized wmf-config/CommonSettings.php: Send 5% of anonymous users to PHP7.2 - T219150
…
15:42 <reedy@deploy1001> Synchronized wmf-config/CommonSettings.php: Re-enable account creation on wikitech
…
18:10 <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Enable SpecialHomepage on cswiki and kowiki (T221266)
…
19:27 <thcipriani@deploy1001> rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.3

@kostajh @Catrope Could you rule out whether it's probably that the SpecialHomepage may have caused an increase in uncached Varnish traffic (e.g. anons starting sessions via edits or by viewing user sign-up page), or some other very chatty or large-value use of Memcached or Redis storage?

@alaa_wmde Could you review Wikibase-related changes between 1.34-wmf.2 and 1.34-wmf.3 for something that may've cause regression in CacheAwarePropertyInfoStore behaviour (T97368), and/or that may've caused a major increase in MediaWiki requests that do not opt-out of ChronologyProtector?

Krinkle added projects: Wikidata-Campsite, Growth-Team.May 20 2019, 10:17 PM

Krinkle added a subscriber: • alaa_wmde.

Krinkle added a subscriber: Tarrow.

Restricted Application added a project: Wikidata. · View Herald TranscriptMay 20 2019, 10:18 PM

Change 511612 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] watchlist: make getLatestNotificationTimestamp() method use the context user

https://gerrit.wikimedia.org/r/511612

gerritbot added a project: Patch-For-Review.May 21 2019, 12:34 AM

I think I know what happened here - and it's possibly in relation with T223180 .

PHP7's APC memory was perfectly ok when I looked into it (and we just had the beta feature enabled), but it's not sufficient by far when real traffic hits php-fpm. So evictions were happening from APC and this might cause what we're seeing there. I'm modifying the APC shm size across all php servers to see, amongst other things, if this fixes this effect.

@kostajh @Catrope Could you rule out whether it's probably that the SpecialHomepage may have caused an increase in uncached Varnish traffic (e.g. anons starting sessions via edits or by viewing user sign-up page), or some other very chatty or large-value use of Memcached or Redis storage?

The traffic to SpecialHomepage is very, very low (50% of newly registered users are added to experiment group on two wikis) so I think we can rule that out.

I’ll be interested to see the effect of increasing the APC shm size, though.

Joe mentioned this in T223952: Increased instability in MediaWiki backends (according to load balancers).May 21 2019, 6:47 AM

@kostajh for now I'm switching off php7 for other investigations, so we will know immediately if the additional traffic is due to that or not.

So, after turning off php7 this morning we saw no modification in the rate of requests to mc1033.

It seems extremely probable that the switch of larger wikis to .wmf3 is what caused this regression.

Would it be possible to deploy https://gerrit.wikimedia.org/r/511612 before the weekly train?

Change 511659 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: bump the APC shm size on php7

https://gerrit.wikimedia.org/r/511659

Change 511659 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: bump the APC shm size on php7

https://gerrit.wikimedia.org/r/511659

Change 511783 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.6] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511783

Change 511612 merged by jenkins-bot:
[mediawiki/core@master] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511612

Krinkle added a project: Wikimedia-Incident.May 21 2019, 11:08 PM

Krinkle removed a project: Wikimedia-Incident.

Krinkle removed a project: MediaWiki-libs-BagOStuff.

Change 511794 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.5] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511794

Change 511783 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511783

Change 511794 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.5] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511794

Krenair subscribed.May 22 2019, 2:18 AM

Thanks a lot for the deploy! I checked metrics and nothing seems changed :(

Change 511951 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/511951

Nikerabbit subscribed.May 23 2019, 10:01 AM

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.5; 2019-05-14).May 25 2019, 1:01 AM

Change 511951 merged by jenkins-bot:
[mediawiki/core@master] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/511951

Change 512514 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.6] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/512514

Change 512514 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/512514

Mentioned in SAL (#wikimedia-operations) [2019-05-27T18:41:16Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/libs/rdbms: rMW66556bf37e88 / T223310, T223978 (duration: 00m 50s)

Stashbot mentioned this in T223978: 1.34.0-wmf.3 generating lots of temporary tables on MySQL slaves.May 27 2019, 6:41 PM

Recovery:

Screenshot 2019-05-27 at 19.47.29.png (1×2 px, 563 KB)

jijiki mentioned this in T224454: Create an alert for high memcached bw usage.May 28 2019, 8:44 AM

	F29260036: Screenshot 2019-05-27 at 19.47.29.png
	May 27 2019, 6:47 PM

	F29166140: Screenshot 2019-05-20 at 23.05.30.png
	May 20 2019, 10:05 PM

	F29062823: Screen Shot 2019-05-16 at 7.50.33 AM.png
	May 16 2019, 5:52 AM

Investigate increase in tx bandwidth usage for mc1033Closed, ResolvedPublic0 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Investigate increase in tx bandwidth usage for mc1033
Closed, ResolvedPublic0 Estimated Story Points
Actions