Page MenuHomePhabricator

Investigate increase in tx bandwidth usage for mc1033
Closed, ResolvedPublic0 Story Points

Description

@Addshore alerted us that from his dashboard it seems that tx bandwidth usage for mc1033 spiked on 2019-05-02 19:26 UTC.

In T97368 a similar pattern was observed and investigated, but it might not be related.

From SAL, we see 1.34.0-wmf.3 was deployed at that time, so new keys might be involved.

Event Timeline

elukey triaged this task as Normal priority.May 14 2019, 5:33 PM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2019, 5:33 PM
elukey added a subscriber: jijiki.May 14 2019, 5:53 PM

From the memcached's bytes_written and bytes_read metrics I don't see anything changing dramatically:

https://grafana.wikimedia.org/d/000000316/memcache?panelId=44&fullscreen&orgId=1&from=now-14d&to=now
https://grafana.wikimedia.org/d/000000316/memcache?panelId=45&fullscreen&orgId=1&from=now-14d&to=now

Same thing in slab usage:

https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=mc1033&var-slab=All&from=now-14d&to=now

Interestingly, I do see an increase in GETs issued by the appservers via mcrouter:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=now-14d&to=now&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All

But the increase starts at around 9 UTC, way before the deployment. At around that time I can see @jijiki logging that PHP 7.2 was getting 5% of the traffic.

I verified with ifstat on mc1033 that we the tx bandwidth is definitely used:

  Time           eth0
HH:MM:SS   Kbps in  Kbps out
17:21:45  97526.38  654420.3
17:21:46  102252.5  657875.6
17:21:47  99799.39  626790.2
17:21:48  106664.9  649912.2
17:21:49  102245.5  656182.2
17:21:50  98895.46  629060.7
17:21:51  105330.4  651062.4
17:21:52  111281.0  633934.4
17:21:53  110871.7  624383.0
17:21:54  104879.6  641662.9
17:21:55  104054.8  627020.4
17:21:56  101483.6  631687.5
17:21:57  101457.5  639215.5

We have also redis on those nodes, wondering if this could be related to it. Need to investigate more, the metrics that I found don't really make sense (yet).

elukey added a subscriber: aaron.EditedMay 15 2019, 2:23 PM

@aaron hi! I am trying to figure out if Redis traffic for mc1033 is causing this increase in bandwidth usage. From tcpdump I can see a lot of traffic for GET global:Wikimedia\Rdbms\ChronologyProtector::v2. Is it something added by 1.34.0-wmf.3 ? If not I'll look more deeper :)

Found https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/505340/1/includes/libs/rdbms/ChronologyProtector.php, but not sure if it may cause this..

I have modified the ops/s of the Redis dashboard (https://grafana.wikimedia.org/d/000000174/redis) to show per host metrics:

Definitely something has changed during the deployment, but the Redis MainStash is the problem :)

elukey raised the priority of this task from Normal to High.May 16 2019, 5:53 AM
elukey added subscribers: Krinkle, Joe.

@Krinkle @aaron I raised the priority to high since the tx network bandwidth usage grew a lot and we might end up saturating it with spikes in memcached traffic (causing timeouts etc.. like T203786). Can we check what caused the increase of traffic asap?

aaron added a comment.May 16 2019, 7:41 AM

None of those look interesting in terms of this problem.

I would expect https://github.com/wikimedia/mediawiki/commit/1ef64b2f2683051066a33ba34c6d04723252c3c0 to have some increase though. Unless the process cache doesn't work, it would seem odd for that to cause problems.

I would expect https://github.com/wikimedia/mediawiki/commit/1ef64b2f2683051066a33ba34c6d04723252c3c0 to have some increase though. Unless the process cache doesn't work, it would seem odd for that to cause problems.

Any specific key/traffic-pattern that I can check with tcpdump?

elukey added a comment.EditedMay 16 2019, 9:24 AM

I grabbed a pcap (~200MB) from 11:03:25 to 11:04:41 of traffic directed to the mc1033's redis port (only one direction to avoid having to deal with huge pcaps), grepped for GET and tried to aggregate some results:

[..]
 317 "GET" "ptwiki:MWSession
 318 "GET" "kowiki:MWSession
 332 "GET" "dawiki:MWSession
 371 "GET" "plwiki:MWSession
 375 "GET" "hewiki:MWSession
 399 "GET" "enwiki:page-recent-delete
 414 "GET" "nlwiki:MWSession
 437 "GET" "arwiki:MWSession
 437 "GET" "svwiki:MWSession
 438 "GET" "bnwikibooks:abusefilter
 466 "GET" "metawiki:MWSession
 466 "GET" "ukwiki:abusefilter
 503 "GET" "urwiki:MWSession
 570 "GET" "nlwiktionary:MWSession
 590 "GET" "dewiki:abusefilter
 634 "GET" "loginwiki:abusefilter
 654 "GET" "frwiki:abusefilter
 682 "GET" "enwiktionary:MWSession
 694 "GET" "commonswiki:sitestatsupdate
 874 "GET" "itwiki:MWSession
 957 "GET" "global:watchlist-recent-updates
1157 "GET" "trwiki:MWSession
1278 "GET" "centralauth:session
1351 "GET" "frwiki:MWSession
1368 "GET" "enwiktionary:abusefilter
1518 "GET" "jawiki:MWSession
1525 "GET" "zhwiki:MWSession
1543 "GET" "eswiki:MWSession
2183 "GET" "ruwiki:MWSession
2609 "GET" "global:echo
3526 "GET" "wikidatawiki:MWSession
3634 "GET" "dewiki:MWSession
4082 "GET" "commonswiki:MWSession
5041 "GET" "enwiki:abusefilter
11031 "GET" "wikidatawiki:abusefilter
11502 "GET" "enwiki:MWSession
17447 "GET" "commonswiki:abusefilter
713937 "GET" "global:Wikimedia\Rdbms\ChronologyProtector

I have also checked that the GETs are spread evenly across mw hosts (so there are not few top talkers). Is it normal that we do that amount of GETs to global:Wikimedia\Rdbms\ChronologyProtector ? If so, is there anything out-of-the-ordinary that you can see in the above list?

EDIT: 713937 GETs for global:Wikimedia\Rdbms\ChronologyProtector in 76 seconds are ~9400 rps.

elukey added a comment.EditedMay 17 2019, 9:44 AM

@aaron what about T212550 ? I thought it was https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/504819/ but it is still not merged, possibly something is calling a lot https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/504815/ ?

elukey claimed this task.May 17 2019, 5:27 PM
Gilles moved this task from Inbox to Radar on the Performance-Team board.May 20 2019, 8:10 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

Given we don't have past data to compare to, the raw key hits doesn't necessarily point to the cause. ChronologyProtector is currently expected to be near the top, but we don't have reason to believe its use has increased recently.

@elukey Would it be possible to have a breakdown that accounts for value size? That might help surface a key that is not the most frequent on its own, but responsible for the most bandwidth.

Given the significant increase, I would expect it to show in such a breakdown.

elukey added a comment.EditedMay 20 2019, 8:58 PM

@Krinkle the graph of ops/second clearly shows a bump from ~3k ops to ~12k ops, and the amount of gets to the chronology protector per second is close to 9k ops from tcpdump. I'll follow up as you suggest but it seems to me a volume of calls problem, rather than big keys hitting Redis.

Krinkle added a comment.EditedMay 20 2019, 9:19 PM

Thanks, I didn't see the op/s stat. That correlation seems fair indeed. I wonder how that would have regression, but certainly reason to look at it again.

Given the lack of changes on the MW backend around this (as far as we know), we may need to expand the scope of things causing ChronologyProtector look ups.

For example:

  • An MW consumer that may've disabled ChronologyProtector in the past through an opt-out, may've stopped doing to, or had its logic broken. This could be both internal or external to WMF.
  • The traffic layer (e.g. Varnish frontend-backend, ATS, Apache) may've had an unintended change causing some ChronologyProtector to be enabled where it normally wouldn't. Not sure what this could be, but seems plausible somehow.
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).EditedMay 20 2019, 9:22 PM

(Moving back from Radar per @elukey's comment showing pretty clear correlation that would make it impossible for that number of op/s to have existed previously, even without a breakdown of the previous week, because we do know the total numbers.)

Krinkle updated the task description. (Show Details)May 20 2019, 10:05 PM
From the Task description:

spiked on 2019-05-02 19:26 UTC.

From SAL:

  • 08:55 <jiji@deploy1001> Synchronized wmf-config/CommonSettings.php: Send 5% of anonymous users to PHP7.2 - T219150
  • 15:42 <reedy@deploy1001> Synchronized wmf-config/CommonSettings.php: Re-enable account creation on wikitech
  • 18:10 <catrope@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Enable SpecialHomepage on cswiki and kowiki (T221266)
  • 19:27 <thcipriani@deploy1001> rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.3
Krinkle added subscribers: kostajh, Catrope.EditedMay 20 2019, 10:15 PM

@kostajh @Catrope Could you rule out whether it's probably that the SpecialHomepage may have caused an increase in uncached Varnish traffic (e.g. anons starting sessions via edits or by viewing user sign-up page), or some other very chatty or large-value use of Memcached or Redis storage?


@alaa_wmde Could you review Wikibase-related changes between 1.34-wmf.2 and 1.34-wmf.3 for something that may've cause regression in CacheAwarePropertyInfoStore behaviour (T97368), and/or that may've caused a major increase in MediaWiki requests that do not opt-out of ChronologyProtector?

Krinkle added a subscriber: alaa_wmde.
Krinkle added a subscriber: Tarrow.
Restricted Application added a project: Wikidata. · View Herald TranscriptMay 20 2019, 10:18 PM

Change 511612 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] watchlist: make getLatestNotificationTimestamp() method use the context user

https://gerrit.wikimedia.org/r/511612

Joe added a comment.May 21 2019, 5:56 AM

I think I know what happened here - and it's possibly in relation with T223180 .

PHP7's APC memory was perfectly ok when I looked into it (and we just had the beta feature enabled), but it's not sufficient by far when real traffic hits php-fpm. So evictions were happening from APC and this might cause what we're seeing there. I'm modifying the APC shm size across all php servers to see, amongst other things, if this fixes this effect.

@kostajh @Catrope Could you rule out whether it's probably that the SpecialHomepage may have caused an increase in uncached Varnish traffic (e.g. anons starting sessions via edits or by viewing user sign-up page), or some other very chatty or large-value use of Memcached or Redis storage?

The traffic to SpecialHomepage is very, very low (50% of newly registered users are added to experiment group on two wikis) so I think we can rule that out.

I’ll be interested to see the effect of increasing the APC shm size, though.

Joe added a comment.May 21 2019, 6:52 AM

@kostajh for now I'm switching off php7 for other investigations, so we will know immediately if the additional traffic is due to that or not.

Joe added a comment.May 21 2019, 7:16 AM

So, after turning off php7 this morning we saw no modification in the rate of requests to mc1033.

It seems extremely probable that the switch of larger wikis to .wmf3 is what caused this regression.

Would it be possible to deploy https://gerrit.wikimedia.org/r/511612 before the weekly train?

Change 511659 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: bump the APC shm size on php7

https://gerrit.wikimedia.org/r/511659

Change 511659 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: bump the APC shm size on php7

https://gerrit.wikimedia.org/r/511659

Change 511783 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.6] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511783

Change 511612 merged by jenkins-bot:
[mediawiki/core@master] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511612

Change 511794 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.5] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511794

Change 511783 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511783

Change 511794 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.5] watchlist: make getLatestNotificationTimestamp() method use the correct user

https://gerrit.wikimedia.org/r/511794

Thanks a lot for the deploy! I checked metrics and nothing seems changed :(

Change 511951 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/511951

Change 511951 merged by jenkins-bot:
[mediawiki/core@master] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/511951

Change 512514 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.6] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/512514

Change 512514 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] rdbms: fix ChronologyProtector client IDs to not be empty

https://gerrit.wikimedia.org/r/512514

Mentioned in SAL (#wikimedia-operations) [2019-05-27T18:41:16Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/libs/rdbms: 66556bf37e8 / T223310, T223978 (duration: 00m 50s)

Krinkle closed this task as Resolved.May 27 2019, 6:46 PM
Krinkle reassigned this task from elukey to aaron.
Krinkle removed a project: Patch-For-Review.

Recovery: