Page MenuHomePhabricator

Phased rollout of sessionstore to production fleet
Open, HighPublic

Description

The remaining steps for session storage rollout are:

  • Update testwiki from kask-transition (multi-write w/ redis) to kask-session (Kask-only)
  • Update the remainder of testwikis to kask-transition (multi-write w/ redis)
  • Update the remainder of testwikis to kask-session (Kask-only)
  • Update group0 & group1 to kask-transition (r569678)
  • Update group0 & group1 to kask-session (r570393)
  • Update group1 to kask-transition (skipping...)
  • Update group1 to kask-session (skipping...)
  • Update all remaining wikis (default) to kask-transition (r570395)
  • Update all remaining wikis (default) to kask-session (r570396)
Each step from kask-transition to kask-session should be spaced apart by either a) $wgObjectCacheSessionExpiry (1 hour), or b) the amount of time necessary to be confident everything is working as expected (which ever is longer).

Event Timeline

Eevans created this task.Jan 17 2020, 9:36 PM
Eevans updated the task description. (Show Details)Jan 17 2020, 9:40 PM
Eevans updated the task description. (Show Details)

Change 565696 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/mediawiki-config@master] Configure remainder of testwikis group for kask-transition

https://gerrit.wikimedia.org/r/565696

Change 569575 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] Upgrade staging to Kask v1.0.6

https://gerrit.wikimedia.org/r/569575

Change 569575 merged by Eevans:
[operations/deployment-charts@master] Upgrade staging to Kask v1.0.6

https://gerrit.wikimedia.org/r/569575

Change 565696 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure remainder of testwikis group for kask-transition

https://gerrit.wikimedia.org/r/565696

Mentioned in SAL (#wikimedia-operations) [2020-02-03T19:09:44Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: 7bb6a12: Configure remainder of testwikis group for kask-transition (T243106) (duration: 01m 14s)

Mentioned in SAL (#wikimedia-operations) [2020-02-04T00:03:47Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Configure remainder of testwikis group for kask-session T243106 (duration: 00m 58s)

Eevans updated the task description. (Show Details)Feb 4 2020, 12:11 AM
Eevans updated the task description. (Show Details)

Change 569678 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/mediawiki-config@master] Configure group1 for kask-transition (multi-write kask/redis)

https://gerrit.wikimedia.org/r/569678

Eevans updated the task description. (Show Details)Feb 4 2020, 7:54 PM

Change 569678 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure group0 & group1 for kask-transition (multi-write kask/redis)

https://gerrit.wikimedia.org/r/569678

Mentioned in SAL (#wikimedia-operations) [2020-02-05T16:37:03Z] <ppchelko@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:569678]] Config: Enable sessionstore on group0 and 1 T243106 (duration: 01m 08s)

Eevans updated the task description. (Show Details)Feb 5 2020, 4:45 PM

Change 570393 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Session Strore: Switch group0 and group1 to kask-session

https://gerrit.wikimedia.org/r/570393

Pchelolo updated the task description. (Show Details)Feb 5 2020, 6:35 PM

Change 570395 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Session Store: Switch group2 to kask-transition

https://gerrit.wikimedia.org/r/570395

Change 570396 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Session Store: Switch everything to kask-session

https://gerrit.wikimedia.org/r/570396

Pchelolo updated the task description. (Show Details)Feb 5 2020, 6:43 PM

Change 570393 merged by jenkins-bot:
[operations/mediawiki-config@master] Session Strore: Switch group0 and group1 to kask-session

https://gerrit.wikimedia.org/r/570393

Mentioned in SAL (#wikimedia-operations) [2020-02-10T19:31:47Z] <ppchelko@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:570393]] Config: Session Store: Switch group0 and group1 to kask-session T243106 (duration: 01m 06s)

Eevans updated the task description. (Show Details)Feb 10 2020, 7:41 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-12T11:53:53Z] <akosiaris> mangle sessionstore on mw1331 so that it is unreachable. Testing for T243106

Mentioned in SAL (#wikimedia-operations) [2020-02-12T13:23:46Z] <akosiaris> mangle sessionstore on mw1331, mw1348 so that it timesout instead of returning TCP RSTs. Testing for T243106

Mentioned in SAL (#wikimedia-operations) [2020-02-12T13:39:28Z] <akosiaris> revert sessionstore on mw1331, mw1348 so that it times out instead of returning TCP RSTs. Testing for T243106

akosiaris added a subscriber: akosiaris.EditedFeb 12 2020, 3:52 PM

I 've just conducted 2 separate tests on 2 selected mw hosts, one appserver and one apiserver. Those were mw1331, mw1348 respectively. The tests were (via mangling /etc/hosts)

  1. Emulate sessionstore to return Connection refused. Possible scenarios that would cause this
    • firewall misconfiguration
    • mediawiki misconfiguration
    • all sessionstore pods being depooled e.g. due to a bad deploy, code issues, etc, very heavy load.
  2. Emulate sessionstore Connection timed out. Possible scenarios that would cause this
    • firewall misconfiguration
    • mediawiki misconfiguration
    • sessionstore backends taking too long to respond because e.g. of heavy load, but not long enough for them to be depooled by the infrastructure

Between the 2, the 2nd test is the one simulating what has happened more frequently up to now.

Connection refused test

The Connection refused test, did not cause any kind of measurable consequence (except a lot of logs in logstash). Nothing was noticed in the metrics dashboards. mediawiki seems to have performed quite well and fall back to redis pretty quickly

Connection timed test

The Connection timed out test however had different results.

I 'll split the results by type of server (appserver vs api) but both are concerning:

appserver

On the appserver we had:

  • An increase in the rate of responses taking >0.5 to respond. From ~5 to ~50 with peaks at ~80. [1]
  • A sizeable 95th percentile increase. Depending on operation type (POST, GET, etc) anything between 5x to 10x.
  • Almost all php-fpm process becoming active. [1]

Despite this being a single server, it's a rather hefty one, receiving a lot of traffic, so that made it to the global stats as well [2]. The increases there are obviously more subtle, on the range of 2x, but that's to be expected because of statistics. There seems to be some organic traffic increases as well there however.

[1] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1331&from=1581505365076&to=1581515546863
[2] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1581501544743&to=1581515746318&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200

apiserver
  • No requests below 0.5s latency [3]
  • 95th percentile increase of 10x to 5s
  • A backlog of requests being served for a perio after the event.
  • All fpm processes being consumed.

On the cluster level, all of these are pronounced as well[4]

[3] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&from=1581505405054&to=1581515857984&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1348
[4] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1581505209037&to=1581515974533&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200

Logs

Not directly related to the above, but Logstash, during that event logged ~700k and ~430k messages for the 2 servers. It does add up math wise, but on a side note we may want to somehow limit this a bit.

[5] https://logstash.wikimedia.org/goto/19968eb5e2c2424d1f15334c43d92b6b

Takeaways

Interestingly enough the fallback to redis, while probably saved the users from receiving error messages, did nothing to keep the latencies low or keep the ratio of active php-fpm workers stable/manageable.

My current takeaway is that if for whatever reason connections from mediawiki to sessionstore timeout we will be having a severe to major incident in our hands.

At best, latencies will increase for all users by anything between 2x and 10x, leading to severe service degradation.

Chances are however that this won't be a stable equilibrium state and the following will happen.

At worst, all php-fpm workers on all nodes will all become active, leaving 0 idle workers. In the beginning, load balancers will try to depool the problematic mw nodes, but since all of them will be problematic soon, this won't save us from anything, in fact it will only exacerbate the effect. Requests' latencies will skyrocket, requests will pile up leading to a stampede problem and from that point on, if history is any witness with https://wikitech.wikimedia.org/wiki/Incident_documentation/20200108-mw-api we might have a domino effect bringing down the caching proxies as well. It will take considerable time and effort from multiple people to recover from that state.

Solution

An actionable, IMHO, would be to revisit the mediawiki timeout and set it to something really low. At latencies with p99s of <50ms per https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 and cross DC latency at 40ms, I would suggest 100ms (0.1s)

Thanks for this @akosiaris, I'll check into it today to make sure we can make the change.

Mentioned in SAL (#wikimedia-operations) [2020-05-08T12:49:09Z] <akosiaris> T243106 redo experiment with REJECT, DROP iptable rules now that we have envoy in the middle

Mentioned in SAL (#wikimedia-operations) [2020-05-08T12:49:32Z] <akosiaris> T243106 redo experiment with REJECT, DROP iptable rules now that we have envoy in the middle. Use mw1331, mw1348

Mentioned in SAL (#wikimedia-operations) [2020-05-08T13:16:30Z] <akosiaris> T243106 undo experiment with REJECT, DROP iptable rules now that we have envoy in the middle. Use mw1331, mw1348. Experiment done successfully, no issues to the infrastructure.

Mentioned in SAL (#wikimedia-operations) [2020-05-08T13:20:52Z] <akosiaris> T243106 redo experiment with DROP iptable rules this time around. Use mw1331, mw1348

Mentioned in SAL (#wikimedia-operations) [2020-05-08T14:05:53Z] <akosiaris> T243106 undo experiment with DROP iptable rules this time around. Use mw1331, mw1348

As an update,

I 've rerun the above scenarios today to make sure we 've addressed them.

Thanks to @Joe and the envoy based middleware I can safely say that this is no longer a concern. envoy in both cases very quickly did the correct thing and, marked sessionstore as failing and started returning 503s to mediawiki which fell back to redis.
The reaction of mediawiki to both of the above scenarios (connection refused/connection timeout) was exactly the same. The tests are in the following 4 dashboards. I don't see any noticeable degradation. There is a minor increase in php worker busy time on the 2 hosts, but it could very well be due to organic traffic patterns.

Thank you @akosiaris! Does this unblock us to continue with the rollout? If so I'll arrange to schedule that work on our side and coordinate with @thcipriani

Joe added a comment.Mon, May 11, 9:12 AM

Thank you @akosiaris! Does this unblock us to continue with the rollout? If so I'll arrange to schedule that work on our side and coordinate with @thcipriani

@WDoranWMF yes we should be cleared to proceed with a wider deployment.

Thank you @akosiaris! Does this unblock us to continue with the rollout? If so I'll arrange to schedule that work on our side and coordinate with @thcipriani

Indeed, that's the recommendation on our side. @wkandek will be reaching out to you about this as well.

@WDoranWMF, should this been on clinic duty, green team, or neither?

BPirkle triaged this task as High priority.Tue, May 19, 8:44 PM

@CCicalese_WMF It should be on Clinic Duty

Change 570395 merged by jenkins-bot:
[operations/mediawiki-config@master] Session Store: Switch group2 to kask-transition

https://gerrit.wikimedia.org/r/570395