Phased rollout of sessionstore to production fleet
Closed, ResolvedPublic
Actions

Description

The remaining steps for session storage rollout are:

Update testwiki from kask-transition (multi-write w/ redis) to kask-session (Kask-only)
Update the remainder of testwikis to kask-transition (multi-write w/ redis)
Update the remainder of testwikis to kask-session (Kask-only)
Update group0 & group1 to kask-transition (r569678)
Update group0 & group1 to kask-session (r570393)
Update group1 to kask-transition (skipping...)
Update group1 to kask-session (skipping...)
Update all remaining wikis (default) to kask-transition (r570395)
Update all remaining wikis (default) to kask-session (r570396)

Each step from kask-transition to kask-session should be spaced apart by either a) $wgObjectCacheSessionExpiry (1 hour), or b) the amount of time necessary to be confident everything is working as expected (which ever is longer).

Details

Subject	Repo	Branch	Lines +/-
Session Store: Switch everything to kask-session	operations/mediawiki-config	master	+1 -23
Session Store: Switch group2 to kask-transition	operations/mediawiki-config	master	+1 -1
Session Strore: Switch group0 and group1 to kask-session	operations/mediawiki-config	master	+2 -2
Configure group0 & group1 for kask-transition (multi-write kask/redis)	operations/mediawiki-config	master	+2 -0
Configure remainder of testwikis group for kask-transition	operations/mediawiki-config	master	+2 -0
Upgrade staging to Kask v1.0.6	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Eevans	T206016 Create a service for session storage
Resolved	Eevans	T243106 Phased rollout of sessionstore to production fleet

Event Timeline

Eevans created this task.Jan 17 2020, 9:36 PM

Eevans updated the task description. (Show Details)Jan 17 2020, 9:40 PM

Eevans updated the task description. (Show Details)

Change 565696 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/mediawiki-config@master] Configure remainder of testwikis group for kask-transition

https://gerrit.wikimedia.org/r/565696

gerritbot added a project: Patch-For-Review.Jan 17 2020, 9:50 PM

BPirkle subscribed.Jan 19 2020, 1:30 AM

Change 569575 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] Upgrade staging to Kask v1.0.6

https://gerrit.wikimedia.org/r/569575

Change 569575 merged by Eevans:
[operations/deployment-charts@master] Upgrade staging to Kask v1.0.6

https://gerrit.wikimedia.org/r/569575

Eevans mentioned this in rDEPLOYCHARTS121eba82cd0c: Upgrade staging to Kask v1.0.6.Feb 3 2020, 5:34 PM

Eevans mentioned this in rDEPLOYCHARTS6751955304bc: Upgrade sessionstore production to Kask v1.0.6.

Change 565696 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure remainder of testwikis group for kask-transition

https://gerrit.wikimedia.org/r/565696

Mentioned in SAL (#wikimedia-operations) [2020-02-03T19:09:44Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: 7bb6a12: Configure remainder of testwikis group for kask-transition (T243106) (duration: 01m 14s)

Maintenance_bot removed a project: Patch-For-Review.Feb 3 2020, 7:10 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-04T00:03:47Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Configure remainder of testwikis group for kask-session T243106 (duration: 00m 58s)

Eevans updated the task description. (Show Details)Feb 4 2020, 12:11 AM

Eevans updated the task description. (Show Details)

Change 569678 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/mediawiki-config@master] Configure group1 for kask-transition (multi-write kask/redis)

https://gerrit.wikimedia.org/r/569678

gerritbot added a project: Patch-For-Review.Feb 4 2020, 7:41 PM

Eevans updated the task description. (Show Details)Feb 4 2020, 7:54 PM

Change 569678 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure group0 & group1 for kask-transition (multi-write kask/redis)

https://gerrit.wikimedia.org/r/569678

Mentioned in SAL (#wikimedia-operations) [2020-02-05T16:37:03Z] <ppchelko@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:569678]] Config: Enable sessionstore on group0 and 1 T243106 (duration: 01m 08s)

Eevans updated the task description. (Show Details)Feb 5 2020, 4:45 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 5 2020, 5:10 PM

Change 570393 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Session Strore: Switch group0 and group1 to kask-session

https://gerrit.wikimedia.org/r/570393

gerritbot added a project: Patch-For-Review.Feb 5 2020, 6:33 PM

• Pchelolo updated the task description. (Show Details)Feb 5 2020, 6:35 PM

Change 570395 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Session Store: Switch group2 to kask-transition

https://gerrit.wikimedia.org/r/570395

Change 570396 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Session Store: Switch everything to kask-session

https://gerrit.wikimedia.org/r/570396

• Pchelolo updated the task description. (Show Details)Feb 5 2020, 6:43 PM

Change 570393 merged by jenkins-bot:
[operations/mediawiki-config@master] Session Strore: Switch group0 and group1 to kask-session

https://gerrit.wikimedia.org/r/570393

Mentioned in SAL (#wikimedia-operations) [2020-02-10T19:31:47Z] <ppchelko@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:570393]] Config: Session Store: Switch group0 and group1 to kask-session T243106 (duration: 01m 06s)

Eevans updated the task description. (Show Details)Feb 10 2020, 7:41 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-12T11:53:53Z] <akosiaris> mangle sessionstore on mw1331 so that it is unreachable. Testing for T243106

Mentioned in SAL (#wikimedia-operations) [2020-02-12T13:23:46Z] <akosiaris> mangle sessionstore on mw1331, mw1348 so that it timesout instead of returning TCP RSTs. Testing for T243106

Mentioned in SAL (#wikimedia-operations) [2020-02-12T13:39:28Z] <akosiaris> revert sessionstore on mw1331, mw1348 so that it times out instead of returning TCP RSTs. Testing for T243106

I 've just conducted 2 separate tests on 2 selected mw hosts, one appserver and one apiserver. Those were mw1331, mw1348 respectively. The tests were (via mangling /etc/hosts)

Emulate sessionstore to return Connection refused. Possible scenarios that would cause this
- firewall misconfiguration
- mediawiki misconfiguration
- all sessionstore pods being depooled e.g. due to a bad deploy, code issues, etc, very heavy load.
Emulate sessionstore Connection timed out. Possible scenarios that would cause this
- firewall misconfiguration
- mediawiki misconfiguration
- sessionstore backends taking too long to respond because e.g. of heavy load, but not long enough for them to be depooled by the infrastructure

Between the 2, the 2nd test is the one simulating what has happened more frequently up to now.

Connection refused test

The Connection refused test, did not cause any kind of measurable consequence (except a lot of logs in logstash). Nothing was noticed in the metrics dashboards. mediawiki seems to have performed quite well and fall back to redis pretty quickly

Connection timed test

The Connection timed out test however had different results.

I 'll split the results by type of server (appserver vs api) but both are concerning:

appserver

On the appserver we had:

An increase in the rate of responses taking >0.5 to respond. From ~5 to ~50 with peaks at ~80. [1]
A sizeable 95th percentile increase. Depending on operation type (POST, GET, etc) anything between 5x to 10x.
Almost all php-fpm process becoming active. [1]

Despite this being a single server, it's a rather hefty one, receiving a lot of traffic, so that made it to the global stats as well [2]. The increases there are obviously more subtle, on the range of 2x, but that's to be expected because of statistics. There seems to be some organic traffic increases as well there however.

[1] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1331&from=1581505365076&to=1581515546863
[2] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1581501544743&to=1581515746318&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200

apiserver

No requests below 0.5s latency [3]
95th percentile increase of 10x to 5s
A backlog of requests being served for a perio after the event.
All fpm processes being consumed.

On the cluster level, all of these are pronounced as well[4]

[3] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&from=1581505405054&to=1581515857984&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1348
[4] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1581505209037&to=1581515974533&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200

Logs

Not directly related to the above, but Logstash, during that event logged ~700k and ~430k messages for the 2 servers. It does add up math wise, but on a side note we may want to somehow limit this a bit.

[5] https://logstash.wikimedia.org/goto/19968eb5e2c2424d1f15334c43d92b6b

Takeaways

Interestingly enough the fallback to redis, while probably saved the users from receiving error messages, did nothing to keep the latencies low or keep the ratio of active php-fpm workers stable/manageable.

My current takeaway is that if for whatever reason connections from mediawiki to sessionstore timeout we will be having a severe to major incident in our hands.

At best, latencies will increase for all users by anything between 2x and 10x, leading to severe service degradation.

Chances are however that this won't be a stable equilibrium state and the following will happen.

At worst, all php-fpm workers on all nodes will all become active, leaving 0 idle workers. In the beginning, load balancers will try to depool the problematic mw nodes, but since all of them will be problematic soon, this won't save us from anything, in fact it will only exacerbate the effect. Requests' latencies will skyrocket, requests will pile up leading to a stampede problem and from that point on, if history is any witness with https://wikitech.wikimedia.org/wiki/Incident_documentation/20200108-mw-api we might have a domino effect bringing down the caching proxies as well. It will take considerable time and effort from multiple people to recover from that state.

Solution

An actionable, IMHO, would be to revisit the mediawiki timeout and set it to something really low. At latencies with p99s of <50ms per https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 and cross DC latency at 40ms, I would suggest 100ms (0.1s)

akosiaris added a project: serviceops-radar.Feb 12 2020, 3:53 PM

akosiaris added subscribers: Joe, jijiki, Dzahn and 2 others.

Paladox subscribed.Feb 12 2020, 3:59 PM

Thanks for this @akosiaris, I'll check into it today to make sure we can make the change.

CCicalese_WMF moved this task from Multi-DC (TEC1) to Session Management Service (CDP2) on the Platform Team Initiatives board.Mar 24 2020, 10:22 PM

CCicalese_WMF edited projects, added Platform Team Initiatives (Session Management Service (CDP2)); removed Platform Team Initiatives (Multi-DC (TEC1)).

Mentioned in SAL (#wikimedia-operations) [2020-05-08T12:49:09Z] <akosiaris> T243106 redo experiment with REJECT, DROP iptable rules now that we have envoy in the middle

Mentioned in SAL (#wikimedia-operations) [2020-05-08T12:49:32Z] <akosiaris> T243106 redo experiment with REJECT, DROP iptable rules now that we have envoy in the middle. Use mw1331, mw1348

Mentioned in SAL (#wikimedia-operations) [2020-05-08T13:16:30Z] <akosiaris> T243106 undo experiment with REJECT, DROP iptable rules now that we have envoy in the middle. Use mw1331, mw1348. Experiment done successfully, no issues to the infrastructure.

Mentioned in SAL (#wikimedia-operations) [2020-05-08T13:20:52Z] <akosiaris> T243106 redo experiment with DROP iptable rules this time around. Use mw1331, mw1348

Mentioned in SAL (#wikimedia-operations) [2020-05-08T14:05:53Z] <akosiaris> T243106 undo experiment with DROP iptable rules this time around. Use mw1331, mw1348

As an update,

I 've rerun the above scenarios today to make sure we 've addressed them.

Thanks to @Joe and the envoy based middleware I can safely say that this is no longer a concern. envoy in both cases very quickly did the correct thing and, marked sessionstore as failing and started returning 503s to mediawiki which fell back to redis.
The reaction of mediawiki to both of the above scenarios (connection refused/connection timeout) was exactly the same. The tests are in the following 4 dashboards. I don't see any noticeable degradation. There is a minor increase in php worker busy time on the 2 hosts, but it could very well be due to organic traffic patterns.

Thank you @akosiaris! Does this unblock us to continue with the rollout? If so I'll arrange to schedule that work on our side and coordinate with @thcipriani

In T243106#6118920, @WDoranWMF wrote:

Thank you @akosiaris! Does this unblock us to continue with the rollout? If so I'll arrange to schedule that work on our side and coordinate with @thcipriani

@WDoranWMF yes we should be cleared to proceed with a wider deployment.

In T243106#6118920, @WDoranWMF wrote:

Thank you @akosiaris! Does this unblock us to continue with the rollout? If so I'll arrange to schedule that work on our side and coordinate with @thcipriani

Indeed, that's the recommendation on our side. @wkandek will be reaching out to you about this as well.

elukey mentioned this in T252391: Reimage one memcached shard per DC to Buster.May 13 2020, 9:55 AM

elukey subscribed.

WDoranWMF added a project: Platform Engineering.May 13 2020, 11:57 AM

@WDoranWMF, should this been on clinic duty, green team, or neither?

CCicalese_WMF moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.May 13 2020, 9:51 PM

BPirkle edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering.May 19 2020, 8:41 PM

BPirkle triaged this task as High priority.May 19 2020, 8:44 PM

BPirkle moved this task from Inbox to Ready (WIP:5) on the Platform Team Workboards (Clinic Duty Team) board.

@CCicalese_WMF It should be on Clinic Duty

Change 570395 merged by jenkins-bot:
[operations/mediawiki-config@master] Session Store: Switch group2 to kask-transition

https://gerrit.wikimedia.org/r/570395

Krinkle mentioned this in T229062: Look into a simple way to have global keys with db-replicated.Jun 2 2020, 7:13 PM

Change 570396 merged by jenkins-bot:
[operations/mediawiki-config@master] Session Store: Switch everything to kask-session

https://gerrit.wikimedia.org/r/570396

Maintenance_bot removed a project: Patch-For-Review.Jun 3 2020, 6:10 PM

Grafana: Session store

Notice how the latency breakdown on the right seems largely unaffected by the increase in traffic.

Krinkle added a project: Wikimedia-Performance-publish.Jun 3 2020, 6:21 PM

Restricted Application added a project: Performance-Team. · View Herald TranscriptJun 3 2020, 6:21 PM

Krinkle moved this task from Untriaged to Misc bookmarks on the Wikimedia-Performance-publish board.Jun 3 2020, 6:21 PM

Krinkle removed a project: Performance-Team.

• Pchelolo closed this task as Resolved.Jun 3 2020, 7:19 PM

• Pchelolo updated the task description. (Show Details)

Krinkle mentioned this in T212129: Move MainStash out of Redis to a simpler multi-dc aware solution.Jul 6 2020, 8:35 PM

	F31852647: Screenshot 2020-06-03 at 19.21.00.png
	Jun 3 2020, 6:20 PM

Phased rollout of sessionstore to production fleetClosed, ResolvedPublicActions