We currently use the default in-memory ticket registry, but a pre-requisite for HA is a replicated ticket registry (Redis, Memcached, ActiveMQ, MySQL/JPA).
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T233921 Further steps for CAS/web SSO | |||
Resolved | MoritzMuehlenhoff | T233931 Cross data center setup for CAS | |||
Resolved | jbond | T233933 Replicated ticket registry |
Event Timeline
Looking at the list of supported ticketing registries we have the following options
- Hazelcast
- Ehcache
- Ignite
- CouchDb
- Memcached
I believe only Memcached is implemented in the current production network as such that seems to be the more sensible option to explore as we can make use of existing infrastructure and expertises . The following options exist for memcached
cas.ticket.registry.memcached.servers=localhost:11211 cas.ticket.registry.memcached.locatorType=ARRAY_MOD cas.ticket.registry.memcached.failureMode=Redistribute cas.ticket.registry.memcached.hashAlgorithm=FNV1_64_HASH cas.ticket.registry.memcached.shouldOptimize=false cas.ticket.registry.memcached.daemon=true cas.ticket.registry.memcached.maxReconnectDelay=-1 cas.ticket.registry.memcached.useNagleAlgorithm=false cas.ticket.registry.memcached.shutdownTimeoutSeconds=-1 cas.ticket.registry.memcached.opTimeout=-1 cas.ticket.registry.memcached.timeoutExceptionThreshold=2 cas.ticket.registry.memcached.maxTotal=20 cas.ticket.registry.memcached.maxIdle=8 cas.ticket.registry.memcached.minIdle=0 cas.ticket.registry.memcached.transcoder=KRYO|SERIAL|WHALIN|WHALINV1 cas.ticket.registry.memcached.transcoderCompressionThreshold=16384 cas.ticket.registry.memcached.kryoAutoReset=false cas.ticket.registry.memcached.kryoObjectsByReference=false cas.ticket.registry.memcached.kryoRegistrationRequired=false
From reading the guide most of theses can remain the same, although it would be usefull to have someone with more memcachd knowledge to validate this. We will need the cas.ticket.registry.memcached.servers setting and the advice is to set cas.ticket.registry.memcached.transcode to KRYO.
Finaly i assume we will need something configuring on the memcached side
Change 550682 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] build.gradle: add memcached support to cas blob
Change 550695 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] apereo_cas: add ability to configure basic memcached support
Change 550682 merged by Jbond:
[operations/software/cas-overlay-template@master] build.gradle: add memcached support to cas blob
Change 551795 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] memcache: comment out memcache
Change 551795 merged by Jbond:
[operations/software/cas-overlay-template@master] memcache: comment out memcache
Change 592642 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp: add mcrouter (WIP)
Change 592659 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] build.gradle: add memcached support to cas blob
Change 592661 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] apero_cas: enable memcached on idp_test
Change 592660 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] apero_cas: alow ability to use memcached for tickets
Change 592642 merged by Jbond:
[operations/puppet@production] profile::idp: add mcrouter
Change 592660 merged by Jbond:
[operations/puppet@production] apero_cas: alow ability to use memcached for tickets
Change 592661 merged by Jbond:
[operations/puppet@production] apero_cas: enable memcached on idp_test
Change 601301 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp: split memcached function to its on profile
Change 601301 merged by Jbond:
[operations/puppet@production] profile::idp: split memcached function to its on profile
Change 601315 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp: use mcrouter port
Change 601315 merged by Jbond:
[operations/puppet@production] profile::idp: use mcrouter port
Change 601321 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/dns@master] IPv6: add AAAA records for francium & htmldumper1001
Change 601325 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp::memcached: mcrouter also needs to own the cert
Change 601325 merged by Jbond:
[operations/puppet@production] profile::idp::memcached: mcrouter also needs to own the cert
Change 601327 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] role:idp* add arguments for default memcached port
Change 601327 merged by Jbond:
[operations/puppet@production] role:idp* add arguments for default memcached port
Change 601321 merged by Jbond:
[operations/dns@master] IPv6: add AAAA records for francium & htmldumper1001
I have configured memcache and mcrouter for CAS however there is currently an error. If CAS talks directly to memcache then all works fine. however when CAS talks to mcrouter we get and amplification storm before a timeout is sent to CAS. After speaking with @elukey we think the following may be happening
- idp-test2001 cas sends a set to mcrouter
- idp-test2001 mcrouter sends this command to idp-test1001 mcrouter instance and localhost using SyncRoute meaning the worst result should be returned to CAS
- localhost on idp-test2001 responds STORED for the set, idp-test2001 is still waiting for a responds from idp-test2001
- idp-test1001 mcrouter sends this connection to idp-test2001 mcrouter instance and localhost using SyncRoute meaning the worst result should be returned to CAS
- localhost on idp-test1001 responds STORED for the set, idp-test2001 is still waiting for a responds from idp-test2001
- as idp-test1001 has not finished its connection it canont send a confirmation back to idp-test2001
- steps 2 to 6 repeat for a second when the timeout experies and sends SERVER_ERROR Reply timeout back to the idp-test2001 cas
We could resolve this by having mcrouter talk directly to memcache in the other DC however this requires 1.5.13 which is not currently in buster. We could of course wrap the current memcache in stunnle but that seems a bit of a hack?
I had a quick look at the mediawiki config and i believe mediawiki resolves this problem in the mediawiki code by modifying the memcache key prefix however we don't [easily] have that control with CAS. Currently exploring further options.
@elukey please add anything i missed and correct anything i got wrong, thanks :)
We could resolve this by having mcrouter talk directly to memcache in the other DC however this requires 1.5.13 which is not currently in buster.
1.6.6 is available in testing and seems to build correctly on buster.
@MoritzMuehlenhoff what is your opinion on using a backported memcache?
@elukey would this be a good path forward or dose mcrouter offer more then just TLS termination and replication?
I'll let Luca comment what's best option-wise, but building (and maintaining with custom patches in case of security issues) seems like an acceptable option to me:
- most memcached security issues are harmless given it's nature
- it's temporary until we move the IDPs to bullseye
- given how much we rely on memcached for our core services, gaining some advance handson experience with changes in 1.6 seems useful as well
I am all for testing new versions of memcached to get experience, so on this front you'll always have my +1 :)
Upstream is also very available to help and give feedback, especially if we test recent versions.
One consideration - I tried to check what is the use case for CAS using memcached, and I ended up in here (not sure if the right link). It seems that CAS already have a mcrouter-like way of handling HA, so another solution could be to configure mcrouter with only the pool for the other DC (to leverage TLS) and configure two servers in CAS: localhost:11211 and localhost:11213. The latter will proxy transparently to the other DC's memcached without requiring us to have memcached listening on TLS.
Thanks for looking at this. This is the correct page however this solution seems sub-optimal. From my reading all this does is automatically switch to a backup server in the event that the primary server fails. however when it dose so any sessions already established will be gone and users will be forced to re-authenticate or am i missing something. This is what we would like to avoid, i.e. we want to be able to transparently fail over services from codfw to eqiad without users noticing or being interrupted.
Ah ok thanks for the explanation, now it is more clear. We have to keep in mind that mcrouter can have tkos and stop sending keys to a particular shard if not responsive, so even in this case the replication could end up to become inconsistent.
I didn't find a good solution for this use case, all docs in mcrouter seem to assume that the pools are configured to contact memcached hosts directly (so not using mcrouter as proxy).
We have to keep in mind that mcrouter can have tkos and stop sending keys to a particular shard if not responsive, so even in this case the replication could end up to become inconsistent.
Good to know I think its acceptable to drop the odd session here or there, would just mean that one or two users would need to re-authenticat which is much better then everyone
I didn't find a good solution for this use case, all docs in mcrouter seem to assume that the pools are configured to contact memcached hosts directly (so not using mcrouter as proxy).
Ack thanks for looking, i will try out using memcache 1.6 in that case and let you know how i go
Change 604603 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Create repository component for memcached 1.6
Change 604603 merged by Muehlenhoff:
[operations/puppet@production] Create repository component for memcached 1.6
Change 604626 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Make memcached 1.6 an option for the memcached class and enable for the IDPs
Change 604626 merged by Muehlenhoff:
[operations/puppet@production] Make memcached 1.6 an option for the memcached class and enable for the IDPs
Change 605937 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] memcached: add TLS support
Change 605947 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp::memcached: move SSL termination to memcached
Change 605937 merged by Jbond:
[operations/puppet@production] memcached: add TLS support
Change 605947 merged by Jbond:
[operations/puppet@production] profile::idp::memcached: move SSL termination to memcached
Change 606433 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet
Change 606433 merged by Muehlenhoff:
[operations/puppet@production] Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet
Mentioned in SAL (#wikimedia-operations) [2020-06-19T06:47:47Z] <moritzm> force reinstall of memcached 1.6 deb packages to ensure that the override is used in addition to the unmodified systemd unit from the deb T233933
Change 592659 abandoned by Jbond:
build.gradle: add memcached support to cas blob
Reason:
already deployed
Did some tests on idp-test* and it's working nicely; sessions persisted across a Tomcat restart when explicitly addressing the failover IDP, they were also present.
Change 592659 merged by Jbond:
[operations/software/cas-overlay-template@master] build.gradle: add memcached support to cas blob
Change 629344 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Add a helper to dump/restore memcached for reboots
Change 629344 merged by Muehlenhoff:
[operations/puppet@production] Add a helper to dump/restore memcached for reboots
Change 550695 abandoned by Jbond:
[operations/puppet@production] apereo_cas: add ability to configure basic memcached support
Reason:
superseded