Page MenuHomePhabricator

Replicated ticket registry
Closed, ResolvedPublic

Description

We currently use the default in-memory ticket registry, but a pre-requisite for HA is a replicated ticket registry (Redis, Memcached, ActiveMQ, MySQL/JPA).

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+76 -42
operations/puppetproduction+64 -1
operations/software/cas-overlay-templatemaster+8 -2
operations/puppetproduction+2 -0
operations/puppetproduction+20 -23
operations/puppetproduction+63 -11
operations/puppetproduction+12 -3
operations/puppetproduction+1 -0
operations/dnsmaster+9 -2
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+69 -63
operations/puppetproduction+2 -0
operations/puppetproduction+124 -57
operations/puppetproduction+65 -5
operations/software/cas-overlay-templatemaster+2 -2
operations/software/cas-overlay-templatemaster+2 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

herron triaged this task as Medium priority.Sep 26 2019, 5:18 PM

Looking at the list of supported ticketing registries we have the following options

  • Hazelcast
  • Ehcache
  • Ignite
  • CouchDb
  • Memcached

I believe only Memcached is implemented in the current production network as such that seems to be the more sensible option to explore as we can make use of existing infrastructure and expertises . The following options exist for memcached

cas.ticket.registry.memcached.servers=localhost:11211
cas.ticket.registry.memcached.locatorType=ARRAY_MOD
cas.ticket.registry.memcached.failureMode=Redistribute
cas.ticket.registry.memcached.hashAlgorithm=FNV1_64_HASH
cas.ticket.registry.memcached.shouldOptimize=false
cas.ticket.registry.memcached.daemon=true
cas.ticket.registry.memcached.maxReconnectDelay=-1
cas.ticket.registry.memcached.useNagleAlgorithm=false
cas.ticket.registry.memcached.shutdownTimeoutSeconds=-1
cas.ticket.registry.memcached.opTimeout=-1
cas.ticket.registry.memcached.timeoutExceptionThreshold=2
cas.ticket.registry.memcached.maxTotal=20
cas.ticket.registry.memcached.maxIdle=8
cas.ticket.registry.memcached.minIdle=0
cas.ticket.registry.memcached.transcoder=KRYO|SERIAL|WHALIN|WHALINV1
cas.ticket.registry.memcached.transcoderCompressionThreshold=16384
cas.ticket.registry.memcached.kryoAutoReset=false
cas.ticket.registry.memcached.kryoObjectsByReference=false
cas.ticket.registry.memcached.kryoRegistrationRequired=false

From reading the guide most of theses can remain the same, although it would be usefull to have someone with more memcachd knowledge to validate this. We will need the cas.ticket.registry.memcached.servers setting and the advice is to set cas.ticket.registry.memcached.transcode to KRYO.

Finaly i assume we will need something configuring on the memcached side

Change 550682 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] build.gradle: add memcached support to cas blob

https://gerrit.wikimedia.org/r/550682

Change 550695 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] apereo_cas: add ability to configure basic memcached support

https://gerrit.wikimedia.org/r/550695

Change 550682 merged by Jbond:
[operations/software/cas-overlay-template@master] build.gradle: add memcached support to cas blob

https://gerrit.wikimedia.org/r/550682

Change 551795 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] memcache: comment out memcache

https://gerrit.wikimedia.org/r/551795

Change 551795 merged by Jbond:
[operations/software/cas-overlay-template@master] memcache: comment out memcache

https://gerrit.wikimedia.org/r/551795

Change 592642 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp: add mcrouter (WIP)

https://gerrit.wikimedia.org/r/592642

Change 592659 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] build.gradle: add memcached support to cas blob

https://gerrit.wikimedia.org/r/592659

Change 592661 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] apero_cas: enable memcached on idp_test

https://gerrit.wikimedia.org/r/592661

Change 592660 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] apero_cas: alow ability to use memcached for tickets

https://gerrit.wikimedia.org/r/592660

Change 592642 merged by Jbond:
[operations/puppet@production] profile::idp: add mcrouter

https://gerrit.wikimedia.org/r/592642

Change 592660 merged by Jbond:
[operations/puppet@production] apero_cas: alow ability to use memcached for tickets

https://gerrit.wikimedia.org/r/592660

Change 592661 merged by Jbond:
[operations/puppet@production] apero_cas: enable memcached on idp_test

https://gerrit.wikimedia.org/r/592661

Change 601301 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp: split memcached function to its on profile

https://gerrit.wikimedia.org/r/601301

Change 601301 merged by Jbond:
[operations/puppet@production] profile::idp: split memcached function to its on profile

https://gerrit.wikimedia.org/r/601301

Change 601315 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp: use mcrouter port

https://gerrit.wikimedia.org/r/601315

Change 601315 merged by Jbond:
[operations/puppet@production] profile::idp: use mcrouter port

https://gerrit.wikimedia.org/r/601315

Change 601321 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/dns@master] IPv6: add AAAA records for francium & htmldumper1001

https://gerrit.wikimedia.org/r/601321

Change 601325 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp::memcached: mcrouter also needs to own the cert

https://gerrit.wikimedia.org/r/601325

Change 601325 merged by Jbond:
[operations/puppet@production] profile::idp::memcached: mcrouter also needs to own the cert

https://gerrit.wikimedia.org/r/601325

Change 601327 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] role:idp* add arguments for default memcached port

https://gerrit.wikimedia.org/r/601327

Change 601327 merged by Jbond:
[operations/puppet@production] role:idp* add arguments for default memcached port

https://gerrit.wikimedia.org/r/601327

Change 601321 merged by Jbond:
[operations/dns@master] IPv6: add AAAA records for francium & htmldumper1001

https://gerrit.wikimedia.org/r/601321

I have configured memcache and mcrouter for CAS however there is currently an error. If CAS talks directly to memcache then all works fine. however when CAS talks to mcrouter we get and amplification storm before a timeout is sent to CAS. After speaking with @elukey we think the following may be happening

  1. idp-test2001 cas sends a set to mcrouter
  2. idp-test2001 mcrouter sends this command to idp-test1001 mcrouter instance and localhost using SyncRoute meaning the worst result should be returned to CAS
  3. localhost on idp-test2001 responds STORED for the set, idp-test2001 is still waiting for a responds from idp-test2001
  4. idp-test1001 mcrouter sends this connection to idp-test2001 mcrouter instance and localhost using SyncRoute meaning the worst result should be returned to CAS
  5. localhost on idp-test1001 responds STORED for the set, idp-test2001 is still waiting for a responds from idp-test2001
  6. as idp-test1001 has not finished its connection it canont send a confirmation back to idp-test2001
  7. steps 2 to 6 repeat for a second when the timeout experies and sends SERVER_ERROR Reply timeout back to the idp-test2001 cas

We could resolve this by having mcrouter talk directly to memcache in the other DC however this requires 1.5.13 which is not currently in buster. We could of course wrap the current memcache in stunnle but that seems a bit of a hack?

I had a quick look at the mediawiki config and i believe mediawiki resolves this problem in the mediawiki code by modifying the memcache key prefix however we don't [easily] have that control with CAS. Currently exploring further options.

@elukey please add anything i missed and correct anything i got wrong, thanks :)

We could resolve this by having mcrouter talk directly to memcache in the other DC however this requires 1.5.13 which is not currently in buster.

1.6.6 is available in testing and seems to build correctly on buster.

@MoritzMuehlenhoff what is your opinion on using a backported memcache?

@elukey would this be a good path forward or dose mcrouter offer more then just TLS termination and replication?

I'll let Luca comment what's best option-wise, but building (and maintaining with custom patches in case of security issues) seems like an acceptable option to me:

  • most memcached security issues are harmless given it's nature
  • it's temporary until we move the IDPs to bullseye
  • given how much we rely on memcached for our core services, gaining some advance handson experience with changes in 1.6 seems useful as well

I am all for testing new versions of memcached to get experience, so on this front you'll always have my +1 :)
Upstream is also very available to help and give feedback, especially if we test recent versions.

One consideration - I tried to check what is the use case for CAS using memcached, and I ended up in here (not sure if the right link). It seems that CAS already have a mcrouter-like way of handling HA, so another solution could be to configure mcrouter with only the pool for the other DC (to leverage TLS) and configure two servers in CAS: localhost:11211 and localhost:11213. The latter will proxy transparently to the other DC's memcached without requiring us to have memcached listening on TLS.

I am all for testing new versions of memcached to get experience, so on this front you'll always have my +1 :)
Upstream is also very available to help and give feedback, especially if we test recent versions.

One consideration - I tried to check what is the use case for CAS using memcached, and I ended up in here (not sure if the right link). It seems that CAS already have a mcrouter-like way of handling HA,

Thanks for looking at this. This is the correct page however this solution seems sub-optimal. From my reading all this does is automatically switch to a backup server in the event that the primary server fails. however when it dose so any sessions already established will be gone and users will be forced to re-authenticate or am i missing something. This is what we would like to avoid, i.e. we want to be able to transparently fail over services from codfw to eqiad without users noticing or being interrupted.

Ah ok thanks for the explanation, now it is more clear. We have to keep in mind that mcrouter can have tkos and stop sending keys to a particular shard if not responsive, so even in this case the replication could end up to become inconsistent.

I didn't find a good solution for this use case, all docs in mcrouter seem to assume that the pools are configured to contact memcached hosts directly (so not using mcrouter as proxy).

We have to keep in mind that mcrouter can have tkos and stop sending keys to a particular shard if not responsive, so even in this case the replication could end up to become inconsistent.

Good to know I think its acceptable to drop the odd session here or there, would just mean that one or two users would need to re-authenticat which is much better then everyone

I didn't find a good solution for this use case, all docs in mcrouter seem to assume that the pools are configured to contact memcached hosts directly (so not using mcrouter as proxy).

Ack thanks for looking, i will try out using memcache 1.6 in that case and let you know how i go

Change 604603 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Create repository component for memcached 1.6

https://gerrit.wikimedia.org/r/604603

Change 604603 merged by Muehlenhoff:
[operations/puppet@production] Create repository component for memcached 1.6

https://gerrit.wikimedia.org/r/604603

Change 604626 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Make memcached 1.6 an option for the memcached class and enable for the IDPs

https://gerrit.wikimedia.org/r/604626

Change 604626 merged by Muehlenhoff:
[operations/puppet@production] Make memcached 1.6 an option for the memcached class and enable for the IDPs

https://gerrit.wikimedia.org/r/604626

Change 605937 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] memcached: add TLS support

https://gerrit.wikimedia.org/r/605937

Change 605947 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::idp::memcached: move SSL termination to memcached

https://gerrit.wikimedia.org/r/605947

Change 605937 merged by Jbond:
[operations/puppet@production] memcached: add TLS support

https://gerrit.wikimedia.org/r/605937

Change 605947 merged by Jbond:
[operations/puppet@production] profile::idp::memcached: move SSL termination to memcached

https://gerrit.wikimedia.org/r/605947

Change 606433 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet

https://gerrit.wikimedia.org/r/606433

Change 606433 merged by Muehlenhoff:
[operations/puppet@production] Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet

https://gerrit.wikimedia.org/r/606433

Mentioned in SAL (#wikimedia-operations) [2020-06-19T06:47:47Z] <moritzm> force reinstall of memcached 1.6 deb packages to ensure that the override is used in addition to the unmodified systemd unit from the deb T233933

Change 592659 abandoned by Jbond:
build.gradle: add memcached support to cas blob

Reason:
already deployed

https://gerrit.wikimedia.org/r/592659

Did some tests on idp-test* and it's working nicely; sessions persisted across a Tomcat restart when explicitly addressing the failover IDP, they were also present.

Change 592659 restored by Jbond:
build.gradle: add memcached support to cas blob

https://gerrit.wikimedia.org/r/592659

Change 592659 merged by Jbond:
[operations/software/cas-overlay-template@master] build.gradle: add memcached support to cas blob

https://gerrit.wikimedia.org/r/592659

Change 629344 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Add a helper to dump/restore memcached for reboots

https://gerrit.wikimedia.org/r/629344

Change 629344 merged by Muehlenhoff:
[operations/puppet@production] Add a helper to dump/restore memcached for reboots

https://gerrit.wikimedia.org/r/629344

Change 550695 abandoned by Jbond:
[operations/puppet@production] apereo_cas: add ability to configure basic memcached support

Reason:
superseded

https://gerrit.wikimedia.org/r/550695

jbond claimed this task.