Page MenuHomePhabricator

Add redundancy to IRC recent changes service
Closed, ResolvedPublic

Description

At the moment, IRC and RCStream services are available only in codfw. Adding those services in eqiad as well would provide redundancy in case of codfw going down.

Further analysis required, but it seems fairly straightforward to have those services in active/active mode.

Event Timeline

Krinkle renamed this task from Add redundancy to IRC and RCStream changelog to Add redundancy to IRC and RCStream services.Apr 21 2016, 3:24 PM
Krinkle updated the task description. (Show Details)

As part of T123729 (migrate irc.wikimedia.org to Jessie) the metal server (argon) in Eqiad running Ubuntu Precise was effectively replaced by a VM (kraz) in Codfw running Debian Jessie.

It seems odd though, to operate a SPOF-service permanently out of the secondary data centre. On the other hand, switching over the irc service without user disruption is impossible in the current set up. So it probably isn't too bad to keep it was it is right now.

So I guess that makes this task about set up a VM similar to kraz in Eqiad, as stand-by? Running active-active is gonna be more complicated, but having a stand-by would be a good first step.

@Krinkle I don't see how running this active-active would be complicated. Set up a second VM somewhere, tell MW to send all changes there as well, add an extra A record to irc.wikimedia.org for it?
Edit: Okay it's a CNAME, but still I don't think this would be a major problem.

(That being said, if this were done today it'd probably make more sense to do the stretch/buster migration work *first* and start the new VM with that, instead of creating a new jessie VM that'd just need rebuilding in a few months anyway.)

@Krenair What you describe sounds like a standby ready for manual failover, not active-active.

I consider a service active-active if users are routed to either of the available instances, and that upon issues with one of them, internal routing can depool one of them and users mostly don't notice anything.

As long as people connect directly by public IP to one of these two instances, I don't see that happening without some kind of proxy or other intermediary mechanism to ensure that upon failure and client reconnect you end up on the other instance?

Krinkle renamed this task from Add redundancy to IRC and RCStream services to Add redundancy to IRC recent changes service.May 14 2019, 12:52 PM
Krinkle updated the task description. (Show Details)
Krinkle moved this task from Later to Discuss next on the Sustainability (MediaWiki-MultiDC) board.

People connect via hostname, and we can put multiple A records in for the different VMs?
Edit: Though I'm not sure how well clients will handle one of them failing while still listed in DNS. Still would be less effort for ops to just alter DNS in such a case instead of having to build a new VM during an incident

That's an old task! @Ottomata et al may have an opinion.

Also see T185319, plus I'm sure we've discussed various ideas over time in other tasks, to either deprecate or transform the IRC feed into more of a stateless, HA service that doesn't depend on patched ancient software (e.g. this PoC of mine from 2016: https://gist.github.com/paravoid/3419e0b5ae1f24b6ea21906a142f2f47).

[..] Still would be less effort for ops to just alter DNS in such a case instead of having to build a new VM during an incident

Yes, this task is for setting up a standby for manual failover. How ops do that failover exact is up to them. What is described here is not active-active and that is indeed more complicated, out of scope for this task, and likely won't happen as part of minimal maintenance effort, lacking additional resourcing or priority.

It'd be really fun to build a replaceement IRC service based on the mediawiki.recentchange Kafka topic. Perhaps one day I will have time... :)

Per today's Multi-DC meeting, I'm detaching this from the current workboard. It was our understanding that the messages here are largerely and perhaps even exclusively sent from the primary DC (assuming RC events only originate from write actions and from GET requests we classify as write actions, per T91820).

And while that is not enforced, for the theoretical case of secondary emissions, those are:

  • not in the critical path (post-send) and thus not latency sensitive or in need of a local proxy,
  • naturally public and without any PII, hence not in dire need of TLS for cross-dc.

This task is still important as it would help reduce maintenance cost and overall system complexity by not having the IRC dispatch from MW anymore, and for general availability and stability of the service by being a stateless active-active proxy in both DCs based on our current Kafka pipeline.

As such, not closing and keeping on the general Sustainability board.

Current status: irc2001 is irc.wm.o, and irc1001 is receiving events from MediaWiki and is a hot spare that can be failed over to by adjusting the irc.wm.o CNAME (on a 5min TTL).

@Krenair What you describe sounds like a standby ready for manual failover, not active-active.

I consider a service active-active if users are routed to either of the available instances, and that upon issues with one of them, internal routing can depool one of them and users mostly don't notice anything.

As long as people connect directly by public IP to one of these two instances, I don't see that happening without some kind of proxy or other intermediary mechanism to ensure that upon failure and client reconnect you end up on the other instance?

Is it even possible for IRC to be active-active? Doesn't the client have to maintain a connection with a single server, and if that server drops, they disconnect, retry and get a connection again (maybe internally to a different server)? In that downtime though you're going to miss a few events. Unless the server remembers what your last position was (which the EventStreams protocol does!), I'm not sure how we avoid that.

Most clients/frameworks I'm aware of allow you to set fallback servers if the primary one isn't available, so we could encourage people to set the server list to: ['irc.wikimedia.org', 'irc1001.wikimedia.org', 'irc2001.wikimedia.org']. So even if irc.wm (currently irc2001) goes down, it'll cycle to the next one on the list and try it. And while specific server hostnames are hardcoded, they'll still be able to connect to irc.wm.o if/when we set up new IRC VMs.

Is it even possible for IRC to be active-active? Doesn't the client have to maintain a connection with a single server, and if that server drops, they disconnect, retry and get a connection again (maybe internally to a different server)? In that downtime though you're going to miss a few events. Unless the server remembers what your last position was (which the EventStreams protocol does!), I'm not sure how we avoid that.

Exactly, I rebooted kraz last week to drain the last remaining connected clients and they all failed over/reconnected almost immediately to the secondary host.

I think this task can be closed, the current solution is about as good as it can can get until we replace it with the new service or everything moves to evenstreams.

Ack, not missing messages !- active-active. So long as reconnect to the same hostname is expected to work within a reasonable amount of time, I guess we can close this. Requiring a public DNS change and for clients to not be subject to a cache is not ideal, e.g. a service IP internally or some other indirection seems better, but that's an improvement for later perhaps.

Krinkle assigned this task to Legoktm.