Page MenuHomePhabricator

Add redundancy to IRC recent changes service
Open, NormalPublic

Description

At the moment, IRC and RCStream services are available only in codfw. Adding those services in eqiad as well would provide redundancy in case of codfw going down.

Further analysis required, but it seems fairly straightforward to have those services in active/active mode.

Event Timeline

Gehel created this task.Mar 2 2016, 2:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2016, 2:39 PM
Krinkle renamed this task from Add redundancy to IRC and RCStream changelog to Add redundancy to IRC and RCStream services.Apr 21 2016, 3:24 PM
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Krinkle.May 5 2017, 3:08 AM

As part of T123729 (migrate irc.wikimedia.org to Jessie) the metal server (argon) in Eqiad running Ubuntu Precise was effectively replaced by a VM (kraz) in Codfw running Debian Jessie.

It seems odd though, to operate a SPOF-service permanently out of the secondary data centre. On the other hand, switching over the irc service without user disruption is impossible in the current set up. So it probably isn't too bad to keep it was it is right now.

So I guess that makes this task about set up a VM similar to kraz in Eqiad, as stand-by? Running active-active is gonna be more complicated, but having a stand-by would be a good first step.

Krenair added a subscriber: Krenair.EditedMay 3 2019, 6:11 PM

@Krinkle I don't see how running this active-active would be complicated. Set up a second VM somewhere, tell MW to send all changes there as well, add an extra A record to irc.wikimedia.org for it?
Edit: Okay it's a CNAME, but still I don't think this would be a major problem.

Krenair updated the task description. (Show Details)May 3 2019, 6:12 PM

(That being said, if this were done today it'd probably make more sense to do the stretch/buster migration work *first* and start the new VM with that, instead of creating a new jessie VM that'd just need rebuilding in a few months anyway.)

Krinkle added a comment.EditedMay 14 2019, 12:51 PM

@Krenair What you describe sounds like a standby ready for manual failover, not active-active.

I consider a service active-active if users are routed to either of the available instances, and that upon issues with one of them, internal routing can depool one of them and users mostly don't notice anything.

As long as people connect directly by public IP to one of these two instances, I don't see that happening without some kind of proxy or other intermediary mechanism to ensure that upon failure and client reconnect you end up on the other instance?

Krinkle renamed this task from Add redundancy to IRC and RCStream services to Add redundancy to IRC recent changes service.May 14 2019, 12:52 PM
Krinkle updated the task description. (Show Details)
Krinkle moved this task from Backlog to Next-up on the Availability (MediaWiki-MultiDC) board.
Krenair added a comment.EditedMay 14 2019, 12:53 PM

People connect via hostname, and we can put multiple A records in for the different VMs?
Edit: Though I'm not sure how well clients will handle one of them failing while still listed in DNS. Still would be less effort for ops to just alter DNS in such a case instead of having to build a new VM during an incident

That's an old task! @Ottomata et al may have an opinion.

Also see T185319, plus I'm sure we've discussed various ideas over time in other tasks, to either deprecate or transform the IRC feed into more of a stateless, HA service that doesn't depend on patched ancient software (e.g. this PoC of mine from 2016: https://gist.github.com/paravoid/3419e0b5ae1f24b6ea21906a142f2f47).

[..] Still would be less effort for ops to just alter DNS in such a case instead of having to build a new VM during an incident

Yes, this task is for setting up a standby for manual failover. How ops do that failover exact is up to them. What is described here is not active-active and that is indeed more complicated, out of scope for this task, and likely won't happen as part of minimal maintenance effort, lacking additional resourcing or priority.

It'd be really fun to build a replaceement IRC service based on the mediawiki.recentchange Kafka topic. Perhaps one day I will have time... :)

ArielGlenn triaged this task as Normal priority.May 15 2019, 10:42 AM