Add redundancy to IRC recent changes service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Mar 2 2016, 2:39 PM

Description

At the moment, IRC ~~and RCStream~~ services are available only in codfw. Adding those services in eqiad as well would provide redundancy in case of codfw going down.

Further analysis required, but it seems fairly straightforward to have those services in active/active mode.

Related Objects
Search...

Status	Assigned	Task
Resolved	Legoktm	T128592 Add redundancy to IRC recent changes service
Resolved	Dzahn	T123729 Migrate irc.wikimedia.org to Jessie
Resolved	MoritzMuehlenhoff	T132427 Build ircd-ratbox for jessie
Resolved	MoritzMuehlenhoff	T133101 build python-irclib for jessie
Resolved	Dzahn	T122933 Remove the "HTTPS to HTTP" url filter in the IRC feed
Resolved	Dzahn	T105422 enable IPv6 on irc.wikimedia.org
Declined	None	T105804 schedule maintenance for IRC server
Open	None	T234234 Port architecture of irc-recentchanges to Kafka
Resolved	Ottomata	T240181 Documentation improvements for Eventstreams
Open	None	T240182 Create EventStream's equivalent to irc.wikimedia.org's #central channel
Open	None	T242712 Deprecation (if possible) of the #central channel on irc.wikimedia.org
Resolved	Operator873	T244542 Use EventStreams for SULWatcher
Resolved	elukey	T244719 Create a replacement for kraz.wikimedia.org
Invalid	None	T245279 decommission kraz.wikimedia.org
Resolved	MoritzMuehlenhoff	T278255 Set up spare irc1001.wikimedia.org in eqiad

Event Timeline

Gehel created this task.Mar 2 2016, 2:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2016, 2:39 PM

Krinkle renamed this task from Add redundancy to IRC and RCStream changelog to Add redundancy to IRC and RCStream services.Apr 21 2016, 3:24 PM

Krinkle updated the task description. (Show Details)

Krinkle added a subtask: T123729: Migrate irc.wikimedia.org to Jessie.

• MZMcBride subscribed.Apr 21 2016, 3:24 PM

Dzahn closed subtask T123729: Migrate irc.wikimedia.org to Jessie as Resolved.May 3 2016, 1:56 AM

Ricordisamoa subscribed.May 8 2016, 4:14 PM

As part of T123729 (migrate irc.wikimedia.org to Jessie) the metal server (argon) in Eqiad running Ubuntu Precise was effectively replaced by a VM (kraz) in Codfw running Debian Jessie.

It seems odd though, to operate a SPOF-service permanently out of the secondary data centre. On the other hand, switching over the irc service without user disruption is impossible in the current set up. So it probably isn't too bad to keep it was it is right now.

So I guess that makes this task about set up a VM similar to kraz in Eqiad, as stand-by? Running active-active is gonna be more complicated, but having a stand-by would be a good first step.

@Krinkle I don't see how running this active-active would be complicated. Set up a second VM somewhere, tell MW to send all changes there as well, add an extra A record to irc.wikimedia.org for it?
Edit: Okay it's a CNAME, but still I don't think this would be a major problem.

Krenair updated the task description. (Show Details)May 3 2019, 6:12 PM

(That being said, if this were done today it'd probably make more sense to do the stretch/buster migration work *first* and start the new VM with that, instead of creating a new jessie VM that'd just need rebuilding in a few months anyway.)

@Krenair What you describe sounds like a standby ready for manual failover, not active-active.

I consider a service active-active if users are routed to either of the available instances, and that upon issues with one of them, internal routing can depool one of them and users mostly don't notice anything.

As long as people connect directly by public IP to one of these two instances, I don't see that happening without some kind of proxy or other intermediary mechanism to ensure that upon failure and client reconnect you end up on the other instance?

Krinkle renamed this task from Add redundancy to IRC and RCStream services to Add redundancy to IRC recent changes service.May 14 2019, 12:52 PM

Krinkle updated the task description. (Show Details)

Krinkle added projects: Sustainability (MediaWiki-MultiDC), SRE.

Krinkle moved this task from Later to Discuss next on the Sustainability (MediaWiki-MultiDC) board.

People connect via hostname, and we can put multiple A records in for the different VMs?
Edit: Though I'm not sure how well clients will handle one of them failing while still listed in DNS. Still would be less effort for ops to just alter DNS in such a case instead of having to build a new VM during an incident

That's an old task! @Ottomata et al may have an opinion.

Also see T185319, plus I'm sure we've discussed various ideas over time in other tasks, to either deprecate or transform the IRC feed into more of a stateless, HA service that doesn't depend on patched ancient software (e.g. this PoC of mine from 2016: https://gist.github.com/paravoid/3419e0b5ae1f24b6ea21906a142f2f47).

In T128592#5180472, @Krenair wrote:

[..] Still would be less effort for ops to just alter DNS in such a case instead of having to build a new VM during an incident

Yes, this task is for setting up a standby for manual failover. How ops do that failover exact is up to them. What is described here is not active-active and that is indeed more complicated, out of scope for this task, and likely won't happen as part of minimal maintenance effort, lacking additional resourcing or priority.

It'd be really fun to build a replaceement IRC service based on the mediawiki.recentchange Kafka topic. Perhaps one day I will have time... :)

ArielGlenn triaged this task as Medium priority.May 15 2019, 10:42 AM

Krinkle moved this task from Discuss next to Current: Performance Team on the Sustainability (MediaWiki-MultiDC) board.May 11 2020, 9:05 PM

Krinkle added subtasks: T234234: Port architecture of irc-recentchanges to Kafka, T232483: Port IRCRecentChanges to Kafka.

Krinkle removed a subtask: T232483: Port IRCRecentChanges to Kafka.May 11 2020, 9:08 PM

Krinkle moved this task from Current: Performance Team to Discuss next on the Sustainability (MediaWiki-MultiDC) board.Jul 2 2020, 2:48 PM

Krinkle moved this task from Discuss next to Later on the Sustainability (MediaWiki-MultiDC) board.

Krinkle edited projects, added Sustainability; removed Sustainability (MediaWiki-MultiDC).Jul 2 2020, 4:32 PM

Per today's Multi-DC meeting, I'm detaching this from the current workboard. It was our understanding that the messages here are largerely and perhaps even exclusively sent from the primary DC (assuming RC events only originate from write actions and from GET requests we classify as write actions, per T91820).

And while that is not enforced, for the theoretical case of secondary emissions, those are:

not in the critical path (post-send) and thus not latency sensitive or in need of a local proxy,
naturally public and without any PII, hence not in dire need of TLS for cross-dc.

This task is still important as it would help reduce maintenance cost and overall system complexity by not having the IRC dispatch from MW anymore, and for general availability and stability of the service by being a stateless active-active proxy in both DCs based on our current Kafka pipeline.

As such, not closing and keeping on the general Sustainability board.

Legoktm added a subtask: T278255: Set up spare irc1001.wikimedia.org in eqiad.Mar 23 2021, 7:06 PM

Legoktm closed subtask T278255: Set up spare irc1001.wikimedia.org in eqiad as Resolved.Apr 13 2021, 11:30 PM

Current status: irc2001 is irc.wm.o, and irc1001 is receiving events from MediaWiki and is a hot spare that can be failed over to by adjusting the irc.wm.o CNAME (on a 5min TTL).

In T128592#5180449, @Krinkle wrote:

@Krenair What you describe sounds like a standby ready for manual failover, not active-active.

I consider a service active-active if users are routed to either of the available instances, and that upon issues with one of them, internal routing can depool one of them and users mostly don't notice anything.

As long as people connect directly by public IP to one of these two instances, I don't see that happening without some kind of proxy or other intermediary mechanism to ensure that upon failure and client reconnect you end up on the other instance?

Is it even possible for IRC to be active-active? Doesn't the client have to maintain a connection with a single server, and if that server drops, they disconnect, retry and get a connection again (maybe internally to a different server)? In that downtime though you're going to miss a few events. Unless the server remembers what your last position was (which the EventStreams protocol does!), I'm not sure how we avoid that.

Most clients/frameworks I'm aware of allow you to set fallback servers if the primary one isn't available, so we could encourage people to set the server list to: ['irc.wikimedia.org', 'irc1001.wikimedia.org', 'irc2001.wikimedia.org']. So even if irc.wm (currently irc2001) goes down, it'll cycle to the next one on the list and try it. And while specific server hostnames are hardcoded, they'll still be able to connect to irc.wm.o if/when we set up new IRC VMs.

In T128592#6996726, @Legoktm wrote:

Is it even possible for IRC to be active-active? Doesn't the client have to maintain a connection with a single server, and if that server drops, they disconnect, retry and get a connection again (maybe internally to a different server)? In that downtime though you're going to miss a few events. Unless the server remembers what your last position was (which the EventStreams protocol does!), I'm not sure how we avoid that.

Exactly, I rebooted kraz last week to drain the last remaining connected clients and they all failed over/reconnected almost immediately to the secondary host.

I think this task can be closed, the current solution is about as good as it can can get until we replace it with the new service or everything moves to evenstreams.

Ack, not missing messages !- active-active. So long as reconnect to the same hostname is expected to work within a reasonable amount of time, I guess we can close this. Requiring a public DNS change and for clients to not be subject to a cache is not ideal, e.g. a service IP internally or some other indirection seems better, but that's an improvement for later perhaps.

Krinkle closed this task as Resolved.Apr 14 2021, 11:31 PM

Krinkle assigned this task to Legoktm.

Add redundancy to IRC recent changes serviceClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add redundancy to IRC recent changes service
Closed, ResolvedPublic
Actions

Related Objects
Search...