Page MenuHomePhabricator

IRC RecentChanges feed: code stewardship request
Closed, ResolvedPublic

Description

  • Current maintainer:

Operations (barely)

  • Number, severity, and age of known and confirmed security issues

None that we know of. The third-party software being used is old and antiquated though, possibly riddled with vulnerabilities.

  • Was it a cause of production outages or incidents? List them.

The current design is one single daemon operating on a single server, in one datacenter. It, perhaps surprisingly, hasn't been part of a large unscheduled outage in the last few years, but even trivial maintenance tasks like a server reboot, cause short outages and require planning a long time ahead (for example, a schedule reboot for Feb 22nd, 2018, was announced on Jan 18th, 2018.

  • Does it have sufficient hardware resources for now and the near future (to take into account expected usage growth)?

Sort of. The resources in terms of serving users are ample, and the service isn't growing. However, it's operating without any redundancy, in terms of both individual hardware failure, and datacenter failure.

  • Is it a frequent cause of monitoring alerts that need action, and are they addressed timely and appropriately?

No.

  • When it was first deployed to Wikimedia production

Many, many years ago (May 2005).

  • Usage statistics based on audience(s) served

Serving almost exclusively bots. The IRC server's own statistics say: "Current global users 288, max 540".

  • Changes committed in last 1, 3, 6, and 12 months

None.

  • Reliance on outdated platforms (e.g. operating systems)

Relies on a custom, patched version of an IRC server (ircd-ratbox, found in operations/debs/ircd-ratbox) that we have to forward-port on every operating system upgrade. See T134271 for an old task that calls for its replacement.

  • Number of developers who committed code in the last 1, 3, 6, and 12 months

Zero.

  • Number and age of open patches

Probably zero.

  • Number and age of open bugs

Six, from 2014-2016.

  • Number of known dependencies?

None internally. Dozens of external bots depend on it however, for operations like counter-vandalism. Rumour has it that the projects absolutely depend on it for their operation and downtimes of the service are tolling on the editors.

  • Is there a replacement/alternative for the feature? Is there a plan for a replacement?

The original replacement was RCStream, which was eventually replaced by https://wikitech.wikimedia.org/wiki/EventStreams. EventStreams is recommended for new projects, and while it has been a successful, well-maintained alternative, it requires porting those bots over to a new protocol (from IRC, to HTTP SSE), and these are controlled by volunteers at best (some may be even abandoned). Several people (including @ori, the author of the original replacement, RCStream), feel that fully deprecating the IRC feed is unrealistic and that we should keep and expand it,

As an alternative, @faidon proposed building an API-compatible replacement instead, that would consume from Kafka but be designed with scalability in mind. He provided a rudimentary proof-of-concept on his GitHub, but estimates it will require more of an investment to productionize. @Ottomata and Analytics is potentially interested in doing so and maintaining it further.

Event Timeline

faidon created this task.Jan 19 2018, 3:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 19 2018, 3:44 PM
faidon added a subscriber: MoritzMuehlenhoff.
bd808 added a subscriber: bd808.
Rxy added a subscriber: Rxy.Jan 19 2018, 7:47 PM
revi added a subscriber: revi.Jan 20 2018, 7:01 PM
Volans added a subscriber: Volans.Jan 21 2018, 7:16 PM
Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.Jan 22 2018, 4:47 PM
greg triaged this task as Normal priority.Jan 22 2018, 11:35 PM

However, it's operating without any redundancy, in terms of both individual hardware failure, and datacenter failure.

This seems something that would easily be fixed by placing a irc server per datacenter and link them the old way.

Making edits on any datacenter to propagate there, all using the same nickname, would be a bit harder, but probably doable with little effort, too.

Platonides updated the task description. (Show Details)Jan 23 2018, 12:24 AM

However, it's operating without any redundancy, in terms of both individual hardware failure, and datacenter failure.

This seems something that would easily be fixed by placing a irc server per datacenter and link them the old way.
Making edits on any datacenter to propagate there, all using the same nickname, would be a bit harder, but probably doable with little effort, too.

Well, sort of.

  • First off, we'd need multiple servers per datacenter, not one per each (so we're now talking of a 4-8 server network at minimum).
  • Second, reboots or failures of those individual servers will still be service-impacting, as users would get disconnected and will need to reconnect, possibly losing messages in the meantime (a problem that's hard to solve with IRC in general, though).
  • Third, this still doesn't solve having multiple bots (even per datacenter). I doubt having multiple bots around emitting the same changes with different nicknames would work with our current user base (the bot still has the nickname "rc-pmtpa" after all...). Patching ircd-ratbox further to support this is also not a very realistic option.
  • Fourth, it doesn't really solve us using a fork of an old piece of software as a server.

Of all of our options, I think operating an IRC network and all of its complexity, is one of the poorest ones. It may be doable with a lot of duct tape, but would certainly not be my preferred one.

I was just trying to address the stated concern, not make it a perfect solution (we will probably still have irc.wikimedia.org for some time, so it would be good that things were better). Also, I am assuming bots will reconnect and rejoin with near-zero downtime, trying several A records as needed.

That said:

  1. Multiple servers per datacenter would probably not be needed, as the secondary irc servers on each datacenter would be basically idle, just awaiting just in case they needed to replace the main one. They could handle connections, but a single server seems to handle that perfectly. (Unless you planned to put several irc servers behind a HA floating IP or similar, but seems adding too effort and complexity for a service in this status). The servers on the other datacenters could work as a backup* if the server somehow failed while the issue is investigated / a new one was procured (quite hypothetical considering that kraz is physically on ganeti). Not that it would be hard to add extra VMs, anyway.
  1. Per the above assumption, lost messages would be minimal and implicitly accepted.
  1. I wasn't proposing having multiple bots, but was thinking that each datacenter would output (with the same nick) the changes performed on his datacenter (they could easily use different nicks, but I suspect bots may be checking the sending nick). If a server failed, one would simply need to update $wgRC2UDPAddress to point to the different irc server (running udpmxircecho.py itself) that would be taking its job.

Of all of our options, I think operating an IRC network and all of its complexity, is one of the poorest ones. It may be doable with a lot of duct tape, but would certainly not be my preferred one.

The longterm solution is probably to make a daemon that translates into irc the output of EventStreams.

Krenair added a subscriber: Krenair.EditedJan 23 2018, 2:01 AM

Pretty sure MW has supported having multiple destinations for these streams for years now. So you could have multiple servers (not in an IRC network) receiving the changes from MW and being available for clients to connect to. The same set of changes (that introduced RCFeeds) would've been the ones that deprecated and eventually killed $wgRC2UDPAddress.

And then when you want to reboot the main one (i.e. the one at the primary irc.wikimedia.org address) you just swap the DNS records, wait for TTL expiry, and kill connections on the one being rebooted. Clients re-connecting should hit the new active box.

Nuria added a comment.Apr 3 2018, 4:11 PM

Analytics agrees to be stewards of this service once it is migrated to be on top of akafka stream, cc @greg

greg added a comment.Apr 3 2018, 4:45 PM

once it is migrated to be on top of akafka stream

Great! But who is on point for doing that migration?

Nuria added a comment.Apr 3 2018, 4:48 PM

One of the analytics engineers.

One of the analytics engineers.

Which analytics engineer, and is there a task?

elukey added a subscriber: elukey.

There is now a task to track the work: T232483