Page MenuHomePhabricator

Port architecture of irc-recentchanges to Kafka
Open, MediumPublic

Description

In T185319 the Analytics team took ownership of the next developments of IRCRecentChanges. This task is meant to track the work to be done.

Background

The irc.wikimedia.org backend service currently run on kraz.wikimedia.org, on one host running Debian Jessie. The host runs a patched irc daemon, offering recent changes feeds for several wikis in separate IRC channels as the user rc-pmtpa. Example from the en.wikipedia channel:

16:45 @<rc-pmtpa> [[Wikipedia:Correct typos in one click/13]] M https://en.wikipedia.org/w/index.php?diff=927910088&oldid=927910078 * Uziel302 * (-299) .Goa in [[Shripad Hegde]] was dismissed
16:46 @<rc-pmtpa> [[Feria de Agosto]] M https://en.wikipedia.org/w/index.php?diff=927910091&oldid=927269885 * HaeB * (+1) Reverted edits by [[Special:Contribs/192.124.203.1|192.124.203.1]] ([[User talk:192.124.203.1|talk]]) to last version by
                  Crimsonfox
16:46 @<rc-pmtpa> [[Calvin Henry Fix]] !N https://en.wikipedia.org/w/index.php?oldid=927910090&rcid=1208394836 * Aboudaqn * (+28) [[WP:AES|←]]Created page with '#REDIRECT to [[Calvin Fixx]]'
16:46 @<rc-pmtpa> [[User:What Looks Like Crazy on an Ordinary Day]]  https://en.wikipedia.org/w/index.php?diff=927910089&oldid=927909390 * Claudiagclark * (+257)
16:46 @<rc-pmtpa> [[Special:Log/move]] move  * Litlok *  moved [[User talk:LiverpoolUniversityHospitals]] to [[User talk:SteveatLiverpoolUniversityHospitals]]: Automatically moved page while renaming the user
                  "[[Special:CentralAuth/LiverpoolUniversityHospitals|LiverpoolUniversityHospitals]]" to "[[Special:CentralAuth/SteveatLiverpoolUniversityHospitals|SteveatLiverpoolUniversityHospitals]]"
16:46 @<rc-pmtpa> [[Club Puebla (women)]]  https://en.wikipedia.org/w/index.php?diff=927910092&oldid=926264808 * 192.122.250.248 * (+27) /* Current squad */
16:46 @<rc-pmtpa> [[Eastern Illinois Panthers football]]  https://en.wikipedia.org/w/index.php?diff=927910093&oldid=926984512 * PWHIT66 * (+0) Updated information box

How it works in detail:

  • every mediawiki appserver sends a pre-formatted string, representing a change like the above examples, via UDP to kraz.wikimedia.org port 9390.
  • a python-based service, representing the pmtpa bot, reads the UDP message and posts it to the right IRC channel.
  • The IRC server is contained in the package ircd-ratbox, that is a patched version that we maintain.

There are currently (Nov 2019) ~286 bots connected to irc.wikimedia.org. They were written by the community in various formats/languages/etc.., and currently we don't exactly know if all of them are maintained and by whom. Some bots currently do important functions, most notably counter-vandalism, and loosing them would be problematic (read https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org).

Proposed solutions

Write a new daemon offering a IRC interface and gathering Recent Changes from Event Streams. Some ideas about how the daemon should look like in T232483#5536868:

Yesterday I had a chat with @faidon about this project and this is what I gathered:

  • we currently run a patched ircd daemon on kraz (role::mw_rc_irc in site.pp) that serves irc.wikimedia.org
  • bots join channels (like #enwiki, etc..) and listen for updates from rc-pmtpa, like recent changes feed, and take actions accordingly. There are ~264 clients using irc.wikimedia.org.

Outstanding issues:

  • kraz runs Debian Jessie, so it needs to be upgraded to Stretch/Buster during the next months before Jessie's LTS deadline expires.
  • maintaining a patched ircd daemon is cumbersome and not scalable

Proposal from Faidon:

  • write a custom stateless daemon to run on Kubernetes based on https://gist.github.com/paravoid/3419e0b5ae1f24b6ea21906a142f2f47
  • the daemon should be stateless and not sharing state
  • the daemon should offer a "sandbox" to each client/bot joining, offering a "private"-like IRC channel with only rc-pmtpa writing updates. In this way running the daemon on multiple pods in kubernetes wouldn't require to share state (like the list of connected clients, etc..)
  • the daemon should pull recentchanges from Event Streams, and feeding it to every channel/client.

The above is not mandatory but only a suggestion about how to proceed :)

Event Timeline

Nuria created this task.Sep 30 2019, 3:50 PM
fdans triaged this task as Medium priority.Sep 30 2019, 3:51 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
elukey updated the task description. (Show Details)Nov 22 2019, 10:38 AM
elukey added a subscriber: faidon.
elukey removed a subscriber: faidon.
elukey added a comment.EditedNov 22 2019, 10:48 AM

Today I quickly joined irc.wikimedia.org with my IRC client, and checked a couple of channels like en.wikimedia. The rc-pmtpa bot indeed writes a ton of updates in a specific format, and there seems to be ~286 clients connected (all bots listening for updates).

One example of bot is http://wikistream.wmflabs.org/, source code in https://github.com/edsu/wikistream/blob/master/public/js/app.js

All these bots are IRC clients that join irc.wikimedia.org and listen for updates on certain channels. Initially I thought that it would have been feasible to make a survey and reach out to people to see if anybody was willing to migrate to EventStreams, but it seems a huge work difficult to complete before the Debian Jessie EOL deadline (end of June 2020) since it would require to push the community to update a lot of code in various repos.

Faidon's idea (briefly outlined in the description) seems sound, but it should be discussed more broadly in the team.

On the other hand, if we were able to - 1) figure out what bots are active/used among the 286 2) follow up with their authors to migrate to ES - it would remove a complex tool from our infrastructure and reduce technical debt.

@Krinkle I am pinging you since you have probably the most context: do you think that it could be something doable to migrate the bots consuming from irc.wikimedia.org during the next 6 months? Or we should just go for implementing a new tool that replaces it ?

difficult to complete before the Debian Jessie EOL deadline (end of June 2020)

JFTR, the jessie deadline is Q3. Jessie was released 25th April 2015 and LTS supports 5 years (but our internal deadline for production is end of Q3)

difficult to complete before the Debian Jessie EOL deadline (end of June 2020)

JFTR, the jessie deadline is Q3. Jessie was released 25th April 2015 and LTS supports 5 years (but our internal deadline for production is end of Q3)

Ack then, so even worse :D

Krinkle renamed this task from Architecture of recent changes on top of kafka. Produce Design Document. to Redesign architecture of irc-recentchanges on top of Kafka.Nov 24 2019, 3:40 AM
Krinkle added a subscriber: faidon.

@Krinkle I am pinging you since you have probably the most context: do you think that it could be something doable to migrate the bots consuming from irc.wikimedia.org during the next 6 months?

I do not think we can/should replace this more than 10 year old service within 6 months. See https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org for more info on its usage.

(Speaking in my volunteer role for a moment:) I wouldn't know where to begin to migrate the CVNBot infra which runs on C-sharp. This is a tool I inherited maintenance and operational responsibility for, but I have very little experience with C-sharp and very little spare time to maintain in in addition to everything else. Yet, it is a crucial part of many curation workflows used by Wikipedia editors to ensure quality of content and combat vandalism.

@Krinkle thanks for the links, I have a better picture now.

I can see three layers of technical debt in here:

  1. There are super useful bots, once written by community members, that patrol daily our projects. We don't really know all of them and what would happen if irc.wikimedia.org stays down for a day or more (for example kraz host fails and it takes a bit to move the service elsewhere).
  2. irc.wikimedia.org is maintained as best effort, its code is old and not really portable to more recent platforms like Debian Buster.
  3. A new software component will be created, to bridge the current bot interface (IRC) with a more recent and simple one (EventStreams).

Just to be clear, I don't mean "technical debt == bad code", just software that is complex and with a long history that we don't maintain completely (in the sense that it is not actively developed/improved, but kept alive as best effort).

It feels a serious problem to be tackled, maybe not now but during this fiscal and the next one? Point 3) is probably needed and we cannot really skip it, but it will only add complexity in the long term (my 2c).

elukey updated the task description. (Show Details)Nov 25 2019, 3:52 PM
Ottomata added a comment.EditedNov 25 2019, 4:28 PM

new software component will be created, to bridge the current bot interface (IRC) with a more recent and simple one (EventStreams).

+1! I don't know if using 'EventStreams' is quite right; this will likely be a separate IRC based service backed by Kafka. But yes to the general idea!

If I were to do this, I would try to do it in NodeJS with service-runner before Python, but I don't think that is a requirement.

Ottomata added a comment.EditedNov 25 2019, 5:42 PM

Qs:

  • Do we need to support full IRC spec? I see in Faidon's PoC he's attempting to at least respond to them all. AFACIT we don't actually support any commands other than joining channels. I guess we want to keep track of user nicks too? Do we need to? We could just keep the list of connected clients in channels and broadcast the incoming messages. Could we just ignore things we don't support? E.g. if not JOIN or whatever else we need: respond with some 'command unsupported' message.
  • Alternatively, is there a reason we want to write our own 'fake' IRC server? Could we use an existing one and have our service just forward as an IRC client the recentchange messages from Kafka to the right channels?

Qs:

  • Do we need to support full IRC spec? I see in Faidon's PoC he's attempting to at least respond to them all. AFACIT we don't actually support any commands other than joining channels. I guess we want to keep track of user nicks too? Do we need to? We could just keep the list of connected clients in channels and broadcast the incoming messages. Could we just ignore things we don't support? E.g. if not JOIN or whatever else we need: respond with some 'command unsupported' message.

We could sample some bots' code and see what commands are needed, or possibly see if kraz offers logging of IRC commands and start from there?

  • Alternatively, is there a reason we want to write our own 'fake' IRC server? Could we use an existing one and have our service just forward as an IRC client the recentchange messages from Kafka to the right channels?

I think that the idea is to move away from a single-host-service and have something more scalable and resilient to failures in Kubernetes. Let's add @faidon to the task and see if he has time to follow up :)

If I were to do this, I would try to do it in NodeJS with service-runner before Python, but I don't think that is a requirement.

We discussed this point within the team after Andrew raised it. It would probably be wise to decide who will own the service when ready: CPT might be a good candidate, and if they accept then nodejs is probably the best solution. If SRE will own it, then Python is probably the best choice.

  • Do we need to support full IRC spec? I see in Faidon's PoC he's attempting to at least respond to them all. AFACIT we don't actually support any commands other than joining channels. I guess we want to keep track of user nicks too? Do we need to? We could just keep the list of connected clients in channels and broadcast the incoming messages. Could we just ignore things we don't support? E.g. if not JOIN or whatever else we need: respond with some 'command unsupported' message.

The PoCs was't attempting to respond to all of the commands, there are way more. The ones that are being handled right now are a) the ones necessary for a client to be able to properly connect b) some that are nice-to-haves but for compatibility with the existing server (e.g. showing the channel topic) c) commands that some clients were sending but we don't care about; it was basically too noisy to see in the logs as "unsupported command" (e.g. mode, whois) so I just added a couple of def handle_...(self, params): pass.

There are a few more missing (e.g. PING), but all should be easy to add given that this is a very simple and stateless implementation. What's in the PoC now is functional with at least irssi. Ideally we'd just let this run with a debug log for a few clients like CVN, and let it log some "unknown commands" and handle (most likely: explicitly ignore) more commands commonly seen in the wild.

  • Alternatively, is there a reason we want to write our own 'fake' IRC server? Could we use an existing one and have our service just forward as an IRC client the recentchange messages from Kafka to the right channels?

Which server and how would you make that redundant/resilient? IRC servers are both complicated to run and have to maintain state (channels, topics, joins etc.), making them hard to run easily. Furthermore, they typically don't even allow making the restrictions we've made in this service to avoid running a full-fledged IRC network (e.g. no PMs) -- part of the problem here is that we've been running a custom-patched IRC server for something like a decade now... so we can't even run a single instance of a standard piece of software, let alone a cluster.

Plus, you'd need to also write and maintain a proper IRC bot anyway - so the other side of the same protocol :)

I don't feel strongly about it and don't have a horse in this race, but last time I looked at it (years ago), it was way more complicated to build a properly configured IRC server (or cluster!) + bot, than it was to build a tiny fake IRC server... The PoC is ~600 lines of code right now. Even with cleanups + more commands + Prometheus + Kafka support it would almost certainly still stay under 1000 lines of code.

Nuria added a comment.Nov 26 2019, 6:17 PM

I think we will have availability to work on porting this next quarter. Regarding ownership Analytics is not staffed to own a critical feed like this one that could be consider(tier-1?/critical?) Will it be possible for the SRE team to own it once migrated?

I think conceptually this belongs together with EventStreams, as a product offering and, by extension, to the same owners and maintainers. This is just another (non-HTTP) API for streaming events, like RCStream was, and its fate and evolution should be viewed together as a whole. For example, a valid product decision -now or in the future- may be "we'll sunset this by date X, and we recommend users to migrate to Y".

So, while SRE can certainly help, I don't think this belongs with us. Perhaps we can talk offline for further follow-up?

I agree it is a similar product but its usage seems quite different, it seems to support quite a few critical bots. EvenStreams availability and support are those of a tier-2 service, this one feels more like tier-1 in terms of availability. We are happy to provide the code to migrate it and probably co-own it together with core platform (cc @Pchelolo ) and SRE but given our staffing we cannot sign up for production support.

We can talk more offline as needed be.

elukey updated the task description. (Show Details)Nov 29 2019, 5:39 PM
elukey added a comment.EditedNov 29 2019, 5:42 PM

Today I digged a bit more in the current architecture of irc.wikimedia.org, and I have added all the info to the description of the task. It seems a bit crazy to me that every appserver sends a UDP message to kraz when a recent change entry is generated, we should pull data from Kafka nowadays.

I created the following script that mimics the mediawiki's format (still WIP and supporting only the edit type but works):

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer('eqiad.mediawiki.recentchange', bootstrap_servers="kafka-jumbo1001.eqiad.wmnet:9092")
for msg in consumer:
    dec_msg = json.loads(msg.value.decode())
    if dec_msg['type'] != 'edit':
        continue
    print(dec_msg)
    diff_link = dec_msg['server_url'] + '/w/index.php?diff=' + str(dec_msg['revision']['new']) + '&oldid=' + str(dec_msg['revision']['old'])
    lines_diff = int(dec_msg['length']['new']) - int(dec_msg['length']['old'])
    print("channel: " + dec_msg['meta']['domain'].replace('.org', '').replace('.wikimedia', '').replace('www.', ''))
    print(dec_msg['title'] + ' ' + dec_msg['type'] + ' ' + diff_link + ' * ' + dec_msg['user'] + ' * (' +  str(lines_diff) + ') ' + dec_msg['comment'])

It uses kafka-python, that we package and deploy on various Analytics hosts.

elukey added a comment.Dec 3 2019, 8:07 PM

Faidon improved his code in https://gist.github.com/paravoid/3419e0b5ae1f24b6ea21906a142f2f47, the python irc prototype is now able to do everything that is should be needed.

High level plan from a chat with Faidon:

  • set up a new VM with buster on codfw (kraz is in codfw)
  • provision the new python ircstream to run there.
  • add iptables -j TEE on kraz to <newbox> (and iptables -j DNAT) to mirror the traffic from kraz to the new VM (to start debugging issues etc..)
  • create a CNAME irc-next.wikimedia.org for the next box; ask Timo to run CVN, and maybe a couple of other volunteers that want to test things out.
  • <fix bugs>
  • once things seem OK, switch irc-next -> irc

The above plan would be related to a v1 version, namely a replacement of what we have now (receiving udp traffic from all the mw appservers and dispatch the messages to the right channel/bot). The v2 version would be related to replace the UDP traffic with a Kafka client that pulls from Kafka recentchanges topic.

About ownership, it is still not clear who will maintain this tool, likely SRE?

Hm, a thought. Since each fake-IRCd server will consume directly from Kafka and then fan-out the messages to connected clients, they could each use a distinct Kafka consumer group. Then, when the server is restarted, it will be able to consume from where it stopped.

Although; maybe this won't work with load balancing the clients? If the clients are failed over to a different fake-IRCd, they'll just be given whatever messages that fake-IRCd is currently consuming. These might be from different offsets than the messages the clients were receiving before they were failed over.

Because of this problem, maybe none of this matters? Maybe no consumer group offset tracking is needed, and each fake-IRCd should just always start from latest in Kafka when it starts up. That'd match the behavior clients are used to now.

The IRC messages don't look like a machine-readable format, but one should assume that every byte of every message is reflective of a stable/frozen interface.

I'd recommend starting by having some unit tests in place that assert byte-identical output based on a given input. Where the input is a pair of the JSON-Kafka and UDP2IRC messages MW currently broadcasts. And output is what UDP2IRC currently becomes (presumably unmodified but I don't recall whether some of it ends up modified or normalised in some way by the IRCd pipe).

Colour coding

This contract includes colour coding, which is currently the only unambiguous way to identify the different values embedded in the message, given that the separation characters we use are not themselves illegal in most values. The bots I know of all use the colour codes as the basis for their message parser.

I think this will be fine because the logic for this in IRCColourfulRCFeedFormatter is quite standalone and not subject to site customisation or other variables.

Message parameters

This contract also includes the quirky hybrid of only-partially localised messages and alternate localisation backend as what MW normally use. For example, the way we encode the source and destination page when a user has renamed an article is through the "comment" string at the end. E.g. * moved [[User talk:Foo]] to [[User talk:Bar]]. An example of this is also in the task description's sample. The parameters to this message are already encoded in the JSON messages ( as log_params).

For most of MediaWiki, these messages have evolved quite a bit over the years and actually no longer match this simple string. For that reason, the localisation messages for user actions were forked. The legacy one frozen as-is for UDP2IRC, and the new ones used elsewhere, e.g. on MW page views.

Most of IRCColourfulRCFeedFormatter is deterministic and standalone logic that could be re-implemented on top of the JSONRCFeedFormatter output we send to EventBus/EventStreams. But, the exception to this is the alternate "comment" formatting. This is backed by LogFormatter::getIRCActionComment() which is a separate highly-variable string, not exposed in JSONRCFeedFormatter. It varies by user action and wiki language.

From what I can tell though, this IRC-specific legacy string is actually exposed through JSONRCFeedFormatter as log_action_comment. So we could just use that, at least initially. It is based by LogFormatter::getIRCActionText, and the example of a renamed page is mapped to the 1movedto2 interface message, which is localised to the current wiki, and HTML-escaped (no reason, except legacy/compat).

As these are frozen, we could eventually move them out of MediaWiki and into this service to make it more self-contained, and to remove/decouple it from MW. It consists of a simple 2d map of a few dozen user actions, and a hundred or so translations. Eternally frozen.

Thanks @Krinkle, very much appreciate all this! I have code from a couple of weeks ago that basically implements all this: consuming from SSE and formatting into IRC logging messages, but by using log_action_comment. It needs some more polishing and repository creation etc. I'll add you as code reviewer once I find some time to work on something better than Gist; hopefully during the end of year holidays.

I have only one question on the above:

I think this will be fine because the logic for this in IRCColourfulRCFeedFormatter is quite standalone and not subject to site customisation or other variables.

Unless I'm misunderstanding how this works, I did notice some site customisation/variables in the code. Specifically:

  • $wgCanonicalServer . $wgScript. I've mapped that to msg["server_url"] + msg["server_script_path"] + "/index.php" (where msg = JSON RC), but that hardcodes /index.php which I'm not sure if is OK for all cases.
  • add_interwiki_prefix/wgLocalInterwikis and omit_bots, which I *think* are set to false in all of Wikimedia wikis. Could you confirm?
  • wgUseRCPatrol and wgUseNPPatrol but for this I've used the existence of the patrolled attribute instead.
elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Jan 3 2020, 10:55 AM

EventStreams or Kafka!?

Hm, the proposal in this ticket isn't specific, there are some mentions of backed by 'EventStreams', and there are others about Kafka consumers.

Unless you wanted to run this thing outside of WMF production (in a CloudVPS?), I don't see a good reason to back this service by EventStreams. Doing so would just add another layer between this service and Kafka for no obvious gain...unless I'm missing something?

Krinkle added a comment.EditedFeb 11 2020, 8:20 PM

EventStreams or Kafka!?

Hm, the proposal in this ticket isn't specific, there are some mentions of backed by 'EventStreams', and there are others about Kafka consumers.

Unless you wanted to run this thing outside of WMF production (in a CloudVPS?), I don't see a good reason to back this service by EventStreams. […]

I believe all involved parties mean the same thing, but using different words. "EventStreams the service" indeed is the public HTTP-SSE service. I don't think we'd want the new IRC deamon to be implemented on top of that.

Rather, it would subscribe recentchanges topic with messages from MediaWiki emitted "for EventStreams". In other words, atop Kafka, as populated by MediaWiki+EventBus.

This is easily paragraphed as "building on top of EventStreams' recentchanges topic", but that indeed technically means something different.

See also T242712 about irc.wikimedia.org/#central, which currently does not yet have an alternative in the machine-readable/EventStreams world and thus we're blocked on finding a way forward for that before we can re-implement irc.wm.o wholesale atop eventstreams's internal kafka topic(s).

EventStreams or Kafka!?

Hm, the proposal in this ticket isn't specific, there are some mentions of backed by 'EventStreams', and there are others about Kafka consumers.

Unless you wanted to run this thing outside of WMF production (in a CloudVPS?), I don't see a good reason to back this service by EventStreams. […]

I believe all involved parties mean the same thing, but using different words. "EventStreams the service" indeed is the public HTTP-SSE service. I don't think we'd want the new IRC deamon to be implemented on top of that.

he wrote a Python prototype of a new irc.wikimedia.org backend that uses eventstreams rather than the UDP recent changes feed […]

To be precise, the new irc.wm.org backend uses Kafka (not EventStreams), […]

My impression from @faidon was that he wanted to use EventStreams, not Kafka directly, […]

Would be good to confirm whether there is an operatinal preference for EventStreams over Kafka, or if the choice was mainly about not using UDP. I'll also notice that given how it quickly written it may've used EventStreams simply because Kafka is private. It seems natural to me that it would use Kafka when actually deployed. Doing so avoids mixing internal and external subscribers, or otherwise letting user-generated load on the public EventStreams cluster influence the internal IRC deamon. It also reduces levels of indirection and service dependencies which seems preferable.

The messages use the same JSON format, require similar continuation logic (EventStreams is based on Kafka after all), and we've got numerous python-kafka micro services in production to use as example for good practices.

It seems natural to me that it would use Kafka when actually deployed

+1

First off: I have prototype code that supports UDP Echo and SSE, but not Kafka. It's not something that it's fully ready or tested yet. This has been developed over weekends/holidays etc., as a fun project -- and I can't promise I'll find spare time to add more stuff to it right now. Someone that can commit to it -staff or volunteer- should pick it up at some point and maybe also add Kafka in the process. We still have an open item and pending conversation on where ownership for the service itself lies.

With all that set aside, on the choice of SSE over Kafka: this was mainly done for two reasons:

  • Ease of local development, especially given the piece that needed the most development work was porting the message formatter
  • The ability for us to be able to ship this daemon to end-users, and give them the option of running this as a side car to their app. Their app could then connect to this daemon on localhost. That could be a low-effort migration path towards the deprecation of the IRC-based API on our end.

HTH!

I'm unconvinced any proposal here is paying down technical debt. Perhaps refinancing or doing some kind of balance transfer.

UDP --> IRC has been incredibly stable for over a decade. This is impressive and perhaps other services should be taking notes! The existing architecture predates almost all Wikimedia Foundation Inc. staff and has outlasted the entire lifecycle of services such as RCStream. It's unclear to me what's gained by changing the existing architecture.

There are mentions of scalability in this task, but those appear to be unsubstantiated given the relatively low usage of irc.wikimedia.org. Is anybody asking for an IRC cluster? Is there any need for one? Yes, it would be quite bad if irc.wikimedia.org suddenly went offline and yes, we should definitely have a way of spinning up a clone of it. However, these points alone don't necessitate making major changes as far as I can tell. What am I missing?

I'm unconvinced any proposal here is paying down technical debt. Perhaps refinancing or doing some kind of balance transfer.

UDP --> IRC has been incredibly stable for over a decade. This is impressive and perhaps other services should be taking notes! The existing architecture predates almost all Wikimedia Foundation Inc. staff and has outlasted the entire lifecycle of services such as RCStream. It's unclear to me what's gained by changing the existing architecture.

There are mentions of scalability in this task, but those appear to be unsubstantiated given the relatively low usage of irc.wikimedia.org. Is anybody asking for an IRC cluster? Is there any need for one? Yes, it would be quite bad if irc.wikimedia.org suddenly went offline and yes, we should definitely have a way of spinning up a clone of it. However, these points alone don't necessitate making major changes as far as I can tell. What am I missing?

Hi :)

irc.wikimedia.org currently has some drawbacks:

  1. all the mw app/api servers send UDP traffic to a single host (kraz) in an unreliable way. Our network has been really stable due to the big effort that SRE has always put on it, but nothing guarantees that UDP packets are not lost in transit (since it is not TCP). The RecentFeed that is sent to Kraz now could suffer, in theory, of packet dropped every now and then due to any unknown issue, and we wouldn't know it.
  2. over the past years rebooting kraz (kernel updates, regular maintenance, etc..) has been really painful, and every time it had to be done (critical security patches etc..) some bots were impacted, and the community immediately alerted us. SRE needs to do some maintenance every now and then to keep a good level of overall security for hosts like kraz, that are in the production network and that hold a public IP address. Currently it is not really possible.
  3. related to the above, there are a lot of incredibly useful bots that are not maintained anymore (or that are maintained as best effort, rightfully, from some members of the community) that cannot be ported to a more modern service like https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams (unknown/inherited codebase, old code, etc..). We chose then to keep IRC as protocol of choice to avoid adding work to any member of the community (to port bots to event streams for example), and this is exactly what Faidon's prototype is doing. The interesting part is that even if the same protocol is used, it will be possible in the future to increase the number of backends via Kubernetes or multiple VMs etc.., to ease future maintenance tasks for SRE.
  4. About the low usage of irc.wikimedia.org - yes I agree that few bots are using it (~300), but as mentioned above it will surely be a problem if we cause impact to them in any way, since some are bots that remove hours of tedious work from the community. The goal from SRE/Analytics is to make the service more robust and maintainable for the future, this is why we talk about technical debt.

Hope this add a bit more clarity, available for more questions if needed.

Summary of my understanding and next steps:

  • ownership - this project is shared between SRE/Analytics, and I think that we could keep maintaining it as shared efforts, like it has been up to now for EventStreams. I don't think that we really need to define clear boundaries, but in case this is needed let's do it.
  • first milestone - Faidon's prototype, using the UDP feed from MW, deployed instead of kraz's stack seems to be the first good goal to reach. It will allow us to finally move to Buster, that is what is pressing us at the moment (since the Jessie EOL deadline is not far from now).
  • second milestone - sort out missing events between UDP and Kafka/EventStreams, like T242712. Doing it without rushing for the Debian Jessie's deadline would surely be helpful, since it doesn't seem something that can be quickly resolved.
  • third milestone - Kafka replacing UDP, and possibly moving the service to kubernetes of LVS with multiple vms etc..

How does it sound?

  1. About the low usage of irc.wikimedia.org - yes I agree that few bots are using it (~300)

Am I going mad or isn't that actually quite a lot of bots given the context?

As for replacing the internal UDP feed with something more reliable, that sounds sensible for something we consider this important. Wonder how many random events have been dropped and not showed up in the IRC feed over the years.

  1. About the low usage of irc.wikimedia.org - yes I agree that few bots are using it (~300)

Am I going mad or isn't that actually quite a lot of bots given the context?

Not many bots, and in theory some of them might also be not-used/not-critical/etc.. The main problem is not their number, but the amplification factor (namely number of working hours) that some of them will have on the community if they go down. The absence of a clear ownership of some (at least, this is my understanding after a bit of research) is also another point to be worried about.

As for replacing the internal UDP feed with something more reliable, that sounds sensible for something we consider this important. Wonder how many random events have been dropped and not showed up in the IRC feed over the years.

I suspect not much since the network has always been reliable and well maintained, but I cannot ensure that events didn't drop over the years of course :)

I appreciate the effort being made toward maintaining capability. Since 2011, I've run the "snatch" IRC bot that sits on irc.wikimedia.org and relays to the "snitch" IRC bot on irc.freenode.net.

In late 2012 and throughout 2013, some previous iteration of the Wikimedia Foundation Inc.'s analytics team made a very similar effort to the one being described in this task and others. It was a mess and nothing ended up coming of it except people continuing to baselessly insult UDP as unreliable. Ori actually wrote https://github.com/atdt/UdpKafka after getting annoyed at the team's bloviating and posturing over its "Kraken" vaporware.

Obviously the people involved in this task are different, for example I like and trust Faidon and Platonides, who both seem to have blessed this approach (cf. T185319#3919427, &c.). And I'm certainly not going to deny anyone the joy of working on a fun side project. It's been five years since the unfounded rumors of deprecating irc.wikimedia.org in T87780 and as long as that interface continues to operate as well as it has, shrug.

Change 589166 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] rcfeed: Add 'notify_url' and 'title_url' to MachineReadableRCFeedFormatter

https://gerrit.wikimedia.org/r/589166

Krinkle renamed this task from Redesign architecture of irc-recentchanges on top of Kafka to Port architecture of irc-recentchanges to Kafka.May 11 2020, 9:06 PM
Krinkle updated the task description. (Show Details)