Page MenuHomePhabricator

Deprecation (if possible) of the #central channel on irc.wikimedia.org
Open, Needs TriagePublic

Description

Hi! I hope that this is the right tag for the goal that I have in mind, if not apologies for the spam :)

In T240182 (and parent tasks) we are trying to move the current irc.wikimedia.org infrastructure to EventStreams/Kafka. The idea is to create a new tool that offers the same API (namely a IRC feed with channels etc..) to bots, but backed up by EventStreams/Kafka and not MediaWiki anymore (see T234234 for more info).

The irc.wikimedia.org's #central feed seems used by a few bots, and possibly its data may not be needed anymore. For example:

09:08 @<rc-pmtpa> [[User:Ashutosh kumar1205]]@commonswiki https://commons.wikimedia.org/wiki/User:Ashutosh_kumar1205 * Ashutosh kumar1205 *
09:08 @<rc-pmtpa> [[Wikipedysta:Old Defender 1814]]@plwiki https://pl.wikipedia.org/wiki/Wikipedysta:Old_Defender_1814 * Old Defender 1814 *

Do you think that it will be possible to try to deprecate this feed?

Event Timeline

Hi @elukey, and thanks for the ping. Our SUL watcher bots do certainly use this channel to report account creations to #cvn-unifications and deal with account creation vandalism if needed. The bot code (rTSTW SULWatcher/SULWatcher.py) parses the feed from that channel and filters it. We're having difficulties to maintain our bots due to lack of stewards with knowledge of Python. So if this channel goes, I'm not sure we'll be able to craft a replacement sadly. Best regards.

@MarcoAurelio thanks a lot for the feedback! I am trying to find use cases for the IRC channel, we don't want to cause disruptions to important bots. Do you have more info about how the account creation report is used by consumers of #cvn-unifications? I am wondering if the info is actively used or not to fight counter-vandalism, as you were saying some bots are old and they may not be up to date now (the info could also come from other places etc..)

#cvn-unifications is intensively used by stewards to detect abusive user names. It's the easiest (if not only useful) source of global account creations.

As a steward, I very much rely on #cvn-unifications.
The SUL watcher bots report all new account creations there, with a link to https://meta.wikimedia.org/wiki/Special:CentralAuth/username. That's very useful to lock LTAs very quickly or at least examine suspicious accounts. Also, there are some regexes to highlight potentially unsuitable user names. And of course you could use your IRC client to set "private" pings for names/name patterns that are of special interest for you.
Some LTAs troll in a specific wiki with a specific name pattern, and when their IP/IP range is blocked locally, switch to another wiki to continue. Without #cvn-unifications it's next to impossible to follow them, because you couldn't watch ~ 950 account creation logs manually. (Okay, you could watch Special:Log/newusers @ Loginwiki, but you need to reload that page permanently ...)

Krinkle added a comment.EditedJan 22 2020, 12:24 AM

("LTA" is a Wikimedia abbreviation for "long-term abuser", often involving multiple IPs and/or accounts.)

@elukey do you know if SULWatcher is currently the only consumer of #central? If so, I think I can help port it to eventstreams once the current Python 3 migration happens (T216020).

revi added a subscriber: revi.Jan 22 2020, 1:33 PM

For the records: I don't use cvn-unification that often, but I always get pinged by someone who uses it whenever the bot is down (within 12 hours, usually).

STRONG OPPOSE for now - until we migrate SULWatcher bot's system to use EventStreams/Kafka. #cvn-unification channel is highly valuable for us, SWMT members. I second Schniggendiller's comments in T242712#5821559.

Xqt added a subscriber: Xqt.Jan 22 2020, 7:34 PM
Xqt added a comment.Jan 22 2020, 7:38 PM

Pywikibot has a powerfull eventstreams handler. Probably we can use it or its code parts for this proposal.

until we migrate SULWatcher bot's system to use EventStreams/Kafka

... Can we do that?

until we migrate SULWatcher bot's system to use EventStreams/Kafka

... Can we do that?

@Ottomata Sure, I guess that's posible but we have a problem: None of us (current stewards) either have the time or the knowledge about how to do this. Now I see @Legoktm very kind offer at T242712#5821880 to help us, which I accept of course. The code used by SULWatcher and StewardBot were both written by users that are no longer around, and we've just been able to keep it running until now.

STRONG OPPOSE for now <snip>

Comments like these aren't productive on Phabricator, thanks. Bold and all caps doesn't help make your point.

until we migrate SULWatcher bot's system to use EventStreams/Kafka

... Can we do that?

Yes. We just need there to be a public EventStreams feed to consume first :) Which I suppose is T240182: Create EventStream's equivalent to irc.wikimedia.org's #central channel.

Ah right, ok. I think we were hoping we wouldn't need the centralauth/account-add data as a stream but it seems like we do, even if we are able to deprecate #central in irc.wikimedia.org. Hm. Once the new event stream exists though, it'll be trivial to include it to be used by T234234: Port architecture of irc-recentchanges to Kafka in the same way that other streams will, so we might as well keep the #central channel around anyway.

Porting SULWatcher to EventStreams will still be nice though, but I guess it won't really let us deprecate #central.

Krinkle added a comment.EditedJan 22 2020, 10:45 PM

[…]
(Okay, you could watch Special:Log/newusers @ Loginwiki, but you need to reload that page permanently ...)

The #central feed is currently powered by CentralAuth in MediaWiki directly. The same code that is called after recent changes and log events, is called by CentralAuth directly from a hook to power #central.

The mention of loginwiki is interesting here. Note that loginwiki did not exist until relatively recently, it certainly is much newer than the #central feed.

It is worth checking whether the recent changes feed for loginwiki/logging/newusers contains enough information to reproduce the essence of #central. If so, then T240182 could be implemented as a light transformation step by consuming the existing machine-readable recentchanges feed, without needing support from MediaWiki PHP and CentralAuth to provide it explicitly. If the basic information is equal or a superset, then I suppose the two feeds could be compared side-by-side for a window of say 24 hours to uncover any discrepancies (e.g. missing or unexpected items).

@Legoktm sorry for the late reply, thanks a lot for the help!

@Krinkle thanks a lot for the info, can you add a bit more detail about:

It is worth checking whether the recent changes feed for loginwiki/logging/newusers contains enough information to reproduce the essence of #central.

Namely, what concrete things do I need to check? The eventstream's recentchanges feed for wiki: login or similar? Or other feeds etc..? (Just to understand how to do the comparison, this test seems promising but it should be done asap since kraz is running Jessie and needs to be moved to Buster soon).

Get a sample from #central on IRC and from wiki=login.wikimedia on EventStreams, then in the latter look for log/newusers entries that correlate with the lines in #central and see if all meta data (numbers, user names, other strings, etc.) are present there.

elukey added a comment.EditedFeb 6 2020, 4:12 PM

I tried a couple of things:

In both cases I don't see any event related to wiki=login, are we sure that we have it in recent changes?

Also I see some differences between the info in https://login.wikimedia.org/wiki/Special:Log and https://login.wikimedia.org/wiki/Special:RecentChanges, for example the latter doesn't show info about user creation.

Krinkle added a subscriber: Tgr.Feb 7 2020, 12:56 AM

[…] I don't see any event related to wiki=login, are we sure that we have it in recent changes?

Also I see some differences between the info in https://login.wikimedia.org/wiki/Special:Log and https://login.wikimedia.org/wiki/Special:RecentChanges, for example the latter doesn't show info about user creation.

Hm.. you're right. It is one of a few rare LogEntry objects that we insert() on (for the logging table behind Special:Log), but not publish() (for the recentchanges table behind Special:RecentChanges, and RCFeed/EventStreams).

For new account registrations from any wiki, source code calls insert() and publish().

However when visiting a (different) wiki for the first time and auto-creating/attaching a local account, we only call insert() (source code).

As example:

  • If you create a new account on say en.wikipedia.org or www.mediawiki.org, then that is a newusers/create log entry and RCFeed event for that specific wiki.
  • Immediately after sign up, via async JS requests, the same account is also auto-created locally on login.wikimedia.org, Commons, Meta-Wiki and a handful of others. That creates newusers/autocreate log entries, but no RCFeed events. Also during the autocreation procedure, the CentralAuth extension emits a hardcoded message directly to RC2IRC #central channel (source code).
  • At any point in in the weeks/months/years after this, whenever you first visit a public WMF wiki you haven't visited before, the same autocreation procedure happens (after handshaking and bouncing in the background via login.wikimedia.org).

So as-is we can't simply filter the existing recentchanges feed to get this information. Some options:

  • Chat with stewards of MediaWiki's "Authentication and Authorization" (CPT and @Tgr) and determine based on product wishes, scalability, and community/UX aspects whether it's okay to have these go to recent changes as well. This means adding a call to $logEntry->publish(). Then from there, the above idea could still work.
  • As lighter and less controversial option, would be to add a call to $logEntry->publish('udp'). Passing the string 'udp' here is confusing way of telling it to only notify RCFeeds (such as irc.wikimedia.org and EventStreams) without adding database entries to recentchanges. This means we don't have any database or UX considerations.
  • Alternatively, we could go from a hardcoded optional "IRC" plug in CentralAuth, to a hardcoded optional "EventBus" plug in CentralAuth. Some minimal conditional statement that then calls EventBus to emit the event in some form directly into some kind of centralauth related topic.

@Ottomata any opinion about the last two ideas from Timo?

Ottomata added a comment.EditedFeb 7 2020, 2:24 PM

I think any of these ideas would work. For the third option (the EventBus option), if we do that, maybe a more generic account creation stream is more useful? Is there a hook that would allow us to fire an event every time an account is created? The #central IRC channel could could then just filter and transform that stream.

Tgr added a comment.Feb 8 2020, 7:04 AM

Autocreation would probably be too spammy for recentchanges. (It's less annoying than wikibase / category changes in that it's a log event and so it gets collapsed into a single item per day in the modern RC interface, but not everyone uses that.) Only publishing to RCFeed sounds fine.

In general, using loginwiki autocreations seems unnecessarily fragile. They are not a necessary byproduct of account creation, we just have a job that autocreates there after registration since the antivandal people said having all accounts searchable there helps their work. If that job breaks, reporting of account creations should ideally not break.
(OTOH, I guess it would at least make it very obvious when it breaks.)

Wrt EventStreams, it's not clear to me whether that's production infrastructure or analytics infrastructure (with weaker commitments on how long it is acceptable to delay fixing it if it breaks)?

elukey added a comment.EditedFeb 8 2020, 9:14 AM

Autocreation would probably be too spammy for recentchanges. (It's less annoying than wikibase / category changes in that it's a log event and so it gets collapsed into a single item per day in the modern RC interface, but not everyone uses that.) Only publishing to RCFeed sounds fine.

In general, using loginwiki autocreations seems unnecessarily fragile. They are not a necessary byproduct of account creation, we just have a job that autocreates there after registration since the antivandal people said having all accounts searchable there helps their work. If that job breaks, reporting of account creations should ideally not break.
(OTOH, I guess it would at least make it very obvious when it breaks.)

Thanks a lot for the feedback!

Wrt EventStreams, it's not clear to me whether that's production infrastructure or analytics infrastructure (with weaker commitments on how long it is acceptable to delay fixing it if it breaks)?

EventStreams will be migrated to Kubernetes this quarter, and it is a service with shared responsibility between Analytics and SRE basically. It can be considered production in my opinion.

Quick summary for everybody about the current status:

irc.wikimedia.org runs on kraz, a Debian Jessie host that needs to be migrated to Buster very soon (by the end of quarter would be ideal). The current architecture has been revamped by Faidon (see parent tasks, especially T234234) and he wrote a Python prototype of a new irc.wikimedia.org backend that uses eventstreams rather than the UDP recent changes feed (that hopefully will be decommed at the end of this migration). In order to test this properly, we'd need to have the #central equivalent channel/feed to be populated from eventstreams, that is what we are discussing in this task. So it would be really important in my opinion that we'd reach an agreement about how to proceed, in order to unblock testing of the new stack :)

Any help is really appreciated, thanks all for the collaboration given the small notice!

@Tgr do you have any preference among the solutions proposed by Timo? If not, is there anything that would recommend to explore/think-about ?

Ottomata added a comment.EditedFeb 10 2020, 2:18 PM

he wrote a Python prototype of a new irc.wikimedia.org backend that uses eventstreams rather than the UDP recent changes feed (that hopefully will be decommed at the end of this migration). In order to test this properly, we'd need to have the #central equivalent channel/feed to be populated from eventstreams

To be precise, the new irc.wm.org backend uses Kafka (not EventStreams), and we need the #central equivalent channel/feed to be produced to Kafka, likely via EventBus either as a new stream (sounds like this would be better) or in RCFeed (which EventBus itself uses to produce the recentchange event stream to Kafka).

https://wikitech.wikimedia.org/wiki/Event* :p

he wrote a Python prototype of a new irc.wikimedia.org backend that uses eventstreams rather than the UDP recent changes feed (that hopefully will be decommed at the end of this migration). In order to test this properly, we'd need to have the #central equivalent channel/feed to be populated from eventstreams

To be precise, the new irc.wm.org backend uses Kafka (not EventStreams), and we need the #central equivalent channel/feed to be produced to Kafka, likely via EventBus either as a new stream (sounds like this would be better) or in RCFeed (which EventBus itself uses to produce the recentchange event stream to Kafka).

https://wikitech.wikimedia.org/wiki/Event* :p

My impression from @faidon was that he wanted to use EventStreams, not Kafka directly, so the final version of the tool may change. I think that we should assume that the #central feed replacement should be available in ES, not only kafka, but I might be wrong..

Hm. I didn't realize that. Moving to other ticket to discuss...

Ottomata moved this task from Incoming to Event Platform on the Analytics board.Feb 20 2020, 5:38 PM