Page MenuHomePhabricator

Determine technical approach for Automoderator edit revert component
Closed, ResolvedPublicSpike

Description

One of the core components of Automoderator will be the technology for reviewing edits and making reverts based on Automoderator's configuration on this wiki. We need to decide what our approach will be for where this software is hosted, what technologies it will use, and how it will interface with other components.

In T345092 we had some ideas and suggestions, and we'll use this ticket to come to conclusions on this specific topic. For this question see:

Features
This component will be responsible for:

  • Taking revision ID and language-agnostic revert risk score as input
  • Filtering based on some internal criteria (e.g. project, namespace)
  • Reverting revisions which meet internal/community configurations and revert risk score threshold
  • Providing an auditable trail of activity, including
    • revert risk probability check
    • action ( eg. revert, notify, no op)

Questions

  • How will we monitor new edits?
    • Should this operate on a stream of events? (scalable, expectation that revision gets processed eventually), if so:
      • not at this time, though we should leave ourselves open to the possibility in the future
      • Where should this be hosted? (eg production kube service, toolforge, cloud vps)
      • What technologies will we use? (eg. flink, changeprop)
    • Or should this operate in a mediawiki extension hook implementation (eg RecentChange_save, RevisionFromEditComplete) like the ORES extension? (simpler design, leverages existing production mediawiki ops support)
      • we should start with this approach as our basic implementation; we can start prototyping, easily do local testing, leverage patchdemos, etc
  • do we need to coelesce Liftwing requests? (eg. add a post-score hook to allow extensions to reuse scores without making additional requests)
    • We've been advised that this is not where bottlenecks would happen (see T349295#9347046)
    • Work is being done in T348298: Add revertrisk-language-agnostic to RecentChanges filters to add language agnostic revert risk scores to the ORES extension, which saves the scores and makes them available upon save via the ORESRecentChangeScoreSaved hook; we can potentially take advantage of that to get our scores
  • where should the code that actually does the reverting go?
    • within a mediawiki extension; that will make localization for edit summaries and potentially revision tag labels more convenient. we can expose an API if needed for an external stream-based tool to call if needed
  • How do we know if we have checked revisions that have not been reverted? (eg. logfile, "autopatrolled" flag)
    • we can define and apply revision tags for revisions that we've checked, reverted, etc, with a single revision being able to have multiple tags
  • How can we implement and support this with limited engineering time?
    • moderator tools does not have the capacity to provide ops/sre for this, so it will need to be deployed within services that are supported by other teams; mediawiki is an obvious choice, but another service could be used so long as we can keep in our lane of design, software development, community collaboration, etc

Determination
We'll implement the revert component within the Automoderator mediawiki extension; see questions section for details.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptOct 19 2023, 12:59 PM
jsn.sherman changed the task status from Open to In Progress.Oct 26 2023, 1:38 PM
jsn.sherman claimed this task.
jsn.sherman triaged this task as High priority.
jsn.sherman updated the task description. (Show Details)

@Krinkle, Could you offer me any guidance or insight on exploring the performance/reliabilty/scaleability of operating in a hook per revision vs a stream via flink/changeprop? I have to admit that my understanding of the characterstics of these services as we run them is very limited.

@jsn.sherman this might be of your interest: T338792

That is in fact, very interesting to me! If an extension looks the most promising for our use case, I wonder if y'all have considered building out a minimal hooks-only lyftwing extension that offers up the score data in it's own hook. That would speak to this extension implementation question we had:

do we need to coelesce lyftwing requests? (eg. add a post-score hook to allow extensions to reuse scores without making additional requests)

If you had both a stream and a mediawiki hook, you could reduce api calls within mediawiki as well as for external tools

This is a really interesting feature!

A couple of questions that popped to my mind as I was reading (and I apologize if this was written somewhere else that I missed, please let me know!)

These two questions:

Where should this be hosted? (eg production kube service, toolforge, cloud vps)
[...]
Or should this operate in a mediawiki extension hook implementation (eg RecentChange_save, RevisionFromEditComplete) like the ORES extension? (simpler design, leverages existing production mediawiki ops support), if so:

Those aren't really mutually exclusive; the ORES system was both a service (for the machine learning models) and an extension (mostly for the bridging with MW + UX/UI stuff, etc).

I think you might need a hybrid approach here too, if I'm reading the behavior correctly, especially when this seems to include some heavier machine learning operation.

In my mind, this might also hint at a possible answer to this:

How do we know if we have checked revisions that have not been reverted? (eg. logfile, "autopatrolled" flag)

We can talk about the ideal of whether to use eventstream or not, but there are issues not only of reverted edits, but also of suppressed ones, which are not shared in the stream. On top of that, since this is a new feature, you could consider working with an iterative manner and start with "path of least resistance" for MVP/testing/iterating while allowing the architecture of the software to (as much as possible) be possible to pivot later, by working with abstraction layers inside the product that separate the "layers of behavior" (front end, business-logic, validation, storage) so that if in the future you need to shift to a bigger scalability it won't require a full rewrite.

From the very brief read, I think a hybrid approach is probably what will be best here -- the machine learning part as a service (you'll probably need to do that anyways), and start with an extension that has layered abstractions for the inner works (hooks on edit [or whatever else makes sense, the specific implementation can vary]), storage and perhaps an API module for others (similar to ORES) etc.

I would just caution to try and do these with some abstraction layers that allow you to later pivot if you need to, and that will also allow you to be a bit more agile in testing behavior and figuring out what types of products would need to read the output and what behavior you need to implement on top. If you then need to expand this, then the pieces are more or less aligned and it will be less risky to iterate and pivot.

I would also look into both the successes and challenges that ORES faced, since this is somewhat similar, and we already have something we can learn from.

Thank you for taking a look and offering feedback!

This is a really interesting feature!

A couple of questions that popped to my mind as I was reading (and I apologize if this was written somewhere else that I missed, please let me know!)

These two questions:

Where should this be hosted? (eg production kube service, toolforge, cloud vps)
[...]
Or should this operate in a mediawiki extension hook implementation (eg RecentChange_save, RevisionFromEditComplete) like the ORES extension? (simpler design, leverages existing production mediawiki ops support), if so:

Those aren't really mutually exclusive; the ORES system was both a service (for the machine learning models) and an extension (mostly for the bridging with MW + UX/UI stuff, etc).

Based on what we've discovered so far in T345092: Determine high-level technical approach for Automoderator, we'll want to house user-facing interactions in an extension: notfications, false positive reporting, and possibly configuration.

I think you might need a hybrid approach here too, if I'm reading the behavior correctly, especially when this seems to include some heavier machine learning operation.

In my mind, this might also hint at a possible answer to this:

How do we know if we have checked revisions that have not been reverted? (eg. logfile, "autopatrolled" flag)

We can talk about the ideal of whether to use eventstream or not, but there are issues not only of reverted edits, but also of suppressed ones, which are not shared in the stream. On top of that, since this is a new feature, you could consider working with an iterative manner and start with "path of least resistance" for MVP/testing/iterating while allowing the architecture of the software to (as much as possible) be possible to pivot later, by working with abstraction layers inside the product that separate the "layers of behavior" (front end, business-logic, validation, storage) so that if in the future you need to shift to a bigger scalability it won't require a full rewrite.

I'm really glad you said something about suppressed edits; that was not on my radar. I will chew on your abstraction advice; I've just been thinking in terms of separation of concerns so far.

From the very brief read, I think a hybrid approach is probably what will be best here -- the machine learning part as a service (you'll probably need to do that anyways), and start with an extension that has layered abstractions for the inner works (hooks on edit [or whatever else makes sense, the specific implementation can vary]), storage and perhaps an API module for others (similar to ORES) etc.

We only have two engineers in moderator tools, so we definintely need to minimize the operational footprint that we are responsible for. Starting with everything besides lyftwing in an extension is looking appealing at the moment.

I would just caution to try and do these with some abstraction layers that allow you to later pivot if you need to, and that will also allow you to be a bit more agile in testing behavior and figuring out what types of products would need to read the output and what behavior you need to implement on top. If you then need to expand this, then the pieces are more or less aligned and it will be less risky to iterate and pivot.

^ this. This is going to be our first new bit of software since we formed last year, and we have spent a lot of time up until this project extending existing things; some of it has been pretty frustrating due to tight coupling. I would like future us to not be mad at present us.

I would also look into both the successes and challenges that ORES faced, since this is somewhat similar, and we already have something we can learn from.

I think you've just pointed out what my next iteration on this spike will be about. I suspect doing this right will involve digging through past work as well as talking to people. I'm very open to suggestions about where to start / who to talk to.

I think you'll need an extension to do the actual reverting - the standard options exposed by MediaWiki (edit, undo, revert) just aren't very appealing from a usability POV. You'll probably want the edit summaries to be localizable by the community (and multilingual on multilingual wikis); you'll want to set a custom change tag. You might want to revert to be more like action=revert than action=edit (ie. not go through the usual edit filters - say an URL in the article gets blacklisted and then a vandal blanks the article, you don't want the revert to be prevented by SpamBlacklist) without using actual action=revert which has a very uninformative edit summary.

(This doesn't answer the harder question of what should be triggering the revert - the extension could have its own revert API, called by an external component, or it could be doing everything.)

I think you'll need an extension to do the actual reverting - the standard options exposed by MediaWiki (edit, undo, revert) just aren't very appealing from a usability POV.

You'll probably want the edit summaries to be localizable by the community (and multilingual on multilingual wikis);

Definitely; it looks like existing external tools are doing rollbacks via the api and setting summaries there. Are these edits instead of reverts? If we need to create an api in our extension, we'll want to mimic that so that the messages can be localized by whatever codebase is calling for the revert. That way a translatewiki integration (whether in an extension or another tool) could take care of that workflow.

you'll want to set a custom change tag. You might want to revert to be more like action=revert than action=edit (ie. not go through the usual edit filters - say an URL in the article gets blacklisted and then a vandal blanks the article, you don't want the revert to be prevented by SpamBlacklist) without using actual action=revert which has a very uninformative edit summary.

I have learning to do here, but I think I see what you mean. Browsing through core, I realize that I'm not sure about the language around reverts. There are things that use the word revert in the messages, but it looks like they might actually be edits. It looks like the revert action is currently used for images only; do I have that right?

(This doesn't answer the harder question of what should be triggering the revert - the extension could have its own revert API, called by an external component, or it could be doing everything.)

It doesn't, but I really appreciate your perspective and additional context on this. I can see that I'm still in the "I don't know what I don't know" stage here. Thank you!

Sorry, I also have a hard time remembering the correct vocabulary for reverts. I meant action=rollback (there is no action=revert) - revert is the general concept of reversing some changes that were made to a page, which can be done via rollback, undo and normal edit (and restore / mcrundo but those aren't used for wikitext-based pages), which are actions / API endpoints. Rollback is very streamlined but very rigid, more useful for humans than tools. Undo is functionally mostly equivalent to editing, it just pre-generates the text (and so also meant to make things easier for humans). Most tools probably use the edit API, but even that is somewhat limited compared to what an extension can do (e.g. by using autocomments or change tags).

(There is a mediawiki.org page with more information on the various revert methods.)

Update: I'm experimenting with adding the language agnostic revert risk requests to the ores extension itself; if that looks workable, that would let us keep our scope narrowed down to the actual reverts. The key here will be thinking through ways in which those scores could come from another source (eg. something driven by an event stream) so that we have a path for a different implementation in the future. I was previously considering the revert action itself as needing to be implemented within the component that was requesting the scores.

Update: I'm experimenting with adding the language agnostic revert risk requests to the ores extension itself; if that looks workable, that would let us keep our scope narrowed down to the actual reverts. The key here will be thinking through ways in which those scores could come from another source (eg. something driven by an event stream) so that we have a path for a different implementation in the future. I was previously considering the revert action itself as needing to be implemented within the component that was requesting the scores.

Some things that stick out about the ores extension so far:

  • it looks like it is accessing the liftwing models via an inference api that is structured differently than the public liftwing apis
  • it is very much oriented to recent changes rather than revision creation

I don't totally understand all of the implications yet, but I'm learning

(There is a mediawiki.org page with more information on the various revert methods.)

helpful! thanks!

Others have already commented on the integration benefits, user expectations, and operational cost/complexity trade-offs. I'll mention that I agree with Moriel and Gergő. I think an extension yields product results quicker, and costs less both in human effort and operational cost.

Some related benefits: Ability for volunteers and other staff to understand and contribute (including for SREs to investigate during an incident), Getting a local dev environment started for new staff, Coordinating changes to deploy and ease of keeping/breaking back-forward compatibility, CI test coverage (i.e. how much do you have to mock, and how much value a passing test carries on the overall product; less value if your own business logic is split across another service). These values inherently get lowered when you split it up. That doesn't mean it's never worthwhile, but it's part of the long-term equation. I generally see these values are more valuable earlier in development, and less as near toward maturity after a few years. And of course, not every new feature makes it there (cost savings, big picture)

@Krinkle, Could you offer me any guidance or insight on exploring the performance/reliabilty/scaleability of operating in a hook per revision vs a stream via flink/changeprop? […]

Before I look at performance, I like to quantify the magnitude. I.e. if we say something is "better" or "cheaper", are we talking about significant benefits (in cost or UX) today, or soon, or in a hypothentical future? For example, if the performance of two approaches is indistinguishable at a small scale, then improving it might not be worthwhile. In other words: Does it "matter"?

For this feature, I believe the answer is Yes. Automoderator is meant to operate on nearly every edit on a given wiki. Apart from page views, that basically puts you in the highest scale category. It is not bound by organic adoption in a way that might stay "small". For comparison, a feature like "Thanks" is opt-in, and we didn't know how often people would want or need to use it (and how often is "too often" for the receiver). Likewise, "revision delete" and "check user" are used orders of magnitude less than "edit" (and that's a good thing!).

In terms of reliability and scalability, the JobQueue can handle this easily. You'd be in good company with other extensions in production that have reacted edits in the same way over the past decade. Some things to keep in mind:

  • You might want to artificially ramp up traffic to reduce operational risk (X% of edits, and deploy to 1 wiki at a time). This would be a temporary measure, but helps increase how quickly one could safely start impacting a larger wikis. You may still favour smaller wikis first for social reasons, but from a technical perspective you could deploy on large wikis fairly quickly if a sampling is applied at first.
  • We typically apply a "fast" check in the hook, before queueing the Job. This is where you'd return early based on aspects you can statically determine with negligible overhead to the request. For example, limiting to certain users (i.e. ignore bots), namespaces, content models (wikitext), and the (temporary) random ramp-up percentage.
  • The JobRunner let's you control overall concurrency as additional safely measure to control unexpected infrastructure cost and social impact. For example, you could start with a concurrency of 1, to the code take events one a time and provide breathing room to spread out peaks. I expect that until you reach double digits on one of the largest wikis (wikidata, commons, enwiki), that will probably suffice for a good while. (Based on RecentChanges stats and Grafana: JobQueue Job: ORESFetchJob).

do we need to coelesce Liftwing requests? (eg. add a post-score hook to allow extensions to reuse scores without making additional requests)

I wouldn't worry about coalescing. It isn't where the bottleneck should be. I expect LiftWing to respond fast with information it has recently computed. Keeping the jobs independent is easier to reason about, and is important for idempotence (automatic retry) as well.

In addition, if you happen to overlap with the first consumer (i.e. with the ORES extensions' FetchOREsJob that pre-caches the scores), I expect Liftwing to internally coalesce those as well. But, if you wanted to specifically start your job after that one, you could switch from onRecentChange_save() to onORESRecentChangeScoreSavedHook(). In that case, you'd use the ORES job only to discover new revision IDs, and to decide whether to queue your own job.

@Krinkle thank you for the feedback and advice!