Page MenuHomePhabricator

Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context
Open, Needs TriagePublic

Description

Context

It would be really useful to have predictions for a proposed revision before the revision is saved. One could use this score, for example, in AbuseFilter in combination with other heuristics to decide whether to deny an edit for a certain type of user (0 edits thus far; or creating temp account) or to show a CAPTCHA, etc.

Looking at https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Multilingual_revert_risk#Model and https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/mediawiki.py?ref_type=heads, we should be able to provide this metadata in a pre-save context -- we can get the wikitext, a diff against parent revision, the texts that were added/removed/changed, etc.

It would be great to know 1) if Research / Machine-Learning-Team think this is feasible and 2) what the timeline or process might be for implementing something along those lines.

The idea would be to use this with other signals to perform an action. See this document for why preventing edits based on revert risk score alone is problematic.

Related:

Proposal

  1. Add a POST endpoint to Multilingual revert risk and revert-risk language agnostic endpoints to accept metadata that is currently retrieved by querying MediaWiki with a revision ID

Consequences

  1. Clients can receive revert risk results for an edit before it is saved to MediaWiki

Event Timeline

kostajh updated the task description. (Show Details)

@calbon following up from T299436: How impactful would pre-save automoderation be on edit save times? which is basically the same request -- would it be possible to have a modified version of the existing endpoint that accepts a POST with the parameters that revertrisk currently fetches by parsing a revision?

@kostajh to follow up on this - the ML team do not intend to do any work on this, and from our side the language agnostic model is capable to be invoked in either situation.

Any further actions you'd like us to take?

[...] from our side the language agnostic model is capable to be invoked in either situation.

To clarify, what I am asking for: my request is that we have an endpoint where we can call revert risk language agnostic and revert risk multilingual APIs using a POST request that provides the diff, or the old text/new text, etc, so that we can check a not-yet-saved edit and use the API response as a signal in whether to accept or deny the edit.

The existing API only supports GET requests with a revision ID for an already saved edit, unless I am missing something.

thanks for the clarification @kostajh. It sounds to me that this is to be discussed with the ML team - it is more about when to invoke the revert risk model (before vs after edit), while the current functionality of the model itself is unchanged.

my two cents:

The existing API only supports GET requests with a revision ID for an already saved edit, unless I am missing something.

True, the current approach works based on the revision ID and from there takes the data of that and from it's parent ID.
If someone would like to implement a solution for unpublished revision I can image two ways:

  • MediaWiki based: Create a temporal revision, assign as parent the existing (current) revision, and send that info to the existing API.
  • Liftwing Modification: The Knowledge Integrity package separates the feature extraction from the model inference, so it should be possible to create a method to accept a JSON instead of the revision, anyhow, as @XiaoXiao-WMF said, you should this discuss this with the ML-team. Maybe @achou could estimate how difficult that would be.

If someone would like to implement a solution for unpublished revision I can image two ways:

  • MediaWiki based: Create a temporal revision, assign as parent the existing (current) revision, and send that info to the existing API.

That would be a lot of effort, and I don't think it's necessary given that we can send the data that knowledge_integrity repo needs.

  • Liftwing Modification: The Knowledge Integrity package separates the feature extraction from the model inference, so it should be possible to create a method to accept a JSON instead of the revision, anyhow, as @XiaoXiao-WMF said, you should this discuss this with the ML-team. Maybe @achou could estimate how difficult that would be.

Thanks, that is what I am proposing as well. @achou, how feasible do you think this is from your side? It would involve accepting a POST with all the features (https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/featureset.py?ref_type=heads) needed.

kostajh renamed this task from Explore using revertrisk language agnostic API in a pre-save context to Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context.Mar 13 2024, 1:34 PM

@achou please let us know if there is a corresponding ML board related to this task? We'd like to know if this is prioritized...

@kostajh @XiaoXiao-WMF thanks for tagging. Sorry I was unaware of the discussion here. The ML team is currently in the middle of quarterly planning. I will bring up the proposal during our planning and get back to you shortly!

Thanks, that is what I am proposing as well. @achou, how feasible do you think this is from your side? It would involve accepting a POST with all the features (https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/featureset.py?ref_type=heads) needed.

@kostajh It would be feasible if clients providing the JSON data of the required properties to construct a Revision object. In Liftwing, we could bypass the preprocessing step that retrieves a revision from the MediaWiki API, and make a prediction directly. (with some code changes needed in the Liftwing model server)

The definition of Revision can be found in schema.py in the knowledge integrity package, and we will then use the from_json method to construct the Revision object. You can find an example in test_schema.py.

If you have any questions, I would be happy to discuss further.

Thanks, that is what I am proposing as well. @achou, how feasible do you think this is from your side? It would involve accepting a POST with all the features (https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/featureset.py?ref_type=heads) needed.

@kostajh It would be feasible if clients providing the JSON data of the required properties to construct a Revision object. In Liftwing, we could bypass the preprocessing step that retrieves a revision from the MediaWiki API, and make a prediction directly.

Thanks @achou!

(with some code changes needed in the Liftwing model server)

Is that something that your team is planning to work on? And if so, what would the timeline be for that?

Hi @kostajh, yes, this is something we can work on this quarter. I am wondering if there's an ongoing project or product in development that needs this feature. If so, could you provide the links? Also, do you have an estimate of the expected traffic for this feature? I'm assuming it will be requested via the external endpoint, correct?