Page MenuHomePhabricator

Integration of Revert Risk Scores to Recent Changes as a filter
Closed, DuplicatePublic

Description

The Revert Risk model (T314385) is already running on Liftwing. The ML-team is working on creating a new stream with it's results (T326179) . It would be very useful to integrate these results on the Recent Changes on MediaWiki. Currently users use filters such as "Likely to have problems". I (@diego) don't know how those filters are created, nor the thresholds used there, but my understanding is they are based on ORES scores.

It would be important to create/replace such filters using Revert Risk.

Event Timeline

We discussed this use case during the ML/Research sync yesterday, and several good things came up. Some highlights:

  • The ORES extension in MediaWiki generates another "stream" (if we can call it in this way) that overlaps with the mediawiki.revision-score one. They are separate, and the overlap of their traffic is non-zero (we don't know exactly how much).
    • The main difference between the two streams is that the one generated by Mediawiki is currently ending up in database tables (ores_classification), meanwhile the mediawiki.revision-score one ends up in Kafka so people can re-use the even stream for other purposes (so the latter is definitely a proper stream, the former not so much).
  • The ML team is going to offer the possibility to create streams like mediawiki.revision-score for each model, using Change-prop as stream processor.
  • In cases like Revert-risk, maybe we want to have a single way to score requests.

All the above assumes the use of ORES, that is not able to generate/send events to EventGate. Lift Wing is able to do it, and this opens up new possibilities:

  • Instead of having two separate streams for Revert Risk, we could configure a MediaWiki Extension to contact Lift Wing upon certain events (like page create/edit). I would personally use a different extension, not the ORES ones, calling it "Revision quality" or something generic (so not referring to any tool like ORES or Lift Wing).
  • The extension would retrieve scores and publish them to a MediaWiki database, but in the meantime Lift Wing would be able to produce an event stream to Event Gate (behind the scenes).

This would reduce the traffic for sure, and provide both functionalities (RC flags and a proper stream) for people interested in Revert Risk's data. The alternative is to build both (MW extension AND a separate stream via Change-prop), but it seems a waste of resources (although it would allow a more cleaner separation of concerns).

Another alternative could be to have a canonical place (like Cassandra/AQS) where we store "scores", populated by a stream processor that would read from the event stream produced by Lift Wing. MediaWiki would read from this canonical datastore rather than from its tables, but this would mix DSE-related infrastructure with MediaWiki (Tier1/pageable support).

@Ottomata you are our guidance in the field, what is your recommendation? :)

Adding some thoughts after the weekend pause - maybe we can just keep the two streams separated without any problem, the number of rps to handle shouldn't be a big issue. It will be surely less efficient but way more flexible, and we'd keep concerns separated.

Tgr subscribed.

Currently users use filters such as "Likely to have problems". I (@diego) don't know how those filters are created, nor the thresholds used there, but my understanding is they are based on ORES scores.

See here for the thresholds and ORES/RCFilters for some information on how they are created.

IMO you best chance is to either make ORES proxy these new scores if there's an easy way to do that (I'd assume not, but I'm not that familiar with ORES internals) or make the ORES MediaWiki extension different API clients for different score types, and provide an API similar to ORES for fetching the scores. The amount of work needed to integrate a new extension with RecentChanges etc. in a similar way to ORES wouldn't be that small.

I'm at 80% understanding, but let me try to summarize.

There is a new ML model that is deployed as a service in LiftWing. You'd like to have a stream of page edit info with the results of the model applied to the edit. You'd also like to have this data stored somewhere that MW can use it to build a user RecentChanges facing feature?

IIUC, the proposed idea is: On edit, MW requests score from RevertRisk LiftWing endpoint, and stores result in MW database. When LiftWing receives this request, it also produces an event to a stream?

That would be fine I think. A more ideal design is the 'canonical place' alternative. I think that design is called a 'materialized view'. In that case, the events would be the 'canonical' source of the revert risk scores, and downstream services that want to serve them would just consume the events and maintain the state they need. In this case, the 'downstream service' is MediaWiki.

Q: does RecentChanges need the score immediately after the edit? If so, then your proposed solution is probably better, as MediaWiki handles getting the score from LiftWing and storing it as part of the edit request. If not, and an eventual score is okay, then the materialized view / event sourcing approach is much better. There is only one event, and all consumers build their view from that.

The image suggestions project is doing something similar, although with much more latency. They produce a new set of image recommendations as a batch job, and publish those somehow (I forget where..cassandra?), and then they serve Image Suggestions via MediaWiki. Ideally they'd do this via streams.

FWIW This kind of pattern (event enrichment + automated ingestion to a datastore) is something we are trying to standardize in Event Platform.

MediaWiki would read from this canonical datastore rather than from its tables

One more Q: Is there any reason why an external pipeline (streaming or batch) can't write to a MariaDB table that MW can use?

@Ladsgroup Hi :) Sorry to drag you in this task but I think that you are definitely one of the best people that can provide some guidelines.. If you have time of course :)

Summary:
The Research team is looking for a way to add, when the models will be fully production-ready, Revert-Risk filters in Recent Changes. The model(s) are aimed to replace the ORES damaging/reverted ones, but we are wondering what's best in the context of the ORES extension. Current ideas are:

  • Either integrate Revert Risk scores in the current ORES extension (calling Lift Wing instead) or create a completely new Extension. In this case we'd keep the current workflow, namely upon edits/page-create actions from users a JobQueue event is created to call ORES Asyncronously, and then populate the related ORES Mariadb tables (that ORES RC filters will be based on).
  • Use changeprop to create a stream of events in Kafka (upon every mediawiki.revision-create or mediawiki.page-change events Lift Wing will be called, that in turn will emit a revision-score event to EventGate and Kafka). Then a consumer (even external to MediaWiki) would pull the data and inserts it into a datastore (Cassandra/AQS or even a Mariadb table), and Mediawiki would simply look for scores (for a given set of rev-ids) in the datastore.

The last point seems to be the more flexible, since with one stream we'll get multiple things:

  1. a central point in Kafka where people can listen to revision-score events.
  2. a way to de-duplicate score requests between Mediawiki (ORES extension) and Change-prop.

I am not sure though if the idea is even possible or highly discouraged, this is why I am asking for an expert opinion :)

Hi, my sincere apologies for late answer, we are understaffed even more than the usual. Anyway, so in paper, ores extension should be able to handle any sort of model, it's not bound to ores or reverting/damaging. The class for building the value out of a json response is well encapsulated. So you should be able to make some changes to ores extension and get it to work with lift wing models. Renaming a deployed extension is practically impossible though. People have tried it with Flow before. It doesn't matter though, it's just internal/technical facing

Regarding the jobs, the reason ores ext doesn't trigger a job is not that it can't, it's because it could 1- overwhelm the ores service 2- it could fill the mw mysql tables with crap. The biggest example is 22M edits done monthly in Wikidata that only a very small fraction of them is valuable for ores ext (edits that are not auto-patrolled by mediawiki are needed for patrollers) so the extension simply ignores edits done by auto-patrolled users (including bots) which filters out 99.9% of edits.

This can be handled in a much better way and de-duplicate extra requests without hurting any functionality:

  • Decouple saving logic from job execution logic
  • Then allow it to be executed in every edit regardless but store the result if it's needed
  • The same job then emits an event for the event stream once it received the results from ores or lift wing.

I can help on code-review and architectural decisions.

Thanks a lot!

Regarding the jobs, the reason ores ext doesn't trigger a job is not that it can't, it's because it could 1- overwhelm the ores service 2- it could fill the mw mysql tables with crap. The biggest example is 22M edits done monthly in Wikidata that only a very small fraction of them is valuable for ores ext (edits that are not auto-patrolled by mediawiki are needed for patrollers) so the extension simply ignores edits done by auto-patrolled users (including bots) which filters out 99.9% of edits.

Just to understand - the extension does trigger async jobs in the job queue right? IIUC calling ORES and inserting in the DB is not done at edit time, but later on (forgive my ignorance but I'd like to be sure about these things, Mediawiki is not my area of expertise :)

I can help on code-review and architectural decisions.

Really appreciated :) I checked the ration between ChangeProp requests to ORES and Mediawiki requests to ORES, the former definitely calls ORES way more than the latter. My impression at the moment is that it would be better to keep the streams separated, since they have different scopes and configurations:

  • A stream handled by MedaWIki will hit Lift Wing to produce entries in the related Mariadb tables.
  • A stream handled by ChangeProp/Flink/etc.. will hit Lift Wing to produce events to EventGate.

There is a non-zero overlapping between the two, but it should be fine to have both served by Lift Wing. It would be great to have only one, but in both cases I think that there will be a non-trivial amount of technical decisions and changes to do to make it happen.

Thanks a lot!

Regarding the jobs, the reason ores ext doesn't trigger a job is not that it can't, it's because it could 1- overwhelm the ores service 2- it could fill the mw mysql tables with crap. The biggest example is 22M edits done monthly in Wikidata that only a very small fraction of them is valuable for ores ext (edits that are not auto-patrolled by mediawiki are needed for patrollers) so the extension simply ignores edits done by auto-patrolled users (including bots) which filters out 99.9% of edits.

Just to understand - the extension does trigger async jobs in the job queue right? IIUC calling ORES and inserting in the DB is not done at edit time, but later on (forgive my ignorance but I'd like to be sure about these things, Mediawiki is not my area of expertise :)

Yes, it's post edit. One of the pillars of mediawiki is to save the edit as soon as possible and build the canonical entry and then triggers massive set of secondary data updates (via deferred updates or jobs) to do wide-range of updates from CDN purge, to ores, to updating search index, etc. This is called "outbox pattern" in the industry. MediaWiki is basically event-driven but not in an obvious way.

I can help on code-review and architectural decisions.

Really appreciated :) I checked the ration between ChangeProp requests to ORES and Mediawiki requests to ORES, the former definitely calls ORES way more than the latter. My impression at the moment is that it would be better to keep the streams separated, since they have different scopes and configurations:

  • A stream handled by MedaWIki will hit Lift Wing to produce entries in the related Mariadb tables.
  • A stream handled by ChangeProp/Flink/etc.. will hit Lift Wing to produce events to EventGate.

There is a non-zero overlapping between the two, but it should be fine to have both served by Lift Wing. It would be great to have only one, but in both cases I think that there will be a non-trivial amount of technical decisions and changes to do to make it happen.

There is a trade-off here. Either do overlap or have one stream. Each has pros and cons, I'm more on the side of one-stream but I can just provide my knowledge, this is the team's decision.

Thanks a lot!

Regarding the jobs, the reason ores ext doesn't trigger a job is not that it can't, it's because it could 1- overwhelm the ores service 2- it could fill the mw mysql tables with crap. The biggest example is 22M edits done monthly in Wikidata that only a very small fraction of them is valuable for ores ext (edits that are not auto-patrolled by mediawiki are needed for patrollers) so the extension simply ignores edits done by auto-patrolled users (including bots) which filters out 99.9% of edits.

Just to understand - the extension does trigger async jobs in the job queue right? IIUC calling ORES and inserting in the DB is not done at edit time, but later on (forgive my ignorance but I'd like to be sure about these things, Mediawiki is not my area of expertise :)

Yes, it's post edit. One of the pillars of mediawiki is to save the edit as soon as possible and build the canonical entry and then triggers massive set of secondary data updates (via deferred updates or jobs) to do wide-range of updates from CDN purge, to ores, to updating search index, etc. This is called "outbox pattern" in the industry. MediaWiki is basically event-driven but not in an obvious way.

Perfect this was my understanding, thanks for the clarification :)

I can help on code-review and architectural decisions.

Really appreciated :) I checked the ration between ChangeProp requests to ORES and Mediawiki requests to ORES, the former definitely calls ORES way more than the latter. My impression at the moment is that it would be better to keep the streams separated, since they have different scopes and configurations:

  • A stream handled by MedaWIki will hit Lift Wing to produce entries in the related Mariadb tables.
  • A stream handled by ChangeProp/Flink/etc.. will hit Lift Wing to produce events to EventGate.

There is a non-zero overlapping between the two, but it should be fine to have both served by Lift Wing. It would be great to have only one, but in both cases I think that there will be a non-trivial amount of technical decisions and changes to do to make it happen.

There is a trade-off here. Either do overlap or have one stream. Each has pros and cons, I'm more on the side of one-stream but I can just provide my knowledge, this is the team's decision.

The one stream solution would be definitely better, but I think that we have two options:

  1. We use MediaWiki as "Stream processor", so an event is generated by a user edit/page-create/etc.. and eventually a Job will hit Lift Wing, that in turn will generate an event to EventGate.
  2. We use ChangeProp/Flink/etc.. as "Stream processor", listening for page-change/page-create events in Kafka, hitting Lift Wing that generates an event to EventGate.

The problem that I see with 1) is that we are already filtering (and rightfully so) a lot of events, meanwhile researchers may want the whole stream scored. The solution with 2) seems more integrated and inline with what the Event Platform folks are building, but its main problem would be that we wouldn't be able to populate the Mariadb table that Mediawiki uses in Special pages for the ORES filters. An idea could be to use Cassandra or similar to store "scores" generated via ChangeProp/Flink, but MediaWiki's special pages would need to read from it and I don't think this is a possibility (correct me if I am wrong, or if there is a way for the ORES extension's filters to read from Cassandra instead).

This is why I was proposing to keep things separate, the streams overlap but they are not really the same...

wouldn't be able to populate the Mariadb table that Mediawiki uses in Special pages for the ORES filters.

I wonder if there is some way to do this. It is probably a fundamental question that we'd need to spend some architectural time thinking about.

Ideas:

  • Could we insert MW Jobs into the JobQueue (by emitting mediawiki.job.xxx events) from non MW?
  • Could we make a special MW PHP (stream or batch?) job queue processor that instead of responding to mediawiki.job.xxx RPC type events, could be told to subscribe to certain streams and THEN launch the appropriate job? This would mean that devs need to write the appropriate stream event -> MW job translator code. I suppose it could even just then emit mediawiki.job.xxx events instead of running the job directly.
  • Could we have a special MW PHP stream processing job that knows how to consume events and inserts into MW MariaDB tables?
  • Could we have a discussion of whether it is (one day?) acceptable to write to MW MariaDB from something that is not MW? (not sure about this one.)

The problem that I see with 1) is that we are already filtering (and rightfully so) a lot of events, meanwhile researchers may want the whole stream scored.

@elukey, do we know which is the data that is being filtered out?

The problem that I see with 1) is that we are already filtering (and rightfully so) a lot of events, meanwhile researchers may want the whole stream scored.

The thing is that it doesn't do it for jobqueue reasons or that it can't. It doesn't do it for storage reasons. It's pretty easy to make ores extension simply queue the job and call ores/liftwing for every edit but just not store it if it doesn't need to. Which you can then make it emit an event for every edit for free. The change is quite minimal.

wouldn't be able to populate the Mariadb table that Mediawiki uses in Special pages for the ORES filters.

I wonder if there is some way to do this. It is probably a fundamental question that we'd need to spend some architectural time thinking about.

Ideas:

  • Could we insert MW Jobs into the JobQueue (by emitting mediawiki.job.xxx events) from non MW?

That seems the wrong direction of coupling and easily prone to breakage.

  • Could we make a special MW PHP (stream or batch?) job queue processor that instead of responding to mediawiki.job.xxx RPC type events, could be told to subscribe to certain streams and THEN launch the appropriate job? This would mean that devs need to write the appropriate stream event -> MW job translator code. I suppose it could even just then emit mediawiki.job.xxx events instead of running the job directly.
  • Could we have a special MW PHP stream processing job that knows how to consume events and inserts into MW MariaDB tables?
  • Could we have a discussion of whether it is (one day?) acceptable to write to MW MariaDB from something that is not MW? (not sure about this one.)

All of these are quite a lot of work and a lot of unneeded coupling for something that can be achieved much simpler via the suggested idea.

much simpler via the suggested idea.

I might be getting lost, but ah, is the suggested idea to use a MW Job to send a request LiftWing (causing event to be generated), and then if the score should be (not all should) in MariaDB, store it? Sounds fine to me!

More generally though, we are going to want to answer the question of how best to use 'generated data' for 'wiki experiences'. Is using MW Job Queue to insert into MariaDB going to always/usually be the answer? Is storing in Cassandra (or other non MW managed stores) and then having MW Extensions query those better? Do we want to support both?

It would be nice to be able to answer this question (there may be multiple preferred solutions) generically, so that we don't have to solve it for every new 'generated data' product feature that comes around.

The problem that I see with 1) is that we are already filtering (and rightfully so) a lot of events, meanwhile researchers may want the whole stream scored.

The thing is that it doesn't do it for jobqueue reasons or that it can't. It doesn't do it for storage reasons. It's pretty easy to make ores extension simply queue the job and call ores/liftwing for every edit but just not store it if it doesn't need to. Which you can then make it emit an event for every edit for free. The change is quite minimal.

Makes sense, but honestly I'd rather use ChangeProp to handle streams rather than a MediaWiki extension. Configuring a new stream is easy and it now takes a single code review in deployment-charts (for every new model, not only revscoring ones), meanwhile we cannot really say the same for a MediaWiki extension (also all management tasks like starting/stopping/metrics/alerts/etc.. are way easier outside MediaWiki). I am not advocating for hybrid solutions but I feel that, even if the Job Queues are a power tool, the may not be the best for these kind of use cases.