Page MenuHomePhabricator

Expose ORES topics in recent changes filters
Open, Needs TriagePublic

Description

User story: As a patroller, I want to filter Recent Changes to a topic of my interest, so that I can focus my efforts where my expertise and interest lies.

https://www.mediawiki.org/wiki/ORES/Articletopic

https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language_agnostic_link-based_article_topic

This would presumably be done on top of T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited, which fetches ORES topics on edit; but they would also have to be stored somewhere (the ORES extension, presumably?) and we'd have to end up with a non-terrible join/filter for the RC query.

This work presumably depends on T380825: Make ORES topics and their translations easily available to MediaWiki extensions so that we have a central place to retrieve the topics for display.

Topics are already available via the LiftWing API here: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outlink_prediction

Future work
The topic model is going through some refreshes in the 24/25 Fiscal Year:

  • In Q3 (Jan-Mar) a new country taxonomy (T371897) will be made available, replacing the geographic topics currently available.
  • A broader taxonomy change will also be made (draft), but the timeline is less clear.

These new taxonomies broadly map to the existing topics, so this isn't a complete rework.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I don't think it's on any team's roadmap at this point.

@Chtnnh if you're interested to work on this and you have questions, let's talk about it here on the task.

Great!

T240558 is not yet done so how do you suggest I go about this task? Also could you throw some light on storing in the ORES extension as I have almost no experience with the Wikimedia DB.

Thanks so much @kostajh

Claiming the task to clear any confusion regarding work. I will be working on this as soon as possible.

@Chtnnh cool! So, I think T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited is not so relevant here.

As far as I can tell (@Tgr please comment if you have feedback), what we need to do is something like this:

  • store the articletopic scores in the ores_classifications table, like we do for other ORES models
    • OresModels and OresModelClasses need to be updated in the ORES extension to handle articletopic model
    • wgOresModels (in operations/mediawiki-config repo) needs to be updated to add a reference to articletopic
    • when that is done, the ORES extension includes/Hooks/RecentChangeSaveHookHandler.php should fetch articletopic scores and store in the database
  • once they are in the database, then we can look at updating onChangesListSpecialPageStructuredFilters() in the ORES extension to add this data to the UI. We can look at the existing examples from the damaging and goodfaith models

@RHo (and @Pginer-WMF as the original designer of this page) when you have time, could you please provide some input on how the topics could be presented in the RCFilters (and Watchlist) UI? Maybe as an "Advanced filter" like Namespaces and tagged edits?

@kostajh Thank you for the pointers. I will treat them as a step-by-step on the task.

Could you just tell me which repo to find the files in? I have been searching up on the Wikimedia Github but no luck :(

I'm worried that adding topic predicitions to ores_classifications might make the table too big.

Right now, we have 64 target classes. Reporting 64 probabilities per page would be an awful lot. 64 * 32bit float * 10 million = 19GB. We could choose the top 5 to limit the output. 5 * 32bit float * 10 million = 1.5GB Also, I wonder if we might consider using a bit field. 64bits * 10 million (or so) pages = 610MB of data.

Alternatively, we already have topic data in the search index. I wonder if there would be a tractable way to filter RecentChanges using ElasticSearch. It doesn't sound like it, but it'd be pretty cool if there was.

@Halfak is there a way to loop in someone who would know about ElasticSearch so that we can explore that possibility? The table size for ores_classifications is quite important as you mentioned.

@Gehel or someone on his team might be able to help with some insights. I imagine @kostajh might know or know of someone who could give us some pointers too.

Using elastic sounds complicated here I think.
I don't know enough about RCFilter but this would require joining two distinct backends (mysql and elastic). I'm afraid that it is going to be very inefficient.
For reference the list of fields we index is visible here: https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump

@RHo (and @Pginer-WMF as the original designer of this page) when you have time, could you please provide some input on how the topics could be presented in the RCFilters (and Watchlist) UI? Maybe as an "Advanced filter" like Namespaces and tagged edits?

I think it makes sense as an additional advanced filter.

One relevant question is whether multiple topics should work as a union or intersection. I think it makes more sense as a union: when selecting "food" and "sports" I assume the expectation is to see edits on articles affecting articles about any of those topics, not only sport activities involving food. However, it is possible that someone expects "food" and "Europe" to lead to European food related articles. In any case, I don't think it makes sense to integrate with other options such as "Tags" since you definitely want only mobile edits when selecting such tag.

In terms of UI adjustments, 3 would be at the limit of the advanced filters we may want to show at once. So it is worth thinking whether it is ok to keep the 3 visible or show one with a "more" option to access the other 2. Maybe that decision can be postponed when a 4th option is considered when I think it will become definitely necessary.

Fetching data from ElasticSearch can be done for a limited number of tasks (T243478: Newcomer tasks: fetch ElasticSearch data for search results is a similar use case) but recent changes is up to a thousand articles so even just for annotating a result set retrieved from DB it would not be trivial; and actually using it for filtering would mean retrieving data for every recentchanges row (tens of thousands, possibly). In an ideal world, all of RC would come from ElasticSearch and this wouldn't be an issue... but for now, storing the topic predictions in the DB is the only option.
(Also, the ElasticSearch data is only updated once a week. That's a minor annoyance usually, but for recent changes it seems particularly ill-suited.)

As for DB size, there is little point in storing all 64 predictions. For goodfaith/damaging it is useful to store the prediction store so that different tools can use different thresholds, like only showing bad which the classifier is confident about to auto-revert tools but show everything suspicious to humans; so RC has multiple filters, such as "maybe bad" / "likely bad" / "very likely bad". For topics though, there is no use for a separate "maybe about art" / "likely about art" / "very likely about art"; and especially since search already uses a specific set of thresholds for topic filtering, user expectation will be to match that. So that would mean something like 2-3 predictions per article, and no need to store the score either, just the model ID. That doesn't fit well with the current DB schema though; and reproducing the way score thresholds are derived for search would be nontrivial. (OTOH, if we do T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited in the future, it will have to be done anyway.)

Also, the ElasticSearch data is only updated once a week. That's a minor annoyance usually, but for recent changes it seems particularly ill-suited

That is truly a matter of concern @Tgr @Halfak
We are faced with two alternatives here, either we edit ElasticSearch to refresh at a greater frequency (which may be complicated and beyond the scope of this task) or we consider an alternative way to achieve

but for now, storing the topic predictions in the DB is the only option.

Will we need a new DB table to store these predictions? How will that process go about? What do you suggest we store in the DB?

In terms of UI adjustments, 3 would be at the limit of the advanced filters we may want to show at once. So it is worth thinking whether it is ok to keep the 3 visible or show one with a "more" option to access the other 2. Maybe that decision can be postponed when a 4th option is considered when I think it will become definitely necessary.

Sounds good, thanks @Pginer-WMF

So that would mean something like 2-3 predictions per article, and no need to store the score either, just the model ID. That doesn't fit well with the current DB schema though; and reproducing the way score thresholds are derived for search would be nontrivial.

Well, it might be useful to store metadata about the match rankings, not necessarily for RC Filters but so that other applications could make use of the top ranked topic match for a given article.

Will we need a new DB table to store these predictions?

Yes, maybe doing something like:

The easiest (but more wasteful in terms of space) option would be to keep using the table as is: the model goes into the ores_model table, topic names are mapped to integers in $wgOresModelClasses and for each article those integers and the topic scores go into the ores_classification table (but rows with low scores are discarded). The table is 36 bytes per row; the more tailored schema suggested by Kosta would be something like 15 bytes per row, so about about half the size. ORES tends to give 3-4 good predictions per article, and we only keep this data around for revisions in the recentchanges table, so for enwiki (10M rows) that's about a gigabyte with easy solution and half of that with the nice solution. You should probably ask DBAs what they think about that.

In terms of code, the easy solution would probably consist of adding an option for a model-specific filter callback in or near ServiceScoreLookup, adding a class for fetching thresholds from ORES and caching them locally, and adding a callback for the articletopic model which uses the thresholds to filter the data. Getting the thresholds is probably too slow to do just in time, so those might have to be kept up to date by some kind of cronjob.

A potential problem with that (other than the non-optimal space usage) is that oresc_class (the topic ID) is a tinyint (so 256 possible values). That's enough for now but does not look super future proof.

Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!