Page MenuHomePhabricator

Newcomer tasks: set initial thresholds for ORES articletopic
Closed, ResolvedPublic

Description

To help us pursue T244192: Newcomer tasks: ORES ontology mapping and score thresholds, @Halfak is going to give us recommended thresholds for each topic model in each language that he expects will give us about 70% precision or better.

He'll also put together some instructions for how we can determine and adjust such thresholds in the future.

@Halfak -- if you get the sense that there are lots of articles that can be captured at 70% precision, then maybe you can recommend thresholds that will get us higher than 70% but with still sufficient recall to supply plenty of articles. Or we can try and adjust from there.

Event Timeline

Here's a gist that I put together with my initial explorations and discussions of choosing thresholds: https://gist.github.com/halfak/630dc3fd811995c2a0260d43da462645

I'm working on generating some output that will help us choose thresholds.

How do we think these thresholds should be applied, It sounds like we need to inject them prior to the indexing pipeline? Or is this something that should be handled much earlier in the pipeline?

If there are a lot of articles, precision won't really matter since the articles will be sorted by score and cut off at some point for pragmatic reasons (currently at the top 250), so the real precision will be higher anyway. What we are really talking about here is a trade-off between having less than 250 articles or having 250 articles but less relevant ones.

What are the thresholds for, exactly? The articletopic: search keyword in general, or just the suggested edits interface specifically?

How do we think these thresholds should be applied, It sounds like we need to inject them prior to the indexing pipeline? Or is this something that should be handled much earlier in the pipeline?

For RCFilters, which has a similar threshold system (for mapping damaging / goodfaith ORES scores to recentchanges labels like "maybe damaging" and "very likely damaging"), the threshold definitions (things like "precision >= 15%") live in MediaWiki configuration, and are turned into threshold values dynamically via the ORES API. It's not very pleasant but it's an option.

They could also live in the script that loads data from Hadoop to ES (and currently uses a cutoff of 0.5 for discarding low scores). That would reduce ES space usage, but seems like an even more unpleasant location to manage such config, especially since it will be different from each wiki (or will it? thresholds, for sure, but threshold definitions?)

Or they could be provided by the ORES API, and handled as part of the data that goes through the ORES -> ES pipeline. That seems conceptually wrong, though, since these are thresholds specific to one application of the ORES scores, not the scores in general. (OTOH, ORES already uses a threshold to calculate the prediction field in the API response. Currently that just seems to be a constant 0.5, but maybe it would make sense to set it in a more meaningful way?)

Hm, I guess putting the threshold config in MediaWiki and applying via the search keyword definition is not really feasible, right? ES has a range search, but that would require a wholly different index structure with a huge number of fields (and then we'd have to commit to only using the ORES scores for filtering and not scoring); and match queries don't seem to provide that level of control.

They could also live in the script that loads data from Hadoop to ES (and currently uses a cutoff of 0.5 for discarding low scores). That would reduce ES space usage, but seems like an even more unpleasant location to manage such config, especially since it will be different from each wiki (or will it? thresholds, for sure, but threshold definitions?)

This is where i was initially thinking it probably fit (I realize re-reading my previous comment wasn't particularly clear), there is already a generic threshold option already provided to the script, currently it applies >= 0.5 to everything, but that can be adjusted. It could also load a json/yaml/etc file containing some mapping of wiki/topic to threshold. But indeed maintaining that list doesn't sound like a fun job. I don't fully grok the way thresholds are set, but it seems the ORES api can provide them. One option might be a script that runs before export to query the ORES api and generate a data file containing thresholds to apply that week. I don't know if that actually makes sense though.

Or they could be provided by the ORES API, and handled as part of the data that goes through the ORES -> ES pipeline. That seems conceptually wrong, though, since these are thresholds specific to one application of the ORES scores, not the scores in general. (OTOH, ORES already uses a threshold to calculate the prediction field in the API response. Currently that just seems to be a constant 0.5, but maybe it would make sense to set it in a more meaningful way?)

I'm again not super familiar with this so I might be totally off base, but it seems like this is a question of the definition for what we log. Today we log the prediction as the raw model output, but perhaps raw model outputs are not directly actionable? What we actually want to know is "How good of a match is article A to topic B", we don't care that a good match for topic B is 0.9, and a good match for topic C is 0.3. I don't know if it's possible or sensible, but munging the scores with thresholds such that 0.5 means the same thing for all possible topics would simplify downstream tasks as they can treat all topics as equivalent.

Hm, I guess putting the threshold config in MediaWiki and applying via the search keyword definition is not really feasible, right? ES has a range search, but that would require a wholly different index structure with a huge number of fields (and then we'd have to commit to only using the ORES scores for filtering and not scoring); and match queries don't seem to provide that level of control.

Right, to use range search every possible topic would have to be it's own numeric field in elasticsearch. The scores are instead stored as term frequencies which are relatively cheap to access, when the results say 1-20 of 100,000, that means we looked up term frequencies for 100,000 documents. Not the end of the world, elasticsearch is pretty good at visiting the docs quickly. Visiting a bunch of docs that by definition will never be returned isn't ideal, but will primarily only be visible as additional query latency. Even searching for the the the, which has to basically lookup the frequencies for all ~6M docs only takes 1-1.5s.

I would prefer, i suppose more-so from an optimization standpoint, to not index topics to results that will never be returned. Essentially looking up all documents matching topic X is very cheap, but then we have to visit all of those documents to score them. If we know for example that we will only ever display results from topic X if the score is >0.9, then we shouldn't even index the topic to pages that score 0.5. I don't know enough about the actual distribution of predictions, but perhaps the distribution is such that this is a premature optimization that we can re-visit later. Especially since any threshold changes will be much more tedious to apply if they are part of the indexing pipeline.

It does sound like no matter what the prediction >= 0.5 in the script that assembles the predictions into a format to ship to elasticsearch will need to be adjusted. Overall I'm still not sure, seems there are many options but no clear winner.

Another concern i just realized with respect to thresholds, will be updating the models. If a new articletopic model is released and topic A threshold goes from 0.9 to 0.8, we will have an index containing scores mixed between old and new models, with no real way to distinguish which version of the model the prediction came from.

I don't fully grok the way thresholds are set, but it seems the ORES api can provide them. One option might be a script that runs before export to query the ORES api and generate a data file containing thresholds to apply that week.

ORES has an API that takes a condition like "threshold with precision >= 50% and highest recall possible" and turns it into an actual threshold. But (at least in the case of RCFilters) those abstract threshold definitions are still going to vary on wiki, and might have to be changed sometimes when the model changes. And it seems like they will be different for each topic. So that's a lot of configuration.

But maybe the script could just use the logic shown in Aaron's gist, that seems like not much added complexity and would get rid of configuration entirely.

Today we log the prediction as the raw model output, but perhaps raw model outputs are not directly actionable? What we actually want to know is "How good of a match is article A to topic B", we don't care that a good match for topic B is 0.9, and a good match for topic C is 0.3. I don't know if it's possible or sensible, but munging the scores with thresholds such that 0.5 means the same thing for all possible topics would simplify downstream tasks as they can treat all topics as equivalent.

Question is, will the same configuration work equally well for all clients of ORES? The clients could specify the configuration, but 1) the threshold-precision-recall relationship is a lot harder to understand for the average user than the API response itself, and 2) that would complicate caching.

+1 to @Tgr. "Useful threshold" depends on what you are optimizing for.

I've added my python script and proposed thresholds for each of the 5 wikis to https://gist.github.com/halfak/630dc3fd811995c2a0260d43da462645/edit

@Halfak any thoughts on how to set thresholds for the interwiki-based cross-wiki scores? Should those just use enwiki thresholds?

That's a good question. If they are using the enwiki model -- even crosswiki-- they should probably use enwiki thresholds.

Another concern i just realized with respect to thresholds, will be updating the models. If a new articletopic model is released and topic A threshold goes from 0.9 to 0.8, we will have an index containing scores mixed between old and new models, with no real way to distinguish which version of the model the prediction came from.

The ORES MediaWiki extension handles that by having a model version field in the table that stores predictions, and essentially discarding everything (or maybe only using as a fallback, I can't remember for sure) when the model version changes.

Change 571790 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/analytics@master] ores articletopic: per-topic thresholding

https://gerrit.wikimedia.org/r/571790

Change 571790 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] ores articletopic: per-topic thresholding

https://gerrit.wikimedia.org/r/571790

The per-topic thresholding is now deployed. I ran only the threshold selection and prediction extraction portion of last weeks job to see how it would look. This will do a full run, where the predictions are also shipped to elasticsearch, on sunday (feb 23rd).

The selected thresholds can be seen in hdfs at /mnt/hdfs/wmf/data/discovery/ores/thresholds/articletopic/20200209.json. The extracted predictions for feb 9 through 16 work out as:

wikipages with predictions
arwiki33,573
cswiki8,038
enwiki378,872
kowiki10,824
viwiki11,521

Do these numbers match up with our expectations?

Enwiki has 750K content edits per week, so given not all of those are to different articles, it seems plausible.