Newcomer tasks: set initial thresholds for ORES articletopic
Closed, ResolvedPublic
Actions

Description

To help us pursue T244192: Newcomer tasks: ORES ontology mapping and score thresholds, @Halfak is going to give us recommended thresholds for each topic model in each language that he expects will give us about 70% precision or better.

He'll also put together some instructions for how we can determine and adjust such thresholds in the future.

@Halfak -- if you get the sense that there are lots of articles that can be captured at 70% precision, then maybe you can recommend thresholds that will get us higher than 70% but with still sufficient recall to supply plenty of articles. Or we can try and adjust from there.

Details

	Subject	Repo	Branch	Lines +/-
	ores articletopic: per-topic thresholding	wikimedia/discovery/analytics	master	+883 -41

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Rileych	T240517 [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics)
Resolved	MMiller_WMF	T244192 Newcomer tasks: ORES ontology mapping and score thresholds
Resolved	Halfak	T244297 Newcomer tasks: set initial thresholds for ORES articletopic

Event Timeline

MMiller_WMF created this task.Feb 4 2020, 9:57 PM

MMiller_WMF moved this task from Incoming to In Progress on the Growth-Team (Sprint 0 (Growth Team)) board.

Here's a gist that I put together with my initial explorations and discussions of choosing thresholds: https://gist.github.com/halfak/630dc3fd811995c2a0260d43da462645

I'm working on generating some output that will help us choose thresholds.

MMiller_WMF mentioned this in T244192: Newcomer tasks: ORES ontology mapping and score thresholds.Feb 4 2020, 11:31 PM

How do we think these thresholds should be applied, It sounds like we need to inject them prior to the indexing pipeline? Or is this something that should be handled much earlier in the pipeline?

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.Feb 6 2020, 8:42 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

If there are a lot of articles, precision won't really matter since the articles will be sorted by score and cut off at some point for pragmatic reasons (currently at the top 250), so the real precision will be higher anyway. What we are really talking about here is a trade-off between having less than 250 articles or having 250 articles but less relevant ones.

What are the thresholds for, exactly? The articletopic: search keyword in general, or just the suggested edits interface specifically?

In T244297#5854477, @EBernhardson wrote:

How do we think these thresholds should be applied, It sounds like we need to inject them prior to the indexing pipeline? Or is this something that should be handled much earlier in the pipeline?

For RCFilters, which has a similar threshold system (for mapping damaging / goodfaith ORES scores to recentchanges labels like "maybe damaging" and "very likely damaging"), the threshold definitions (things like "precision >= 15%") live in MediaWiki configuration, and are turned into threshold values dynamically via the ORES API. It's not very pleasant but it's an option.

They could also live in the script that loads data from Hadoop to ES (and currently uses a cutoff of 0.5 for discarding low scores). That would reduce ES space usage, but seems like an even more unpleasant location to manage such config, especially since it will be different from each wiki (or will it? thresholds, for sure, but threshold definitions?)

Or they could be provided by the ORES API, and handled as part of the data that goes through the ORES -> ES pipeline. That seems conceptually wrong, though, since these are thresholds specific to one application of the ORES scores, not the scores in general. (OTOH, ORES already uses a threshold to calculate the prediction field in the API response. Currently that just seems to be a constant 0.5, but maybe it would make sense to set it in a more meaningful way?)

Hm, I guess putting the threshold config in MediaWiki and applying via the search keyword definition is not really feasible, right? ES has a range search, but that would require a wholly different index structure with a huge number of fields (and then we'd have to commit to only using the ORES scores for filtering and not scoring); and match queries don't seem to provide that level of control.

They could also live in the script that loads data from Hadoop to ES (and currently uses a cutoff of 0.5 for discarding low scores). That would reduce ES space usage, but seems like an even more unpleasant location to manage such config, especially since it will be different from each wiki (or will it? thresholds, for sure, but threshold definitions?)

This is where i was initially thinking it probably fit (I realize re-reading my previous comment wasn't particularly clear), there is already a generic threshold option already provided to the script, currently it applies >= 0.5 to everything, but that can be adjusted. It could also load a json/yaml/etc file containing some mapping of wiki/topic to threshold. But indeed maintaining that list doesn't sound like a fun job. I don't fully grok the way thresholds are set, but it seems the ORES api can provide them. One option might be a script that runs before export to query the ORES api and generate a data file containing thresholds to apply that week. I don't know if that actually makes sense though.

Or they could be provided by the ORES API, and handled as part of the data that goes through the ORES -> ES pipeline. That seems conceptually wrong, though, since these are thresholds specific to one application of the ORES scores, not the scores in general. (OTOH, ORES already uses a threshold to calculate the prediction field in the API response. Currently that just seems to be a constant 0.5, but maybe it would make sense to set it in a more meaningful way?)

I'm again not super familiar with this so I might be totally off base, but it seems like this is a question of the definition for what we log. Today we log the prediction as the raw model output, but perhaps raw model outputs are not directly actionable? What we actually want to know is "How good of a match is article A to topic B", we don't care that a good match for topic B is 0.9, and a good match for topic C is 0.3. I don't know if it's possible or sensible, but munging the scores with thresholds such that 0.5 means the same thing for all possible topics would simplify downstream tasks as they can treat all topics as equivalent.

In T244297#5858545, @Tgr wrote:

Hm, I guess putting the threshold config in MediaWiki and applying via the search keyword definition is not really feasible, right? ES has a range search, but that would require a wholly different index structure with a huge number of fields (and then we'd have to commit to only using the ORES scores for filtering and not scoring); and match queries don't seem to provide that level of control.

Right, to use range search every possible topic would have to be it's own numeric field in elasticsearch. The scores are instead stored as term frequencies which are relatively cheap to access, when the results say 1-20 of 100,000, that means we looked up term frequencies for 100,000 documents. Not the end of the world, elasticsearch is pretty good at visiting the docs quickly. Visiting a bunch of docs that by definition will never be returned isn't ideal, but will primarily only be visible as additional query latency. Even searching for the the the, which has to basically lookup the frequencies for all ~6M docs only takes 1-1.5s.

I would prefer, i suppose more-so from an optimization standpoint, to not index topics to results that will never be returned. Essentially looking up all documents matching topic X is very cheap, but then we have to visit all of those documents to score them. If we know for example that we will only ever display results from topic X if the score is >0.9, then we shouldn't even index the topic to pages that score 0.5. I don't know enough about the actual distribution of predictions, but perhaps the distribution is such that this is a premature optimization that we can re-visit later. Especially since any threshold changes will be much more tedious to apply if they are part of the indexing pipeline.

It does sound like no matter what the prediction >= 0.5 in the script that assembles the predictions into a format to ship to elasticsearch will need to be adjusted. Overall I'm still not sure, seems there are many options but no clear winner.

Another concern i just realized with respect to thresholds, will be updating the models. If a new articletopic model is released and topic A threshold goes from 0.9 to 0.8, we will have an index containing scores mixed between old and new models, with no real way to distinguish which version of the model the prediction came from.

In T244297#5858557, @EBernhardson wrote:

I don't fully grok the way thresholds are set, but it seems the ORES api can provide them. One option might be a script that runs before export to query the ORES api and generate a data file containing thresholds to apply that week.

ORES has an API that takes a condition like "threshold with precision >= 50% and highest recall possible" and turns it into an actual threshold. But (at least in the case of RCFilters) those abstract threshold definitions are still going to vary on wiki, and might have to be changed sometimes when the model changes. And it seems like they will be different for each topic. So that's a lot of configuration.

But maybe the script could just use the logic shown in Aaron's gist, that seems like not much added complexity and would get rid of configuration entirely.

Today we log the prediction as the raw model output, but perhaps raw model outputs are not directly actionable? What we actually want to know is "How good of a match is article A to topic B", we don't care that a good match for topic B is 0.9, and a good match for topic C is 0.3. I don't know if it's possible or sensible, but munging the scores with thresholds such that 0.5 means the same thing for all possible topics would simplify downstream tasks as they can treat all topics as equivalent.

Question is, will the same configuration work equally well for all clients of ORES? The clients could specify the configuration, but 1) the threshold-precision-recall relationship is a lot harder to understand for the average user than the API response itself, and 2) that would complicate caching.

+1 to @Tgr. "Useful threshold" depends on what you are optimizing for.

I've added my python script and proposed thresholds for each of the 5 wikis to https://gist.github.com/halfak/630dc3fd811995c2a0260d43da462645/edit

@Halfak any thoughts on how to set thresholds for the interwiki-based cross-wiki scores? Should those just use enwiki thresholds?

That's a good question. If they are using the enwiki model -- even crosswiki-- they should probably use enwiki thresholds.

MMiller_WMF mentioned this in T243956: Regenerate data for prototype using updated ORES models.Feb 11 2020, 7:46 PM

In T244297#5858563, @EBernhardson wrote:

Another concern i just realized with respect to thresholds, will be updating the models. If a new articletopic model is released and topic A threshold goes from 0.9 to 0.8, we will have an index containing scores mixed between old and new models, with no real way to distinguish which version of the model the prediction came from.

The ORES MediaWiki extension handles that by having a model version field in the table that stores predictions, and essentially discarding everything (or maybe only using as a fallback, I can't remember for sure) when the model version changes.

Tgr mentioned this in T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update.Feb 12 2020, 7:30 PM

Change 571790 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/analytics@master] ores articletopic: per-topic thresholding

https://gerrit.wikimedia.org/r/571790

gerritbot added a project: Patch-For-Review.Feb 12 2020, 7:59 PM

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Feb 12 2020, 9:37 PM

Tgr mentioned this in T242476: Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others.Feb 13 2020, 1:15 AM

Tgr mentioned this in T240559: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword.Feb 13 2020, 2:19 AM

Change 571790 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] ores articletopic: per-topic thresholding

https://gerrit.wikimedia.org/r/571790

EBernhardson mentioned this in rWDAN5ad38f6d6638: ores articletopic: per-topic thresholding.Feb 19 2020, 11:06 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 19 2020, 11:10 PM

The per-topic thresholding is now deployed. I ran only the threshold selection and prediction extraction portion of last weeks job to see how it would look. This will do a full run, where the predictions are also shipped to elasticsearch, on sunday (feb 23rd).

The selected thresholds can be seen in hdfs at /mnt/hdfs/wmf/data/discovery/ores/thresholds/articletopic/20200209.json. The extracted predictions for feb 9 through 16 work out as:

wiki	pages with predictions
arwiki	33,573
cswiki	8,038
enwiki	378,872
kowiki	10,824
viwiki	11,521

Do these numbers match up with our expectations?

Enwiki has 750K content edits per week, so given not all of those are to different articles, it seems plausible.

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Feb 24 2020, 5:50 PM

MMiller_WMF mentioned this in T245368: Newcomer tasks: evaluate new ORES topic models.Feb 29 2020, 12:22 AM

EBernhardson moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Mar 2 2020, 6:43 PM

TJones closed this task as Resolved.Mar 4 2020, 4:23 PM

Newcomer tasks: set initial thresholds for ORES articletopicClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Newcomer tasks: set initial thresholds for ORES articletopic
Closed, ResolvedPublic
Actions

Related Objects
Search...