Page MenuHomePhabricator

Regenerate data for prototype using updated ORES models
Closed, ResolvedPublic

Description

Per discussion with @MMiller_WMF, we want to update the prototype for https://newcomertasks-ores-drafttopic.netlify.com/ (source: https://github.com/kostajh/newcomertasks-drafttopic) to use the latest models for ORES draft topic. For now this would involve (I think) just re-generating the data with the existing scripts, plus some modifications to the task type loading. Later, when the per-wiki models are available, we'd want to regenerate the dataset again for evaluating the local models.

Event Timeline

@MMiller_WMF I don't think updating the prototype would be a ton of work but it's not zero time either. If we're willing to wait a few weeks (T240559) then we can evaluate the topic models using the Special:Search with about-topic: {topics} hastemplate: {list-of-templates}. That has the advantage of querying in real-time, whereas the prototype listed in the task description works by creating a dataset that takes a couple of hours to compute per wiki.

@kostajh -- I don't want us to wait until the work of loading the models to Search is already done, because I want us to expose issues in the models as soon as we can so that @Halfak can improve on them, and so that we can tune the model cutoffs and deploy them to users sooner.

Today I filed T244192: Newcomer tasks: ORES ontology mapping and score thresholds which I believe is a pre-requisite to the work here, because we'll be rolling up the new ontology to be more user-friendly. We also need to pick some initial cutoffs to try out, which @Halfak can help with. I will make decisions on the ontology tomorrow, and see what headway I can make on the cutoffs.

@kostajh -- some additional notes for when we're ready to do this (still waiting for us to complete T244192). Feel free to incorporate these into the task description:

  • We'll need to do this for two sets of models:
    • The English models cross-walked to our target languages.
    • The native language models for our target languages.
  • We'll want to use articletopic, not drafttopic models.
  • Is it possible to simulate the steps that we use in production of first selecting a certain number of articles above a score threshold and then randomizing them to display in the prototype? I want to make sure we don't look at a misleadingly high-performing set of articles, which is what we struggled with when evaluating morelike over the API in T243035: Newcomer tasks: investigate discrepancy between topics in module and API.

Sounds good.

Is it possible to simulate the steps that we use in production of first selecting a certain number of articles above a score threshold and then randomizing them to display in the prototype?

Yep, can do.

Now that we have the new ontology mapping and are setting score thresholds via T244192, this is Ready for Development.

Hey @MMiller_WMF I started on this task today (preview with a small amount of data from cswiki is here https://deploy-preview-1--newcomertasks-ores-drafttopic.netlify.com, code for that is here https://github.com/kostajh/newcomertasks-drafttopic/pull/1), but before proceeding further with data generation wanted to check a few assumptions:

  1. When you click "cs" or "ar" or whatever language in the prototype, you should be looking at ORES derived data from the native model for that wiki. In other words none of the cross-wiki lookup stuff we were doing before.
  2. Seems like the thresholds discussion is still not finalized; for now I am including the top 3 topic scores for a given article, which is what the prototype did earlier. Let me know if I should loosen that or not.
  3. I am not attempting to build the UX that has subgroups mapping from the ORES ontology to how we want our end users to see and interact with topics, because it's not trivial and seems tangential to the main goal here of assessing if the ORES topic scores are good enough per each wiki. Also, not sure if building that UX here would be reusable in GrowthExperiments, so I'm hesitant to do work that won't get used further. But let me know if you'd like it done.
  4. There's no kind of interleaving of results when you select more than one topic; again that doesn't seem like the most important thing (maybe I should make the topic selector a single choice dropdown?) but let me know if you want that adjusted.

I'll aim to start generating the datasets later tonight.

Thanks for the questions, @kostajh.

  1. We want to have two separate prototypes (or one prototype with a toggle). That's because we want to be able to test the cross-walk English model and test the native language models. Is that possible? Also, I want to make sure that you're using the new articletopic models, as opposed to the old drafttopic models.
  2. @Halfak posted his recommended thresholds here: T244297#5866701. Can you use those effectively? If not, please let him know on that task. The way we should use them is: if an article has a score for a given topic that is above the threshold, then that article "counts" for that topic when the user selects it. A given articles inclusion in a topic should not depend on its scores for any of the other topics.
  3. Thanks for calling this out. That's fine, and I'll let the ambassadors know.
  4. That's fine -- because we ask our ambassadors to test one topic at a time anyway.

I also want to make sure that we simulate the steps that we use in production of first selecting a certain number of articles above a score threshold and then randomizing them to display in the prototype. I want to make sure we don't look at a misleadingly high-performing set of articles, which is what we struggled with when evaluating morelike over the API in T243035.

We want to have two separate prototypes (or one prototype with a toggle). That's because we want to be able to test the cross-walk English model and test the native language models. Is that possible?

Yes, it's possible. But the caveat is, the toggle on the prototype that uses cross-walk English models will have far fewer tasks to choose from (IIRC, ~60% of what the native models would have). Does that matter to the comparison? Would you want the mode which shows native model data to only include articles that also have a corresponding cross-walk score too? Or put another way, should the prototype only show articles that exist on both enwiki and local (cs/ar/vi/ko) wikis?

Also, I want to make sure that you're using the new articletopic models, as opposed to the old drafttopic models.

Yep, I'm using articletopic.

I also want to make sure that we simulate the steps that we use in production of first selecting a certain number of articles above a score threshold and then randomizing them to display in the prototype

Yes, I can do that.

@kostajh -- it is okay for the cross-walk version to have fewer articles. It's just that there have to be enough for the ambassadors to evaluate, like we're thinking at least 20 in each topic, simulating the articles that the actual module would be choosing, so that we get a good sense of the quality of topics that users will experience. The native mode should not be set up to show articles that exist in the cross-walk, because we're not going to be comparing the results article-by-article; we're going to be comparing it topic-by-topic. We want to look at how these models will perform, as closely as we can simulating the results they would be getting in production.

The native mode should not be set up to show articles that exist in the cross-walk, because we're not going to be comparing the results article-by-article; we're going to be comparing it topic-by-topic. We want to look at how these models will perform, as closely as we can simulating the results they would be getting in production.

OK sounds good. Just spot checking some data generated by the script, it is interesting to compare the local vs enwiki ORES model topic scores, but agreed that it's not as useful as the topic-by-topic overall comparison.

@MMiller_WMF I'm nearly done with data generation for cswiki and I'll have to work on production the other wiki data later today / tonight / this weekend. It takes a while to churn through everything. I've pushed the latest code and data (cswiki only, with most of the data in) here https://deploy-preview-1--newcomertasks-ores-drafttopic.netlify.com (@Tgr and @Catrope, code is here https://github.com/kostajh/newcomertasks-drafttopic/pull/1).

@MMiller_WMF please have a look and if it looks good, you could probably pass it on to Martin to review, as there should be plenty of tasks in there to start the evaluation process. I'll post a comment here when I've pushed the full dataset for cswiki and then again when I have the other wikis done.

cswiki is closer to being complete but not quite done. I've pushed what I have. I'll keep processing the data for the other wikis over the weekend, so it should be ready on Monday.

cswiki data export is done and can be reviewed at https://deploy-preview-1--newcomertasks-ores-drafttopic.netlify.com

I'm working on the other wikis now.

The prototype now has kowiki and viwiki data. Processing arwiki now

Partial export of arwiki is ready (15,000 tasks), potentially enough so the evaluation could begin?

@kostajh -- thanks for putting in extra time to work on this. The prototypes look good to me. I created T245368: Newcomer tasks: evaluate new ORES topic models so that the ambassadors can get started with evaluating the models.

I pushed another batch of data so there should be ~18k results for arwiki.

@MMiller_WMF I belatedly realized that when using articletopic with the crosswiki approach (getting the articletopic data from enwiki), I'm using the local per-wiki threshold cutoffs rather than the enwiki thresholds. So, when evaluating topics for crosswiki, the topics may be more or less correct than if I used the enwiki defined thresholds. I can adjust the code and re-run data processing for that use case but it will take a while. Let me know.

When the prototype is set to use local, per-wiki articletopic (the default in the UI), the correct per-wiki thresholds are used.

@kostajh -- how different do those thresholds look, just eyeballing the numbers? If this can be an easy background task to fix them, then I think worth it, but if it will take a lot of person time, then not.

It's not vastly different from a very quick look (https://github.com/kostajh/newcomertasks-drafttopic/pull/1/files?file-filters%5B%5D=.json#diff-92840933c52fc8843d93e215054a7d84) but hard to say. It would take a little time to update the code to use the enwiki thresholds, mostly it would take another longer period of time (24 hours) to regenerate the data for the crosswiki use case.

@MMiller_WMF I re-exported the data, so when using the prototype with Only get tasks with topics returned from enwiki ORES, ignore local wiki ORES models selected, you will only see topics that pass the enwiki threshold (rather than the cs/ar/vi/ko wiki thresholds).

Moving to PM Review just to indicate that nothing out of ordinary was noticed in the prototype working.
Interesting that when a set of articles is really limited (1-5) those two options "Only return article when topic is top-ranked match from ORES" and "Only get tasks with topics returned from enwiki ORES, ignore local wiki ORES models" return articles with quite different range of ranking.

Do we really need articles that ranked as low as 0.4-0.5 ? I checked couple of examples and they do not seem to me relevant at all.
Example:
Selected: arwiki
Culture.Media.Radio
كأس السويد 2010
2010 Svenska Cupen

{ "id": "138333", "page_title": "كأس السويد 2010", "topic": "{\"Culture.Sports\":0.992882291390404,\"Geography.Regions.Europe.Northern Europe\":0.9637957273434438,\"Geography.Regions.Europe.Europe*\":0.9028177724103539,\ "Culture.Media.Radio\":0.4137321480627542}", "template": "مصدر", "enwiki_title": "2010 Svenska Cupen", "is_foreignwiki": "1", "rev_id": "934014582", "wikibase_id": null, "lang": "ar" }

But, of course, it'd be great to see the results of evaluation in T245368.

@Etonkovidova's particular example is fine, because although the English score for "Radio" is weirdly high, the local language score is appropriately low. I am resolving this task because the ambassadors are productively using the prototype to test the model scores.