Page MenuHomePhabricator

Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available)
Closed, ResolvedPublic

Description

The Growth team will be deploying newcomer tasks to five additional wikis in April 2020. When the feature is deployed, it will be using the English crosswalk ORES topic models, but we would like to be using local language models. We want to build and load local models for these wikis:

  • ukwiki
  • huwiki
  • hywiki
  • euwiki
  • srwiki

Once the models have been created and exposed in the ORES precache endpoint, the ores_bulk_ingest Spark job needs to be run for these wikis to load data for existing revisions.

Event Timeline

@Halfak -- I just made this task for local topic models for additional wikis. Is this a good way to put in this request? How on-demand do you think your team can handle requests like this?

@EBernhardson @Tgr -- I suppose that more models means more loading to elasticsearch. Could you please update the task description with what will need to happen on that front? Is it relatively trivial to do at this point?

MMiller_WMF renamed this task from Scale: ORES topic models for uk, hu, hy, eu, sr to Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available).Apr 3 2020, 10:25 PM
MMiller_WMF updated the task description. (Show Details)

Updated (@EBernhardson please correct me if I got it wrong). AIUI this is trivial (although not necessarily fast, with these wikis together comprising something like 12 million articles).

although not necessarily fast, with these wikis together comprising something like 12 million articles

I confused it with T249383: Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020) , these are more like 3 million. IIRC enwiki took a day for twice as many articles.

@Halfak -- thanks for looking into this and filing T249520: Fit more topic models into ORES. Can you speak directly to the models on this task, and whether they are possible within ORES's current infrastructure? Is it just the models from T249383: Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020) that need the expansion?

There were roughly three parts to turning wikis on in elasticsearch:

  • Reindexing was the longest part last time around. This does not need to be redone for all wikis. *IF* we want to purge the propagated topics before importing the new topics we will have to reindex the individual wikis to be purged.
  • Full export/import of topics from ORES to pre-populate. If we want to generate predictions for all pages and update them that is a manual process that will take a couple days. We could alternatively let the existing propagated topics be replaced with the wiki specific predictions as edits happen.
  • Thresholding values are generated for all wikis that ores reports support for. Essentially as soon as ores reports a wiki at https://ores.wikimedia.org/v3/scores we will generate acceptance thresholds for the wiki. Propagation of enwiki scores to the wiki will stop as soon as thresholding has a value for the wiki. Essentially we only propagate to wikis that the model doesn't know about.

TL/DR: When the wiki is reported as supported by the ores v3 scores api we will automagically stop propagating enwiki results to that wiki and instead only send scores that were generated specifically for that wiki. The only manual work to do will be if we need to purge the old enwiki propagated predictions and re-populate with wiki specific predictions.

Thanks, @EBernhardson. I'm pretty sure that when we want to switch over to local language scores, we will want to re-populate all the articles with them (even if it takes time), instead of waiting to update them when the articles are edited.

Wouldn't the ORES import overwrite all existing scores without a reindex, anyway? (Assuming every page is classified into one topic at least.)

Wouldn't the ORES import overwrite all existing scores without a reindex, anyway? (Assuming every page is classified into one topic at least.)

Indeed purging is likely not necessary, although without it there may remain some propagated predictions. Perhaps we could fix that by generating a fake category (Uncategorized?) when the prediction is empty after filtering, that might avoid the lingering predictions.

See https://github.com/mediawiki-utilities/python-mwtext for building embeddings

This Makefile rule builds the 5 language embeddings we already have: https://github.com/mediawiki-utilities/python-mwtext/blob/master/Makefile#L27

Once we have embeddings for the languages, we'll want to implement the topic models in https://github.com/wikimedia/drafttopic

We've managed to compress our vectors and reduce the memory footprint of ORES. That means we have space for these models and @HAKSOAT is going to start work.

@calbon -- I see that you resolved this task. What does that mean for the status of the models for those five wikis? Do you know where they are in their journey to ElasticSearch? Do we need to notify the Search team to load them?

And also, do you know what we should expect for this task: T249383: Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020) ? Is that one also in progress? Or is there only bandwidth on your team for the first set?

This is specifically the articletopic model and corresponding search keyword. Checking the ores apis it doesn't look like they've been deployed yet (arwiki included as example). Once deployed i'll need to kick off some tasks before closing.

$ for wiki in arwiki ukwiki huwiki hywiki euwiki srwiki; do curl https://ores.wikimedia.org/v3/scores/$wiki?models=articletopic; done

{
  "arwiki": {
    "models": {
      "articletopic": {
        "version": "1.2.0"
      }
    }
  }
}{
  "error": {
    "code": "not found",
    "message": "Models ('articletopic',) not available for ukwiki"
  }
}{
  "error": {
    "code": "not found",
    "message": "Models ('articletopic',) not available for huwiki"
  }
}{
  "error": {
    "code": "not found",
    "message": "No scorers available for hywiki"
  }
}{
  "error": {
    "code": "not found",
    "message": "Models ('articletopic',) not available for euwiki"
  }
}{
  "error": {
    "code": "not found",
    "message": "Models ('articletopic',) not available for srwiki"
  }
}

FWIW, I believe that @HAKSOAT built these models and that they are basically ready for deployment. The primary concern with doing that deployment was related to memory usage of the models. @HAKSOAT did a lot of work to ensure that the models would fit in memory. In fact, I expect our memory footprint to decrease with the deployment of these new models and their embeddings because they reduce the memory footprint per language by about 90%. In my last conversation with @calbon, he said he wanted to be cautious with new deployment while the team is in transition. I'm happy to make time to advise and support getting these models out the door. Feel free to reach out if/when you're ready to get a new deployment configuration together.

Yes. The models have been built and there was supposed to be a trial deployment. I think it was @Pavol86 that did the work of reducing memory footprint though.