Page MenuHomePhabricator

Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available)
Open, Needs TriagePublic

Description

The Growth team will be deploying newcomer tasks to five additional wikis in April 2020. When the feature is deployed, it will be using the English crosswalk ORES topic models, but we would like to be using local language models. We want to build and load local models for these wikis:

  • ukwiki
  • huwiki
  • hywiki
  • euwiki
  • srwiki

Once the models have been created and exposed in the ORES precache endpoint, the ores_bulk_ingest Spark job needs to be run for these wikis to load data for existing revisions.

Event Timeline

Restricted Application added subscribers: Petar.petkovic, Base. · View Herald TranscriptApr 3 2020, 10:23 PM

@Halfak -- I just made this task for local topic models for additional wikis. Is this a good way to put in this request? How on-demand do you think your team can handle requests like this?

@EBernhardson @Tgr -- I suppose that more models means more loading to elasticsearch. Could you please update the task description with what will need to happen on that front? Is it relatively trivial to do at this point?

MMiller_WMF renamed this task from Scale: ORES topic models for uk, hu, hy, eu, sr to Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available).Apr 3 2020, 10:25 PM
MMiller_WMF updated the task description. (Show Details)
Tgr updated the task description. (Show Details)Apr 4 2020, 10:56 AM
Tgr added a comment.Apr 4 2020, 11:03 AM

Updated (@EBernhardson please correct me if I got it wrong). AIUI this is trivial (although not necessarily fast, with these wikis together comprising something like 12 million articles).

Tgr added a comment.Apr 4 2020, 11:08 AM

although not necessarily fast, with these wikis together comprising something like 12 million articles

I confused it with T249383: Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020) , these are more like 3 million. IIRC enwiki took a day for twice as many articles.

@Halfak -- thanks for looking into this and filing T249520: Fit more topic models into ORES. Can you speak directly to the models on this task, and whether they are possible within ORES's current infrastructure? Is it just the models from T249383: Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020) that need the expansion?

EBernhardson added a comment.EditedApr 13 2020, 4:09 PM

There were roughly three parts to turning wikis on in elasticsearch:

  • Reindexing was the longest part last time around. This does not need to be redone for all wikis. *IF* we want to purge the propagated topics before importing the new topics we will have to reindex the individual wikis to be purged.
  • Full export/import of topics from ORES to pre-populate. If we want to generate predictions for all pages and update them that is a manual process that will take a couple days. We could alternatively let the existing propagated topics be replaced with the wiki specific predictions as edits happen.
  • Thresholding values are generated for all wikis that ores reports support for. Essentially as soon as ores reports a wiki at https://ores.wikimedia.org/v3/scores we will generate acceptance thresholds for the wiki. Propagation of enwiki scores to the wiki will stop as soon as thresholding has a value for the wiki. Essentially we only propagate to wikis that the model doesn't know about.

TL/DR: When the wiki is reported as supported by the ores v3 scores api we will automagically stop propagating enwiki results to that wiki and instead only send scores that were generated specifically for that wiki. The only manual work to do will be if we need to purge the old enwiki propagated predictions and re-populate with wiki specific predictions.

Thanks, @EBernhardson. I'm pretty sure that when we want to switch over to local language scores, we will want to re-populate all the articles with them (even if it takes time), instead of waiting to update them when the articles are edited.

Tgr added a comment.Apr 14 2020, 1:06 PM

Wouldn't the ORES import overwrite all existing scores without a reindex, anyway? (Assuming every page is classified into one topic at least.)

MMiller_WMF moved this task from Inbox to External on the Growth-Team board.

Wouldn't the ORES import overwrite all existing scores without a reindex, anyway? (Assuming every page is classified into one topic at least.)

Indeed purging is likely not necessary, although without it there may remain some propagated predictions. Perhaps we could fix that by generating a fake category (Uncategorized?) when the prediction is empty after filtering, that might avoid the lingering predictions.

Halfak added a comment.Jul 1 2020, 5:33 PM

See https://github.com/mediawiki-utilities/python-mwtext for building embeddings

This Makefile rule builds the 5 language embeddings we already have: https://github.com/mediawiki-utilities/python-mwtext/blob/master/Makefile#L27

Once we have embeddings for the languages, we'll want to implement the topic models in https://github.com/wikimedia/drafttopic

Halfak reassigned this task from Halfak to HAKSOAT.Jul 15 2020, 2:42 PM

We've managed to compress our vectors and reduce the memory footprint of ORES. That means we have space for these models and @HAKSOAT is going to start work.

Isaac added a subscriber: Isaac.Jul 17 2020, 5:18 PM
calbon closed this task as Resolved.Sep 23 2020, 4:16 PM
calbon moved this task from Active to Done on the Machine Learning Platform (Current) board.
MMiller_WMF added a subscriber: calbon.EditedSep 24 2020, 8:46 PM

@calbon -- I see that you resolved this task. What does that mean for the status of the models for those five wikis? Do you know where they are in their journey to ElasticSearch? Do we need to notify the Search team to load them?

And also, do you know what we should expect for this task: T249383: Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020) ? Is that one also in progress? Or is there only bandwidth on your team for the first set?

EBernhardson reopened this task as Open.Sep 24 2020, 10:13 PM

This is specifically the articletopic model and corresponding search keyword. Checking the ores apis it doesn't look like they've been deployed yet (arwiki included as example). Once deployed i'll need to kick off some tasks before closing.

$ for wiki in arwiki ukwiki huwiki hywiki euwiki srwiki; do curl https://ores.wikimedia.org/v3/scores/$wiki?models=articletopic; done

{
  "arwiki": {
    "models": {
      "articletopic": {
        "version": "1.2.0"
      }
    }
  }
}{
  "error": {
    "code": "not found",
    "message": "Models ('articletopic',) not available for ukwiki"
  }
}{
  "error": {
    "code": "not found",
    "message": "Models ('articletopic',) not available for huwiki"
  }
}{
  "error": {
    "code": "not found",
    "message": "No scorers available for hywiki"
  }
}{
  "error": {
    "code": "not found",
    "message": "Models ('articletopic',) not available for euwiki"
  }
}{
  "error": {
    "code": "not found",
    "message": "Models ('articletopic',) not available for srwiki"
  }
}

FWIW, I believe that @HAKSOAT built these models and that they are basically ready for deployment. The primary concern with doing that deployment was related to memory usage of the models. @HAKSOAT did a lot of work to ensure that the models would fit in memory. In fact, I expect our memory footprint to decrease with the deployment of these new models and their embeddings because they reduce the memory footprint per language by about 90%. In my last conversation with @calbon, he said he wanted to be cautious with new deployment while the team is in transition. I'm happy to make time to advise and support getting these models out the door. Feel free to reach out if/when you're ready to get a new deployment configuration together.

Yes. The models have been built and there was supposed to be a trial deployment. I think it was @Pavol86 that did the work of reducing memory footprint though.