Page MenuHomePhabricator

Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available)
Open, Needs TriagePublic

Description

The Growth team will be deploying newcomer tasks to five additional wikis in April 2020. When the feature is deployed, it will be using the English crosswalk ORES topic models, but we would like to be using local language models. We want to build and load local models for these wikis:

  • ukwiki
  • huwiki
  • hywiki
  • euwiki
  • srwiki

Once the models have been created and exposed in the ORES precache endpoint, the ores_bulk_ingest Spark job needs to be run for these wikis to load data for existing revisions.

Event Timeline

Restricted Application added subscribers: Petar.petkovic, Base. · View Herald TranscriptApr 3 2020, 10:23 PM

@Halfak -- I just made this task for local topic models for additional wikis. Is this a good way to put in this request? How on-demand do you think your team can handle requests like this?

@EBernhardson @Tgr -- I suppose that more models means more loading to elasticsearch. Could you please update the task description with what will need to happen on that front? Is it relatively trivial to do at this point?

MMiller_WMF renamed this task from Scale: ORES topic models for uk, hu, hy, eu, sr to Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available).Apr 3 2020, 10:25 PM
MMiller_WMF updated the task description. (Show Details)
Tgr updated the task description. (Show Details)Apr 4 2020, 10:56 AM
Tgr added a comment.Apr 4 2020, 11:03 AM

Updated (@EBernhardson please correct me if I got it wrong). AIUI this is trivial (although not necessarily fast, with these wikis together comprising something like 12 million articles).

Tgr added a comment.Apr 4 2020, 11:08 AM

although not necessarily fast, with these wikis together comprising something like 12 million articles

I confused it with T249383: Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020) , these are more like 3 million. IIRC enwiki took a day for twice as many articles.

@Halfak -- thanks for looking into this and filing T249520: Fit more topic models into ORES. Can you speak directly to the models on this task, and whether they are possible within ORES's current infrastructure? Is it just the models from T249383: Scale: ORES topic models for fr, pt, pl, fa, sv, da, it, id, he (needed for May 2020) that need the expansion?

EBernhardson added a comment.EditedApr 13 2020, 4:09 PM

There were roughly three parts to turning wikis on in elasticsearch:

  • Reindexing was the longest part last time around. This does not need to be redone for all wikis. *IF* we want to purge the propagated topics before importing the new topics we will have to reindex the individual wikis to be purged.
  • Full export/import of topics from ORES to pre-populate. If we want to generate predictions for all pages and update them that is a manual process that will take a couple days. We could alternatively let the existing propagated topics be replaced with the wiki specific predictions as edits happen.
  • Thresholding values are generated for all wikis that ores reports support for. Essentially as soon as ores reports a wiki at https://ores.wikimedia.org/v3/scores we will generate acceptance thresholds for the wiki. Propagation of enwiki scores to the wiki will stop as soon as thresholding has a value for the wiki. Essentially we only propagate to wikis that the model doesn't know about.

TL/DR: When the wiki is reported as supported by the ores v3 scores api we will automagically stop propagating enwiki results to that wiki and instead only send scores that were generated specifically for that wiki. The only manual work to do will be if we need to purge the old enwiki propagated predictions and re-populate with wiki specific predictions.

Thanks, @EBernhardson. I'm pretty sure that when we want to switch over to local language scores, we will want to re-populate all the articles with them (even if it takes time), instead of waiting to update them when the articles are edited.

Tgr added a comment.Apr 14 2020, 1:06 PM

Wouldn't the ORES import overwrite all existing scores without a reindex, anyway? (Assuming every page is classified into one topic at least.)

MMiller_WMF moved this task from Inbox to External on the Growth-Team board.

Wouldn't the ORES import overwrite all existing scores without a reindex, anyway? (Assuming every page is classified into one topic at least.)

Indeed purging is likely not necessary, although without it there may remain some propagated predictions. Perhaps we could fix that by generating a fake category (Uncategorized?) when the prediction is empty after filtering, that might avoid the lingering predictions.

Halfak added a comment.Jul 1 2020, 5:33 PM

See https://github.com/mediawiki-utilities/python-mwtext for building embeddings

This Makefile rule builds the 5 language embeddings we already have: https://github.com/mediawiki-utilities/python-mwtext/blob/master/Makefile#L27

Once we have embeddings for the languages, we'll want to implement the topic models in https://github.com/wikimedia/drafttopic

Halfak reassigned this task from Halfak to HAKSOAT.Wed, Jul 15, 2:42 PM

We've managed to compress our vectors and reduce the memory footprint of ORES. That means we have space for these models and @HAKSOAT is going to start work.

Isaac added a subscriber: Isaac.Fri, Jul 17, 5:18 PM