Page MenuHomePhabricator

Copy English Wikipedia drafttopic scores to other wikis somewhere in the CirrusSearch pipeline
Closed, ResolvedPublic

Description

The plan for the initial rollout of T240517: [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics) is that only English Wikipedia will have its own drafttopic ORES model, and other wikis will just use the score from the associated enwiki article, when that exists. Later, more local drafttopic models are planned to be added, but probably there will always be some use for riding on enwiki scores, at least for small wikis.

Probably the least awkward place for copying the scores is in Hadoop (see discussion in T240517). Per @EBernhardson in T240517#5749611, the rough plan for doing that would be:

  • Loop over all known wikis and load the wikibase_item page_prop from the analytics mysql replicas using spark jdbc integration (I've never tested that, but should work). End result should be a table cached in spark memory (or written out to hdfs) of the form <wikiid: str, page_id: int, wikibase_item: str>. Care must be taken to do this with responsible parallelism so as not to overwhelm the replicas. Appropriate indices are in place for queries of the form select pp_page as page_id from page_props where pp_propname = 'wikibase_item'.
  • Extract drafttopic data formatted for elasticsearch usage from event.mediawiki_revision_score hive table for appropriate time range. End result should be a table in spark of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>.
  • Left join draft topics against wikibase_items on wikiid=enwiki and page_id=page_id. End result should be of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>, wikibase_item: str>.
  • Left join above against wikibase_items again, this time on wikibase_item=wikibase_item and wikiid != enwiki. End result should be a propagation of the drafttopics to all wikis by wikibase_item. After dropping columns that are no longer necessary the end result should be in the form <wikiid: str, page_id: int, ores_drafttopics: array<str>.

I'd prefer to not put this inside the job that ships data to elasticsearch though. Today we have two jobs, popularity_score, and transfer_to_cirrussearch. The first job calculates everything and emits a table of the form <wikiid: str, page_id: int, popularity_score: float>. In my mind another job will be added that emits a table of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>>. The transfer script will essentially outer join the provided datasets together, format rows as elasticsearch documents, and ship to prod. Today these are scheduled from the wikimedia/discovery/analytics repository using oozie.

Event Timeline

Change 565402 had a related patch set uploaded (by Gergő Tisza; owner: EBernhardson):
[wikimedia/discovery/analytics@master] [wip] propagate ores predictions by wikibase item

https://gerrit.wikimedia.org/r/565402

Change 565402 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Propagate ores predictions by wikibase item

https://gerrit.wikimedia.org/r/565402