Page MenuHomePhabricator

[EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics)
Open, Needs TriagePublic

Description

It should be possible to filter articles by topic when using the normal wiki search, based on ORES articletopic (link needs to be updated) scores. The immediate use case for this is NewcomerTasks 1.1 which involves filtering tasks via topic; but seems like a widely useful capability in general, both for readers and for tools (especially considering the Product plans about neighborhoods).

The high-level plan is to push ORES articletopic scores into ElasticSearch on two (+1) channels:

  • Have ORES calculate the new score on edit (changeprop already supports this), and (via EventGate) push it to Kafka and collate into HDFS; use it in the weekly bulk update of the search index.
  • To keep the scores more up to date, after each edit fetch and apply the ORES score in the index update MediaWiki job. This might be beyond the capacity of ORES, so we might want to limit into some manageable subset (like newly created pages or wikis where Newcomer Tasks is enabled), and hope that a one-week delay is not too much of a usability problem for the rest (presumably most edits don't affect the topic scores much).
  • To ensure there's data about every article in HDFS, even if it has never been edited since this functionality was deployed, do a one-time job of going through all pages on all wikis and pushing their score to Kafka.

On ElasticSearch the scores would then be put into a poor man's sparse vector field (real sparse vectors are in the nonfree part of ElasticSearch), with topic names and scores being represented as document words and word frequencies, and the topic could be queried via tf-idf. This search functionality would be exposed via some search keyword like topic:.

Currently ORES can only score English Wikipedia; that's planned to be fixed soonish, but for the interim period we will just fake scores for other wikis, using the score from the English interwiki article. (Details to be specified.)

More specifically, the concrete steps to implement the feature are:

DescriptionTaskOwner
1Configure ORES to publish new articletopic scores to a Kafka topic when notified by changeprop about a new revision.T240549Scoring
2Configure the new ElasticSearch field.T240550Search
3Configure EventGate to consume the Kafka queue and store the data on HDFS (and merge with existing data by title, or maybe page ID).T240553no-op?
3.5Copy English Wikipedia articletopic scores to other wikisT241015?
4Go through all Wikipedia articles one time, score them and push the score to Kafka.T243357Growth
5Configure the weekly ElasticSearch bulk update job to pull the data from HDFS.T240556Search
6Make the ORES extension hook into CirrusSearchAddQueryFeatures and provide the ES logic for the topic: keyword. (Needs discussion, there is probably more than one way to implement this.)T240559Growth (with support from Search)
7Make some extension (ORES? GrowthExperiments?) hook into ContentHandler::getDataForSearchIndex (?), fetch the scores from the ORES service and add them to the ES document.T240558Growth (with support from Search)
8Implement articletopic score for all wikisT235181 (?)Scoring
9Repeat step 4, now with the local articletopic scores for all wikisGrowth
10(optional) Integrate with AdvancedSearch extensionT245905?
11(optional) Integrate with recent changesT245906?

The MVP version is steps 1-6, maybe without 4 (or with a limited version of 4 that only includes Newcomer Tasks wikis). The provisional target date for that is mid-January, when Growth plans to deploy topic filtering in Newcomer Tasks. (It's not strictly a blocker for that though, we'll have a lower-quality alternative search logic, via T240512: Newcomer tasks: Morelike backend for topic matching.)

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedEBernhardson
InvalidNone
ResolvedOttomata
ResolvedHalfak
OpenNone
Resolveddcausse
ResolvedTgr
ResolvedEBernhardson
OpenTgr
ResolvedTgr
ResolvedEBernhardson
ResolvedIsaac
ResolvedIsaac
ResolvedHalfak
ResolvedHalfak
Resolvedkevinbazira
ResolvedMMiller_WMF
ResolvedHalfak
Resolvedkostajh
OpenTgr
Resolvedkostajh
OpenTrizek-WMF
ResolvedTgr
ResolvedTgr
OpenTgr
OpenChtnnh
ResolvedTgr
OpenNone
OpenNone
OpenIsaac
OpenNone
Resolvedkostajh

Event Timeline

Tgr created this task.Dec 11 2019, 11:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2019, 11:44 PM
Tgr updated the task description. (Show Details)Dec 11 2019, 11:52 PM
Tgr updated the task description. (Show Details)Dec 11 2019, 11:54 PM

One thing we haven't really discussed is how the fake non-English drafttopic will work. Would that be done within ORES, or in the ES bulk update job + the MediaWiki index update job?

Tgr updated the task description. (Show Details)Dec 12 2019, 12:58 AM
EBernhardson updated the task description. (Show Details)Dec 12 2019, 2:08 AM
Tgr updated the task description. (Show Details)Dec 12 2019, 10:45 AM
Gehel added a subscriber: Gehel.Dec 13 2019, 8:30 AM

One thing we haven't really discussed is how the fake non-English drafttopic will work. Would that be done within ORES, or in the ES bulk update job + the MediaWiki index update job?

We had some discussion in the Search team and we think it makes more sense to handle those "fake drafttopics" outside of the CirrusSearch pipeline. From a search point of view, how those topics are generated (faked, real, manually tagged, magic, ...) should be transparent. And it is likely to evolve in the future.

Tgr added a subscriber: Halfak.Dec 15 2019, 11:38 PM

@Halfak any thoughts on how to handle the "fake" drafttopic scores? I agree that conceptually it would be the nicest to implement them in ORES, so it would serve drafttopic scores for all wikis, and whether those are calculated via enwiki or local wikiproject or category system could be an implementation detail. That would be about 2x load (number of content edits) and 10x storage (number of content pages) compared to English Wikipedia only; is that feasible?

We won't be able to re-engineer ORES to serve predictions based on sitelinks across wikis with ORES in a reasonable amount of time. ORES is really not designed to serve predictions for one wiki via another wiki. We have no mechanism to relate a page on one Wiki, to a wikidata item, and then to another page on another wiki. It seems to me that the most logical place to make connections between enwiki and other wikis is in Hadoop.

BTW, it's weird to call this "fake". We actually do have "fake" predictions in ORES that people use to test out ORES-powered tools in beta and in vagrant. This is more of an "enwiki-proxy" prediction. Nothing fake about it.

Tgr updated the task description. (Show Details)Dec 16 2019, 10:33 PM
Tgr added a comment.Dec 16 2019, 11:30 PM

I imagine in the long term you'd want that capability anyway, since using interwiki data can probably improve predictions even when they are not fully proxy-based, but I can see how that's a significant change in data flows.

The other option I can think of would be to do cross-wiki lookup in the ES bulk update job and in the MediaWiki individual update jobs; for the MediaWiki jobs that should be pretty easy, for the Hadoop-based bulk update job it would require looking up scores in event.mediawiki_revision_score based on Wikidata QID (which is not recorded in that table currently). @Gehel any idea if that is workable? And would it require ORES to at least emit the QID in its response?

I don't think that ORES is the right place to do any joining of data. ORES output format doesn't allow us to append a QID to it.

The revision-score event contains a lot of fields that ORES does not maintain. E.g., here's an event for an edit to nlwiki:

id: [{"topic":"eqiad.mediawiki.revision-score","partition":0,"timestamp":1576594215297},{"topic":"codfw.mediawiki.revision-score","partition":0,"offset":-1}]
data: {
  "$schema":"/mediawiki/revision/score/2.0.0",
  "meta":{
    "stream":"mediawiki.revision-score",
    "uri":"https://nl.wikipedia.org/wiki/Actief_luisteren",
    "request_id":"XfjrJgpAADsAACCiDOwAAABY",
    "id":"8914d800-20dc-11ea-a175-43a758b0852a",
    "dt":"2019-12-17T14:50:15.296Z",
    "domain":"nl.wikipedia.org",
    "topic":"eqiad.mediawiki.revision-score",
    "partition":0,
    "offset":605287741
  },
  "database":"nlwiki",
  "page_id":808534,
  "page_title":"Actief_luisteren",
  "page_namespace":0,
  "page_is_redirect":false,
  "performer":{"user_text":"84.83.71.74","user_groups":["*"],"user_is_bot":false},
  "rev_id":55265238,
  "rev_parent_id":52428289,
  "rev_timestamp":"2019-12-17T14:50:14Z",
  "scores":{
    "damaging":{
      "model_name":"damaging",
      "model_version":"0.5.1",
      "prediction":["true"],
      "probability":{"false":0.38523596682321115,"true":0.6147640331767888}
    },
    "goodfaith":{
      "model_name":"goodfaith",
      "model_version":"0.5.1",
      "prediction":["true"],
      "probability":{"false":0.0922603340475846,"true":0.9077396659524154}
    }
  }
}

Maybe this event could also contain relevant Wikidata item IDs.

Tgr added a comment.Dec 17 2019, 6:45 PM

Oh, right, most of that data is coming from changeprop, not ORES. That should be an easy place to inject QIDs then, since MediaWiki can pass it to changeprop with only an extra local DB lookup needed.

A drawback of that would be that connecting an article to Wikidata is not an edit and will not trigger change propagation and rescoring, so when an article is created, then connected to Wikidata, and not edited anymore after that, there's no way to get the QID via changeprop, and we'd miss the topic score for those (unless there is a way to handle interwiki connections fully within the ES bulk update job, just based on the page ID / title). But we can probably live with that, articles which do not have an English interwiki won't be scored either and it doesn't make too much difference compared to that. (And all of this is only for the interim period until local drafttopic scoring is introduced for non-English wikis, anyway.)

EBernhardson added a comment.EditedDec 17 2019, 11:23 PM

Off the top of my head the drafttopics propagation could probably happen in hadoop/spark. The process would be roughly:

  • Loop over all known wikis and load the wikibase_item page_prop from the analytics mysql replicas using spark jdbc integration (I've never tested that, but should work). End result should be a table cached in spark memory (or written out to hdfs) of the form <wikiid: str, page_id: int, wikibase_item: str>. Care must be taken to do this with responsible parallelism so as not to overwhelm the replicas. Appropriate indices are in place for queries of the form select pp_page as page_id from page_props where pp_propname = 'wikibase_item'.
  • Extract drafttopic data formatted for elasticsearch usage from event.mediawiki_revision_score hive table for appropriate time range. End result should be a table in spark of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>.
  • Left join draft topics against wikibase_items on wikiid=enwiki and page_id=page_id. End result should be of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>, wikibase_item: str>.
  • Left join above against wikibase_items again, this time on wikibase_item=wikibase_item and wikiid != enwiki. End result should be a propagation of the drafttopics to all wikis by wikibase_item. After dropping columns that are no longer necessary the end result should be in the form <wikiid: str, page_id: int, ores_drafttopics: array<str>.

I'd prefer to not put this inside the job that ships data to elasticsearch though. Today we have two jobs, popularity_score, and transfer_to_cirrussearch. The first job calculates everything and emits a table of the form <wikiid: str, page_id: int, popularity_score: float>. In my mind another job will be added that emits a table of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>>. The transfer script will essentially outer join the provided datasets together, format rows as elasticsearch documents, and ship to prod. Today these are scheduled from the wikimedia/discovery/analytics repository using oozie.

Tgr added a comment.Dec 18 2019, 12:39 AM

Thanks @EBernhardson! Filed as T241015: Copy English Wikipedia drafttopic scores to other wikis somewhere in the CirrusSearch pipeline and added to the list as step 3.5 (with the understanding that explaining how something could be done is not the same as committing to do it).

Tgr updated the task description. (Show Details)Jan 21 2020, 11:14 PM
MMiller_WMF renamed this task from Allow searching articles by ORES drafttopic to [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics).Jan 22 2020, 5:40 AM
MMiller_WMF edited projects, added Epic, Growth-Team (Current Sprint); removed Growth-Team.

@Tgr -- I've split this out of T238608: [EPIC] Growth: Newcomer tasks 1.1.0 (topic matching), and made it into an epic in its own right. Let's try and make sure all the related tasks are children of this one.

kostajh updated the task description. (Show Details)Feb 6 2020, 11:19 AM
kostajh updated the task description. (Show Details)
Chtnnh added a subscriber: Chtnnh.Mar 4 2020, 2:40 PM

Change 577325 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Switch GrowthExperiments topic search to ORES

https://gerrit.wikimedia.org/r/577325

Change 577325 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch GrowthExperiments topic search to ORES

https://gerrit.wikimedia.org/r/577325

Mentioned in SAL (#wikimedia-operations) [2020-03-05T19:45:22Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:577325|Switch GrowthExperiments topic search to ORES (T240517)]] (duration: 00m 58s)