[EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tgr
	Dec 11 2019, 11:44 PM

Description

It should be possible to filter articles by topic when using the normal wiki search, based on ORES articletopic (link needs to be updated) scores. The immediate use case for this is NewcomerTasks 1.1 which involves filtering tasks via topic; but seems like a widely useful capability in general, both for readers and for tools (especially considering the Product plans about neighborhoods).

The high-level plan is to push ORES articletopic scores into ElasticSearch on two (+1) channels:

Have ORES calculate the new score on edit (changeprop already supports this), and (via EventGate) push it to Kafka and collate into HDFS; use it in the weekly bulk update of the search index.
To keep the scores more up to date, after each edit fetch and apply the ORES score in the index update MediaWiki job. This might be beyond the capacity of ORES, so we might want to limit into some manageable subset (like newly created pages or wikis where Newcomer Tasks is enabled), and hope that a one-week delay is not too much of a usability problem for the rest (presumably most edits don't affect the topic scores much).
To ensure there's data about every article in HDFS, even if it has never been edited since this functionality was deployed, do a one-time job of going through all pages on all wikis and pushing their score to Kafka.

On ElasticSearch the scores would then be put into a poor man's sparse vector field (real sparse vectors are in the nonfree part of ElasticSearch), with topic names and scores being represented as document words and word frequencies, and the topic could be queried via tf-idf. This search functionality would be exposed via some search keyword like topic:.

Currently ORES can only score English Wikipedia; that's planned to be fixed soonish, but for the interim period we will just fake scores for other wikis, using the score from the English interwiki article. (Details to be specified.)

More specifically, the concrete steps to implement the feature are:

	Description	Task	Owner
1	Configure ORES to publish new articletopic scores to a Kafka topic when notified by changeprop about a new revision.	T240549	Scoring
2	Configure the new ElasticSearch field.	T240550	Search
3	Configure EventGate to consume the Kafka queue and store the data on HDFS (and merge with existing data by title, or maybe page ID).	T240553	no-op?
3.5	Copy English Wikipedia articletopic scores to other wikis	T241015	?
4	Go through all Wikipedia articles one time, score them and push the score to Kafka.	T243357	Growth
5	Configure the weekly ElasticSearch bulk update job to pull the data from HDFS.	T240556	Search
6	Make the ORES extension hook into CirrusSearchAddQueryFeatures and provide the ES logic for the `topic:` keyword. (Needs discussion, there is probably more than one way to implement this.)	T240559	Growth (with support from Search)
7	Make some extension (ORES? GrowthExperiments?) hook into ContentHandler::getDataForSearchIndex (?), fetch the scores from the ORES service and add them to the ES document.	T240558	Growth (with support from Search)
8	Implement articletopic score for all wikis	T235181 (?)	Scoring
9	Repeat step 4, now with the local articletopic scores for all wikis		Growth
10	(optional) Integrate with AdvancedSearch extension	T245905	?
11	(optional) Integrate with recent changes	T245906	?

The MVP version is steps 1-6, maybe without 4 (or with a limited version of 4 that only includes Newcomer Tasks wikis). The provisional target date for that is mid-January, when Growth plans to deploy topic filtering in Newcomer Tasks. (It's not strictly a blocker for that though, we'll have a lower-quality alternative search logic, via T240512: Newcomer tasks: Morelike backend for topic matching.)

Details

	Subject	Repo	Branch	Lines +/-
	Switch GrowthExperiments topic search to ORES	operations/mediawiki-config	master	+3 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• Rileych	T240517 [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics)
Resolved		EBernhardson	T240556 Load ORES articletopic data into ElasticSearch via the weekly bulk update
Invalid		None	T240553 Consume ORES articletopic data from Kafka and store it in HDFS
Resolved		Ottomata	T240549 Configure ORES to publish new drafttopic scores to Kafka
Resolved		Halfak	T240609 Produce drafttopic score events on every edit to English Wikipedia articles
Declined		None	T240558 Update ORES articletopic data score in ElasticSearch when an article gets edited
Resolved		dcausse	T240550 Add mapping for ORES topic field in ElasticSearch
Resolved		Tgr	T240559 Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword
Resolved		EBernhardson	T241015 Copy English Wikipedia drafttopic scores to other wikis somewhere in the CirrusSearch pipeline
Open		None	T242476 Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others
Resolved		Tgr	T243359 Define configuration for ORES articletopic search
Resolved		EBernhardson	T243357 Once the ORES articletopic - ElasticSearch pipeline is set up, update data about all articles
Resolved		Isaac	T236713 Improve drafttopic training data pipeline
Resolved		Isaac	T240273 Extract cross-wiki WikiProject tags
Resolved		Halfak	T240286 Re-train English Wikipedia topic model using new WikiProject Taxonomy
Resolved		Halfak	T240276 Restructure WikiProject directory to be better
Resolved		kevinbazira	T240282 Improve WikiProject template --> WikiProject mapping
Resolved		MMiller_WMF	T244192 Newcomer tasks: ORES ontology mapping and score thresholds
Resolved		Halfak	T244297 Newcomer tasks: set initial thresholds for ORES articletopic
Resolved		kostajh	T244421 Newcomer tasks: UX changes for ORES topics
Resolved		Tgr	T247124 Newcomer tasks: additional EventLogging for ORES topics UX
Resolved		kostajh	T243956 Regenerate data for prototype using updated ORES models
Resolved		Trizek-WMF	T245368 Newcomer tasks: evaluate new ORES topic models
Resolved		Tgr	T243477 Newcomer tasks: update topic task suggestion backend to handle multiple topic search methods
Resolved		Tgr	T245219 Document ORES articletopic search setup
Open		Tgr	T245905 Integrate CirrusSearch topic search capability with AdvancedSearch
Open		None	T245906 Expose ORES topics in recent changes filters
Resolved		Tgr	T246061 Newcomer tasks: Sort topics alphabetically
Open		None	T246909 Follow-up cleanup to topic models
Open		None	T246910 Filter out disambiguation pages in topic labels
Resolved		Isaac	T246912 Clean up History and Society.Society in the topic taxonomy.
Open		None	T248042 Newcomer tasks: SE for kowiki not scored correctly (investigate)
Resolved		kostajh	T247361 Newcomer tasks: change "geography" to "regions"
Resolved	Nov 6 2020	Trizek-WMF	T266201 Link-based and text-based topic evaluations October 2020

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

One thing we haven't really discussed is how the fake non-English drafttopic will work. Would that be done within ORES, or in the ES bulk update job + the MediaWiki index update job?

MMiller_WMF added a parent task: T238608: [EPIC] Growth: Newcomer tasks 1.1.0 (topic matching).Dec 12 2019, 12:12 AM

MMiller_WMF subscribed.

Tgr updated the task description. (Show Details)Dec 12 2019, 12:58 AM

EBernhardson subscribed.Dec 12 2019, 2:06 AM

EBernhardson updated the task description. (Show Details)Dec 12 2019, 2:08 AM

Tgr updated the task description. (Show Details)Dec 12 2019, 10:45 AM

Tgr added a subtask: T240550: Add mapping for ORES topic field in ElasticSearch.Dec 12 2019, 10:57 AM

Tgr added a subtask: T240553: Consume ORES articletopic data from Kafka and store it in HDFS.Dec 12 2019, 11:02 AM

Tgr mentioned this in T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update.Dec 12 2019, 11:07 AM

Tgr updated the task description. (Show Details)

Tgr added a subtask: T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update.

Tgr added a subtask: T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited.Dec 12 2019, 11:23 AM

Tgr added a subtask: T240559: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword.Dec 12 2019, 11:31 AM

Tgr updated the task description. (Show Details)

EBernhardson moved this task from needs triage to watching / waiting on the Discovery-Search board.Dec 12 2019, 8:33 PM

In T240517#5734191, @Tgr wrote:

One thing we haven't really discussed is how the fake non-English drafttopic will work. Would that be done within ORES, or in the ES bulk update job + the MediaWiki index update job?

We had some discussion in the Search team and we think it makes more sense to handle those "fake drafttopics" outside of the CirrusSearch pipeline. From a search point of view, how those topics are generated (faked, real, manually tagged, magic, ...) should be transparent. And it is likely to evolve in the future.

Gehel mentioned this in T235181: Build WikiProject directory topic models for ar, cs, and kowiki.Dec 13 2019, 8:58 AM

Halfak reopened subtask T240549: Configure ORES to publish new drafttopic scores to Kafka as Open.Dec 13 2019, 2:05 PM

@Halfak any thoughts on how to handle the "fake" drafttopic scores? I agree that conceptually it would be the nicest to implement them in ORES, so it would serve drafttopic scores for all wikis, and whether those are calculated via enwiki or local wikiproject or category system could be an implementation detail. That would be about 2x load (number of content edits) and 10x storage (number of content pages) compared to English Wikipedia only; is that feasible?

We won't be able to re-engineer ORES to serve predictions based on sitelinks across wikis with ORES in a reasonable amount of time. ORES is really not designed to serve predictions for one wiki via another wiki. We have no mechanism to relate a page on one Wiki, to a wikidata item, and then to another page on another wiki. It seems to me that the most logical place to make connections between enwiki and other wikis is in Hadoop.

BTW, it's weird to call this "fake". We actually do have "fake" predictions in ORES that people use to test out ORES-powered tools in beta and in vagrant. This is more of an "enwiki-proxy" prediction. Nothing fake about it.

kostajh subscribed.Dec 16 2019, 6:57 PM

Tgr updated the task description. (Show Details)Dec 16 2019, 10:33 PM

I imagine in the long term you'd want that capability anyway, since using interwiki data can probably improve predictions even when they are not fully proxy-based, but I can see how that's a significant change in data flows.

The other option I can think of would be to do cross-wiki lookup in the ES bulk update job and in the MediaWiki individual update jobs; for the MediaWiki jobs that should be pretty easy, for the Hadoop-based bulk update job it would require looking up scores in event.mediawiki_revision_score based on Wikidata QID (which is not recorded in that table currently). @Gehel any idea if that is workable? And would it require ORES to at least emit the QID in its response?

I don't think that ORES is the right place to do any joining of data. ORES output format doesn't allow us to append a QID to it.

The revision-score event contains a lot of fields that ORES does not maintain. E.g., here's an event for an edit to nlwiki:

id: [{"topic":"eqiad.mediawiki.revision-score","partition":0,"timestamp":1576594215297},{"topic":"codfw.mediawiki.revision-score","partition":0,"offset":-1}]
data: {
  "$schema":"/mediawiki/revision/score/2.0.0",
  "meta":{
    "stream":"mediawiki.revision-score",
    "uri":"https://nl.wikipedia.org/wiki/Actief_luisteren",
    "request_id":"XfjrJgpAADsAACCiDOwAAABY",
    "id":"8914d800-20dc-11ea-a175-43a758b0852a",
    "dt":"2019-12-17T14:50:15.296Z",
    "domain":"nl.wikipedia.org",
    "topic":"eqiad.mediawiki.revision-score",
    "partition":0,
    "offset":605287741
  },
  "database":"nlwiki",
  "page_id":808534,
  "page_title":"Actief_luisteren",
  "page_namespace":0,
  "page_is_redirect":false,
  "performer":{"user_text":"84.83.71.74","user_groups":["*"],"user_is_bot":false},
  "rev_id":55265238,
  "rev_parent_id":52428289,
  "rev_timestamp":"2019-12-17T14:50:14Z",
  "scores":{
    "damaging":{
      "model_name":"damaging",
      "model_version":"0.5.1",
      "prediction":["true"],
      "probability":{"false":0.38523596682321115,"true":0.6147640331767888}
    },
    "goodfaith":{
      "model_name":"goodfaith",
      "model_version":"0.5.1",
      "prediction":["true"],
      "probability":{"false":0.0922603340475846,"true":0.9077396659524154}
    }
  }
}

Maybe this event could also contain relevant Wikidata item IDs.

Oh, right, most of that data is coming from changeprop, not ORES. That should be an easy place to inject QIDs then, since MediaWiki can pass it to changeprop with only an extra local DB lookup needed.

A drawback of that would be that connecting an article to Wikidata is not an edit and will not trigger change propagation and rescoring, so when an article is created, then connected to Wikidata, and not edited anymore after that, there's no way to get the QID via changeprop, and we'd miss the topic score for those (unless there is a way to handle interwiki connections fully within the ES bulk update job, just based on the page ID / title). But we can probably live with that, articles which do not have an English interwiki won't be scored either and it doesn't make too much difference compared to that. (And all of this is only for the interim period until local drafttopic scoring is introduced for non-English wikis, anyway.)

Off the top of my head the drafttopics propagation could probably happen in hadoop/spark. The process would be roughly:

Loop over all known wikis and load the wikibase_item page_prop from the analytics mysql replicas using spark jdbc integration (I've never tested that, but should work). End result should be a table cached in spark memory (or written out to hdfs) of the form <wikiid: str, page_id: int, wikibase_item: str>. Care must be taken to do this with responsible parallelism so as not to overwhelm the replicas. Appropriate indices are in place for queries of the form select pp_page as page_id from page_props where pp_propname = 'wikibase_item'.

Extract drafttopic data formatted for elasticsearch usage from event.mediawiki_revision_score hive table for appropriate time range. End result should be a table in spark of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>.

Left join draft topics against wikibase_items on wikiid=enwiki and page_id=page_id. End result should be of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>, wikibase_item: str>.

Left join above against wikibase_items again, this time on wikibase_item=wikibase_item and wikiid != enwiki. End result should be a propagation of the drafttopics to all wikis by wikibase_item. After dropping columns that are no longer necessary the end result should be in the form <wikiid: str, page_id: int, ores_drafttopics: array<str>.

I'd prefer to not put this inside the job that ships data to elasticsearch though. Today we have two jobs, popularity_score, and transfer_to_cirrussearch. The first job calculates everything and emits a table of the form <wikiid: str, page_id: int, popularity_score: float>. In my mind another job will be added that emits a table of the form <wikiid: str, page_id: int, ores_drafttopics: array<str>>. The transfer script will essentially outer join the provided datasets together, format rows as elasticsearch documents, and ship to prod. Today these are scheduled from the wikimedia/discovery/analytics repository using oozie.

Tgr mentioned this in T241015: Copy English Wikipedia drafttopic scores to other wikis somewhere in the CirrusSearch pipeline.Dec 18 2019, 12:36 AM

Tgr added a subtask: T241015: Copy English Wikipedia drafttopic scores to other wikis somewhere in the CirrusSearch pipeline.

Tgr updated the task description. (Show Details)

Thanks @EBernhardson! Filed as T241015: Copy English Wikipedia drafttopic scores to other wikis somewhere in the CirrusSearch pipeline and added to the list as step 3.5 (with the understanding that explaining how something could be done is not the same as committing to do it).

MMiller_WMF removed a subtask: T240550: Add mapping for ORES topic field in ElasticSearch.Dec 18 2019, 1:52 AM

MMiller_WMF removed a subtask: T240549: Configure ORES to publish new drafttopic scores to Kafka.

MMiller_WMF removed a subtask: T240553: Consume ORES articletopic data from Kafka and store it in HDFS.

Halfak moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.Dec 18 2019, 9:43 PM

kostajh mentioned this in T240512: Newcomer tasks: Morelike backend for topic matching.Dec 19 2019, 12:45 PM

• marcella subscribed.Jan 7 2020, 5:42 PM

MMiller_WMF added a subtask: T242476: Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others.Jan 15 2020, 6:38 PM

Tgr updated the task description. (Show Details)Jan 21 2020, 11:14 PM

@Tgr -- I've split this out of T238608: [EPIC] Growth: Newcomer tasks 1.1.0 (topic matching), and made it into an epic in its own right. Let's try and make sure all the related tasks are children of this one.

MMiller_WMF added a subtask: T243359: Define configuration for ORES articletopic search.Jan 22 2020, 5:42 AM

MMiller_WMF added a subtask: T243357: Once the ORES articletopic - ElasticSearch pipeline is set up, update data about all articles.

MMiller_WMF added a subtask: T236713: Improve drafttopic training data pipeline.Jan 22 2020, 5:45 AM

kostajh updated the task description. (Show Details)Feb 6 2020, 11:19 AM

kostajh updated the task description. (Show Details)

MMiller_WMF added a subtask: T243956: Regenerate data for prototype using updated ORES models.Feb 7 2020, 5:38 PM

TJones closed subtask T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update as Resolved.Feb 12 2020, 4:35 PM

Tgr added a subtask: T243477: Newcomer tasks: update topic task suggestion backend to handle multiple topic search methods.Feb 13 2020, 12:56 AM

Tgr added a subtask: T245219: Document ORES articletopic search setup.Feb 13 2020, 10:12 PM

Tgr added a subtask: T245905: Integrate CirrusSearch topic search capability with AdvancedSearch.Feb 22 2020, 4:45 AM

Tgr updated the task description. (Show Details)

Tgr added a subtask: T245906: Expose ORES topics in recent changes filters.Feb 22 2020, 5:36 AM

Tgr updated the task description. (Show Details)

Tgr added a subtask: T246061: Newcomer tasks: Sort topics alphabetically.Feb 25 2020, 2:47 AM

Halfak closed subtask T236713: Improve drafttopic training data pipeline as Resolved.Feb 25 2020, 6:27 PM

MMiller_WMF closed subtask T243956: Regenerate data for prototype using updated ORES models as Resolved.Feb 27 2020, 11:06 PM

MMiller_WMF closed subtask T244192: Newcomer tasks: ORES ontology mapping and score thresholds as Resolved.Mar 3 2020, 3:57 PM

Chtnnh subscribed.Mar 4 2020, 2:40 PM

Change 577325 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Switch GrowthExperiments topic search to ORES

https://gerrit.wikimedia.org/r/577325

gerritbot added a project: Patch-For-Review.Mar 5 2020, 7:23 PM

Change 577325 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch GrowthExperiments topic search to ORES

https://gerrit.wikimedia.org/r/577325

Mentioned in SAL (#wikimedia-operations) [2020-03-05T19:45:22Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:577325|Switch GrowthExperiments topic search to ORES (T240517)]] (duration: 00m 58s)

Maintenance_bot removed a project: Patch-For-Review.Mar 5 2020, 8:10 PM

MMiller_WMF added a subtask: T246909: Follow-up cleanup to topic models.Mar 6 2020, 8:20 PM

MMiller_WMF closed subtask T244421: Newcomer tasks: UX changes for ORES topics as Resolved.Mar 6 2020, 8:50 PM

Etonkovidova closed subtask T243359: Define configuration for ORES articletopic search as Resolved.Mar 7 2020, 1:41 AM

Etonkovidova closed subtask T246061: Newcomer tasks: Sort topics alphabetically as Resolved.Mar 11 2020, 6:52 PM

Etonkovidova closed subtask T243477: Newcomer tasks: update topic task suggestion backend to handle multiple topic search methods as Resolved.Mar 11 2020, 9:01 PM

TJones closed subtask T241015: Copy English Wikipedia drafttopic scores to other wikis somewhere in the CirrusSearch pipeline as Resolved.Mar 18 2020, 2:46 PM

Tgr closed subtask T243357: Once the ORES articletopic - ElasticSearch pipeline is set up, update data about all articles as Resolved.Mar 18 2020, 4:47 PM

MMiller_WMF closed subtask T247361: Newcomer tasks: change "geography" to "regions" as Resolved.Mar 24 2020, 3:52 PM

MMiller_WMF closed subtask T240559: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword as Resolved.Apr 9 2020, 4:53 PM

Tgr closed subtask T245219: Document ORES articletopic search setup as Resolved.Apr 13 2020, 8:35 PM

MMiller_WMF mentioned this in T266201: Link-based and text-based topic evaluations October 2020.Oct 21 2020, 11:40 PM