[Epic] Add drafttopic predictions to ElasticSearch index for the Draft namespace where available
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Apr 3 2020, 3:01 PM

Description

This is an expansion of work done in T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited (which ended up being more specific to a Growth experiment) to also include the drafttopic predictions for the Draft namespace.

Background:

ORES supports two topic models.

articletopic - Trained and tested against full articles. Designed to be scored against the most recent version of a full article
draftttopic - Trained and tested against initial versions of articles. Designed to be scored against articles that are still early in their development (AKA drafts)

I think we'll want to enable the drafttopic model for all pages in the Draft namespace on English Wikipedia and any other wikis that have such a namespace.

We'll need to do new threshold optimizations for the drafttopic model to choose good thresholds. They are slightly different than the articletopic model. @Halfak has a script for that.
We'll need to gather predictions from HDFS. ORES already produces drafttopic predictions for changes to pages in the draft namespace and the first edit to pages in the article namespace.

Details

	Subject	Repo	Branch	Lines +/-
	convert_to_esbulk: Implement multilist handler from super_detect_noop	wikimedia/discovery/analytics	master	+259 -10

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T249341 [Epic] Add drafttopic predictions to ElasticSearch index for the Draft namespace where available
Resolved	EBernhardson	T250237 super-detect-noop: Support recognizing and updating subsets within an array
Resolved	EBernhardson	T250238 convert_to_esbulk: Ship cirrussearch updates to non-content indices
Resolved	dcausse	T268272 Implement CirrusSearch keyword for drafttopic

Event Timeline

Halfak created this task.Apr 3 2020, 3:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 3 2020, 3:01 PM

dcausse edited projects, added Discovery-Search; removed Discovery-ARCHIVED.Apr 3 2020, 3:04 PM

Pinging @EBernhardson.
The mediawiki_revision_score schema does include the page namespace, my understanding on what needs to be done on the search data pipeline is to keep the namespace around and stop hardcoding the "content" index in spark/convert_to_esbulk.py. The namespace -> index type mapping is available in the mediawiki config repository which can be read (we already read dblists), the difficulty is that it's written in PHP but we're executing python here.

See P10884 for the output of my thresholds script.

Halfak moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.Apr 6 2020, 4:55 PM

Do these predict the same set of classes, and should they be found with the same articletopic keyword? Or do we need to distinguish between the two models?

A few thoughts:

Assuming we need to distinguish between the two models, my first question is where the data goes in elasticsearch. Perhaps we should have called the field in elasticsearch ores_predictions and formatted them slightly different. Today we index STEM.Computing, but we could instead index articletopic/STEM.Computing (or some other combiner). This would allow the search keywords to prepend the model name where appropriate, and for multiple models to be stored in the same field. This would require some custom support in our super-detect-noop plugin to handle taking a field that contains drafttopic and articletopic predictions, and updating the set of articletopic predictions while keeping the unrelated drafttopic predictions. Likely we can live with the odd situation of having the field called ores_articletopics, and have articletopics unprefixed while having any other models prefixed.

For the actual shipping of data, indeed this will require knowledge about which namespaces live in which indices. Starting with the PHP code seems difficult, we would need not only the cirrus config but we need to know which namespaces a wiki considers "content" namespaces. I wonder if we can source this from the api somehow? Seems we could do something similar to the thresholds where a script fetches the appropriate live configuration to provide to the batch process.

One additional problem with shipping of data, so far we calculate and ship popularity scores for the non-content pages, but they get thrown away in the reindexing process. Essentially if we fix the pipeline so it can ship to multiple namespaces we will start triggering lots of updates we were not previously. When adding functionality to`convert_to_esbulk.py` to resolve namespaces into indices we probably should filter out the non-content updates coming from popularity score.

Overall this is a bit of work, but is all doable I think.

These predict the same set of classes. If adding another keyword is not a big deal, it would be nice to provide this as "drafttopic:<term>". It sounds like it might be a pain but it could also future proof us as we add other models. I might be back soon with a request for "articlequality:<class>" and even "viewrate:<range>" if we don't already have something that does that. :D

CBogen added a project: Epic.Aug 6 2020, 7:26 PM

CBogen moved this task from ML & Data Pipeline to [epic] on the Discovery-Search board.Aug 6 2020, 7:48 PM

CBogen moved this task from [epic] to Current work on the Discovery-Search board.

CBogen edited projects, added Discovery-Search (Current work); removed Discovery-Search.

CBogen moved this task from Incoming to Epics on the Discovery-Search (Current work) board.

Gehel renamed this task from Add drafttopic predictions to ElasticSearch index for the Draft namespace where available to [Epic] Add drafttopic predictions to ElasticSearch index for the Draft namespace where available.Oct 6 2020, 2:20 PM

Change 613345 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] convert_to_esbulk: Implement multilist handler from super_detect_noop

https://gerrit.wikimedia.org/r/613345

gerritbot added a project: Patch-For-Review.Oct 16 2020, 5:24 PM

Change 613345 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] convert_to_esbulk: Implement multilist handler from super_detect_noop

https://gerrit.wikimedia.org/r/613345

EBernhardson mentioned this in rWDAN5731d0533969: convert_to_esbulk: Implement multilist handler from super_detect_noop.Oct 16 2020, 5:37 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2020, 6:10 PM

Gehel closed subtask T250238: convert_to_esbulk: Ship cirrussearch updates to non-content indices as Resolved.Nov 9 2020, 12:54 PM

Gehel closed subtask T250237: super-detect-noop: Support recognizing and updating subsets within an array as Resolved.

Appologies for the long delay, this has been idle some time waiting for operational tasks to complete. The data updating portion of this had it's first run today. Data looks to be appropriately available in the search clusters now, and will continue import going forward. CirrusSearch still needs to be adjusted to expose these as a keyword before this can be marked complete.

• ACraze moved this task from Backlog/Lift Wing to Backlog/Other on the Machine-Learning-Team board.Jan 20 2021, 12:59 AM

Gehel closed subtask T268272: Implement CirrusSearch keyword for drafttopic as Resolved.Mar 3 2021, 2:14 PM

Is this epic ready to be closed out?

I'm not aware of anything remaining to do, this can probably be closed.

Gehel closed this task as Resolved.Nov 4 2021, 2:56 PM

Gehel claimed this task.

dcausse mentioned this in T328276: Add outlink topic model predictions to CirrusSearch indices.Feb 3 2023, 4:29 PM

[Epic] Add drafttopic predictions to ElasticSearch index for the Draft namespace where availableClosed, ResolvedPublicActions