Page MenuHomePhabricator

[Epic] Add drafttopic predictions to ElasticSearch index for the Draft namespace where available
Closed, ResolvedPublic

Description

This is an expansion of work done in T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited (which ended up being more specific to a Growth experiment) to also include the drafttopic predictions for the Draft namespace.

Background:

ORES supports two topic models.

  • articletopic - Trained and tested against full articles. Designed to be scored against the most recent version of a full article
  • draftttopic - Trained and tested against initial versions of articles. Designed to be scored against articles that are still early in their development (AKA drafts)

I think we'll want to enable the drafttopic model for all pages in the Draft namespace on English Wikipedia and any other wikis that have such a namespace.

  • We'll need to do new threshold optimizations for the drafttopic model to choose good thresholds. They are slightly different than the articletopic model. @Halfak has a script for that.
  • We'll need to gather predictions from HDFS. ORES already produces drafttopic predictions for changes to pages in the draft namespace and the first edit to pages in the article namespace.

Event Timeline

dcausse added a subscriber: EBernhardson.

Pinging @EBernhardson.
The mediawiki_revision_score schema does include the page namespace, my understanding on what needs to be done on the search data pipeline is to keep the namespace around and stop hardcoding the "content" index in spark/convert_to_esbulk.py. The namespace -> index type mapping is available in the mediawiki config repository which can be read (we already read dblists), the difficulty is that it's written in PHP but we're executing python here.

See P10884 for the output of my thresholds script.

Do these predict the same set of classes, and should they be found with the same articletopic keyword? Or do we need to distinguish between the two models?

A few thoughts:

Assuming we need to distinguish between the two models, my first question is where the data goes in elasticsearch. Perhaps we should have called the field in elasticsearch ores_predictions and formatted them slightly different. Today we index STEM.Computing, but we could instead index articletopic/STEM.Computing (or some other combiner). This would allow the search keywords to prepend the model name where appropriate, and for multiple models to be stored in the same field. This would require some custom support in our super-detect-noop plugin to handle taking a field that contains drafttopic and articletopic predictions, and updating the set of articletopic predictions while keeping the unrelated drafttopic predictions. Likely we can live with the odd situation of having the field called ores_articletopics, and have articletopics unprefixed while having any other models prefixed.

For the actual shipping of data, indeed this will require knowledge about which namespaces live in which indices. Starting with the PHP code seems difficult, we would need not only the cirrus config but we need to know which namespaces a wiki considers "content" namespaces. I wonder if we can source this from the api somehow? Seems we could do something similar to the thresholds where a script fetches the appropriate live configuration to provide to the batch process.

One additional problem with shipping of data, so far we calculate and ship popularity scores for the non-content pages, but they get thrown away in the reindexing process. Essentially if we fix the pipeline so it can ship to multiple namespaces we will start triggering lots of updates we were not previously. When adding functionality to`convert_to_esbulk.py` to resolve namespaces into indices we probably should filter out the non-content updates coming from popularity score.

Overall this is a bit of work, but is all doable I think.

These predict the same set of classes. If adding another keyword is not a big deal, it would be nice to provide this as "drafttopic:<term>". It sounds like it might be a pain but it could also future proof us as we add other models. I might be back soon with a request for "articlequality:<class>" and even "viewrate:<range>" if we don't already have something that does that. :D

CBogen moved this task from [epic] to Current work on the Discovery-Search board.
CBogen moved this task from Incoming to Epics on the Discovery-Search (Current work) board.
Gehel renamed this task from Add drafttopic predictions to ElasticSearch index for the Draft namespace where available to [Epic] Add drafttopic predictions to ElasticSearch index for the Draft namespace where available.Oct 6 2020, 2:20 PM

Change 613345 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] convert_to_esbulk: Implement multilist handler from super_detect_noop

https://gerrit.wikimedia.org/r/613345

Change 613345 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] convert_to_esbulk: Implement multilist handler from super_detect_noop

https://gerrit.wikimedia.org/r/613345

Appologies for the long delay, this has been idle some time waiting for operational tasks to complete. The data updating portion of this had it's first run today. Data looks to be appropriately available in the search clusters now, and will continue import going forward. CirrusSearch still needs to be adjusted to expose these as a keyword before this can be marked complete.

I'm not aware of anything remaining to do, this can probably be closed.

Gehel claimed this task.