Page MenuHomePhabricator

Load ORES articletopic data into ElasticSearch via the weekly bulk update
Closed, ResolvedPublic

Description

Once ORES articletopic data is in HDFS (via T240553: Consume ORES articletopic data from Kafka and store it in HDFS), the bulk ElasticSearch update job needs to be modified to handle it.

Initially, only enwiki has articletopic data, and we want to fake it for other wikis by using the data from the English interwiki. The bulk update job might or might not be the right place for that (cf T240517#5734191).

Event Timeline

EBernhardson triaged this task as Medium priority.
EBernhardson moved this task from watching / waiting to Current work on the Discovery-Search board.

Change 564177 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/analytics@master] Import ores_drafttopics

https://gerrit.wikimedia.org/r/564177

Change 564177 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Import ores_drafttopics

https://gerrit.wikimedia.org/r/564177

FYI:

Hey folks, it looks like articletopic (a slightly different model that we now have in production) is a better option than the drafttopic model. Once we release the native models for ar, cs, ko, and viwiki, they'll only have an articletopic model for use.

Hey folks, it looks like articletopic (a slightly different model that we now have in production) is a better option than the drafttopic model. Once we release the native models for ar, cs, ko, and viwiki, they'll only have an articletopic model for use.

I guess we need to redo some patches then (e.g. https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/564177) to reference articletopic instead of drafttopic?

Change 566801 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/analytics@master] Rename drafttopics to articletopics

https://gerrit.wikimedia.org/r/566801

Patch is up to change the analytics side to articletopic as well. Since some ores_drafttopic data has already been shipped we will need to remember to ask the update script to delete those from the source documents when doing the reindex to add ores_articletopic to the schema

Change 566801 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Rename drafttopics to articletopics

https://gerrit.wikimedia.org/r/566801

kostajh renamed this task from Load ORES drafttopic data into ElasticSearch via the weekly bulk update to Load ORES articletopic data into ElasticSearch via the weekly bulk update.Feb 6 2020, 11:17 AM
kostajh updated the task description. (Show Details)

I think this is not fully resolved (or might not be, depending on how we handle the threshold issue discussed in the comments of T244297: Newcomer tasks: set initial thresholds for ORES articletopic). I left a comment on the patch with what seemed like the best solution to me; there's also the threshold change issue Erik mentioned (T244297#5858563) which I'm not sure how to deal with.

Or should these rather be tracked in a separate task?

Note to self: if, after an edit, all ORES topic predictions are below threshold, the old predictions won't be cleared (due to a temporary workaround). This is pretty unlikely to be a problem in practice.

Looking at a random enwiki article which has been edited last week, I don't see any articletopic data in the Cirrus dump. Is that because we need to wait for T240550: Add mapping for ORES topic field in ElasticSearch for it to show up? Is there an easy way to fake the data in a local setup until then?

While this workflow is deployed, it's currently flagged to off in the airflow admin. My main thought there was that we are adding thresholding, and current runs aren't taking that into account. It seemed better to wait until per-wiki/topic thresholding was deployed before turning on the data shipping. This was briefly deployed and run for a week or two, before I realized we needed the updated thresholding.

To fake the data, a couple parts:

  • Allow cirrussearch to create the field needs configuration like: https://gerrit.wikimedia.org/r/573003
  • Re-create the search indices so it has the new field, via updateSearchIndex.php
  • Inject some fake data: curl -XPOST https://localhost:9200/devwiki_content/page/<mw page id>/_update -H 'Content-Type: application/json' -d '{"doc": {"ores_articletopics": ["TestA|123", "TestB|321"]}}'

Thanks @EBernhardson!

That is now done by the cirrussearch vagrant role by default (provisioning should also update the indexes)

  • Inject some fake data: curl -XPOST https://localhost:9200/devwiki_content/page/<mw page id>/_update -H 'Content-Type: application/json' -d '{"doc": {"ores_articletopics": ["TestA|123", "TestB|321"]}}'

I ended up writing a small help script for setting up several test pages: P10461