Page MenuHomePhabricator

Once the ORES articletopic - ElasticSearch pipeline is set up, update data about all articles
Closed, ResolvedPublic

Description

Once the ORES articletopic -> ElasticSearch pipeline is set up and tested, send through it data about all the articles (since normally it would only be used for edits, so we wouldn't have any data about long-untouched articles).

If the pipeline is via T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update (this is the current plan, probably around beginning of February), this would be done via a script hitting the ORES precache endpoint for all articles. According to @Halfak that's 1-2 day for enwiki, probably a few hours for the smaller wikis that will have a drafttopic model. Probably does not need much coordination, other than making sure the request rate is reasonable.

(If the pipeline would be via T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited, we'd have to use forceSearchIndex.php instead.)

Event Timeline

The pipeline has been set up but will be changed in https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/571790 so this task is blocked on merging that.

this would be done via a script hitting the ORES precache endpoint for all articles

That doesn't work; changeprop doesn't listen to ORES, it listens to MediaWiki, it calls the ORES precache endpoint and moves the response forward. That process can only be triggered on the MediaWiki end (and only with creating new revisions which we obviously can't do).

So either we implement T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited (we wanted it anyway but not immediately and maybe not on all wikis) and rely on calling that, or we just manually export the data from ORES (@Halfak points out oresapi can do that a lot more efficiently than precache would) and load it into HDFS. Which seems not too bad (@EBernhardson any thoughts?) but it would be nice to end up with an easily repeatable process. Maybe it can be turned into a script on stat1007? Or a job although that seems a bit heavy-handed.

It wouldn't be too hard to adjust mw_prepare_rev_score.py to source it's data from some other source, at some point in that script we have a DataFrame containing three fields, (wikiid, page_id, dict from label to prediction probability)`. It sounds like we should be able to generate a dataset in that format from oresapi, for a one-time script i can hack reading that in relatively easily. Simplest format for exchange is probably gzip'd files containing a json row per line. If the dataset is large these can be split across multiple files.

Example content:

{"wikiid": "eswiki", "page_id": 4, "scores": {"foo": 0.432, "bar": 0.987}}
{"wikiid": "eswiki", "page_id": 5, "scores": {"bang": 0.99}}
...

oresapi takes revision IDs as the input, so the script would have to walk through all mainspace MediaWiki pages (via the allpages API, presumably), fetch revids, use oresapi's score method to fetch scores from the ORES API's batch endpoint, write them to disk, then trigger mw_prepare_rev_score.py.

P10507 has a simple test script for fetching all enwiki ORES scores.

A manual all-pages export has been run against arwiki, cswiki, kowiki and viwiki. These are now loaded into the elasticsearch cluster, but some indices are still in the reindex queue before the fields will be searchable.

kostajh renamed this task from Once the ORES drafttopic - ElasticSearch pipeline is set up, update data about all articles to Once the ORES articletopic - ElasticSearch pipeline is set up, update data about all articles.Mar 2 2020, 10:54 AM
kostajh updated the task description. (Show Details)

A manual all-pages export has been run against arwiki, cswiki, kowiki and viwiki. These are now loaded into the elasticsearch cluster, but some indices are still in the reindex queue before the fields will be searchable.

Thanks @EBernhardson. How do we want to move forward on this for other wikis?

I've also written everything necesssary to do the same for enwiki and propagate it out to all the wikis that don't have articletopics models. Just running the enwiki export took most of the weekend, the rest should go out this week.

I've also written everything necesssary to do the same for enwiki and propagate it out to all the wikis that don't have articletopics models. Just running the enwiki export took most of the weekend, the rest should go out this week.

This is completed and all the data has been uploaded to the search clusters. It is available for searching on many wikis, and will be activated on the rest as the reindexing process goes through them all over the next few weeks.

@Tgr this is done on our side, should we close? Or are you still tracking it?

Thanks @EBernhardson and @Gehel! This is finished as far as it matters for the current Growth plans. When more wikis get a native ORES model, those will need to be updated, but no point in keeping the task open for that.