Page MenuHomePhabricator

Reload ORES data into weighted_tags
Closed, ResolvedPublic

Description

As a search engineer i want to finish deprecating the ores_articletopic field from elasticsearch so we can provide a consistent and standard way of accepting weighted properties from other sources.

We are replacing the ores_articletopics field in elasticsearch with weighted_tags. To complete this transition and stop using the old fields we need to load all of the ores data into this new field.

AC: weighted_tags is populated with ores_(article|draft)topic predictions

Event Timeline

Change 663283 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] Add manually triggered dag for ores bulk exports

https://gerrit.wikimedia.org/r/663283

Change 664556 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/vagrant@master] cirrussearch: Follow up on ores_articletopic -> weighted_tag rename

https://gerrit.wikimedia.org/r/664556

Change 664556 merged by jenkins-bot:
[mediawiki/vagrant@master] cirrussearch: Follow up on ores_articletopic -> weighted_tag rename

https://gerrit.wikimedia.org/r/664556

Change 663283 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Add manually triggered dag for ores bulk exports

https://gerrit.wikimedia.org/r/663283

Change 667709 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] ores_bulk_ingest: Handle unexpected api response

https://gerrit.wikimedia.org/r/667709

Articletopic dumps completed. Drafttopic dump died over the weekend, retried, and then died again at about the same point today. Looks to be due to a returned page that doesn't contain any revision information. Patch gracefully handles the error, we will need to deploy and re-enable the dag so it keeps working.

It looks like a full dump of drafttopic will take ~100 hours, due to needing to visit > 40M pages, vs articletopic which only visited 6M. We've never actually loaded a full drafttopic dump, I'm left wondering if it's actually a useful thing. We could chose some arbitrary date, like jan 1 2020, and only dump revisions edited since then. In a quick test against the elastic apis only 10M of the 40M pages have been edited in the last year.

Change 667709 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] ores_bulk_ingest: Handle unexpected api response

https://gerrit.wikimedia.org/r/667709

While we could reduce the set of pages, it seems perhaps premature. Can still ponder if we should though, there seems to be a non-zero chance this yet again doesn't manage to run to completion. I've deployed the above patch and let it try again with the existing configuration.

Change 667892 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] ores_bulk_ingest: Increase drafttopic error_threshold to 1 per 500

https://gerrit.wikimedia.org/r/667892

Change 667892 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] ores_bulk_ingest: Increase drafttopic error_threshold to 1 per 500

https://gerrit.wikimedia.org/r/667892

Restarted the dump after deploying change to error_threshold, it was only ~4 hours into the run since the last fail. The last fail was:

Exception: Exceeded error threshold of 0.001, Seen 5084185 items with 5085 errors.

This is repeated for several attempts. Not clear how previous runs over the weekend made it well past this before failing, but if if refuses to get past this point now the sanest thing seemed to be to allow more failures. With any luck this should finish in ~100 hours.

The node the job was running on was taken down for a reimage, it has restarted on another host.

Same thing, the node it was running on was taken down for reimage this morning. It's now running on a host that's already been reimaged, letting it try again.

@EBernhardson -- I just wanted to check in on this. We're eager to test and see if this job fixes some issues in the model we've been seeing.

It's still running. Looks like it's requested just under 34M of the expected ~40M predictions, with a current runtime of ~110 hours. Once the dump finishes it should automatically be processed and uploaded to the production clusters.

Change 672524 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] prepare_rev_score: Rename scores_export to bulk_ingest

https://gerrit.wikimedia.org/r/672524

The script finished, but the processing framework OOM'd while finishing up and putting everything where it belongs. For now I'm bypassing the drafttopic dump which will allow articletopic to ship to the cluster. To run drafttopic we will need a minor refactor of the orchestration to partition the intermediate data by namespace and re-run drafttopic one namespace at a time.

Change 672524 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] prepare_rev_score: Rename scores_export to bulk_ingest

https://gerrit.wikimedia.org/r/672524

articletopic dumps have been processed and uploaded to swift. This includes updates for ~35M pages and will likely take a day or two to make it through the indexing pipeline.

Articletopic should be fully loaded into prod now, the ores_articletopics and the weighted_tags fields. We will have to decide if we are going to push through drafttopic and refactor the orchestration into smaller pieces that don't retry on a week-long window.

Change 674416 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] airflow: Partition ores export tasks by namespace

https://gerrit.wikimedia.org/r/674416

Change 674416 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] airflow: Partition ores export tasks by namespace

https://gerrit.wikimedia.org/r/674416

Reworked exports so we can run a task per namespace. Triggered a new run of the ores_predictions_bulk_ingest dag and manually marked all the articletopic tasks as success so it skips them and only does the drafttopic. Now waiting for it to complete.

enwiki ns 0 has completed, ns 1 is working it's way though. Optimistically, looks like this should work out and complete.

Completed a number of namespaces, it's up to 14 now. Taking it's time but looking good.

Looks like forgot to follow up on this, the data ended up shipping march 31'st around midnight. Everything downstream looks to have worked appropriately.