Page MenuHomePhabricator

Ingest new image suggestions index diffs
Closed, ResolvedPublic

Description

Following the bugfix for T314120: The image suggestion data pipeline generates too many weighted_tags, we re-generated the search indices diffs (AKA deltas). Note that the bug seems to have affected Commons.

Tasks

  • clean up the Commons index from tags produced by T314120

Given the analytics_platform_eng.image_suggestions_search_index_delta Hive table:

  • ingest the 2022-07-11 snapshot
  • ingest the 2022-07-18 snapshot

Summary

Event Timeline

The 2022-07-11 snapshot had already been pushed into the kafka queues to be ingested, but the daemons were paused while they were running. I've run the DAG to push 2022-07-18 into the queues today.

The daemons that ingest these updates into elasticsearch are running again with T314078 mostly resolved, but are backlogged on the updates that were sent while they were paused including the regular weekly batch load. I'm hoping the image suggestion data will be available in eqiad by end of day thursday and codfw probably by friday. They could potentially take another day or so though and finish up on saturday.

@EBernhardson , thanks for the update. Are you talking about the re-generated 2022-07-11 snapshot? This is the one that should be ingested: it contains the bugfix for T314120 and it's available on Hive since August 2 at 5 pm UTC.

@EBernhardson , thanks for the update. Are you talking about the re-generated 2022-07-11 snapshot? This is the one that should be ingested: it contains the bugfix for T314120 and it's available on Hive since August 2 at 5 pm UTC.

Didn't realize they were re-generated, i've re-shipped the 7-11 and 7-18 datasets. They should be live in eqiad by now, codfw is still processing through it's backlog but will get there eventually.

I suppose a related thought, are the regenerated diff's diffing against the correct thing? For the diffs to be correct they need to be diffed against the expected state of the production indices. When you regenerate the 7-11 dataset is it building against the previous 7-11 dataset that was shipped, or against the 7-4 dataset which is no longer the expected state since the 7-11 dataset was already shipped once?

I suppose i expected the 7-18 dataset to be a diff against the incorrect 7-11 dataset which would bring it back to the newly expected state.

I suppose a related thought, are the regenerated diff's diffing against the correct thing? For the diffs to be correct they need to be diffed against the expected state of the production indices. When you regenerate the 7-11 dataset is it building against the previous 7-11 dataset that was shipped, or against the 7-4 dataset which is no longer the expected state since the 7-11 dataset was already shipped once?

I suppose i expected the 7-18 dataset to be a diff against the incorrect 7-11 dataset which would bring it back to the newly expected state.

The re-generated 7-11 is against 7-4, and 7-18 is against the re-generated 7-11. This is what @Cparle instructed before leaving. Unfortunately, I wasn't aware that the bad 7-11 was already shipped, otherwise I'd have built the correct 7-18 only.

Checked all bad_commons_ids as per P32103 as follows:

$ for id in $(cat bad_commons_ids); do curl "https://commons.wikimedia.org/w/?curid="$id"&action=cirrusDump" -H 'Accept: application/json' > "$id".json; done
$ ipython
In [1]: import json, os
In [2]: from collections import OrderedDict
In [3]: tags = {}
In [4]: for f in os.listdir('.'):
  ...:     if f.endswith('.json'):
  ...:         with open(f) as fin:
  ...:             j = json.load(fin)
  ...:             tags[f.split('.')[0]] = len(j[0]['_source']['weighted_tags'])
In [5]: print(json.dumps(OrderedDict(sorted(tags.items(), key=lambda x: x[1], reverse=True)), indent=2))

See output in P32288: all oversized tags aren't there anymore.

Also dry-ran the T292147: [L] Send Image Suggestions notifications to experienced users script against the 3 target wikis, i.e., pt, id, and ru. Results:

  • pt: Done. Notified 1558 users about 2910 pages. 118597 pages had no available users.
  • id: Done. Notified 454 users about 849 pages. 41403 pages had no available users.
  • ru: Done. Notified 3725 users about 6884 pages. 77347 pages had no available users.

Also double-checked T313412: Local images not accounted for when looking at unillustrated articles. Illustrated articles counts follow:

  • id: 4 out of 849
  • pt: 17 out of 2910
  • ru: 60 out of 6884
  • total: 81 out of 10643

Looks like a few leftovers are still there maybe due to T314164: Some pages' search index docs indicate they have a suggestion when they do not.
@EBernhardson, @dcausse : not sure what to do here to clean that up.

I think we're good enough to actually send out the notifications!

CBogen claimed this task.