Page MenuHomePhabricator

[L] List articles appearing in articles with image suggestions
Closed, ResolvedPublic

Description

See https://en.wikipedia.org/w/index.php?go=Go&search=hasrecommendation%3Aimage&title=Special:Search&ns0=1 (this is where Growth gets their list of suggestions from)

Lots of the articles are lists and should be excluded from image suggestions

If you look at the image pipeline code here you can see that list articles are explicitly excluded, but somehow they're getting through anyway, so I guess we have a bug

Event Timeline

MarkTraceur renamed this task from List articles appearing in articles with image suggestions to [M] List articles appearing in articles with image suggestions.Oct 19 2022, 5:46 PM
MarkTraceur renamed this task from [M] List articles appearing in articles with image suggestions to [L] List articles appearing in articles with image suggestions.

Checking over the first few items in the list above their timestamps in Cassandra are from last April, so it seems likely that this is old data that got imported into the search indices and never got updated

I have a draft notebook that I'll run the week of Nov 7th (I'm off next week, and I want to make sure all the most recent data has been imported into the search indices before running it). The notebook will save data to a special snapshot in analytics_platform_eng.image_suggestions_search_index_delta, and then I'd hope @EBernhardson or @dcausse can import it using the search import pipeline and all the data should be repaired

For reference here's the notebook code I ran. Saved the data into my own hive db because I don't have permission to write to analytics_platform_eng from a notebook

from pyspark.sql import functions as F
fakeSnapshotString = "fixup-T320656"
latestWeekly = "2022-10-24"
latestMonthly = "2022-10"
latestSuggestions = spark.sql('SELECT * FROM analytics_platform_eng.image_suggestions_search_index_full where snapshot="{}"'.format(latestWeekly))
allPages = spark.sql('select wiki_db, page_namespace, page_id from wmf_raw.mediawiki_page where snapshot="{}" and page_namespace=0'.format(latestMonthly))
searchIndexUpdates = allPages.join(
    latestSuggestions,
    on=[
        allPages.page_id == latestSuggestions.page_id,
        allPages.wiki_db == latestSuggestions.wikiid
    ],
    how='left_anti'
).select(
    allPages.wiki_db.alias('wikiid'),
    allPages.page_namespace,
    allPages.page_id
).withColumn(
    'tag', F.lit('recommendation.image')
).withColumn(
    'values', F.array(F.lit('__DELETE_GROUPING__'))
).withColumn(
    'snapshot', F.lit(fakeSnapshotString)
).distinct()

searchIndexUpdates.write.saveAsTable(
    'cormac.image_suggestions_search_index_delta', 
    partitionBy=['snapshot'],
    mode='append'
)

Change 855569 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia/discovery/analytics@master] image_suggestions: schedule ad hoc dataset to fix improper suggestions

https://gerrit.wikimedia.org/r/855569

cormac.image_suggestions_search_index_delta has more than 250M items in it but I doubt that there are 250M pages to fixup, @Cparle could you prune a bit the dataset, e.g. using the corrupted snapshot that shipped these wrong suggestions on list page as a base instead of listing all the existing pages?

hmm ok ... let me try a slightly different approach

Change 855569 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] image_suggestions: schedule ad hoc dataset to fix improper suggestions

https://gerrit.wikimedia.org/r/855569

Mentioned in SAL (#wikimedia-operations) [2022-11-10T17:44:58Z] <dcausse@deploy1002> Started deploy [wikimedia/discovery/analytics@84dd7b5]: T320656: image_suggestions: schedule ad hoc dataset to fix improper suggestions

Mentioned in SAL (#wikimedia-operations) [2022-11-10T17:47:17Z] <dcausse@deploy1002> Finished deploy [wikimedia/discovery/analytics@84dd7b5]: T320656: image_suggestions: schedule ad hoc dataset to fix improper suggestions (duration: 02m 18s)

Change 855670 had a related patch set uploaded (by DCausse; author: DCausse):

[wikimedia/discovery/analytics@master] convert_to_esbulk: fix typo in config

https://gerrit.wikimedia.org/r/855670

Change 855670 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] convert_to_esbulk: fix typo in config

https://gerrit.wikimedia.org/r/855670

Mentioned in SAL (#wikimedia-operations) [2022-11-10T18:15:11Z] <dcausse@deploy1002> Started deploy [wikimedia/discovery/analytics@a030f5f]: T320656: convert_to_esbulk: fix typo in config

Mentioned in SAL (#wikimedia-operations) [2022-11-10T18:17:33Z] <dcausse@deploy1002> Finished deploy [wikimedia/discovery/analytics@a030f5f]: T320656: convert_to_esbulk: fix typo in config (duration: 02m 22s)

Import is done, there are a lot fewer list pages with this recommendation so I can confirm that this worked, there are still a few on enwiki: https://en.wikipedia.org/w/index.php?search=hasrecommendation%3Aimage+intitle%3AList&title=Special:Search&profile=advanced&fulltext=1&ns0=1 (most probably outliers).

Ok great! I guess the question now is whether we need to go through the remaining articles and exclude things like "Wikimedia list of persons by gender (P21) and occupation (P106)" and "information list". 585 articles with "list" in the title out of ~87k articles-with-suggestions might be tolerable. @AUgolnikova-WMF ?

As discussed with Cormac, we won't do additional effort for now, this ticket can be considered resolved