Page MenuHomePhabricator

Unable to get list of more than 10k pages with recommendations
Closed, ResolvedPublicBUG REPORT

Description

SendNotificationsForUnillustratedWatchedTitles.php (the script that generates notifications for articles with image suggestions) currently uses the search API with hasrecommendation:image to get a list of articles with image recommendations. Elastic, however, caps at 10k results.

This means that we can't process more than 10k articles at a time.
And because we ignore many articles for multiple reasons (couldn't find a relevant editor to notify; suggestion not good enough; ...), this means that we're only sending notifications for a small subset of 10k.
And because we simply ignore a large amount of articles, they will continue to show up in the first 10k results, essentially blocking us from progressing forward (at some point, all 10k first results will not match criteria to send a notification, and we can't get to those further down)

Event Timeline

Potential solutions we've discussed:

  • Narrow down searches in a way that we are never expecting 10k+ results
    • Batch in ranges of 10k pageids (not possible ATM; pageid not indexed in a way that would allow a range query)
    • Batch based on max(pageid) % 10k (not possible ATM; pageid not indexed in a way that would allow a range query)
    • Batch via individual ids (e.g. pageid:1|2|3|... (not realistic ATM because too limited (1000 max; even fewer taking into account max query length))
  • Use Elastic's scroll API (which means building the raw elastic query & communicating with elastic, rather than using the search API)
  • Use the data already available (or to be made available) via data pipelines in Cassandra to expose a full list through an API (still under discussion)

One other option:

  • When fetching the 10k results, order them randomly. This will not realistically allow us to get all suggestions in any given week, but will allow us to get 10k results this week and a mostly different 10k results next week, and avoid the case where our forward progress is blocked (at least in the short-medium term)

Here's what's involved in doing a Cassandra-based solution

  • create a new Cassandra table
  • populate it from our existing airflow DAG (easy)
  • create a http interface for that new table (easy)
  • BUT the set of articles-with-suggestions for enwiki contains ~250k rows
  • AND the way that we deal with updates to data atm is to always append new data, with a time-based identifier (and allow it to eventually get deleted by Cassandra itself via a ttl) ... and this means the size of the set of articles-with-suggestions for enwiki will be a multiple of 250k
  • AND right now there's no way to use http to page through a Cassandra dataset - you can only get all the data at once. Returning a multiple of 250k rows from a single http call seems like a bad idea
  • Paging through a dataset is possible on the Cassandra side, but it'd need to be implemented in the http gateway. The obvious person to do this is @Eevans but he's moved to SRE

(the way we deal with updates to data isn't really ideal, and needs a re-think, but that's kinda independent of this blocker)

Here's what's involved in doing a Cassandra-based solution

  • create a new Cassandra table
  • populate it from our existing airflow DAG (easy)
  • create a http interface for that new table (easy)
  • BUT the set of articles-with-suggestions for enwiki contains ~250k rows
  • AND the way that we deal with updates to data atm is to always append new data, with a time-based identifier (and allow it to eventually get deleted by Cassandra itself via a ttl) ... and this means the size of the set of articles-with-suggestions for enwiki will be a multiple of 250k
  • AND right now there's no way to use http to page through a Cassandra dataset - you can only get all the data at once. Returning a multiple of 250k rows from a single http call seems like a bad idea
  • Paging through a dataset is possible on the Cassandra side, but it'd need to be implemented in the http gateway. The obvious person to do this is @Eevans but he's moved to SRE

(the way we deal with updates to data isn't really ideal, and needs a re-think, but that's kinda independent of this blocker)

One thing I think you missed, is that: valid suggestions are a merge between the suggestions & feedback tables, since you want to omit anything no longer considered valid. The feedback table is partitioned with the expectation that you're doing discreet, by-pageID lookups (read: it will not work for this as-is).

Change 810282 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/ImageSuggestions@master] Retrieve pages-with-suggestion via Elastic scroll directly

https://gerrit.wikimedia.org/r/810282

I have a working solution using Elastic's scroll (https://gerrit.wikimedia.org/r/810282) that does seem to be able to get us past 10k results.

Ideally we'd have a fix that doesn't rely on using search, since that likely won't be an option when it comes to doing this for section topics.
But for now (pending CR and actual use in prod...), the heat is off and we can take the time to work on that more robust long-term solution.

Change 810282 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@master] Retrieve pages-with-suggestion via Elastic scroll directly

https://gerrit.wikimedia.org/r/810282

Change 810889 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/ImageSuggestions@wmf/1.39.0-wmf.18] Retrieve pages-with-suggestion via Elastic scroll directly

https://gerrit.wikimedia.org/r/810889

Change 810889 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@wmf/1.39.0-wmf.18] Retrieve pages-with-suggestion via Elastic scroll directly

https://gerrit.wikimedia.org/r/810889

Mentioned in SAL (#wikimedia-operations) [2022-07-05T07:21:38Z] <urbanecm@deploy1002> Synchronized php-1.39.0-wmf.18/extensions/ImageSuggestions/maintenance/SendNotificationsForUnillustratedWatchedTitles.php: d5050b773992aa6100aa14cd328836ff336ef8c1: Retrieve pages-with-suggestion via Elastic scroll directly (T311476) (duration: 03m 32s)

Confirmed working in prod, was able to scroll through 130k+ results.