We plan to bulk load ORES drafttopic scores to ElasticSearch via T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update, but that will have a delay of up to one week, which is not ideal in some cases (especially for recently created articles; presumably for mature articles the topic scores don't change much over time). To fix that, hook into the ES update logic in MediaWiki (via ContentHandler::getDataForSearchIndex, probably), pull the score from the ORES service and add it to the ES document.
Document updates happen much more often than edits (e.g. via template edits), so a naive implementation might overload the ORES service. We need to figure out what's an acceptable load, and limit this functionality accordingly. Maybe we can get away with only doing it on real edits, otherwise we could limit it to new / recent articles, or maybe wikis where GrowthExperiment is deployed.
The code would either live in MediaWiki-extensions-ORES (where the search keyword code will live as well) or GrowthExperiments (if we limit it to those wikis).
Open questions:
- if we limit to new articles, and for non-new articles omit the field from ContentHandler::getDataForSearchIndex, will the old value be preserved in ES?
- can ORES latency be a problem? the call will be made from a job so 1-2 second is acceptable; tens of seconds, probably not so much.
- if we can do this for all articles on all wikis (and only filter out indirect updates that happen due to template edits and such), do we even need T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update & co? (ORES would have to score each edit anyway for T240549: Configure ORES to publish new drafttopic scores to Kafka, so if it caches the results, querying it on every edit should not make a difference.)