It should be possible to filter articles by topic when using the normal wiki search, based on ORES articletopic (link needs to be updated) scores. The immediate use case for this is NewcomerTasks 1.1 which involves filtering tasks via topic; but seems like a widely useful capability in general, both for readers and for tools (especially considering the Product plans about neighborhoods).
The high-level plan is to push ORES articletopic scores into ElasticSearch on two (+1) channels:
- Have ORES calculate the new score on edit (changeprop already supports this), and (via EventGate) push it to Kafka and collate into HDFS; use it in the weekly bulk update of the search index.
- To keep the scores more up to date, after each edit fetch and apply the ORES score in the index update MediaWiki job. This might be beyond the capacity of ORES, so we might want to limit into some manageable subset (like newly created pages or wikis where Newcomer Tasks is enabled), and hope that a one-week delay is not too much of a usability problem for the rest (presumably most edits don't affect the topic scores much).
- To ensure there's data about every article in HDFS, even if it has never been edited since this functionality was deployed, do a one-time job of going through all pages on all wikis and pushing their score to Kafka.
On ElasticSearch the scores would then be put into a poor man's sparse vector field (real sparse vectors are in the nonfree part of ElasticSearch), with topic names and scores being represented as document words and word frequencies, and the topic could be queried via tf-idf. This search functionality would be exposed via some search keyword like topic:.
Currently ORES can only score English Wikipedia; that's planned to be fixed soonish, but for the interim period we will just fake scores for other wikis, using the score from the English interwiki article. (Details to be specified.)
More specifically, the concrete steps to implement the feature are:
|1||Configure ORES to publish new articletopic scores to a Kafka topic when notified by changeprop about a new revision.||T240549||Scoring|
|2||Configure the new ElasticSearch field.||T240550||Search|
|3||Configure EventGate to consume the Kafka queue and store the data on HDFS (and merge with existing data by title, or maybe page ID).||T240553||no-op?|
|3.5||Copy English Wikipedia articletopic scores to other wikis||T241015||?|
|4||Go through all Wikipedia articles one time, score them and push the score to Kafka.||T243357||Growth|
|5||Configure the weekly ElasticSearch bulk update job to pull the data from HDFS.||T240556||Search|
|6||Make the ORES extension hook into CirrusSearchAddQueryFeatures and provide the ES logic for the topic: keyword. (Needs discussion, there is probably more than one way to implement this.)||T240559||Growth (with support from Search)|
|7||Make some extension (ORES? GrowthExperiments?) hook into ContentHandler::getDataForSearchIndex (?), fetch the scores from the ORES service and add them to the ES document.||T240558||Growth (with support from Search)|
|8||Implement articletopic score for all wikis||T235181 (?)||Scoring|
|9||Repeat step 4, now with the local articletopic scores for all wikis||Growth|
|10||(optional) Integrate with AdvancedSearch extension||T245905||?|
|11||(optional) Integrate with recent changes||T245906||?|
The MVP version is steps 1-6, maybe without 4 (or with a limited version of 4 that only includes Newcomer Tasks wikis). The provisional target date for that is mid-January, when Growth plans to deploy topic filtering in Newcomer Tasks. (It's not strictly a blocker for that though, we'll have a lower-quality alternative search logic, via T240512: Newcomer tasks: Morelike backend for topic matching.)