Page MenuHomePhabricator

The Search/articletopic page at Wikitech appears to be out of date
Closed, ResolvedPublic

Description

Earlier today, I wanted to check articletopic scoring for a particular page at dewiki. Since https://ores.wikimedia.org/v3/scores/dewiki/XXX/articletopic (where XXX is the revision ID) always returns Models ('articletopic',) not available for dewiki, I decided to check how the articletopic scoring goes into Search and I found Search/articletopic describes that. Since the page says topic-related information is sent to mediawiki.revision-score-articletopic, I decided to check the event.mediawiki_revision_score_articletopic Hive table. Unfortunately, that table appears to be empty:

hive (event)> select count(*) from mediawiki_revision_score_articletopic where year=2024;
MapReduce Total cumulative CPU time: 30 minutes 44 seconds 340 msec
Ended Job = job_1734703658237_11029
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 105  Reduce: 1   Cumulative CPU: 1844.34 sec   HDFS Read: 210253882 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 30 minutes 44 seconds 340 msec
OK
_c0
0
Time taken: 85.165 seconds, Fetched: 1 row(s)
hive (event)>

Since the Hive table corresponding to the documented schema is empty (and the raw data only include health checks), the documentation page appears to be out of date.

Event Timeline

I attempted to find the events by browsing around related links from the docs page. I found the articletopic-outlink service, which produces its data to the mediawiki.page_outlink_topic_prediction_change.v1 stream. The Hive table corresponding to that stream is not empty, and it contained predictions that looked believable. However, I'm not updating the page based on what I found, as I have no idea whether I found the new way of doing things, or something unrelated that shouldn't be mentioned in this page.

If someone has tips for checking predictions that are not accessible via ores.wikimedia.org, I'd appreciate them as well.

Gehel triaged this task as Medium priority.Jan 6 2025, 4:31 PM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
dcausse moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.
dcausse subscribed.

Thanks for pointing this out, you were correct, mediawiki.page_outlink_topic_prediction_change.v1 is indeed the new stream being populated and used by the search update pipeline, I updated the doc with new links and stream names. I think that predictions can be run calling liftwing, from https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language_agnostic_link-based_article_topic:

curl https://api.wikimedia.org/service/lw/inference/v1/models/outlink-topic-model:predict -X POST -d '{"page_title": "Frida_Kahlo", "lang": "en", "threshold": 0.1}' -H "Content-type: application/json"

Should generate the predictions, but I believe that using pre-computed predictions from hive if possible is certainly better.

Thanks for the update, this is very useful to know. The API works for me, and I can also see the data in the new stream. FTR, I wasn't building any automation around topics, I was checking a possible bug in GrowthExperiments and wanted to know what the topics are.

Thanks for the update, this is very useful to know. The API works for me, and I can also see the data in the new stream. FTR, I wasn't building any automation around topics, I was checking a possible bug in GrowthExperiments and wanted to know what the topics are.

Sure, if it's for debugging you might be interested in some cirrus debug tools as well, for instance appending ?action=cirrusDump to any article page should output the content that is indexed (weighted_tags included), e.g. https://en.wikipedia.org/wiki/Frida_Kahlo?action=cirrusDump if you spot differences between the lift wing online prediction and and this content then it's possible that something got lost in the event machinery.

Thanks for the tip! The dumping URL is useful to know about.