Page MenuHomePhabricator

Add outlink topic model predictions to CirrusSearch indices
Closed, ResolvedPublic5 Estimated Story Points

Description

The outlink topic model is be able to do topic detection in all language we should use this data to possibly drop the need to import ORES articletopics model predictions.

The search jobs should push outlink predictions to the existing articletopics weighted_tags prefix (the set of topics are the same):

  • the search jobs should consume events from the dedicated mediawiki.revision_score_$model stream (blocked on T328576)
  • there should be no impact on the articletopics search keyword
  • the predictions made by ores articletopics will be slowly replaced by the outlink model ones as new edits are made to existing pages
  • thresholds will be set statically in the code-base to 0.5 instead of being fetched from the ORES api

AC:

  • outlink topic model predictions are pushed to the CirrusSearch indices and are queryable (via the existing articletopics keyword or a new one)
  • crosswiki propagation is no longer required and can be removed from the discolytics codebase.
    • The ores drafttopic model will still be consumed as is but should not be a reason to keep the crosswiki propagation if it uses it (we could even consider only using it to populate the draft namespace?).

Details

TitleReferenceAuthorSource BranchDest Branch
search: Change articletopic source to the outlink modelrepos/data-engineering/airflow-dags!448ebernhardsonwork/ebernhardson/outlinkmain
ores: Adjust ingestion for changing articletopic ingestion to outlink modelrepos/search-platform/discolytics!28ebernhardsonwork/ebernhardson/outlinkmain
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@AikoChou You might be able to answer some questions

Echoing that I would love Aiko's thoughts as well. My input:

Do we want to replace&reuse the existing articletopics weighted_tags prefix with the data obtained from this model? existing ORES predictions will stay until the page is edited and have the new predictions pushed.

  • @KStoller-WMF and I discussed this a while back and concluded that we were both fine with allowing existing ORES predictions to remain until an edit is made to the article and they are overwritten (no backfill). This should be much simpler, the existing ORES model is generally fine just limited in scope, and we can't think of any use-case where it's important to do a full refresh beyond potential confusion about the sources of erroneous predictions.

The articletopics model requires some thresholds to be computed to better filter the predictions (https://gist.github.com/halfak/630dc3fd811995c2a0260d43da462645), does the outlink topic model requires something similar or can we simplify this by using static thresholds for all predictions (e.g. hardcode 0.5)?

We've been hard-coding 0.5 in part for simplicity and because that's how the model is trained. I think thresholds are less of an issue for the outlinks model than the original ORES models because my experience is that the new model tends to have a bit better separation (a lot of scores above 0.9 or below 0.1). If we begin to see issues with that (there are definitely some topics under History/Society where it's more common to see scores hovering around 0.5 and lower thresholds might be beneficial), I can look into something similar to that script for tuning thresholds (thanks for the link to the gist).

We also import the ORES drafttopics model predictions, can they be replaced with the outlink topic model ones?

Good question about drafttopic. I have not done formal testing but thoughts below:

  • Context: drafttopic has identical goals to articletopic or outlinks but is fine-tuned to make predictions for first drafts of articles. First drafts are also when articles are least likely to contain links, which are what the new outlinks model uses for predictions. That means that the outlinks model does perform worse on these articles -- FYI some performance stats. And these articles are important because they're likely less discoverable in general so finding a good solution to handle them would be very nice. The drafttopic model, however, only works for English Wikipedia so it's not a panacea.
  • Potential approaches:
    • Do nothing: run both models and I presume use the union of their predictions?
      • Benefits: minimal changes to existing workflow.
      • Drawbacks: doesn't reduce the maintenance debt that much or help with draft articles in other languages.
    • Simplify: get rid of drafttopic model.
      • Benefit: single model for topic suggestions makes maintenance way easier.
      • Drawbacks: performance will suffer for low-link articles. I also don't know if there are other end-users of drafttopic that would be impacted.
    • Compromise: set up a basic decision tree for which model to use -- e.g., if <5 links on enwiki, use drafttopic, else use outlinks.
      • Benefits: probably improved coverage for English Wikipedia and we only run drafttopic when it's most likely to help.
      • Drawbacks: maintaining multiple models still on LiftWing and doesn't help languages outside of English.
    • Have cake and eat it too (?): for many languages, we have add-a-link models (see T307881) that will take articles and predict what links could be added. We could do something similar to the option above but instead of using drafttopic, we could add a separate pre-processing pipeline that instead predicted links for an article if there aren't enough already and use those predicted links to make the predictions. We might want to do some testing if we go down this route and @AikoChou in particular I'm curious whether you think this is feasible. It wouldn't get us full language coverage but there's separate work to extend add-a-link to many many languages so at least we benefit from that for free essentially.
      • Benefits: in theory great language coverage and good predictions for low-link articles. Could potentially deprecate drafttopic if not being used elsewhere. If we codify this within the outlinks model preprocessing on LiftWing, Search doesn't have to know anything about it so easier maintenance.
      • Disadvantages: untested and new external dependency of add-a-link model outputs for small proportion of articles.

I personally prefer the final approach as it feels the simplest in terms of maintenance and also the best coverage. The logic for it would be implemented within the LiftWing pre-processing scripts so Search could be agnostic to how it works too and implementing it is not a blocker to the switch-over.

have migration steps for the Growth team if a new keyword is required or if the set of predictable topics are differents

Just a note that thanks for including this -- I personally would like to make some changes eventually to the model topics but am holding on that to keep this migration simple.

Hi!

For the first and second questions, Isaac has answered. For the question about the drafttopic model, considering the outlinks model performs worse on low-link/new articles, I think we can keep the drafttopic model for now (although it only supports enwiki).

@Isaac - currently we haven’t deployed add-a-link models on Lift Wing, so it increases the complexity of the final approach you mentioned, like where to get the add-a-link outputs for the need of the outlinks model. If we have add-a-link models in Lift Wing, I think the approach is feasible and a smart way to enhance the prediction for low-link articles in the outlinks model, and also an interesting use case the ML team want to cover. For the task, I think we can continue to use drafttopic for now (your first approach) and maybe put the final approach as future work.

@dcausse - does the current pipeline use the union of the predictions from articletopic and drafttopic model? or how does it aggregate the two predictions for the same article?

To achieve the goal of this ticket, the ML team will need to complete T328576 first, in which the changeprop will change from connecting with the ORES precache API to the Lift Wing API, and Lift Wing will generate new streams and send them to EventGate. New streams are more-granular mediawiki.revision_score_$model streams, which will eventually generate multiple event.mediawiki_revision_score_$model tables in HDFS, then can be used for the CirrusSearch use case.

The change for changeprop will be applied to the revscoring models on Lift Wing (including goodfaith, damage, articlequality, articletopic, drafttopic, ...) and new languaue-agnostic models (currently we have outlink topic model and revert-risk model).

So the data flow for the CirrusSearch articletopic use case will become:
wiki edit -> changeprop -> Lift Wing -> EventBus -> HDFS -> Airflow / Spark (-> cross-wiki propagation) -> Elasticsearch

@dcausse - does the current pipeline use the union of the predictions from articletopic and drafttopic model? or how does it aggregate the two predictions for the same article?

Currently the two set of predictions are kept and the user can use one or the other: searching for articletopic:biography vs searching for drafttopic:biography. As far I understood the drafttopic keyword was requested for searching by topic on the Draft (T249341) namespace for which the articletopic model does not work. If the outlink model is not enabled on the Draft namespace it might make to leave this problem out for now and not think too much about how to replace the ORES drafttopic model yet.

So the data flow for the CirrusSearch articletopic use case will become:
wiki edit -> changeprop -> Lift Wing -> EventBus -> HDFS -> Airflow / Spark (-> cross-wiki propagation) -> Elasticsearch

The part around cross-wiki propagation is what I'm hoping we can remove using the outlink model.

I think the question that remains is:

  • do we want to repopulate the indices with the prediction of the outlink model (requires a backfill by re-running the model on all the pages and importing the output to elasticsearh)
  • are we fine having the outlink predictions to slowly replace articletopics ones

Thanks!

If the outlink model is not enabled on the Draft namespace it might make to leave this problem out for now and not think too much about how to replace the ORES drafttopic model yet.

Thanks for the explanation. The outlink model is used for pages in the main/article namespace (which is the source of the training data), not for the draft namespace yet, so I agree with the point, we can leave the problem out for now.

The part around cross-wiki propagation is what I'm hoping we can remove using the outlink model.

Yes, if we use the outlink topic model, we don’t need cross-wiki propagation as it supports all languages.

  • do we want to repopulate the indices with the prediction of the outlink model (requires a backfill by re-running the model on all the pages and importing the output to elasticsearh)
  • are we fine having the outlink predictions to slowly replace articletopics ones

I agree with what Isaac said — no backfill as the existing ORES predictions are fine, and we're fine having the outlink predictions slowly replace articletopics ones.

thanks for the extra details @achou and @dcausse! the clear delineation of outlinks for namespace 0 and drafttopic for draft namespace also makes a lot of sense to me and conveniently seems to be the easiest thing to do as well.

currently we haven’t deployed add-a-link models on Lift Wing, so it increases the complexity of the final approach you mentioned, like where to get the add-a-link outputs for the need of the outlinks model. If we have add-a-link models in Lift Wing, I think the approach is feasible and a smart way to enhance the prediction for low-link articles in the outlinks model, and also an interesting use case the ML team want to cover.

@achou, not urgent, but perhaps we can pick this up in the next ML:Research sync. I'd be curious to learn more. Sounds like we can do that outside this scope of work too as it would be a silent change from Search's perspective.

Gehel triaged this task as High priority.Feb 13 2023, 4:25 PM
Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

Thanks for all the input! I've updated the task description accordingly.

@dcausse Hi! I am working in T328576 to split mediawiki.revision-score into multiple substreams, and I am wondering which ones are really needed. IIUC from reading this task your team would still need drafttopic right? If so we could start from it so that you'll be able to migrate away from ORES' revision-score. Does it make sense?

Moreover, do you folks use or plan to use other ORES revision score streams in the future? I recall that we discussed briefly something about it, but I don't remember the exact list of things needed :)

@dcausse Hi! I am working in T328576 to split mediawiki.revision-score into multiple substreams, and I am wondering which ones are really needed. IIUC from reading this task your team would still need drafttopic right? If so we could start from it so that you'll be able to migrate away from ORES' revision-score. Does it make sense?

Moreover, do you folks use or plan to use other ORES revision score streams in the future? I recall that we discussed briefly something about it, but I don't remember the exact list of things needed :)

Hey!
Sure I think we can certainly start using a dedicated stream for ORES drafttopic predictions once it's running on your side (I might file a dedicated ticket for it), note that we are in the middle of a spark 3 migration so we might start doing this refactoring after the migration but please let us know if it's a blocker on your side. Regarding other ORES models I can't think of anything else other than articlequality (ex WP10) but Erik found that it was not very not useful for search ranking and I don't remember anyone requesting these scores to be queryable from the search index so I don't think we'll add it anytime soon, it's very probable that I'm forgetting something tho... @EBernhardson might know if there's an another ORES model we have plans to use?

Thanks for the feedback! I am collecting info in T328576 so we can use that task for reference, I'll wait for Erik's recommendation about what streams to keep and what not (for example, if we want articlequality etc..).

once articletopic transfers to the link-based topic modeling drafttopic should be the only one we still need, afaik.

Looked into this, it looks like progress is being made but it's not quite ready for us to pickup. The event streams (mediawiki.page_outlink_topic_prediction_change) are currently populated with only canary events.

Looked into this, it looks like progress is being made but it's not quite ready for us to pickup. The event streams (mediawiki.page_outlink_topic_prediction_change) are currently populated with only canary events.

@EBernhardson the correct topic is eqiad.mediawiki.page_outlink_topic_prediction_change.v1, we started producing events to it via ChangeProp since the past couple of days, lemme know how it looks! We are looking forward to deprecate the mediawiki.revision-score stream :)

ebernhardson updated https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/28

ores: Adjust ingestion for changing articletopic ingestion to outlink model

ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/28

ores: Adjust ingestion for changing articletopic ingestion to outlink model

Mentioned in SAL (#wikimedia-operations) [2023-07-10T16:25:39Z] <ebernhardson@deploy1002> Started deploy [airflow-dags/search@8fa416b]: T328276: Change articletopic source to the outlink model

Mentioned in SAL (#wikimedia-operations) [2023-07-10T16:25:58Z] <ebernhardson@deploy1002> Finished deploy [airflow-dags/search@8fa416b]: T328276: Change articletopic source to the outlink model (duration: 00m 20s)

This has been shipped, our pipelines are now reading from event.mediawiki_page_outlink_topic_prediction_change_v1 and event.mediawiki_revision_score_drafttopic. This has reduced the update rate a bit. It looks like before we had ~35-40k updates flowing per hour, with the new model we are down closer to 18k/hr. That seems sensible, previously we were propagating edits from enwiki to other wikis, whereas now we wait until those wikis have seen an edit and so the wikis we previously propagated to are now likely seeing lower update levels.

This is excellent @EBernhardson and agreed on the update rate meeting expectations, especially because in the past you could have enwiki articles with 100+ sitelinks kicking off 100+ updates every time they were edited. I just confirmed that at least one recently-changed article that doesn't have an English equivalent moved from no topics to the appropriate topic. Thank you very much!

@KStoller-WMF thanks for the patience -- newcomers should hopefully start seeing a much greater diversity of task recommendations if they have topics selected in non-English wikis as incoming edits propagate the new topics! It'll take a while before we start seeing the impact in a quantitative way but I'll revisit some of my analyses in a few months and in the meantime will start suggesting other Product teams running recommender systems (Language, Android) tap into this awesome resource :)