Add outlink topic model predictions to CirrusSearch indices
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Jan 30 2023, 10:14 AM

Description

The outlink topic model is be able to do topic detection in all language we should use this data to possibly drop the need to import ORES articletopics model predictions.

The search jobs should push outlink predictions to the existing articletopics weighted_tags prefix (the set of topics are the same):

the search jobs should consume events from the dedicated mediawiki.revision_score_$model stream (blocked on T328576)
there should be no impact on the articletopics search keyword
the predictions made by ores articletopics will be slowly replaced by the outlink model ones as new edits are made to existing pages
thresholds will be set statically in the code-base to 0.5 instead of being fetched from the ORES api

AC:

outlink topic model predictions are pushed to the CirrusSearch indices and are queryable (via the existing articletopics keyword or a new one)
crosswiki propagation is no longer required and can be removed from the discolytics codebase.
- The ores drafttopic model will still be consumed as is but should not be a reason to keep the crosswiki propagation if it uses it (we could even consider only using it to populate the draft namespace?).

Details

	Title	Reference	Author	Source Branch	Dest Branch
	search: Change articletopic source to the outlink model	repos/data-engineering/airflow-dags!448	ebernhardson	work/ebernhardson/outlink	main
	ores: Adjust ingestion for changing articletopic ingestion to outlink model	repos/search-platform/discolytics!28	ebernhardson	work/ebernhardson/outlink	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Open	None	T312518 Migrate ORES clients to LiftWing
Resolved	EBernhardson	T328276 Add outlink topic model predictions to CirrusSearch indices
Resolved	achou	T328899 Add a new outlink topic stream for EventGate main
Open	achou	T331399 Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page
Resolved	pfischer	T325315 Add support for redirects in CirrusSearch
Resolved	bking	T344366 Rollout Elasticsearch extra plugins package and restart cluster to apply
Resolved	achou	T331401 Design event schema for ML scores/recommendations on current page state

Event Timeline

dcausse created this task.Jan 30 2023, 10:14 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJan 30 2023, 10:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcausse mentioned this in T317768: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score-<model>.Jan 30 2023, 10:15 AM

@AikoChou You might be able to answer some questions

KStoller-WMF awarded a token.Jan 30 2023, 5:10 PM

KStoller-WMF subscribed.

• kostajh subscribed.Jan 30 2023, 8:30 PM

Echoing that I would love Aiko's thoughts as well. My input:

Do we want to replace&reuse the existing articletopics weighted_tags prefix with the data obtained from this model? existing ORES predictions will stay until the page is edited and have the new predictions pushed.

@KStoller-WMF and I discussed this a while back and concluded that we were both fine with allowing existing ORES predictions to remain until an edit is made to the article and they are overwritten (no backfill). This should be much simpler, the existing ORES model is generally fine just limited in scope, and we can't think of any use-case where it's important to do a full refresh beyond potential confusion about the sources of erroneous predictions.

The articletopics model requires some thresholds to be computed to better filter the predictions (https://gist.github.com/halfak/630dc3fd811995c2a0260d43da462645), does the outlink topic model requires something similar or can we simplify this by using static thresholds for all predictions (e.g. hardcode 0.5)?

We've been hard-coding 0.5 in part for simplicity and because that's how the model is trained. I think thresholds are less of an issue for the outlinks model than the original ORES models because my experience is that the new model tends to have a bit better separation (a lot of scores above 0.9 or below 0.1). If we begin to see issues with that (there are definitely some topics under History/Society where it's more common to see scores hovering around 0.5 and lower thresholds might be beneficial), I can look into something similar to that script for tuning thresholds (thanks for the link to the gist).

We also import the ORES drafttopics model predictions, can they be replaced with the outlink topic model ones?

Good question about drafttopic. I have not done formal testing but thoughts below:

Context: drafttopic has identical goals to articletopic or outlinks but is fine-tuned to make predictions for first drafts of articles. First drafts are also when articles are least likely to contain links, which are what the new outlinks model uses for predictions. That means that the outlinks model does perform worse on these articles -- FYI some performance stats. And these articles are important because they're likely less discoverable in general so finding a good solution to handle them would be very nice. The drafttopic model, however, only works for English Wikipedia so it's not a panacea.
Potential approaches:
- Do nothing: run both models and I presume use the union of their predictions?
  - Benefits: minimal changes to existing workflow.
  - Drawbacks: doesn't reduce the maintenance debt that much or help with draft articles in other languages.
- Simplify: get rid of drafttopic model.
  - Benefit: single model for topic suggestions makes maintenance way easier.
  - Drawbacks: performance will suffer for low-link articles. I also don't know if there are other end-users of drafttopic that would be impacted.
- Compromise: set up a basic decision tree for which model to use -- e.g., if <5 links on enwiki, use drafttopic, else use outlinks.
  - Benefits: probably improved coverage for English Wikipedia and we only run drafttopic when it's most likely to help.
  - Drawbacks: maintaining multiple models still on LiftWing and doesn't help languages outside of English.
- Have cake and eat it too (?): for many languages, we have add-a-link models (see T307881) that will take articles and predict what links could be added. We could do something similar to the option above but instead of using drafttopic, we could add a separate pre-processing pipeline that instead predicted links for an article if there aren't enough already and use those predicted links to make the predictions. We might want to do some testing if we go down this route and @AikoChou in particular I'm curious whether you think this is feasible. It wouldn't get us full language coverage but there's separate work to extend add-a-link to many many languages so at least we benefit from that for free essentially.
  - Benefits: in theory great language coverage and good predictions for low-link articles. Could potentially deprecate drafttopic if not being used elsewhere. If we codify this within the outlinks model preprocessing on LiftWing, Search doesn't have to know anything about it so easier maintenance.
  - Disadvantages: untested and new external dependency of add-a-link model outputs for small proportion of articles.

I personally prefer the final approach as it feels the simplest in terms of maintenance and also the best coverage. The logic for it would be implemented within the LiftWing pre-processing scripts so Search could be agnostic to how it works too and implementing it is not a blocker to the switch-over.

have migration steps for the Growth team if a new keyword is required or if the set of predictable topics are differents

Just a note that thanks for including this -- I personally would like to make some changes eventually to the model topics but am holding on that to keep this migration simple.

Hi!

For the first and second questions, Isaac has answered. For the question about the drafttopic model, considering the outlinks model performs worse on low-link/new articles, I think we can keep the drafttopic model for now (although it only supports enwiki).

@Isaac - currently we haven’t deployed add-a-link models on Lift Wing, so it increases the complexity of the final approach you mentioned, like where to get the add-a-link outputs for the need of the outlinks model. If we have add-a-link models in Lift Wing, I think the approach is feasible and a smart way to enhance the prediction for low-link articles in the outlinks model, and also an interesting use case the ML team want to cover. For the task, I think we can continue to use drafttopic for now (your first approach) and maybe put the final approach as future work.

@dcausse - does the current pipeline use the union of the predictions from articletopic and drafttopic model? or how does it aggregate the two predictions for the same article?

To achieve the goal of this ticket, the ML team will need to complete T328576 first, in which the changeprop will change from connecting with the ORES precache API to the Lift Wing API, and Lift Wing will generate new streams and send them to EventGate. New streams are more-granular mediawiki.revision_score_$model streams, which will eventually generate multiple event.mediawiki_revision_score_$model tables in HDFS, then can be used for the CirrusSearch use case.

The change for changeprop will be applied to the revscoring models on Lift Wing (including goodfaith, damage, articlequality, articletopic, drafttopic, ...) and new languaue-agnostic models (currently we have outlink topic model and revert-risk model).

So the data flow for the CirrusSearch articletopic use case will become:
wiki edit -> changeprop -> Lift Wing -> EventBus -> HDFS -> Airflow / Spark (-> cross-wiki propagation) -> Elasticsearch

achou added a project: Machine-Learning-Team.Feb 3 2023, 3:17 PM

In T328276#8585466, @achou wrote:

@dcausse - does the current pipeline use the union of the predictions from articletopic and drafttopic model? or how does it aggregate the two predictions for the same article?

Currently the two set of predictions are kept and the user can use one or the other: searching for articletopic:biography vs searching for drafttopic:biography. As far I understood the drafttopic keyword was requested for searching by topic on the Draft (T249341) namespace for which the articletopic model does not work. If the outlink model is not enabled on the Draft namespace it might make to leave this problem out for now and not think too much about how to replace the ORES drafttopic model yet.

So the data flow for the CirrusSearch articletopic use case will become:
wiki edit -> changeprop -> Lift Wing -> EventBus -> HDFS -> Airflow / Spark (-> cross-wiki propagation) -> Elasticsearch

The part around cross-wiki propagation is what I'm hoping we can remove using the outlink model.

I think the question that remains is:

do we want to repopulate the indices with the prediction of the outlink model (requires a backfill by re-running the model on all the pages and importing the output to elasticsearh)
are we fine having the outlink predictions to slowly replace articletopics ones

Thanks!

If the outlink model is not enabled on the Draft namespace it might make to leave this problem out for now and not think too much about how to replace the ORES drafttopic model yet.

Thanks for the explanation. The outlink model is used for pages in the main/article namespace (which is the source of the training data), not for the draft namespace yet, so I agree with the point, we can leave the problem out for now.

The part around cross-wiki propagation is what I'm hoping we can remove using the outlink model.

Yes, if we use the outlink topic model, we don’t need cross-wiki propagation as it supports all languages.

do we want to repopulate the indices with the prediction of the outlink model (requires a backfill by re-running the model on all the pages and importing the output to elasticsearh)

are we fine having the outlink predictions to slowly replace articletopics ones

I agree with what Isaac said — no backfill as the existing ORES predictions are fine, and we're fine having the outlink predictions slowly replace articletopics ones.

achou mentioned this in T328899: Add a new outlink topic stream for EventGate main.Feb 6 2023, 8:58 AM

achou mentioned this in T315994: Connect Outlink topic model to eventgate.Feb 6 2023, 12:14 PM

achou moved this task from Unsorted to Watching on the Machine-Learning-Team board.Feb 7 2023, 3:25 PM

thanks for the extra details @achou and @dcausse! the clear delineation of outlinks for namespace 0 and drafttopic for draft namespace also makes a lot of sense to me and conveniently seems to be the easiest thing to do as well.

currently we haven’t deployed add-a-link models on Lift Wing, so it increases the complexity of the final approach you mentioned, like where to get the add-a-link outputs for the need of the outlinks model. If we have add-a-link models in Lift Wing, I think the approach is feasible and a smart way to enhance the prediction for low-link articles in the outlinks model, and also an interesting use case the ML team want to cover.

@achou, not urgent, but perhaps we can pick this up in the next ML:Research sync. I'd be curious to learn more. Sounds like we can do that outside this scope of work too as it would be a silent change from Search's perspective.

Gehel triaged this task as High priority.Feb 13 2023, 4:25 PM

Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

Thanks for all the input! I've updated the task description accordingly.

@dcausse Hi! I am working in T328576 to split mediawiki.revision-score into multiple substreams, and I am wondering which ones are really needed. IIUC from reading this task your team would still need drafttopic right? If so we could start from it so that you'll be able to migrate away from ORES' revision-score. Does it make sense?

Moreover, do you folks use or plan to use other ORES revision score streams in the future? I recall that we discussed briefly something about it, but I don't remember the exact list of things needed :)

elukey mentioned this in T328576: Implement new mediawiki.revision-score streams with Lift Wing.Feb 17 2023, 7:11 AM

In T328276#8624184, @elukey wrote:

@dcausse Hi! I am working in T328576 to split mediawiki.revision-score into multiple substreams, and I am wondering which ones are really needed. IIUC from reading this task your team would still need drafttopic right? If so we could start from it so that you'll be able to migrate away from ORES' revision-score. Does it make sense?

Moreover, do you folks use or plan to use other ORES revision score streams in the future? I recall that we discussed briefly something about it, but I don't remember the exact list of things needed :)

Hey!
Sure I think we can certainly start using a dedicated stream for ORES drafttopic predictions once it's running on your side (I might file a dedicated ticket for it), note that we are in the middle of a spark 3 migration so we might start doing this refactoring after the migration but please let us know if it's a blocker on your side. Regarding other ORES models I can't think of anything else other than articlequality (ex WP10) but Erik found that it was not very not useful for search ranking and I don't remember anyone requesting these scores to be queryable from the search index so I don't think we'll add it anytime soon, it's very probable that I'm forgetting something tho... @EBernhardson might know if there's an another ORES model we have plans to use?

Thanks for the feedback! I am collecting info in T328576 so we can use that task for reference, I'll wait for Erik's recommendation about what streams to keep and what not (for example, if we want articlequality etc..).

once articletopic transfers to the link-based topic modeling drafttopic should be the only one we still need, afaik.

mfossati subscribed.Mar 29 2023, 3:00 PM

dcausse edited projects, added Discovery-Search (Current work); removed Discovery-Search.Mar 29 2023, 4:05 PM

achou added a parent task: T312518: Migrate ORES clients to LiftWing.Mar 31 2023, 10:56 AM

achou added a subtask: T328899: Add a new outlink topic stream for EventGate main.

EBernhardson set the point value for this task to 5.Apr 24 2023, 3:49 PM

EBernhardson moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

EBernhardson claimed this task.Apr 24 2023, 8:58 PM

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

EBernhardson moved this task from In Progress to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.Apr 26 2023, 7:26 PM

EBernhardson moved this task from Ready for Dev -- SRE/Ops to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Jun 12 2023, 3:27 PM

EBernhardson mentioned this in T333468: Use the mediawiki.revision_score_drafttopic stream instead of mediawiki.revision-score.Jun 12 2023, 7:46 PM

Looked into this, it looks like progress is being made but it's not quite ready for us to pickup. The event streams (mediawiki.page_outlink_topic_prediction_change) are currently populated with only canary events.

In T328276#8924827, @EBernhardson wrote:

Looked into this, it looks like progress is being made but it's not quite ready for us to pickup. The event streams (mediawiki.page_outlink_topic_prediction_change) are currently populated with only canary events.

@EBernhardson the correct topic is eqiad.mediawiki.page_outlink_topic_prediction_change.v1, we started producing events to it via ChangeProp since the past couple of days, lemme know how it looks! We are looking forward to deprecate the mediawiki.revision-score stream :)

EBernhardson moved this task from Blocked/Waiting to In Progress on the Discovery-Search (Current work) board.Jun 26 2023, 3:19 PM

ebernhardson updated https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/28

ores: Adjust ingestion for changing articletopic ingestion to outlink model

ebernhardson opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/448

Draft: search: Change articletopic source to the outlink model

CodeReviewBot added a project: Patch-For-Review.Jun 30 2023, 7:39 PM

ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/merge_requests/28

ores: Adjust ingestion for changing articletopic ingestion to outlink model

EBernhardson moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Jul 7 2023, 7:40 PM

EBernhardson moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Jul 10 2023, 3:07 PM

ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/448

search: Change articletopic source to the outlink model

Mentioned in SAL (#wikimedia-operations) [2023-07-10T16:25:39Z] <ebernhardson@deploy1002> Started deploy [airflow-dags/search@8fa416b]: T328276: Change articletopic source to the outlink model

Mentioned in SAL (#wikimedia-operations) [2023-07-10T16:25:58Z] <ebernhardson@deploy1002> Finished deploy [airflow-dags/search@8fa416b]: T328276: Change articletopic source to the outlink model (duration: 00m 20s)

This has been shipped, our pipelines are now reading from event.mediawiki_page_outlink_topic_prediction_change_v1 and event.mediawiki_revision_score_drafttopic. This has reduced the update rate a bit. It looks like before we had ~35-40k updates flowing per hour, with the new model we are down closer to 18k/hr. That seems sensible, previously we were propagating edits from enwiki to other wikis, whereas now we wait until those wikis have seen an edit and so the wikis we previously propagated to are now likely seeing lower update levels.

This is excellent @EBernhardson and agreed on the update rate meeting expectations, especially because in the past you could have enwiki articles with 100+ sitelinks kicking off 100+ updates every time they were edited. I just confirmed that at least one recently-changed article that doesn't have an English equivalent moved from no topics to the appropriate topic. Thank you very much!

@KStoller-WMF thanks for the patience -- newcomers should hopefully start seeing a much greater diversity of task recommendations if they have topics selected in non-English wikis as incoming edits propagate the new topics! It'll take a while before we start seeing the impact in a quantitative way but I'll revisit some of my analyses in a few months and in the meantime will start suggesting other Product teams running recommender systems (Language, Android) tap into this awesome resource :)

@Isaac Exciting, thanks for the update!

Gehel closed this task as Resolved.Jul 21 2023, 9:39 AM

KStoller-WMF mentioned this in T332089: Community discussion about "add a link" with Arabic Wikipedia.Jul 25 2023, 4:52 PM

elukey closed subtask T328899: Add a new outlink topic stream for EventGate main as Resolved.Sep 5 2023, 1:24 PM

calbon moved this task from Watching to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 29 2023, 2:19 PM