Page MenuHomePhabricator

Add a link engineering: Search pipeline
Closed, ResolvedPublic

Description

This is a tracking-only task for the search-related part of the Add Link project, mainly to have a self-contained description of the plan, which might also serve as a reference point for the upcoming image recommendation work. All work is happening in the linked tasks.

Product overview

WMF Product is planning to implement a series of structured tasks that can be recommended for new users; these tasks can be performed without using the editor and without having to understand all wiki policies, thus offering a more gradual learning experience. (More details here.) The first one is recommending links (Add-Link), the second recommending lead images (T254768); more might come based on the success of those.

The shape of the plan for link recommendations (and probably the others): there is some tool that can generate suggested edits (e.g. which words in an article should link where), which is used to provide a task feed or similar interface to the users. Users review the suggestions, which might or might not result in edits (the suggestion might be fully or partially accepted, or rejected). Outcomes are logged, and used to track the user's progress, and possibly to improve the recommendation tool.

Product requirements

  • Users can filter tasks via some selection criteria (currently, the ORES topics of the article). For any criteria, the number of available tasks should be above a certain threshold (to make sure users can find tasks they are interested in). This is not enforced real-time, but whenever the number of tasks dips below the limit, it should be fixed within a reasonable amount of time.
  • Picking a task from the pool must be fast (faster than the recommendation tool itself).
  • Recommendations need to fulfill certain criteria (e.g. confidence level). This might change per wiki.
  • The same recommendation should not be given to multiple users (this can be probabilistic). If a recommendation has been reviewed (accepted or rejected), it should not be given to users anymore. Same if the article has been edited.

Engineering requirements

  • We don't want to build any kind of custom search system; instead, use the proper one we already have.
  • Minimize the number of needed search index updates.
  • Limited storage size, at least initially (ie. we probably can't store recommendations for every article). In the future, storage might be reimplemented via the upcoming AI platform.
  • If possible, keep business logic in MediaWiki since product requirements might change and product teams can easier change it there.

Design decisions

Since storage is limited, and directly querying the recommendation tool is too slow, we maintain a recommendation pool as a MediaWiki DB table. For finding tasks, we use the search engine (which supports ORES topics; this also gives us flexibility in case the filtering criteria change), with randomized sorting to avoid collisions. We use a cronjob to make sure the pool is large and diverse enough; since the pool is consumed by newly registered users (whose number and speed doesn't change that much), this doesn't have to be super fast. We need to keep the search index in sync; this does not have to be real-time (the cronjob isn't anyway) so it should be batched to minimize search index updates. There is no way to do batched updates from within MediaWiki so we use the EventGate - Kafka - Hadoop - Spark pipeline that other similar features (e.g. ORES scores) do. Invalidating tasks on edits (or a user rejecting the recommendation) does have to be fast; we rely on the normal MediaWiki index update mechanism there, which is more or less real-time.

Implementation steps

  • a MediaWiki DB table for storing recommendations (T261410)
  • a search index field for whether an article has recommendations this got rolled into the new, more generic weighted_tags field
  • a cronjob that monitors the size of the table and fetches more recommendations when needed (T261408)
  • the cronjob sends an event via EventBus when it adds a new recommendation (T261407)
  • the Elasticsearch update pipeline consumes these events (probably in hourly batches) and updates the index (T262226)
  • a search keyword for filtering for articles which have recommendations (T269493)
  • when an article is edited or a recommendation has been reviewed, discard the recommendation from the MediaWiki DB table (T275790)
  • during the MediaWiki search index update, check the recommendation table and set the search index flag accordingly (note that this happens after an edit, in which case the flag should be set to false; but also after a null edit or other re-render, in which case it should keep its value) (T261409)
  • when a recommendation was reviewed without making an edit, trigger the index update manually (T261409)

Open questions

  • How do we avoid the same (rejected) recommendation being generated again? We'll probably want to keep some list of rejected recommendations, and make the cron job ignore them (T266446)
  • Should we use EventGate for logging the outcome of users reviewing the recommendations? In theory, that could also be an alternative meachanism for index updates. Doesn't seem to have any benefit though.

Event Timeline

I'm curious about the details of this:

when a recommendation was reviewed without making an edit, trigger the index update manually (TBD)

Of course, it is marked "TBD" so it isn't fair to press for more information at this stage. But when we get there, I remain interested in how passive rejection will work.

I'm curious about the details of this:

when a recommendation was reviewed without making an edit, trigger the index update manually (TBD)

Of course, it is marked "TBD" so it isn't fair to press for more information at this stage. But when we get there, I remain interested in how passive rejection will work.

"TBD" was just a reminder to myself to add a task. The plan is to delete the recommendation from the MediaWiki table, retrace the steps the search code is doing after an edit (not 100% sure but I think that means scheduling a SearchUpdate object). Since we are hooking into the search update to set the "has recommendations" flag based on whether the given revision ID is present in that table, this will result in the flag being unset.

AIUI there are two modes an ElasticSearch field can be configured: managed by MediaWiki, in which case index updates sent by MediaWiki after edits / reparses need to include that field, or managed separately, in which case the MW index updates won't affect it at all. So there's a design choice here and both options seem reasonable: we could have a separately managed field, send updates through EventGate from the cron job when it stores recommendations, and send updates through EventGate whenever an article is edited or parsed or a recommendation rejected. Or, make it a managed field, in which case MediaWiki handles edits/parses as long as we provide a fast lookup for the field value, and we can just use simulated MediaWiki search updates to handle rejections (and adding recommendations, in theory, but we stick to the custom pipeline for those as we want to batch them). Using a managed field means a little more traffic in MediaWiki (for rejections we have to recalculate all the unrelated fields even though they will never change) but way less custom EventGate events (rejections are rare, reparsing is common) so that seemed preferable.

kostajh subscribed.

@Tgr can we close this?

The only item not done is when an article is edited or a recommendation has been reviewed, discard the recommendation from the MediaWiki DB table. It's not needed for correctness, and not a blocker for deployment, but eventually we should have a cronjob or something for discarding old entries to save space.