Page MenuHomePhabricator

Add a link engineering: update search index on page edit
Closed, ResolvedPublic

Description

Make sure the "has recommendations" flag is removed from the search index when the page is edited.

Event Timeline

kostajh changed the task status from Open to Stalled.Oct 26 2020, 2:33 PM

I am not sure we will want a job to execute in the first iteration of this product. Marking as stalled until we have some more clarity around the indexing requirements (T261408 is probably the best place to centralize discussion around that).

Tgr renamed this task from Add a link engineering: Job specification to retrieve/cache/search index update on page edit and deletion to Add a link engineering: update search index on page edit.Jan 9 2021, 12:56 AM
Tgr updated the task description. (Show Details)

Repurposed this into just dealing with unindexing. We don't plan to automatically update the recommendation for now, we just consider edited page as not having a recommendation anymore. We need to make sure the search index matches that, though.

This is blocked on Search work, but that is very nearly done now, I think.

@dcausse or @EBernhardson would you be able to advise how the Cirrus query for this would look? (I tried to find the functionality mentioned in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/655083/5#message-f55d5c03f065a9e6b483888be280d8fff35cfb47 but probably was looking at the wrong place.) If it could be done as part of the normal MediaWiki index update (SearchDataForIndex) that would be convenient, but I assume ores_articletopics / weighted_tags is not accessible there as it's not derived from page content?

It could be done using MW core hooks but since this field is mainly owned by Cirrus I wonder if Cirrus would not be better place.

FWIW using a MW core hooks would look like this:

	/** SearchDataForIndex hook */
	function resetRecommendations( &$fields, $handler, $page, $output, $engine ) {
		// possibly check that $engine is CirrusSearch
		$fields['weighted_tags'] = $fields['weighted_tags'] ?? [] + [ 'recommendation.link/__DELETE_GROUPING__' ];
	}

But then cirrus needs to properly some hints to the update request, which could be done via the SearchIndexFields MW core hook but this would mean moving all the logic out of cirrus which I'm not sure would be good idea.

I'd be for having all the logic in the same extension and do all this in Cirrus for simplicity reasons.

In cirrus it should be relatively simple:

  • attach the data to the weighted_tags as shown the php snippet above
  • attach the multilist noop handle to the weighted_tags
  • make sure we do this only for real page edits (e.g. not saneitization requests or other unrelated updates)

Does multilist make it preserve existing content?

There are two scenarios in which the tag needs to be unset: when the page is edited, and when the recommendation is reviewed by a user (this might result in an edit but not necessarily). The latter needs to be handled by the GrowthExperiments extension, and the code would largely be the same. If there's an appropriate Cirrus-specific hook, GrowthExperiments could subscribe to that.

I wonder though, would it make sense to extend SearchDataForIndex with something like a $mergeFields parameter, which gets superimposed on top of the old field data, instead of replacing it? Seems like that could be useful in general for tag-like features.

Does multilist make it preserve existing content?

yes this is the point of this handler.

There are two scenarios in which the tag needs to be unset: when the page is edited, and when the recommendation is reviewed by a user (this might result in an edit but not necessarily). The latter needs to be handled by the GrowthExperiments extension, and the code would largely be the same. If there's an appropriate Cirrus-specific hook, GrowthExperiments could subscribe to that.

I think we could certainly expose a hook from cirrus to reset this data, the GrowthExperiments extension would be responsible for triggering it.

I wonder though, would it make sense to extend SearchDataForIndex with something like a $mergeFields parameter, which gets superimposed on top of the old field data, instead of replacing it? Seems like that could be useful in general for tag-like features.

Might be, I think it's the first time we have this setup (single field fed from multiple sources) but I suggest to first resolve the third point I raised (make sure we do this only for real page edits). Currently all these hooks are not aware of the context they are called and we do not want to reset the recommendation flag for unrelated index updates, I think Cirrus might know this and is why I suggest doing everything in cirrus for now.

I suggest to first resolve the third point I raised (make sure we do this only for real page edits)

My original plan for that was to check whether we are storing a recommendation for that revision in the GrowthExperiments table responsible for that, growthexperiments_link_recommendations (which is a primary key lookup in the database, so I think it would be acceptable performance-wise). That is convenient in that it would also handle user rejections (where they mark a recommendation as bad, so we need to unindex the page without any edit happening) - we'd just delete the table row and trigger an index update. Although now that I think about it that seems a bit wasteful as all document fields would be recalculated.

Anyway, I'm happy to go with a Cirrus-specific hook if you think that's a good approach. The only requirement we have is that we need to trigger a similar indexing change when recommendations are rejected and no normal update event is happening.

I suggest to first resolve the third point I raised (make sure we do this only for real page edits)

My original plan for that was to check whether we are storing a recommendation for that revision in the GrowthExperiments table responsible for that, growthexperiments_link_recommendations (which is a primary key lookup in the database, so I think it would be acceptable performance-wise). That is convenient in that it would also handle user rejections (where they mark a recommendation as bad, so we need to unindex the page without any edit happening) - we'd just delete the table row and trigger an index update. Although now that I think about it that seems a bit wasteful as all document fields would be recalculated.

Oh sorry I missed this point, if you have somewhere this information then implementing the hook SearchDataForIndex might be sufficient indeed.
On edits you would have to make sure that the entry in the growthexperiments_link_recommendations table is cleared before the indexing job is run. Note that cirrus triggers its update during the LinksUpdateComplete hook.

As to allow the GrowthExperiments extension to invalidate a recommendation flag without an edit I'm not aware of something that would allow you to do that. Unless you have something in mind options could be:

  • expose a function that triggers an index refresh for one doc (this will have to go through to the jobqueue and might add some latency)
  • expose a function dedicated to resetting recommendation flags, could be synchronous and would not require rebuilding all the page fields (it'll go async only if the cluster is under maintenance)

Anyway, I'm happy to go with a Cirrus-specific hook if you think that's a good approach. The only requirement we have is that we need to trigger a similar indexing change when recommendations are rejected and no normal update event is happening.

I'm fine with the SearchDataForIndex approach, the weigted_tags field configuration with the multilist noop handler can go to CirrusSearch (as it's owning the mapping).

So, per T273508: CirrusSearch should expose a function to reset its weighted_tags for a particular tag category, we should call (new CirrusSearch())->resetWeightedTags( Title::newFromText( '...' )->toPageIdentity(), 'recommendation.link/' ) when the user rejects a recommendation and it needs to be cleared, right? (Thanks for providing that!) And for edits use a (to be implemented) new functionality in SearchDaraForIndex, since we want to free-ride on the normal index update?

So, per T273508: CirrusSearch should expose a function to reset its weighted_tags for a particular tag category, we should call (new CirrusSearch())->resetWeightedTags( Title::newFromText( '...' )->toPageIdentity(), 'recommendation.link/' ) when the user rejects a recommendation and it needs to be cleared, right? (Thanks for providing that!) And for edits use a (to be implemented) new functionality in SearchDaraForIndex, since we want to free-ride on the normal index update?

Yes exactly, note that the tag prefix should be recommendation.link (without the /) when using the resetWeightedTags.
Hooking into SearchDaraForIndex will probably have to go in your extension (note that this hook is also called in processes unrelated to page edits):

	/** SearchDataForIndex hook */
	function onSearchDataForIndex( &$fields, $handler, $page, $output, $engine ) {
		// possibly check that $engine is CirrusSearch
		if ( !$this->hasLinkRecommendationFor( $page ) ) {
			// append 'recommendation.link/__DELETE_GROUPING__' to the $fields['weighted_tags'] array (might not exist)
		}
	}

String constants could be used from Cirrus classes but it's unsure if this is worthwhile since I think you would have to add stubs for phan.

  • __DELETE_GROUPING__ is \CirrusSearch\Search\CirrusIndexField::MULTILIST_DELETE_GROUPING
  • weighted_tags is \CirrusSearch\Wikimedia\WeightedTagsHooks::FIELD_NAME

Change 662099 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Do not store the link recommendation in the DB on miss

https://gerrit.wikimedia.org/r/662099

Change 662099 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Do not store the link recommendation in the DB on miss

https://gerrit.wikimedia.org/r/662099

Tgr changed the task status from Stalled to Open.Feb 7 2021, 7:15 PM
Tgr claimed this task.

Change 662133 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] [WIP] Update search index after a recommendation has been submitted

https://gerrit.wikimedia.org/r/662133

Change 664411 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/CirrusSearch@master] [WIP] Add functionality to update weighted tags

https://gerrit.wikimedia.org/r/664411

Change 664411 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add helper method and script for updating weighted tags

https://gerrit.wikimedia.org/r/664411

Change 662133 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Update search index after a recommendation has been submitted

https://gerrit.wikimedia.org/r/662133

Nothing QA'able yet, but moving to that column for now anyway.

kostajh moved this task from Backlog to Done / QA on the Add-Link board.
kostajh added a subscriber: Etonkovidova.

@Etonkovidova I think you could QA this now.

Updating my comment : @kostajh - I rechecked some test scenarios and there are still cases that are problematic. Please review the following and let me know if there shuold be some follow-ups. Probably, it's just some limitations of betalabs environment.
There is T294160 - probably it relates to this task?

(1) There are quite few articles in enwiki betalabs that are still have entries in growthexperiments_link_recommendations but would display "Suggestions are no longer available".

e.g. "Anthony Field" (Page ID: 182195 ) - one of the suggested articles for the filters "Oceania"+"Add links between articles
(Machine suggestions)" (total count of suggested articles is 9).

MariaDB [enwiki]> select * from growthexperiments_link_recommendations where gelr_page=182195\G
*************************** 1. row ***************************
gelr_revision: 344358
    gelr_page: 182195
    gelr_data: {"links":[{"link_text":"Queenie Paul","link_target":"Queenie Paul","match_index":0,"wikitext_offset":2899,"score":0.5792330503463745,"context_before":"aunt was \"","context_after":"\", known f","link_index":0},{"link_text":"tin whistle","link_target":"Tin whistle","match_index":0,"wikitext_offset":10092,"score":0.7919995188713074,"context_before":"uding the ","context_after":", bagpipes","link_index":1},{"link_text":"Melbourne","link_target":"Melbourne","match_index":0,"wikitext_offset":10847,"score":0.5287876725196838,"context_before":"Coffee in ","context_after":".","link_index":2},{"link_text":"Australian rugby league","link_target":"Australian Rugby League","match_index":0,"wikitext_offset":12528,"score":0.7031171321868896,"context_before":"an of the ","context_after":" club the ","link_index":3}]}
1 row in set (0.00 sec)

But "Anthony Field" (Page ID: 182195 ) history shows that links were added (and the newer revision is recorded for the page in revision table:

Screen Shot 2021-09-07 at 5.43.15 PM.png (119×1 px, 63 KB)

It seems that the entries in growthexperiments_link_recommendations have not been updated after link recommendation task was done.

(2) The scenario that leads to "Suggestions are no longer available"

  • on Homepage apply "Add links between article(Machine suggestions)" filter.
  • in a different tab, enter the title that you noticed on Homepage
  • edit an article and publish the edit
  • on the tab where Homepage displays the article card, click on the card- "Suggestions are no longer available" warning will be displayed.

(3) Edit conflict will be raised if in VE mode one of the recommended links will be added manually.

  • a user switches to VE mode from recommended link mode and dd manually one of the recommended links. Publish it.
  • a user switches back to the link recommendation mode - the links would reflect reduced number of recommended links and add/reject links workflow will go as usual.

When a user attempts to save added/rejected links- the warning about edit conflict is displayed and edit cannot be saved.

growthexperiments_link_recommendations was updated and the article was excluded.