Page MenuHomePhabricator

Add support for redirects in CirrusSearch
Closed, ResolvedPublic8 Estimated Story Points

Description

CirrusSearch does not index redirects as individual index document but rather maintain a list of redirects to the page they redirect to.

This data is not correlated to a particular revision of the target page and won't be handled by revision based updates. Thus the update will have to only update the redirect field:

{
      "wiki": "testwiki",
      "page_id": 123,
      "namespace": 10,
      "title": "Target title",
      "redirect": [
        {
          "namespace": 10,
          "title": "Redirect to target",
        },
        {
          "namespace": 2,
          "title": "Another Redirect to target",
        },
        {
          "namespace": 0,
          "title": "Yet Another Redirect to target",
        },
      ]
}

Note the absence of the version field here as we cannot attach this data to a particular revision of the page, it is the state of the redirects to the page Target title after one of its redirect is updated (added/removed).

There could be two approaches to do this:

Use the page-state stream state but use a different enrichment API
The cirrus doc api could be extended to support a builder and construct the update fragment with all the redirects targeting this page.

  1. When given the revision of redirect the target page has to be identified and then all redirects to this page must be collected
  2. List all the redirects like what's done in RedirectsAndIncomingLinks.php
  3. Produce the update fragment

While this technique works for adding a redirect, changing or removing a redirect might be tricky to do right:

  • changing a redirect from one target to another might cause 2 updates and the page-state stream might not know what was the previous target page
  • deleting a redirect might cause the call to the cirrus doc to fail as the content of deleted page cannot be extracted
  • what to do when transforming a page into a redirect and vice-versa

So while this approach is tempting as it resembles in many points to the rest of the pipeline the couple unknowns might make it not the best one to implement.

Create a dedicated stream from CirrusSearch
Another approach might be to simply ignore redirects from the page-state stream but create a new stream directly from CirrusSearch.
CirrusSearch might have better access to all the components required to sort out the various edges-cases mentioned above and produce the right events.

AC:

  • changes to redirect pages are properly reflected in the elasticsearch index

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
MPhamWMF set the point value for this task to 8.Feb 6 2023, 4:45 PM

Change 913030 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[mediawiki/extensions/EventBus@master] Encode redirect targets in page change events.

https://gerrit.wikimedia.org/r/913030

@Ottomata, as suggested on Wikimedia-Search via IRC, I just added redirect information to page change events, see https://gerrit.wikimedia.org/r/913030. I'd highly appreciate feedback early on so I don't wast time moving in a wrong direction. Right now I only pass the redirect for the latest version of a page. As far as I understand, extracting the redirect of the parent/prior revision would require parsing the content of that revision. Is that correct? Would that be something within the scope of that event or rather for the (ye to be implemented) events holding the page content?

Change 914803 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[schemas/event/primary@master] Encode redirect targets in page change events.

https://gerrit.wikimedia.org/r/914803

Change 914862 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[schemas/event/primary@master] Encode redirect targets in page change events.

https://gerrit.wikimedia.org/r/914862

Change 914803 abandoned by Peter Fischer:

[schemas/event/primary@master] Encode redirect targets in page change events.

Reason:

https://gerrit.wikimedia.org/r/914803

Change 914867 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[schemas/event/primary@master] Encode redirect targets in page change events.

https://gerrit.wikimedia.org/r/914867

Change 914862 abandoned by Peter Fischer:

[schemas/event/primary@master] Encode redirect targets in page change events.

Reason:

https://gerrit.wikimedia.org/r/914862

Use Cases (with redirect property in page change events, see schema proposal):

  • changelog_kind = insert & page_change_kind = create & page.is_redirect = true & page.redirect.title != null
    • fetch prop=cirrusbuilddoc&titles={page.redirect.page_title} and add in ES
  • changelog_kind = update & page_change_kind = move
    • fetch prop=cirrusbuilddoc&pageids={page.page_id} and add in ES
    • if created_redirect_page != null
      • delete document matching created_redirect_page.page_id from ES
  • changelog_kind = update & page_change_kind = edit & page.is_redirect = true & page.redirect.title != null & prior_state.revision.rev_id != null -> fetch prop=cirrusbuilddoc&rev_ids={prior_state.revision.rev_id}
    • if cirrusbuilddoc.outgoing_link has a size of 0 or cirrusbuilddoc.source_text does not start with #REDIRECT this page transformed into a redirect
      • delete document matching page.page_id from ES
      • fetch prop=cirrusbuilddoc&titles={page.redirect.page_title} and add/update in ES
  • changelog_kind = delete & page_change_kind = delete & page.is_redirect = true & page.redirect.title != null
    • fetch prop=cirrusbuilddoc&titles={page.redirect.page_title} and add/update in ES
  • changelog_kind = insert & page_change_kind = undelete & page.is_redirect = true & page.redirect.title != null
    • fetch prop=cirrusbuilddoc&titles={page.redirect.page_title} and add/update in ES

@pfischer @dcausse Reviewing implementation, I really wonder whether support for RedirectTarget and lookups should be in MediaWiki core. cc @daniel for vetting of this idea and whether MW would accept such a patch. See also T290639: Move redirect lookup logic into a service object.

requirements tl;dr:
From a WikiPage (or other Service) we can:

  • getRedirectTarget as LinkTarget, and as WikiPage LinkTarget points to.
  • Resolve redirect target chains to final WikiPage LinkTarget points to.

Perhaps LinkTarget should have the ability to return the WikiPage it points to?
Or Perhaps RedirectStore should have this ability?

Ottomata renamed this task from Add support for redirects to Add support for redirects in CirrusSearch.May 15 2023, 4:50 PM

@pfischer @dcausse Reviewing implementation, I really wonder whether support for RedirectTarget and lookups should be in MediaWiki core. cc @daniel for vetting of this idea and whether MW would accept such a patch. See also T290639: Move redirect lookup logic into a service object.

RedirectTarget sounds like it's the same as LinkTarget. RedirectLookup and PageLookup exist in core. I'm not sure I understand what's missing...

requirements tl;dr:
From a WikiPage (or other Service) we can:

  • getRedirectTarget as LinkTarget, and as WikiPage LinkTarget points to.
  • Resolve redirect target chains to final WikiPage LinkTarget points to.

Support for redirect chains have been removed from MediaWiki, they never worked properly and were causing confusion, see T296430: Remove support for $wgMaxRedirects.

Perhaps LinkTarget should have the ability to return the WikiPage it points to?

No, it can't. LinkTargets are value objects, it cannot perform lookups. Also, LinkTargets may refer to things that are not pages (relative section links, external links, language links, etc). Resolving links to pages is the purpose of PageLookup.

Or Perhaps RedirectStore should have this ability?

Yes, RedirectStore could have a convenience method like getRedirectTargetPage() wethod returns a PageReference. Not generally a PageIdentity, since the redirect may point to a special page.

Usage of WikiPage and Title objects should be avoided because they are "monster objects". New code should not use them, at least not in public method signatures.

This comment was removed by daniel.

(Oops, realized I linked to the wrong gerrit patch in my previous comment, meant to link to this implementation. Edited).

@daniel, thank you for your feedback! I reduced the implementation to only support direct redirects. As an alternative, I looked into the convenience method RedirectLookup/Store#getRedirectTargetPage(): ?PageReference. Just to make sure, I am on the right track: Did you think of something like that:

	public function getRedirectTargetPage ( PageIdentity $page ): ?PageReference {
		$linkTarget = $this->getRedirectTarget($page);
		if ($linkTarget != null) {
			try {
				return $this->pageLookup->getPageForLink($linkTarget); // TODO - Is that kind of coupling between RedirectLookup and PageLookup OK? It feels wrong.
			} catch (Exception $e) { // TODO - Is there a more specific exception to be caught?
				return new PageReferenceValue($linkTarget->getNamespace(), $linkTarget->getDBkey(), $linkTarget->isExternal() ? $linkTarget->getInterwiki() : WikiAwareEntity::LOCAL);
			}
		}
		return null;
	}

@daniel, thank you for your feedback! I reduced the implementation to only support direct redirects. As an alternative, I looked into the convenience method RedirectLookup/Store#getRedirectTargetPage(): ?PageReference. Just to make sure, I am on the right track: Did you think of something like that:

Yes, something like that. If you make a patch, we can discuss the details there.

@pfischer since we might want to change the current behavior in CirrusSearch (stop treating double redirects) it might make sense to have a function like that in core so that it's re-used by both EventBus and CirrusSearch?

Thanks for the data modeling work yall!

Now that we are looking at the options in T331399#8898566, I wonder if we can get away without a new data model for redirect info in page change. As long as redirect links are always local (are they?), the number of fields to represent a link is pretty low...and is the same as the fields in the entity/page schema.

It'd be nice if we could just add

redirect_target_page:
  $ref: /fragment/mediawiki/state/entity/page/1.0.0

to the entity/page schema..but that'd be circular. We only need 3 of the page fields to represent a redirect, so we could just ref each of them (or hardcode them) individually:

redirect_target_page:
  type: object
  properties:
    page_id: 
      $ref: '#/properties/page_id'
    page_title: 
      $ref: '#/properties/page_title'
    namespace_id: 
      $ref: '#/properties/namespace_id'

When we need a new model for page link target for T331399: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page, we can also $ref page fields.

The question I'm asking myself is...should we add this redirect_target_page field directly to entity/page? Or is it better to keep that fragment schema simple, and just add it to state/change/page directly?

If we add to state/change/page, we could add it into the page field there, and it would look the same. We'd also have to make sure it is added to the prior_state.page field as well.

Thoughts? (I think I'm leaning towards state/change/page directly).
@gmodena @tchin @Milimetric ?

@Ottomata, I'm not sure about redirects always being local. According to the redirect table definition redirects may contain an interwiki component. I discussed this with @dcausse already, we're happy to consume any redirect (link) that resolves to a wiki page. As long as we get all the properties of the target page we need, which are

  • page_id
  • page_domain (so we know how to build the URL to fetch details of the target page via CirrusSearch API; would have to be looked up in EventBus)
  • page_is_redirect (so we can skip double+ redirects)
  • [page_namespace] (optional as page_id is unique across namespaces)
  • [page_title] (optional as long as we have page_id)

Edit: Just figured, that looking up a target page_id of a page from another wiki does not work.

@Ottomata, I'm not sure about redirects always being local. According to the redirect table definition redirects may contain an interwiki component.

Indeed, redirects can be cross-wiki. MediaWiki will not follow interwiki redirects automatically (at least not on WMF sites, maybe this is configurable), but it will recognize, render and record them as redirects.

@daniel, do you know if we can get a canonical name (wiki_id?) of the wiki for interwiki link? If so, we might want to add a wiki_id field to the entity/page schema?

@daniel, do you know if we can get a canonical name (wiki_id?) of the wiki for interwiki link? If so, we might want to add a wiki_id field to the entity/page schema?

Unfortunately, tehre is no easy and reliable way to do this. It would not be possible in general since interwiki prefixes can map to exetrnal sites that might not even be wikis. But it should be possible for "local" interwiki links. In practice however, we don't have that mapping readily available anywhere. I have long been wanting to fix that, but it never happened. See T113034: RFC: Overhaul Interwiki map, unify with Sites and WikiMap.

But it should be possible for "local" interwiki links

Does "local" here mean local to the current wiki, like an interwiki link that should be a regular page link, or does it mean WMF hosted wikis?

Does "local" here mean local to the current wiki, like an interwiki link that should be a regular page link, or does it mean WMF hosted wikis?

No, I mean as in the iw_local database field, as returned by Interwiki::isLocal(): whether the target site is a wiki and lives on our cluster and we can talk to the database directly.

I note that the interwiki table actually has a iw_wikiid field which would do exactly what you want. But we are not using the table in production, we are loading the interwiki table from a php array (see ClassicInterwikiLookup::getAllPrefixesPregenerated). This only supports iw_prefix, iw_url, and iw_local. The other fields are lost.

This comment was removed by daniel.

If we do not resolve non-local page IDs, would it suffice to pass the iw_prefix for now (to indicate that this is non-local), instead of trying to map this to a domain/project (which obviously can't be done reliably at the moment)?

So I guess:

redirect_target_page:
  type: object
  properties:
    page_id: 
      $ref: '#/properties/page_id'
    page_title: 
      $ref: '#/properties/page_title'
    namespace_id: 
      $ref: '#/properties/namespace_id'

Plus a field that IDs the interwiki if needed. interwiki_prefix? Or what is iw_url?

(Sorry I haven't totally had time to investigate all the stuff Daniel linked. :) )

Change 914867 merged by jenkins-bot:

[schemas/event/primary@master] Encode redirect targets in page change events.

https://gerrit.wikimedia.org/r/914867

Change 930211 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] Bump primary schema repo version to get new page/change schema

https://gerrit.wikimedia.org/r/930211

Change 930211 merged by jenkins-bot:

[eventgate-wikimedia@master] Bump primary schema repo version to get new page/change schema

https://gerrit.wikimedia.org/r/930211

Change 930219 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] Bump primary schema repo version to get new page/change schema

https://gerrit.wikimedia.org/r/930219

Change 930219 merged by Ottomata:

[operations/deployment-charts@master] Bump primary schema repo version to get new page/change schema

https://gerrit.wikimedia.org/r/930219

Change 913030 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Encode redirect targets in page change events.

https://gerrit.wikimedia.org/r/913030

Change 933182 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[mediawiki/extensions/EventBus@master] Resolve redirect target before delete, so it can be serialized in the event fired after.

https://gerrit.wikimedia.org/r/933182

Change 933182 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Resolve redirect target before delete, so it can be serialized in the event fired after.

https://gerrit.wikimedia.org/r/933182

Change 935775 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[search/extra@master] Add support for limiting the size of arrays resulting from the `set` handler.

https://gerrit.wikimedia.org/r/935775

Change 935775 merged by jenkins-bot:

[search/extra@master] Add support for limiting the size of arrays resulting from the `set` handler.

https://gerrit.wikimedia.org/r/935775

pfischer subscribed.

Change 938210 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/software/elasticsearch/plugins@master] Bump version of extra plugin

https://gerrit.wikimedia.org/r/938210

Change 938210 merged by Bking:

[operations/software/elasticsearch/plugins@master] Bump version of extra plugin

https://gerrit.wikimedia.org/r/938210

Change 941483 had a related patch set uploaded (by Bking; author: Bking):

[operations/software/elasticsearch/plugins@master] Increment BUILD_VERSION so plugin can build

https://gerrit.wikimedia.org/r/941483

Change 941483 merged by Bking:

[operations/software/elasticsearch/plugins@master] Increment BUILD_VERSION so plugin can build

https://gerrit.wikimedia.org/r/941483

Unfortunately, the package build is failing. We're following the process from these notes , and we're getting the following error:
dpkg-gencontrol: error: unknown option '~bullseye'

Full dump of the output is here . @MoritzMuehlenhoff are you able to offer any suggestions on this one? Let us know if you need more context.

Change 948692 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[mediawiki/extensions/CirrusSearch@master] Count duplicate files in search results.

https://gerrit.wikimedia.org/r/948692

Change 948692 abandoned by Peter Fischer:

[mediawiki/extensions/CirrusSearch@master] Count duplicate files in search results.

Reason:

Something went wrong with this commit.

https://gerrit.wikimedia.org/r/948692

Built wmf-elasticsearch-search-plugins_7.10.2-9 and wmf-elasticsearch-search-plugins_7.10.2-9~bullseye (https://apt.wikimedia.org/wikimedia/pool/thirdparty/elastic710/w/wmf-elasticsearch-search-plugins/); installed on all elastic* hosts (incl. relforge* and cloudelastic*). Rolling restarts not completed yet. relforge* can be restarted at any time, but elastic* and cloudelastic* must wait till after an ongoing reindex of all wikis has completed.

Change 951829 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[schemas/event/primary@master] Reuse existing schema fragments for redirects.

https://gerrit.wikimedia.org/r/951829

Change 951829 merged by jenkins-bot:

[schemas/event/primary@master] cirrussearch: move to development

https://gerrit.wikimedia.org/r/951829