Page MenuHomePhabricator

URL internal links that use parser functions or go to a specified target are hardly searchable
Open, LowestPublic

Description

The ability to locate internal HTML links is valuable, but currently if that link uses URL-related parser functions , it's hardly possible.

  • What-all links to a section allows us to change the section title.
  • Finding any bare URL in ref tags is a style-correction issue.
  • Related functionality, LinkSearch, was a top concern in the recent recent Wikimedia survey (["to search for external links to pages on this site..."](//en.wikipedia.org/wiki/Special:LinkSearch|stated function of LinkSearch) ).
  • WhatLinksHere cannot find internal URL.
  • The linksto search parameter cannot find internal URL. (Linksto tracks [[square brackets]], but only if they are not redirects.)
  • The insource search parameter is the only way. It's nearly impossible because the only approach is running a large set of queries. Currently a probability is the best we can do when reporting URL internal linkage.

Background
To link, we can create URL-style internal link to one point in such a generous number ways, that there can be literally hundreds of ways, each of which are significantly different enough that Search can need a hundred queries to find that one URL internal link.

To start the picture, here are just five of the many text patterns that can link to a point:

  1. [//wikipedia.org/wiki/namespace:pagename]
  2. {{canonicalurl:namespace:pagename}}
  3. [{{fullurl:namespace:pagename}}]
  4. [{{SERVER}}{{localurl:namespace:pagename/}}]
  5. [{{SERVER}}/wiki/namespace:pagename/]

It is not only the generous number of magic words that confounds Search but their interplay, whitespace, and letter case. Here is an example of how one parser function accepts spacing.

  1. {{fullurl:namespace:pagename}}
  2. {{fullurl: namespace:pagename}}
  3. {{fullurl:namespace: pagename}}
  4. [{{fullurl: namespace: pagename}} link label]

That single, parser-function characteristic alone quadruples the number of insource queries. But their are many more like it, in that they each multiply the number of queries needed to find a link, not simply add to them. Each of the many queries needed to find a single link would also require a regexp, to find for example, that only the last one is verified as an actual link.

There has to be yet another insource query for all of the following different text patterns,
all multipliers of the numbers of queries needed to find a single URL:

  • Namespace can be said subjectspace, articlespace, or talkspace.
  • Pagenames can be said basepagename/subpagename, articlepagename, subjectpagename, talkpagename, rootpagename, or pagename. These six can also take a :fullpagename as a parameter so that's a multiplier of twelve.
  • equal magic word name for server, servername, or scriptpath. That's three more.
  • most of these can equally name content by using a :fullpagename colon parameter.
  • several magic words (*URL ones) take parameters like |path or |wiki
  • "Fullurl" and "canonicalurl" also accept "urlencode" or "anchorencode" forms.
  • many take an "EE" form
  • There's also things like {{NS:{{NAMESPACENUMBER}}}} that needs to be searched to see if it goes to the page name in question.
  • The URL can equally be in HTTP POST or HTTP GET forms so many of these can equally name content by using a |query on the path: server/w/query where query is index.php?title= or index.php?pageid=

Given a page name, each of these must be tried. That's hundreds of insource queries given a single page name.
So we probably do five queries instead and say "probably" what links there".

Foreground
Spacing and case can be significant for insource, and a varying regexp is required for each query for matching multiple patterns. So we can search for URL in only an ungenerous way. For each single external link construct, Search is narrow and specific.

Each of these characteristics multiplies the number of searches required many fold:

  • Search is camelCase sensitive, but namespace names and parser functions are not.
  • Insource treats an unspaced colon : character like:this as a letter, where the non-indexed strings "like" and "this" cannot be found unless with a regex. Insource is the only option, because page visibility of the sought construct is, although possible, not likely. To find non-indexed strings, a regex needs a filter. As just explained, their is no filter possible. Each search would need its own separate regex for verification purposes. For an insource search the non-spaced colon is no different from a letter or a number. If there is a space after it, the alternative without the space will not match.
  • A namespace with two aliases adds triples the number of insource searches.
  • You really need yet another variable, a /regex/ just to look for the opening [ bracket. It would need to accompany each one in its own distinct, unique form. Yet still it could not prove a closing ] bracket existed, because the dot . metacharacter represents any character including a newline.
  • Insource doesn't take OR.

It is a series of searches few could understand. A template could try to offer to report URLs to a given canonical name of any of: a section, a fullpagename, a prefix, or a namespace. There is no closer-to-singular process available. A way for end-users to find URLs.

What is a workaround, please?

Event Timeline

Cpiral created this task.Dec 14 2015, 8:41 AM
Cpiral raised the priority of this task from to Needs Triage.
Cpiral updated the task description. (Show Details)
Cpiral added a project: CirrusSearch.
Cpiral added a subscriber: Cpiral.
Restricted Application added a project: Discovery. · View Herald TranscriptDec 14 2015, 8:41 AM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Cpiral renamed this task from URL-style internal wikilinks are not well-enough tracked or searchable to Wikipedia URL-style internal wikilinks are not well-enough tracked or searchable.Dec 17 2015, 10:30 AM
Cpiral updated the task description. (Show Details)
Cpiral set Security to None.
Cpiral updated the task description. (Show Details)Dec 18 2015, 3:41 AM
Cpiral renamed this task from Wikipedia URL-style internal wikilinks are not well-enough tracked or searchable to URL-style internal wikilinks are not easily searchable.Dec 19 2015, 12:51 AM
Cpiral renamed this task from URL-style internal wikilinks are not easily searchable to URL-style, internal, wikilinks are hardly searchable.Dec 20 2015, 3:12 AM
Cpiral updated the task description. (Show Details)
Cpiral renamed this task from URL-style, internal, wikilinks are hardly searchable to URL internal links are hardly searchable.Dec 30 2015, 7:35 AM
Cpiral updated the task description. (Show Details)
Deskana triaged this task as Lowest priority.Dec 31 2015, 12:28 AM
Deskana moved this task from Needs triage to Search on the Discovery board.
Deskana added a subscriber: Deskana.
Cpiral updated the task description. (Show Details)Jan 2 2016, 8:10 AM
Cpiral renamed this task from URL internal links are hardly searchable to URL internal links that use parser functions or go to a specified target are hardly searchable.Jan 22 2016, 12:58 AM
Cpiral updated the task description. (Show Details)

Wikicheck mentions URL-style internal wikilinks as error 90, "Internal link written as an external link". No bots fix them, but they are found and listed for cleanup.

The best way to find these would be to allow searching the list of page links and external links we store in the search index. Unfortunately the wikitext parser doesn't seem to think most of those examples given qualify as external links( for example, [{{fullurl:{{FULLPAGENAME}}|action=edit}} improve this {{{1|article}}}])

Any fix here would have to be to the wikitext parser, for it to consider these things to actually be external links and report them as such.