Page MenuHomePhabricator

Add data-title attribute to anchors
Closed, InvalidPublic

Description

This task encompasses the discussion and work to provide API consumers with wiki page title information for every internal link in page content.

For example, the Popups extension provides users with page and reference previews for links in the page content. To present a preview, Popups must identify the page title for an arbitrary link in the content. It does so currently using a JavaScript heuristic. It looks about like this, has lots of tests, and isn't information that's a good idea to guess at:

JavaScript URL to page title heuristic
/**
 * Returns the decoded wiki page title referenced by the passed link as a string when parsable.
 * The title query parameter is returned, if present. Otherwise, a heuristic is used to attempt to
 * extract the title from the path.
 *
 * The API is the source of truth for page titles. This function should only be used in
 * circumstances where the API cannot be consulted.
 *
 * Assuming the current page is on metawiki, consider the following example links and
 * `newFromUri()` outputs:
 *
 *     https://meta.wikimedia.org/wiki/Foo → Foo (path title)
 *     http://meta.wikimedia.org/wiki/Foo → Foo (mismatching protocol)
 *     /wiki/Foo → Foo (relative URI)
 *     /w/index.php?title=Foo → Foo (title query parameter)
 *     /wiki/Talk:Foo → Talk:Foo (non-main namespace URI)
 *     /wiki/Foo bar → Foo_bar (name with spaces)
 *     /wiki/Foo%20bar → Foo_bar (name with percent encoded spaces)
 *     /wiki/Foo+bar → Foo+bar (name with +)
 *     /w/index.php?title=Foo%2bbar → Foo+bar (query parameter with +)
 *     / → null (mismatching article path)
 *     /wiki/index.php?title=Foo → null (mismatching script path)
 *     https://archive.org/ → null (mismatching host)
 *     https://foo.wikimedia.org/ → null (mismatching host)
 *     https://en.wikipedia.org/wiki/Bar → null (mismatching host)
 *
 * This function invokes `Uri.isInternal()` to validate that this link is assuredly a local
 * wiki link and that the internal usage of both the title query parameter and value of
 * wgArticlePath are relevant.
 *
 * This function doesn't throw. `null` is returned for any unparseable input.
 *
 * @param {mw.Uri|Object|string} [uri] Passed to Uri.
 * @param {Object|boolean} [options] Passed to Uri.
 * @param {Object|boolean} [options.validateReadOnlyLink] If true, only links that would show a
 *     page for reading are considered. E.g., `/wiki/Foo` and `/w/index.php?title=Foo` would
 *     validate but `/w/index.php?title=Foo&action=bar` would not.
 * @return {mw.Title|null} A Title or `null`.
 */
Title.newFromUri = function ( uri, options ) {
  var
    mwUri,
    regExp,
    matches,
    title;

  try {
    // uri may or may not be a Uri but the Uri constructor accepts a Uri parameter.
    mwUri = new mw.Uri( uri, options );
  } catch ( e ) {
    return null;
  }

  if ( !mwUri.isInternal() ) {
    return null;
  }

  if ( ( options || {} ).validateReadOnlyLink && !isReadOnlyUri( mwUri ) ) {
    // An unknown query parameter is used. This may not be a read-only link.
    return null;
  }

  if ( mwUri.query.title ) {
    // True if input starts with wgScriptPath.
    regExp = new RegExp( '^' + mw.RegExp.escape( mw.config.get( 'wgScriptPath' ) ) + '/' );

    // URL has a nonempty `title` query parameter like `/w/index.php?title=Foo`. The script path
    // should match.
    matches = regExp.test( mwUri.path );
    if ( !matches ) {
      return null;
    }

    // The parameter was already decoded at Uri construction.
    title = mwUri.query.title;
  } else {
    // True if input starts with wgArticlePath and ends with a nonempty page title. The first
    // matching group (index 1) is the page title.
    regExp = new RegExp( '^' + mw.RegExp.escape( mw.config.get( 'wgArticlePath' ) ).replace( '\\$1', '(.+)' ) );

    // No title query parameter is present so the URL may be "pretty" like `/wiki/Foo`.
    // `Uri.path` should not contain query parameters or a fragment, as is assumed in
    // `Uri.getRelativePath()`. Try to isolate the title.
    matches = regExp.exec( mwUri.path );
    if ( !matches || !matches[ 1 ] ) {
      return null;
    }

    try {
      // `Uri.path` was not previously decoded, as is assumed in `Uri.getRelativePath()`, and
      // decoding may now fail. Do not use `Uri.decode()` which is designed to be paired with
      // `Uri.encode()` and replaces `+` characters with spaces.
      title = decodeURIComponent( matches[ 1 ] );
    } catch ( e ) {
      return null;
    }
  }

  // Append the fragment, if present.
  title += mwUri.fragment ? '#' + mwUri.fragment : '';

  return Title.newFromText( title );
};

It would be much better if this information could be modeled in the content as a data-title attribute. For example, a link to the enwiki banana page may look like:

data-title example (Banana)
<a rel="mw:WikiLink" href="/wiki/Banana" title="Banana" data-title="Banana" class="new">bananas.</a>

And here's an example outside the main namespace to banana's talk page:

data-title example (Talk:Banana)
<a rel="mw:WikiLink" href="/wiki/Talk:Banana" title="Banana (talk)" data-title="Talk:Banana" class="new">banana (talk)</a>

Note that the existing title property is conflated with presentation and so should not be used for modeling.

Related:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdlrobson added a subscriber: Jdlrobson.

This would need buy-in and implementation from the parsing team. Please let us know if you need more clarity around the proposal and priority here!

We are right now heads down in porting Parsoid to PHP and would prefer not to undertake any development on the JS codebase unrelated to the porting itself. So, clarity around priority would be helpful for us to figure out how and when to engage with this request.

@ssastry thanks for the clarity. To be clear this would be a change /in/ the PHP parser to annotate all links with data-title (we'd handle the JS bit). Totally understand if that has to wait until post-porting Parsoid to PHP but was curious how difficult that would be. Given activity in T201339 it looks like we can work around this in the meantime.

My understanding is that if the link is marked with mw:wikilink you can just infer the title from the relative path. Is there a case where this wouldn't work?

My understanding is that if the link is marked with mw:wikilink you can just infer the title from the relative path. Is there a case where this wouldn't work?
VE does this here: https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/8846e65e4475fb1942e2d30478d072b5261cd4a7/modules/ve-mw/ve.MWutils.js#L163

I think this would cover the use case in Popups and Minerva. @ssastry, can you confirm that using mw:wikilink anchors with @Esanders' regular expression is safe?

My understanding is that if the link is marked with mw:wikilink you can just infer the title from the relative path. Is there a case where this wouldn't work?
VE does this here: https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/8846e65e4475fb1942e2d30478d072b5261cd4a7/modules/ve-mw/ve.MWutils.js#L163

I think this would cover the use case in Popups and Minerva. @ssastry, can you confirm that using mw:wikilink anchors with @Esanders' regular expression is safe?

One of us will take a look soonish .. but do note that @Esanders is talking about Parsoid output with mw:WikiLink annotations but @Jdlrobson mentioned the PHP parser output. Can you clarify? Of course, we are on the path to move everything over to Parsoid output, but that is still 18-24 months out.

@ssastry, this task is about improving the current implementation in Popups and now Minerva which have JavaScript client-side implementations. As far as I know, both use the PHP parser presently and there's no rush to replace the client-side hacks. I just wanted to track the work since 1) @Krinkle identified this improvement in code review and 2) dropping the client side implementations would hopefully drop a bunch of "guesswork" code.

@ssastry, this task is about improving the current implementation in Popups and now Minerva which have JavaScript client-side implementations. As far as I know, both use the PHP parser presently and there's no rush to replace the client-side hacks. I just wanted to track the work since 1) @Krinkle identified this improvement in code review and 2) dropping the client side implementations would hopefully drop a bunch of "guesswork" code.

Given what you said there, how sad would you all be if we didn't do this at all and just waited for the Parsoid switchover?

@ssastry, if it's true that titles can always be derived by removing the relative path on anchors classified as mw:wikilink, I don't think there is any work to be done on the Parsoid side. The client side implementation could then know that every internal page link on an arbitrary wiki page could be stripped of its relative path to obtain the title, including namespace, which would eliminate guesswork and permit the client patch in Core while greatly simplifying the code. In the case, I wouldn't be sad at all. If that's not the case and work is needed, this task is valid but its resolution can easily wait post-switchover AFAIK.

ssastry changed the task status from Open to Stalled.Mar 19 2019, 6:31 PM

Thanks! So, I am going to mark this stalled. But, if priorities change and you need this sooner, please update this ticket accordingly.

Yes, the intention is certainly that title (in normalized form, ie spaces converted) can always be derived by removing the relative path. Furthermore, in modern Parsoid (not historically, but we can ignore that) that relative path is *always* ./. So just strip the first two characters and you've got your title.

In fact, the function of the relative path is just to protect colons in namespaces, since otherwise href="Template:Foo" gets interpreted as a URL *protocol* named Template. To prevent that we'd have to URL encode the colon -- href="Template%3AFoo" -- but (once upon a time) we decided that looked ugly and we'd rather use ./Template:Foo instead.

tl;dr: there shouldn't be any need for a data-title attribute in Parsoid, as it is currently.

That said, it's been brought up that Parsoid output doesn't natively support wikis where the title needs to be passed in the query string. I don't think we've ever decided what to do about that -- that configuration is not used in WMF production and I think the quiet hope was we could just deprecate it.

@cscott, thanks for verifying and the informative aside on protocols. I think the title query parameter discussion can be considered out of scope for use in Popups and Minerva. I see that both missing and present pages and pages with fragments would be rendered with "pretty" URLs and could use './' snipping. E.g.:

https://en.wikipedia.org/api/rest_v1/page/html/User:Jdlrobson%2Fdraft
https://en.wikipedia.org/api/rest_v1/page/html/Barack_Obama

I think this covers the needs of Popups and Minerva so I'm going to close this task. I will open new tasks for updating the client side code but it will depend on the parser migration.