Page MenuHomePhabricator

User talk message notifications from notifications API does not encode link fragments
Closed, DeclinedPublicBUG REPORT

Description

This is branched off from https://phabricator.wikimedia.org/T307529.

How many times were you able to reproduce it?

Every time

Steps to reproduce

  1. Trigger a user talk page message by writing to your talk page under a topic title that contains special or non-latin characters.
  2. Fetch user talk page message notification from Notifications API, e.g.

https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&meta=notifications&notfilter=!read&notformat=model&notlimit=max&notwikis=enwiki

Expected results

Urls within the links object have fragments encoded

Actual results

Urls within the links object contain special characters within fragments that are not encoded

This causes our URLs to fail to decode on the iOS side. We have loosened our decoding a bit to fix the worst bugs, but we are still unable to link directly to a talk page topic without an encoded fragment.

Example response from GU Wikipedia:

#c-Aniket-2022-05-03T16:16:00.000Z-સ્વાગત!"

...
            "header": "‪New user message‬ left a message on <strong>your talk page</strong>.",
            "compactHeader": "‪New user message‬ left you a message.",
            "body": "પ્રિય Tsevener, શુભ સંધ્યા, ગુજરાતી વિકિપીડિયામુક્ત વિશ્વજ્ઞાનકોશમાં જોડાવા બદલ આપનો આભાર અને અહીં આપનું હાર્દિક સ્વાગત છે! જગતભરના જ્ઞાની લોકોથી લ...",
            "icon": "edit-user-talk",
            "links": {
              "primary": {
                "url": "https://gu.wikipedia.org/wiki/%E0%AA%B8%E0%AA%AD%E0%AB%8D%E0%AA%AF%E0%AA%A8%E0%AB%80_%E0%AA%9A%E0%AA%B0%E0%AB%8D%E0%AA%9A%E0%AA%BE:Tsevener?markasread=245545&markasreadwiki=guwiki#c-Aniket-2022-05-03T16:16:00.000Z-સ્વાગત!",
                "label": "સંદેશ જુઓ"
              },
              "secondary": [
                {
                  "url": "https://gu.wikipedia.org/wiki/%E0%AA%B8%E0%AA%AD%E0%AB%8D%E0%AA%AF:New_user_message",
                  "label": "‪New user message‬",
                  "tooltip": "",
                  "description": "",
                  "icon": "userAvatar",
                  "prioritized": true
                },
                {
                  "url": "https://gu.wikipedia.org/w/index.php?title=%E0%AA%B8%E0%AA%AD%E0%AB%8D%E0%AA%AF%E0%AA%A8%E0%AB%80_%E0%AA%9A%E0%AA%B0%E0%AB%8D%E0%AA%9A%E0%AA%BE:Tsevener&oldid=prev&diff=819623",
                  "label": "ફેરફારો જુઓ",
                  "description": "",
                  "icon": "changes",
                  "prioritized": true
                }
              ],
              "legacyPrimary": {
                "url": "//gu.wikipedia.org/wiki/%E0%AA%B8%E0%AA%AD%E0%AB%8D%E0%AA%AF%E0%AA%A8%E0%AB%80_%E0%AA%9A%E0%AA%B0%E0%AB%8D%E0%AA%9A%E0%AA%BE:Tsevener?markasread=245545&markasreadwiki=guwiki",
                "label": "સંદેશ જુઓ"
              }
            },
            "iconUrl": "/w/extensions/Echo/modules/icons/edit-user-talk-progressive.svg"
          }
...

Hash signs within the topic titles are not encoded either:

#c-TSevener_(WMF)-2022-05-04T12:53:00.000Z-Test_adding_title_with_#_mark",
#Test_adding_title_with_#_mark",

...
            "header": "‪TSevener (WMF)‬ te dejó un mensaje en <strong>tu página de discusión</strong> en «<strong>‪Test adding title with # mark‬</strong>».",
            "compactHeader": "‪TSevener (WMF)‬ te dejó un mensaje en «<strong>‪Test adding title with # mark‬</strong>».",
            "body": "asdf",
            "icon": "edit-user-talk",
            "links": {
              "primary": {
                "url": "https://es.wikipedia.org/wiki/Usuario_discusi%C3%B3n:Tsevener?markasread=45724720&markasreadwiki=eswiki#c-TSevener_(WMF)-2022-05-04T12:53:00.000Z-Test_adding_title_with_#_mark",
                "label": "Ver mensaje"
              },
              "secondary": [
                {
                  "url": "https://es.wikipedia.org/wiki/Usuario:TSevener_(WMF)",
                  "label": "‪TSevener (WMF)‬",
                  "tooltip": "",
                  "description": "",
                  "icon": "userAvatar",
                  "prioritized": true
                },
                {
                  "url": "https://es.wikipedia.org/w/index.php?title=Usuario_discusi%C3%B3n:Tsevener&oldid=prev&diff=143315311",
                  "label": "Ver cambios",
                  "description": "",
                  "icon": "changes",
                  "prioritized": true
                }
              ],
              "legacyPrimary": {
                "url": "//es.wikipedia.org/wiki/Usuario_discusi%C3%B3n:Tsevener?markasread=45724720&markasreadwiki=eswiki#Test_adding_title_with_#_mark",
                "label": "Ver mensaje"
              }
            }
...

Event Timeline

MMiller_WMF subscribed.

This has come to Growth team from iOS, for us to triage as a bug.

LGoto added a subscriber: JMinor.
LGoto subscribed.
kostajh subscribed.

Thanks for this report @Tsevener. Can I get an idea from you please on the priority/urgency of fixing this?

Thanks @Tsevener and @kostajh Becsause this effects the ability to click through on welcome messages and other key pages in any language with unicode characters and is in production, I'd say this is a pretty urgent issue for us. The scale maybe is medium/limited but the severity is bad (user experience is broken) and it seems like an equity issue that this works for latin characters but not other scripts. I don't think it is an unbreak now, but it is urgent for us.

@Tsevener happy to get your perspective on this as well, of course.

Thanks @Tsevener and @kostajh Becsause this effects the ability to click through on welcome messages and other key pages in any language with unicode characters and is in production, I'd say this is a pretty urgent issue for us. The scale maybe is medium/limited but the severity is bad (user experience is broken) and it seems like an equity issue that this works for latin characters but not other scripts. I don't think it is an unbreak now, but it is urgent for us.

@Tsevener happy to get your perspective on this as well, of course.

Thanks for this clarification. I'm discussing with others on Growth team to see if we can make some time to look at this.

I could be wrong, but I think that DiscussionTools is likely involved in that includes/Notifications/DiscussionToolsEventTrait.php#getCommentLink is invoked in order to generate the URL, so I'm tagging that project in case this is something the engineers there have encountered or thought about already.

@JMinor Agreed, that sounds good to me. Thanks!

I had the impression that the normal flow of calling Title->getFullURL (as DiscussionTools is doing) is supposed to escape the fragment. Presumably there's some chain of circumstances which is bypassing that...

This is the normal MediaWiki behavior for links to fragments. See e.g. https://en.wikipedia.org/wiki/User:Matma_Rex/sandbox and the sections there. The links to "ąśź" are not encoded: <a href="#ąśź">. That page and the links on it seem to work fine for me in the iOS app, so maybe it's a problem specific to the notification links?

Hash signs within the topic titles are not encoded either:
#c-TSevener_(WMF)-2022-05-04T12:53:00.000Z-Test_adding_title_with_#_mark",

This seems intentional, as far as I can tell. Clicking the link https://es.wikipedia.org/wiki/Usuario_discusi%C3%B3n:Tsevener?markasread=45724720&markasreadwiki=eswiki#c-TSevener_(WMF)-2022-05-04T12:53:00.000Z-Test_adding_title_with_#_mark takes me to the correct location on the talk page.

we are still unable to link directly to a talk page topic without an encoded fragment.

This link (https://gu.wikipedia.org/wiki/%E0%AA%B8%E0%AA%AD%E0%AB%8D%E0%AA%AF%E0%AA%A8%E0%AB%80_%E0%AA%9A%E0%AA%B0%E0%AB%8D%E0%AA%9A%E0%AA%BE:Tsevener?markasread=245545&markasreadwiki=guwiki#c-Aniket-2022-05-03T16:16:00.000Z-સ્વાગત!) seems to work also?

@Tsevener could you please elaborate a bit on what you wrote in the task description, about being unable to decode URLs in iOS? Is it because the URL and its query parameters are encoded but the fragment only is not, so e.g. when parsing #c-TSevener_(WMF)-2022-05-04T12:53:00.000Z-Test_adding_title_with_#_mark your URL parser gets confused as to where the fragment is supposed to begin?

Again assuming this is about DiscussionTools generated URLs, we could run the comment ID through url encoding:

<?php
// For a single-comment notification, make a pretty(ish) direct link to the comment.
// The browser scrolls and we highlight it client-side.
$commentId = $this->event->getExtraParam( 'comment-id' );
if ( !$commentId ) {
	return null;
}
$commentId = wfUrlencode( $commentId );
$title = $this->event->getTitle();
return $title->createFragmentTarget( $commentId )->getFullURL();

...but that would also require some change in DiscussionTools client-side to decode the fragment, as clicking a link with an encoded fragment then doesn't work with DiscussionTools' code that looks for that comment ID to bring you to the correct location on the page.

@kostajh sure - it's worth noting that we take the URL from the notifications API, extract the fragment topic title (we use the legacyPrimary URL since the topic title is more on its own there), and use that to send them to a native user talk page. The native user talk page matches up the topic titles with the fragment string and if it finds a match, sends them straight to that particular topic thread with its replies.

A Swift URL struct fails to instantiate with these decoded fragments, which is what we were trying to use when deserializing the notifications API response. I could just use a string for this, but then it felt like I was starting to make assumptions about when a fragment begins or ends (for example, a topic containing a hash sign broke my initial attempt at this). Plus a URL with encoded paths but unencoded fragments just looked off to me, so I figured it was a bug.

I think a link to https://en.wikipedia.org/wiki/User:Matma_Rex/sandbox#ąśź works from the app because it's just pushing on an embedded web view instead of trying to deal with it natively. Maybe web browsers are inherently able to switch back & forth between encoded and unencoded fragments?

Anyways, I'm happy to discuss alternatives if needed, I was just hoping this would be be easier and a more proper fix area on the backend than client-side. Hope this helps!

Why does the URL not work

MediaWiki URLs are compliant with the WhatWG URL Standard (https://url.spec.whatwg.org/), which is what modern browsers implement.

The Swift URL struct you linked apparently implements RFC 1808 (https://www.ietf.org/rfc/rfc1808.txt), from 1995.

I found that there's also a URLComponents struct, which implements RFC 3986 (https://www.ietf.org/rfc/rfc3986.txt), from 2005. (But that is not too helpful, it still doesn't handle URLs like ours, according to some random people on some random forums on the internet that I could find.)

…and how could you make it work

Apparently there is a library for Swift that implements the WhatWG URL Standard: https://github.com/karwa/swift-url. I don't know whether that's an option for you (entirely reasonable if you don't want to add dependencies).

Why do we emit such weird URLs anyway

You have a good point that a URL with encoded paths but unencoded fragments is pretty weird. I thought that maybe it's for compatibility with some older browsers, so I investigated.

For historical context, MediaWiki used to encode URL fragments using a custom approach, kind of like percent-encoding but using different characters. We still support them for backwards-compatibility, e.g.: https://en.wikipedia.org/wiki/User:Matma_Rex/sandbox#.C4.85.C5.9B.C5.BA. (This behavior is controlled by $wgFragmentMode, and is likely to be removed in the future, so please don't use this anywhere.) It was changed to the current behavior in T152540: Migrate to HTML5 section ids.

I had a look at the subtasks there, and the commits that added it, and I found the explanation here: T157729#3462443 (2017)

This raises the question whether we really want to percent-encode the fragments: in addition to quirks mode (really minor concern these days, but still can unexpectedly flip on for some chthonic reason), it makes fragments unreadable on Chrome (44% of pageviews) and IE/Edge (11%).

…and how could we make it work

It looks like Chrome has fixed this since then (found the link in another task: https://bugs.chromium.org/p/chromium/issues/detail?id=789163), and IE is going the way of the dodo, so it wouldn't be unreasonable to change this behavior in all of MediaWiki. We'd probably want to consult this somewhat more widely though.

We could also change it only for our API instead of everywhere. I think the correct way to do it is like this:

$title->getFullURL() . '#' . wfUrlencode( $commentId );

…and this actually seems to work with the DiscussionTools client-side code, since that is an equivalent WhatWG URL, and we just use the normal browser APIs everywhere. But I'd be mildly concerned about inventing a custom approach again and maybe running into compatibility issues somewhere else (and also, I'd expect that you'll keep running into issues with URL parsing in other places in the iOS app).

@matmarex thanks for the information! I didn't realize the URL standardization was that different with modern web vs Swift URL vs URLComponents, though I'm not wild about introducing a third party library for something as widespread as URL in our codebase. I'll discuss with the wider team in our sync tomorrow and will let you know our collective thoughts.

I found this to be a pretty interesting thread from skimming. The WebURL type from that swift-url library is mentioned often.

@matmarex

So as a team we decided against pulling in the swift-url library. The Swift Foundation URL is just too integral to all of Apple's APIs. It's also basically how we identify articles in the app so it would be a huge change if we were to switch to it across the codebase. If Apple were to come out with a new WhatWG-supported URL type of their own, that sign-off would be enough of a push for us to switch. WWDC is in two weeks - I'm extremely doubtful but maybe we can kick the can down the road until we're sure there's nothing new coming from them in this area.

Another client-side alternative we have is for us to attempt our own parsing here, but I don't think it's a good idea for our team to get into the URL parsing game - we're bound to mess it up. I'm still preferring a backend fix, especially if the original reason for MediaWiki unencoded fragments no longer applies as you say. Let us know if y'all warm up to the idea any further.

I'd expect that you'll keep running into issues with URL parsing in other places in the iOS app

Absolutely a fair point - there are definitely issues elsewhere with our URL parsing of article links, not just here in the notifications APIs. While we await WWDC announcements I can do some digging in this area.

It looks like Chrome has fixed this since then (found the link in another task: https://bugs.chromium.org/p/chromium/issues/detail?id=789163), and IE is going the way of the dodo, so it wouldn't be unreasonable to change this behavior in all of MediaWiki. We'd probably want to consult this somewhat more widely though.

Thank you for digging through that history and documenting it here, @matmarex.

@Tsevener if this is a change your team would like to see made across MediaWiki, a good next step is probably a short email explaining the proposed change, and then it needs to be circulated in wikitech-l for feedback. Would you or someone from your team be interested to lead that process? I'm not sure if running this through the technical decision forum is worth doing (cc @DAbad as chair of the technical forum for product department -- do you think we need to do that?).

@kostajh Sorry for the delay - that sounds good! I'll write something up and send it to wikitech-l early next week.

This seems trivial to fix on the app side, as long as we are talking about URLs in JSON structures (and not HTML parsing). Just find the first # character in the URL and percent-encode everything past it. The app is a controlled environment where behavior is easy to predict and monitor. The web, where hundreds of different clients will process the data, is not a controlled environment - it's hard to predict which browsers will mishandle something and often hard to notice that something does. We shouldn't break standards compliance on the web just for an app's convenience.

Also, such a change would be liable to break navigation via existing external links. The HTML fragment matching algorithm is somewhat intricate, and for less standard-compliant browsers the only reliable thing is bytewise equivalence, so changes to anchors should not be taken lightly.

@Tgr yep, just URLs in JSON structures. I was thinking if an encoded fragment would still be WhatWG-compliant, yet allow other clients with older URL standards to more easily deal with them, it would be win-win. But I see what you mean that the risk isn't worth it when there's a simple workaround on our side.

Just find the first # character in the URL and percent-encode everything past it.

Originally I tried this from grabbing the last # character, but that of course messed up topic titles containing a hash. I wasn't sure if the first # character was a safe assumption. I'm not familiar with all of the URL standard(s) rules, but we can definitely grab the first hash if that sounds reliable to y'all.

Thanks for chiming in!

Going to close this in favor of a client-side workaround. This work will be done in T307604. Thanks everyone!