Page MenuHomePhabricator

DuplicateUnnamedReferences generates ref name "https"
Closed, ResolvedPublic

Description

See this diff, for example. I had a look around and found a few things.

There are two reuters references on that page, with different URLs. The first gets named "reuters.com" because DeriveReferenceName() eventually tries a UrlDomain regex, which extracts the domain name. That's in WikiFunctions/Parse/References.cs.

For the second reuters reference, the name is already taken, so DeriveReferenceName() falls through to the CiteTemplateUrl regex. The comment just above the call is "now try title of a citation", but that's not actually what the regex does. It extracts part of the url parameter of any cite template:

\s*\{\{\s*cit[^{}<>]*\|\s*url\s*=\s*([^\/<>{}\|]{4,35})

You'll notice that the group will stop on any of /<>{}|. Since this is applied to a URL, this will typically capture http: or https: (the scheme without //). It will then pass through CleanDerivedReferenceName(), which removes the colon.

My guess would be that when this code was first introduced in commit r4437 (before the split from Parser.cs), the regex was copied from "website URL" above, but url wasn't changed to title, even though that's what the comment says. This was actually reported in April 2014 and a fix was commited in r10079, but all it did was add a special case to ignore the name "http" instead of fixing the root of the problem.

So there's two possible fixes: add a second check for "https" alongside "http" (and hope there are no ftp references), or fix the regex so that it actually captures something meaningful. If my understanding is correct, that would be title instead of url.

Event Timeline

Rjwilmsi claimed this task.
Rjwilmsi subscribed.

rev 12330 DeriveReferenceName: try to use title of citation as documented