Page MenuHomePhabricator

URLs for the same title without extra query parameters should have the same canonical link
Closed, ResolvedPublic

Description

These urls all claim different canonical urls

https://en.wikipedia.org/wiki/San_Francisco

<link rel="canonical" href="https://en.wikipedia.org/wiki/San_Francisco">

https://en.wikipedia.org/w/index.php?title=San_Francisco

<link rel="canonical" href="https://en.wikipedia.org/w/index.php?title=San_Francisco" />

https://en.wikipedia.org/w/?title=San_Francisco

<link rel="canonical" href="https://en.wikipedia.org/w/?title=San_Francisco" />

https://en.wikipedia.org/?title=San_Francisco

<link rel="canonical" href="https://en.wikipedia.org/?title=San_Francisco" />

They are exactly the same resource though - so which one is the real canonical URL?

Details

Reference
bz65402

Related Objects

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:16 AM
bzimport added a project: MediaWiki-Sites.
bzimport set Reference to bz65402.
bzimport added a subscriber: Unknown Object (MLST).

In wmf5 the canonical Tag isn't set, when the URL page isn't a redirect, if i'm right. Test:

Redirect page:
http://www.houseofcardswiki.de/OTA

Tag URL: /Over_the_Air -> redirect resource

http://www.houseofcardswiki.de/index.php?title=OTA

Tag URL: /Over_the_Air -> redirect resource

Content Page:
http://www.houseofcardswiki.de/Over_the_Air

No Tag

http://www.houseofcardswiki.de/index.php?title=Over_the_Air

No Tag

Maybe it's a configuration setting (my wiki is clean without anything), because in MediaWiki is the same bug :)

My local wiki doesn't have a canonical link in the head in either case. Any idea what the purpose of that tag is?

If i see this right, canonical link will only be add, if the page requested by user is a redirect, so for example, the user requests /wiki/OTA which is a redirect to /wiki/Over_the_Air, so in the page will a canonical tag with "/wiki/Over_the_Air" added. That's so in the clean install, but not in mediawiki and wikipedia (wmf4 and 5).

This gives a good breakdown:
http://moz.com/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps

Essentially a canonical URL points to the one true URL for that content that search engines should index.

In mobile http://en.m.wikipedia.org/wiki/Main_Page declares http://en.wikipedia.org/wiki/Main_Page as it's canonical URL (e.g. desktop site)

We do this to stop mobile results for the same content showing up in search results.

I can see this being less of a problem to Wikipedia, but for new projects like Wikivoyage we're probably not helping them rise up the rankings with this bug.

Stu added a subscriber: Stu.Mar 23 2015, 12:27 AM
MC8 added a subscriber: MC8.Jun 17 2015, 9:40 AM
Nemo_bis raised the priority of this task from Normal to High.Jun 17 2015, 1:37 PM
Nemo_bis added a project: HTTPS-by-default.
Nemo_bis set Security to None.

Change 100952 had a related patch set uploaded (by Nemo bis):
Let FauxRequest use data from a Title object

https://gerrit.wikimedia.org/r/100952

Krinkle renamed this task from Canonical URL in link tag may not be an actual canonical URL to URLs for the same title without extra query parameters should have the same canonical link.Jun 18 2015, 12:36 AM
Krinkle updated the task description. (Show Details)

This is pretty bad. I'm surprised and a bit disappointed that this issue has been lingering for so long.

It seems there is existing logic for normalising urls.

https://en.wikipedia.org/?title=San%20Francisco
https://en.wikipedia.org/w/index.php?title=San%20Francisco
->
https://en.wikipedia.org/wiki/San_Francisco

It's triggered by the "not canonical" encoding of the title. And as expected (for now) this logic correctly prevents redirects in case of additional query parameters
https://en.wikipedia.org/w/index.php?title=San%20Francisco&foo=a
-> No redirect.

So depending on where this logic lives, it may be as simple as changing it from looking at the url itself instead of merely the encoding of the title value. We can preserve the logic that denies redirects in case of other query parameters.

Krinkle claimed this task.Jun 19 2015, 7:14 PM

Change 219446 had a related patch set uploaded (by Krinkle):
MediaWiki.php: Redirect non-standard title urls to canonical

https://gerrit.wikimedia.org/r/219446

Change 219446 had a related patch set uploaded (by Krinkle):
MediaWiki.php: Redirect non-standard title urls to canonical
https://gerrit.wikimedia.org/r/219446

Thank you for working on this!

Redirecting certain non-canonical URLs seems like good start, but is only a partial solution? I'm concerned that after https://gerrit.wikimedia.org/r/219446 is merged and deployed, we'll still be outputting bullshit canonical tags for certain types of requests. For example:

$ curl -s "https://en.wikipedia.org/?title=San_Francisco&action=history" | grep canonical
<link rel="canonical" href="https://en.wikipedia.org/?title=San_Francisco&amp;action=history" />

Yeah this problem is growing in the Google indices. We may actually need to fix the rel=canonical outputs first and let that run for a while before we do the redirects, to make it clearer to google to fix the existing stuff...

(and yes: this hurts our cache performance. It also hurts purge behavior on edits, from the user's POV)

Change 219446 had a related patch set uploaded (by Krinkle):
MediaWiki.php: Redirect non-standard title urls to canonical
https://gerrit.wikimedia.org/r/219446

Thank you for working on this!
Redirecting certain non-canonical URLs seems like good start, but is only a partial solution? I'm concerned that after https://gerrit.wikimedia.org/r/219446 is merged and deployed, we'll still be outputting bullshit canonical tags for certain types of requests. For example:

$ curl -s "https://en.wikipedia.org/?title=San_Francisco&action=history" | grep canonical
<link rel="canonical" href="https://en.wikipedia.org/?title=San_Francisco&amp;action=history" />

There are at least two dozen isolated issues surrounding canonical link tags. Many of which are tracked under T93550. For some, redirects are appropriate, for others not. Either way, it's a distributed problem that'll need patching in many different places. One step at a time.

I expect this will require standardisation of sorts that needs to be architected carefully. One way to do it would be to implement something like getCanonicalUrl() in Action and WikiPage/SpecialPage sub classes. Possibly defaulting to ignoring query parameters. And perhaps using wgActionPaths.

Note that action=history has a robots=noindex,nofollow so those aren't indexed anyway (found no hits in Google).

We may actually need to fix the rel=canonical outputs first and let that run for a while before we do the redirects, to make it clearer to google to fix the existing stuff...

How does that many anything "clearer" to Google? We already have an established convention to redirect non-standard encodings of the title, and non-canonical versions of e.g. Special page names and namespaces – which seems to be working well (the variations on those are not indexed by Google). This is extends that.

I'd like to get rid of these uncontrolled cache hits in Varnish sooner rather than later. It's causing stale content to be served and also issues with our url detection patterns that rely on /wiki or /w/index.php (such as for MobileFrontend redirects). Possibly breaks other patterns and expectations as well.

The only reason I mention rel=canonical first perhaps being clearer is because I'm no longer at all sure how Google is handling any of this. It's very puzzling to me that, for instance, Google has lots of index entries under https://www.google.com/?q=site:www.en.wikipedia.org, when www.en.wp.o returns 301s to en.wp.o (and AFAIK, has since long before any recent stuff), which has a correct domainname in its rel=canonical.

ori added a comment.EditedJun 21 2015, 10:40 PM

I expect this will require standardisation of sorts that needs to be architected carefully. One way to do it would be to implement something like getCanonicalUrl() in Action and WikiPage/SpecialPage sub classes. Possibly defaulting to ignoring query parameters. And perhaps using wgActionPaths.

Yes, I think that's the right approach. We ought to construct a canonical URL based on the object we're operating on and the action that is being performed, rather than derive it from the request URL, which is a bit like reading the label on incoming mail to remind oneself where one lives.

I think the right first step is to construct a clean canonical URL from scratch based on the current request context rather than attempt to arrive at a canonical URL by applying transformations to the raw request URL. Getting it right for normal article views would do a lot to lower the severity of this bug. We can worry about other actions and about special pages later.

ori added a subscriber: tstarling.Jun 21 2015, 10:50 PM

Change 219782 had a related patch set uploaded (by Ori.livneh):
Construct clean canonical URLs for articles, ignoring request URL

https://gerrit.wikimedia.org/r/219782

Change 219782 merged by jenkins-bot:
Construct clean canonical URLs for wiki pages, ignoring request URL

https://gerrit.wikimedia.org/r/219782

Change 219893 had a related patch set uploaded (by Ori.livneh):
Construct clean canonical URLs for wiki pages, ignoring request URL

https://gerrit.wikimedia.org/r/219893

Change 219893 merged by Ori.livneh:
Construct clean canonical URLs for wiki pages, ignoring request URL

https://gerrit.wikimedia.org/r/219893

Nemo_bis added a subscriber: Nemo_bis.

Getting quite serious currently, https://www.google.it/search?q="wikipedia.org/%3Ftitle%3D" says 341k results here.

It's very puzzling to me that, for instance, Google has lots of index entries under https://www.google.com/?q=site:www.en.wikipedia.org, when www.en.wp.o returns 301s to en.wp.o

That's tracked at T28115.

Krinkle closed this task as Resolved.Jun 25 2015, 6:09 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 25 2015, 6:09 PM
MC8 added a comment.Jun 25 2015, 6:23 PM

Is this resolved? There's still a patch for review (219446).

Yes, that patch is fixing something slightly different, but related (and would have fixed this bug too if it was merged first).

Change 219446 merged by jenkins-bot:
MediaWiki.php: Redirect non-standard title urls to canonical

https://gerrit.wikimedia.org/r/219446

Does not redirect, but has <link rel="canonical" href="https://de.wikipedia.org/wiki/Rosa_Luxemburg" />.

Same for https://de.wikipedia.org/?oldid=143610757 and https://de.wikipedia.org/?curid=12876.

Krinkle added a subscriber: Seb35.Aug 25 2015, 2:15 PM

I monitored the Google results with the link in the description. We are now two months after the commit.
At first sight it is strange because I obtain 4,560,000 results, and 81,600,000 with the more general query https://www.google.fr/search?q=site:wikipedia.org+inurl:title+-intitle:title. But when I walk accross results, the last page (for en.wp.org) is the 29th page, announcing 281 results, so the commit seems to be effective (although it should have been tested in the past what was the last page); and the last page for wp.org is the 35th with 344 results. However I don’t know why Google announces such high numbers in the first page.

@Seb35 Yes, but I've come to believe those queries don't matter. Google intentionally exposes non-canonical urls in results when the query forces this. It does so in order to satisfy your query, but I think it does know what the canonical variant is.

Queries like https://www.google.com/search?q=site:en.wikipedia.org+inurl:title+-intitle:title intentionally bypass this due to use of operators like inurl, and intitle. I've haven't seen a single result with ?title= in the url when refraining from using inurl. We were definitely getting some "?title" urls in regular search results in the past, but I think our efforts to provide canonical urls in the html and the 301 redirects have fixed this.

We just need to stop trying to find them with clever queries, because Google is cleverererer.

Example result from such query:

Emperor of Japan - Wikipedia, the free encyclopedia
en.wikipedia.org/?title=Emperor_of_Japan

Searching for that subject separately without inurl (https://www.google.com/search?q=site:en.wikipedia.org%20intitle:%22Emperor%20of%20Japan%22) yields only 10 results. Each result is a different page on Wikipedia. The top result is "Emperor of Japan" at the canonical url
"https://en.wikipedia.org/wiki/Emperor_of_Japan". So it's aware of what the canonical url is for that result. It allows your query to match any of indexed non-canonical variants, but if your query doesn't match the canonical variant, then it will show the result with the url of the variant it matched.

The same applies to cross-domain redirects. Searching for site:commons.wikipedia.org (note "Wikipedia") shows several hundred thousand results, including "General diagram types" at
https://commons.wikipedia.org/wiki/General_diagram_types. But searching for https://www.google.co.uk/search?q=intitle:%22General%20diagram%20types%22%20intitle:%22Wikimedia%20Commons%22 only shows "General diagram types" at
https://commons.wikimedia.org/wiki/General_diagram_types.

I think our efforts to provide canonical urls in the html and the 301 redirects have fixed this

Well I saw a bunch of non-canonical results in normal searches of the past few months, can't remember whether before or after the recent MediaWiki changes.

I guess the only way to really know is to search the request logs.

Seb35 added a comment.Aug 26 2015, 8:36 AM

Thanks for the analysis, I was suspecting such a behaviour and that makes sense.

However on your example with only intitle in the query, the first result I obtain is still the non-canonical URL http://en.wikipedia.org/?title=Emperor_of_Japan, the same for "Batting average" but it works for "Order of the British Empire". Given I searched previously the special request, I used Tor to get anonymous results for https://www.google.com/search?q=site:en.wikipedia.org%20intitle:%22Emperor%20of%20Japan%22 and I still got the non-canonical URL.

But perhaps these pages are still the old version in Google cache and we should wait still some time before seeing the new URL -- I’m not really knowledgeable in Google/SEO science so I could not be really helpful in a fine analysis of its behaviour.

Paladox added a subscriber: Paladox.Sep 7 2016, 1:16 PM

Fixing this task caused this T131414 problem since the patches broke supporting encoding.

I have merged the revert of https://gerrit.wikimedia.org/r/219446 due to the numerous problems described at T106793, that affected both Wikimedia wikis (certain pages were inaccessible using certain tools/browsers) and third-party wikis (depending on the server software and configuration, certain pages were entirely inaccessible).

Note that this task, as written (about generation of <link rel="canonical" …>) is still fixed – https://gerrit.wikimedia.org/r/219782 was the only patch necessary to fix it, and it's not reverted. Someone should probably file a separate tasks about the redirects if you want to pursue that.

Deskana moved this task from Tag to Done on the SEO board.Jul 6 2018, 10:29 AM