Page MenuHomePhabricator

Fix canonical URL issues (tracking)
Open, Needs TriagePublic

Description

I spent some time looking through our reports on Google Webmaster Tools. They highlighted a few issues with our handling of canonical URLs:

  • we're serving canonical URL tags for non-canonical URLs. For example, https://en.wikipedia.org/w/index.php?title=San_Francisco sends that URL as its canonical when IMHO it should send http://en.wikipedia.org/wiki/San_Francisco. Also saw this one in T67402: URLs for the same title without extra query parameters should have the same canonical link. IMHO https://en.wikipedia.org/w/index.php?title=San_Francisco should 301 to http://en.wikipedia.org/wiki/San_Francisco which would take care of the canonical issue.
  • we're serving canonical URL tags with URL parameters. For example, https://en.wikipedia.org/wiki/Category:Living_people?from=Fe. I don't know what the parameter from specifies, but canonical URLs generally shouldn't include URL parameters, mostly because URL parameters shouldn't be used to identify a unique canonical page. We could address this issue by telling Google, for each of our sites (e.g. all 286 language versions of Wikipedia), to ignore certain URL parameters (e.g. the page for en.wiki). That's a hack though we should just fix the app to strip URL parameters from canonicals.
  • we're treating old versions of pages as canonical. For example, http://en.wikipedia.org/w/index.php?title=San_Francisco&oldid=652862076 sends that URL as its canonical when it should send http://en.wikipedia.org/wiki/San_Francisco. This is a specific case of the above issues.

In smaller language wikipedias I expect these issues are having a significant impact search visibility and resulting in a much higher likelihood of search engines preferring an English language article to a local language article. Marking this as an i18n issue because of the disproportionate amount of impact on non-English wikipedias. They're also pretty easy fixes so worth a few hours of cleanup.

Related Objects

Event Timeline

Stu created this task.Mar 23 2015, 1:00 AM
Stu raised the priority of this task from to Needs Triage.
Stu updated the task description. (Show Details)
Stu added subscribers: Stu, Eloquence, DarTar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2015, 1:00 AM
Stu set Security to None.
Stu updated the task description. (Show Details)
Aklapper updated the task description. (Show Details)Mar 26 2015, 2:20 PM
Aklapper updated the task description. (Show Details)Mar 26 2015, 2:23 PM
Florian added a subscriber: Florian.May 2 2015, 4:36 PM
greg added a subscriber: greg.Jun 10 2015, 10:01 PM

I am prioritizing this for Q1 2015, which starts in July.

Split this up into smaller task (and moved component labels down to the relevant tasks only, such as i18n). None of the mentioned examples seem related to mobile yet though. If we do find mobile specific ones, add them as blockers :)

phuedx added a subscriber: phuedx.Jun 19 2015, 4:44 PM

A new(?) development: while canonical URLs were apparently never really canonical, AFAICT so far Google was smart enough to offer them in the nice https://en.wikipedia.org/wiki/Page format. However, recently I noticed that it is often starting to offer them in https://en.wikipedia.org/?title=Page format.

This needs much higher priority, IMHO, as I think some of these problems were mostly flying under the radar until the HTTPS switch, but with the HTTPS switch Google has been re-indexing us on the new HTTPS URLs and all of these related problems are suddenly becoming very large in our search results.

As noted in IRC and in the comments of some of the related tickets under this tracking ticket: for a yet-unknown reason, we're handing out more URLs of the form ?title= (when /wiki/ would have been more common in the past) for users to follow than we did before, and this change seems to be relatively-recent. Relatedly, with the HTTPS reindexing going on, a bunch of our results are flipping from /wiki/ links to ?title= links. In the general case, this causes a few distinct problems: it hurts our cache efficiency, it hurts purging on edits (as we're not purging all URL variants...), and it's going to lower our search rankings as we'll have several different "canonical" URLs in Google as hits for the same content (the ?title= and /wiki/ variants have distinct rel=canonical for their own URL flavor).

Critically, our mobile redirect code (which is what gets used when a mobile browser google-searches us and clicks a canonical desktop link, to redirect them to the mobile version of the site) did not historically handle the /?title= cases, and therefore as these URLs began to infect Google results more-quickly with the HTTPS reindexing, mobile redirects from this could have fallen sharply. A Varnish-level workaround has recently been merged (on Friday the 19th: T103158 - https://gerrit.wikimedia.org/r/#/c/219471) to ensure these redirect to mobile as well, but it was many days too late to not impact initial HTTPS mobile pageviews.

Change 219446 had a related patch set uploaded (by Krinkle):
MediaWiki.php: Redirect non-standard title urls to canonical
https://gerrit.wikimedia.org/r/219446

Thank you for working on this!
Redirecting certain non-canonical URLs seems like good start, but is only a partial solution? I'm concerned that after https://gerrit.wikimedia.org/r/219446 is merged and deployed, we'll still be outputting bullshit canonical tags for certain types of requests. For example:

$ curl -s "https://en.wikipedia.org/?title=San_Francisco&action=history" | grep canonical
<link rel="canonical" href="https://en.wikipedia.org/?title=San_Francisco&amp;action=history" />

There are at least two dozen isolated issues surrounding canonical link tags. Many of which are tracked under T93550. For some, redirects are appropriate, for others not. Either way, it's a distributed problem that'll need patching in many different places. One step at a time.

I have two questions here.

  1. Isn't it better to not provide a canonical tag than it is to provide a wrong canonical tag? It seems like the output logic here is overly aggressive.
  1. I don't understand why outputting the appropriate canonical URL, at least for the most common case of article views, is so difficult. We already have $wgArticlePath... it seems pretty strange that /?title=Foo would generate the wrong canonical title. I suppose I should find and read the relevant code.

Note that action=history has a robots=noindex,nofollow so those aren't indexed anyway (found no hits in Google).

Yes, this was a contrived example. Still, this seems like all the more reason not to output a wrong canonical tag. We could just omit the tag.

@Stu, it seems like the fix @ori provided in https://phabricator.wikimedia.org/T67402 handles most of the issues in the Description in this ticket. @Stu, would you please confirm?

we're treating old versions of pages as canonical. For example, http://en.wikipedia.org/w/index.php?title=San_Francisco&amp;oldid=652862076 sends that URL as its canonical when it should send http://en.wikipedia.org/wiki/San_Francisco. This is a specific case of the above issues.

sorry? how would the crawler know about the old versions then?

Stu added a comment.Jun 23 2015, 10:10 AM

sorry? how would the crawler know about the old versions then?

@Gryllida -- Because crawlers crawl. Old versions are linked to from the "View history" tab so Google and others will find them there and then go crawl them. Did I understand your question?

Stu added a comment.EditedJun 23 2015, 10:12 AM

@Stu, it seems like the fix @ori provided in https://phabricator.wikimedia.org/T67402 handles most of the issues in the Description in this ticket. @Stu, would you please confirm?

@dr0ptp4kt -- Based on a quick scan, I think so. Let's get that fix deployed and confirm.

@Stu, I believe it's actually deployed. Able to confirm things look correct from your side? They did look okay from a cursory check on my side.

Stu added a comment.Jun 28 2015, 12:19 AM

Just checked my various examples and it's looking great. Nice work!

dr0ptp4kt added a comment.EditedJun 29 2015, 4:57 PM

@Stu, thanks for validating.

The macro impact of the change on pageviews will probably be a little hard to ascertain given the confluence of (1) HTTPS rollout and the (2) implicit reindexing by intermediaries such as Google followed shortly by (3) this correction by @ori (Performance) and the surge and trickle of re-indexing that's inevitable again, plus (4) seasonal and (5) regional effects. I'm thinking stuff will become a little more clear in the coming weeks.

As @BBlack (Ops) noted in https://phabricator.wikimedia.org/T67402#1385978 , we needed this done to relieve the Varnish caches and user perceived speed impact for "incorrectly" (per spec and our interpretation) canonicalized URLs followed from search indexes and the like. So I think we did the right thing here.

@Stu, would you like @Wwes and I to reach out to our contact at Google to get his take on stuff? I think going forward, as @phuedx has noted, we'll probably want to coordinate sweeping changes of this nature with tech contacts at search providers (Google and Bing are the two biggest) if for no other reason than to give them a heads up in case they see strange stuff in their infrastructure.

Stu added a comment.EditedJul 20 2015, 9:27 PM

@dr0ptp4kt, definitely go ahead and reach out to Google. Side note: Best to avoid delaying anything we think is right answer until we hear back from Google. They can be.... slow at responding. :)

Related question for you. I did some more testing and was looking through Google Webmaster tools and noticed some odd behavior around redirects and canonicals for m. articles. For example, let's take the enwiki redirect for Cantilever_brake at https://en.wikipedia.org/w/index.php?title=Cantilever_brake&redirect=no. On the regular site, going to https://en.wikipedia.org/wiki/Cantilever_brake redirects to https://en.wikipedia.org/wiki/Bicycle_brake#Rim_brakes. Makes sense.

But on the mobile site there's a weird outcome. https://en.m.wikipedia.org/wiki/Cantilever_brake redirects to https://en.m.wikipedia.org/wiki/Cantilever_brake#Rim_brakes. Note that instead of Bicycle_brake#Rim_brakes it's redirecting to Cantilever_brake#Rim_brakes which is nonsensical since it's the title from the redirect and the section heading from the actual article. I noticed because Google webmaster tools was kicking off a bunch of errors.

@Stu, I'll check with Google on canonicalization disposition at their end.

I created a new ticket for the mdot redirection thing you noted - T106354. Thanks for noting!