Page MenuHomePhabricator

Google doesn't honor canonical URLs of zh.wiki
Open, Needs TriagePublic

Description

Currently, Google is unexpectedly indexing /zh/ language variant URLs instead of /wiki/ links for Chinese Wikipedia.
A quick example is: https://www.google.com/#q=汉语+wikipedia
As you can see, the first link's URL is https://zh.wikipedia.org/zh/汉语 and lots of other links with /zh/ URLs.
If you open it and check its source, it says

<link rel="canonical" href="https://zh.wikipedia.org/wiki/%E6%B1%89%E8%AF%AD" />

So since the "canonical" version is /wiki/ links, Google should follow and index it instead. But at the moment it's not for some reasons.

The most weird part is, if you search with "site:wikipedia.org": https://www.google.com/#q=汉语+site:wikipedia.org
Then suddenly almost all the links in the first page became /wiki/ URLs (correct behavior).

The problem of /zh/ links is that they ignore user's language variant settings. I need to manually change to /zh-cn/ or /zh-tw/ variants after clicking a link from Google (which is a very common scenario). For /wiki/ links, they would automatically jump to the variants according to user's' preference.

It has been like this for months if not years. I have no idea if it's on Google or Wikipedia. I asked several times on zh.wiki but none takes responsibility or has the ability to fix it.

Event Timeline

fireattack raised the priority of this task from to Needs Triage.
fireattack updated the task description. (Show Details)
fireattack added projects: I18n, SEO.
fireattack added a subscriber: fireattack.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 8 2015, 6:34 AM
fireattack renamed this task from Google doesn't honer canonical URLs on zh.wiki to Google doesn't honor canonical URLs of zh.wiki.Aug 8 2015, 6:35 AM
fireattack updated the task description. (Show Details)
fireattack set Security to None.
Krenair added a subscriber: Krenair.Aug 9 2015, 3:45 PM

@Aklapper: Despite that ticket is highly related to this one (I actually mentioned this bug somehow there years ago), I think they're not the same problem as all. That's why I deliberately opened a brand new ticket to bring the attention to developers actually.

To me, that ticket makes NO SENSE, should be closed and keep this one instead. Here is why:

  1. In that ticket, the author mentioned every language variant pages (/zh-tw/ ones, for example) include a canonical rel pioing to /wiki/ links. That's true, but it should be the expected behavior actually.
  2. The author said "This rel=”canonical” link asks search engines to index the Simplified Chinese page" which is completely wrong, at least in today (i don't know if there is any difference then). As I mentioned in this ticket, /wiki/ links are language neutral, not Simplified Chinese variant. It will jumped to user preferred variants according to Wikipedia settings or browsers settings (for guests) automatically.
  3. The author there argued we should let Google index both Simplified Chinese version and Traditional Chinese version. I totally don't agree with that. IMO, we should only let Google index the language neutral version, just like what we're doing now.
  4. Luckily, it seems none agreed with the author so we don't have any progress around that ticket.

In a word, that ticket made a suggestion around our current "canonical" links behavior which in my opinion is based on unfounded evidence (see #2) and should not be followed.

But this ticket is about a bug, in Wikipedia or Google. It's about our intention of canonical links doesn't get honored by Google somehow.

About T33838, I believe it's either the user set something wrong, or it's a different bug existed at that time and got fixed later. Because now if you visit a /zh-tw/ links it definitely will show Traditional Chinese, not Simplified Chinese.

Also I want to bring @liangent to this discussion :)

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptJan 7 2016, 6:33 AM
Nemo_bis added a subscriber: Stu.Jan 25 2016, 8:29 AM

This problem is not solved in MW 1.28.0. For what I know this is bug related to sitemap, which was not solved for at lest three years. Sitemap list all language varietals with same priority, which cause Google index them randomly.

Example: https://zh.moegirl.org/sitemap/sitemap-zhmoegirl-NS_0-0.xml.gz

By hack Mediawiki's core code to remove all Chinese varietals, provide sitemap with only the canonical link significantly improve the correct rate for Chinese Moegirlpedia.

I have reported this bug very early bug the sitemap bug were not fixed till today.

Restricted Application added a subscriber: Cosine02. · View Herald TranscriptNov 30 2016, 3:12 PM
Xfiner added a subscriber: Xfiner.May 14 2019, 1:28 AM

According to Google Support, you must use "alternate" XHTML tags for language variants and sitemap.

Multi Language related sitemap rules: https://support.google.com/webmasters/answer/189077?hl=en

Sitemap common rules: https://support.google.com/webmasters/answer/139066?hl=en#sitemap-method
"Don't include non-canonical pages in a sitemap. If using a sitemap, specify only canonical URLs in the sitemap."

My test results are:

  1. If you include all language variant URLs in the sitemap, they will be considered as different pages and all of them appears in the search result.
  2. If you only include canonical URL in the sitemap, Google will mix the results (some language variants got excluded, some indexed).
  3. If you use "<xhtml:link rel="alternate" hreflang="">" to tag all variants in the sitemap, Google will choose the correct URL based on user browser language and IP.