I spent some time looking through our reports on Google Webmaster Tools. They highlighted a few issues with our handling of canonical URLs:
- we're serving canonical URL tags for non-canonical URLs. For example, https://en.wikipedia.org/w/index.php?title=San_Francisco sends that URL as its canonical when IMHO it should send http://en.wikipedia.org/wiki/San_Francisco. Also saw this one in T67402: URLs for the same title without extra query parameters should have the same canonical link. IMHO https://en.wikipedia.org/w/index.php?title=San_Francisco should 301 to http://en.wikipedia.org/wiki/San_Francisco which would take care of the canonical issue.
- we're serving canonical URL tags with URL parameters. For example, https://en.wikipedia.org/wiki/Category:Living_people?from=Fe. I don't know what the parameter from specifies, but canonical URLs generally shouldn't include URL parameters, mostly because URL parameters shouldn't be used to identify a unique canonical page. We could address this issue by telling Google, for each of our sites (e.g. all 286 language versions of Wikipedia), to ignore certain URL parameters (e.g. the page for en.wiki). That's a hack though we should just fix the app to strip URL parameters from canonicals.
- we're serving canonical URL tags with encoding issues. For example, http://en.wikipedia.org/wiki/R%C3%B3bert_Tomaschek and http://en.wikipedia.org/wiki/R%F3bert_Tomaschek are both served as canonical URLs.
- we're treating old versions of pages as canonical. For example, http://en.wikipedia.org/w/index.php?title=San_Francisco&oldid=652862076 sends that URL as its canonical when it should send http://en.wikipedia.org/wiki/San_Francisco. This is a specific case of the above issues.
In smaller language wikipedias I expect these issues are having a significant impact search visibility and resulting in a much higher likelihood of search engines preferring an English language article to a local language article. Marking this as an i18n issue because of the disproportionate amount of impact on non-English wikipedias. They're also pretty easy fixes so worth a few hours of cleanup.