I spent some time looking through our reports on Google Webmaster Tools. They highlighted a few issues with our handling of canonical URLs:
- we're serving canonical URL tags for non-canonical URLs. For example, `https://en.wikipedia.org/w/index.php?title=San_Francisco` is sends that URL as its canonical when IMHO it should send `http://en.wikipedia.org/wiki/San_Francisco`. Also saw this one in https://phabricator.wikimedia.org/T67402. IMHO `https://en.wikipedia.org/w/index.php?title=San_Francisco` should 301 to `http://en.wikipedia.org/wiki/San_Francisco` which would take care of the canonical issue.
- we're serving canonical URL tags with URL parameters. For example, `https://en.wikipedia.org/wiki/Category:Living_people?from=Fe`. I don't know what the parameter `from` specifies, but canonical URLs generally shouldn't include URL parameters, mostly because URL parameters shouldn't be used to identify a unique canonical page. We could address this issue by telling Google, for each of our sites (e.g. all 286 language versions of Wikipedia), to ignore certain URL parameters (e.g. the [[ https://www.google.com/webmasters/tools/crawl-url-parameters?hl=en&siteUrl=http://en.wikipedia.org/&prop=go | page for en.wiki ]]). That's a hack though we should just fix the app to strip URL parameters from canonicals.
- we're serving canonical URL tags with encoding issues. For example, http://en.wikipedia.org/wiki/R%C3%B3bert_Tomaschek and http://en.wikipedia.org/wiki/R%F3bert_Tomaschek are both served as canonical URLs.
- we're treating old versions of pages as canonical. For example, `http://en.wikipedia.org/w/index.php?title=San_Francisco&oldid=652862076` sends that URL as its canonical when it should send `http://en.wikipedia.org/wiki/San_Francisco`. This is a specific case of the above issues.
In smaller language wikipedias I expect these issues are having a significant impact search visibility and resulting in a much higher likelihood of preferring an English language article to a local language article. Marking this as an i18n issue because of the disproportionate amount of impact on non-English wikipedias. They're also pretty easy fixes so worth a few hours of cleanup.