Page MenuHomePhabricator

GenerateSitemap.php should not generate all language variants with same priority
Open, Needs TriagePublic

Description

This is a realllllllly old bug related to languages with variants (e.g. Chinese) which is still not solved!!!! I reported it at-least three years ago though the Bugzilla system with another account.

This bug affect almost every site using Mediawiki with Chinese/Gan/Inuktitut/Kazakh/Kurdish/Serbian/Tachelhit/Tajik/Uzbek. Search engine will index random language variants link from the sitemap for same page.

Following is an example segment of https://zh.moegirl.org/sitemap/sitemap-zhmoegirl-NS_0-0.xml.gz . The 1st one is the canonical URL, and the rest four are variants links.
Mediawiki generate all five link for same page "Bios"with same priority 1. Search engine then take one link from these five randomly, which usually would be a language variants (4/5 chance) and cause users using language variant A, can't read the article in B.

	<url>
		<loc>https://zh.moegirl.org/Bios</loc>
		<lastmod>2016-11-25T04:20:47Z</lastmod>
		<priority>1.0</priority>
	</url>
	<url>
		<loc>https://zh.moegirl.org/zh-hans/Bios</loc>
		<lastmod>2016-11-25T04:20:47Z</lastmod>
		<priority>1.0</priority>
	</url>
	<url>
		<loc>https://zh.moegirl.org/zh-hant/Bios</loc>
		<lastmod>2016-11-25T04:20:47Z</lastmod>
		<priority>1.0</priority>
	</url>
	<url>
		<loc>https://zh.moegirl.org/zh-cn/Bios</loc>
		<lastmod>2016-11-25T04:20:47Z</lastmod>
		<priority>1.0</priority>
	</url>
	<url>
		<loc>https://zh.moegirl.org/zh-tw/Bios</loc>
		<lastmod>2016-11-25T04:20:47Z</lastmod>
		<priority>1.0</priority>
	</url>
  • This cause extreme damage to the user experience. For example. Transitional Chinese reader were not able to read Simplified Chinese, especially the youth generation. Meanwhile, Simplified Chinese reader could only understand part of Transitional Chinese. It's like you are searching for simple English article titled: My Little Pony Friendship Is Magic and Wikipedia give you article wrote in Hebrew.

For a simple and quick easy fix. We could simply remove all language variant links from the sitemap. Only keep the canonical URL. Since Mediawiki can detect users' language setting and provide proper page. (if the build in detect does not work then extension UniversalLanguageSelector can be use)

For a more proper fix, the priority of language variant links should be reduced relevant to the original link. However, there is at-least one other bug T108443 need fix to get proper URL indexed .

Event Timeline

This is a realllllllly old bug related to languages with variants (e.g. Chinese) which is still not solved!!!!

If you'd like to see a bug solved, providing a patch will speed up the process. You are very welcome to use developer access to submit a proposed code change as a Git branch directly into Gerrit which makes it easier to review them quickly and provide feedback.

I reported it at-least three years ago though the Bugzilla system with another account.

All Wikimedia Bugzilla tasks got imported into Wikimedia Phabricator so this task might be a duplicate. However, looking at tasks that mention "GenerateSitemap" I'm not sure which one. :(

Hi Aklapper,

We made a temporary fix for sitemap error here: T65098 to overcome the 50,000 url limit bug caused by language variants. But I can't find the old post related to language variants. Sorry.

This bug been submitted here to get the community aware of language variants problem affect many users. If my colleague or me simply submit a patch completely remove the $hasVariant part from generateSitemap.php (line 328, 377-394). It will get straight reject. People who reviewing the patch should be aware of certain group of people suffering from these bug. A new Configuration settings such as $wgSitemapLanguageVariantsLink can be add, too. But to solve this problem, coordinate on canonical URL function is necessary too.

Just tell you @Baskice that, you could just use T108443 and T65098 as well as using Magic Words in MW, no need to make such unnecessary hrefs.

I reported it at-least three years ago though the Bugzilla system with another account.

All Wikimedia Bugzilla tasks got imported into Wikimedia Phabricator so this task might be a duplicate. However, looking at tasks that mention "GenerateSitemap" I'm not sure which one. :(

Well, again, he is probably mentioning T65098, so are you @Zoglun , Baskice ?

I also doubt the authenticity of This bug affect almost every site using Mediawiki with Chinese/Gan/Inuktitut/Kazakh/Kurdish/Serbian/Tachelhit/Tajik/Uzbek. since it seems just copied from MediaWiki-Language-converter , does all really need such patch? Serbian? Kazakh? Tajik? Uzbek? Has this been mentioned in all languages that mentioned in this sentense?

Change 609513 had a related patch set uploaded (by VulpesVulpes825; owner: VulpesVulpes825):
[mediawiki/core@master] Write language varaint link as child element rather than individual entry in sitemap

https://gerrit.wikimedia.org/r/609513