Page MenuHomePhabricator

generateSitemap.php generating more links in each segment file than the url_limit 50000
Closed, DuplicatePublic

Description

Hi,

Recently the google search console reporting error "Too many URLs in sitemap" for us. We found that every single sitemap file containing links more than 50000. (e.g. https://zh.moegirl.org/sitemap/sitemap-zhmoegirl-NS_0-0.xml.gz contain 51975 links, NS_0-1.xml.gz have 59375 links, 59040 and so on...)

The generateSitemap.php separate sitemap file every 50000 links. (Hardcoded in https://github.com/wikimedia/mediawiki/blob/REL1_31/maintenance/generateSitemap.php line 179: $this->url_limit = 50000;) However, it some how generating sitemap file that containing more links than 50000 limit in production.

This error may or may not have relationship with language variants, which was fixed by one of our colleague @Nbdd0121 from MoegirlPedia on 2016, in this patch https://gerrit.wikimedia.org/r/290143. It seems that all the patch code regarding the language variant links count were removed? I am not sure why there are some extra links in each sitemap file. There were no PHP error generated. The total number of links in each specific sitemap segment change every time with re-run of /generateSitemap.php. (e.g. the NS_0-0.xml.gz contain 65615 links for the second time, and 50835 third time...)

Mediawiki version 1.31.1