Page MenuHomePhabricator

Unescaped commas in XML sitemap cause Google Search Console to report pages not found (404 errors)
Open, LowestPublic

Description

We observed this bug on our 3rd-party wiki.

Steps to Reproduce:

  1. Create a page named "New York, NY" i.e. https://www.example.com/wiki/New_York,_NY
  2. Generate XML sitemap (php maintenance/generateSitemap.php)
  3. Submit XML sitemap to Google via Google Search Console (https://www.google.com/webmasters/)
  4. Wait (~days) for Google to crawl the sitemap
  5. Google Search Console, under Crawl Errors, will report that it could not find the page "/wiki/New_York" -- note it's missing the ",_NY" portion of the URL. Everything after the first comma gets truncated.

Actual Results:
Google Search Console reports a missing page "/wiki/New_York".

Expected Results:
Googlebot should correctly find the page "/wiki/New_York,_NY".

Here are other persons experiencing the same issue on non-MediaWiki sites:
https://webmasters.stackexchange.com/questions/77283/sitemap-xml-generates-404s-for-urls-with-single-quotes-and-commas
https://productforums.google.com/forum/#!topic/webmasters/khrOfwjvP5Q

This is arguably a Googlebot defect. However, to be inclusive of the widest possible number of articles being crawled, MediaWiki could be altered to escape the comma.

By the way, this does not appear to be related to T36666, which introduced HTML entity escaping for sitemap URLs via htmlspecialchars(). Our issue, on the other hand, should be fixed by URL encoding, not by HTML entity escaping.

Here is our simple workaround. We added the following code to LocalSettings.php which replaces each comma with %2C:

if (class_exists('GenerateSitemap')) {

        $wgHooks['GetCanonicalURL'][] = 'myGetCanonicalURL';

        function myGetCanonicalURL($title, &$url, $query) {
                $url = str_replace(',', '%2C', $url);
        }
}

Event Timeline

richardkmiller renamed this task from Unescaped characters in XML sitemap cause Google Search Console to report pages not found (404 errors) to Unescaped commas in XML sitemap cause Google Search Console to report pages not found (404 errors).Jun 15 2017, 4:12 PM
richardkmiller updated the task description. (Show Details)
richardkmiller updated the task description. (Show Details)
Aklapper triaged this task as Lowest priority.Jun 15 2017, 7:22 PM