We observed this bug on our 3rd-party wiki.
Steps to Reproduce:
1. Create a page named "New York, NY" i.e. https://www.example.com/wiki/New_York,_NY
2. Generate XML sitemap (`php maintenance/generateSitemap.php`)
3. Submit XML sitemap to Google via Google Search Console (https://www.google.com/webmasters/)
4. Wait (~days) for Google to crawl the sitemap
5. Google Search Console, under Crawl Errors, will report that it could not find the page "/wiki/New_York" -- note it's missing the ",_NY" portion of the URL. The sitemap is correct, but Google is choking on the comma.
Actual Results:
Google Search Console reports a missing page "/wiki/New_York".
Expected Results:
Googlebot should correctly find the page /wiki/New_York,_NY.
Here are other persons experiencing the same issue on non-MediaWiki sites:
https://webmasters.stackexchange.com/questions/77283/sitemap-xml-generates-404s-for-urls-with-single-quotes-and-commas
https://productforums.google.com/forum/#!topic/webmasters/khrOfwjvP5Q
This is arguably a Googlebot defect. However, to be inclusive of the widest possible number of articles being crawled, MediaWiki could be altered to escape the comma.
By the way, this does not appear to be related to T36666, which introduced HTML entity escaping for sitemap URLs via `htmlspecialchars()`. Our issue, on the other hand, should be fixed by URL encoding, not by HTML entity escaping.
Here is our simple workaround. We added the following code to `LocalSettings.php` which replaces each comma with `%2C`:
```
if (class_exists('GenerateSitemap')) {
$wgHooks['GetCanonicalURL'][] = 'myGetCanonicalURL';
function myGetCanonicalURL($title, &$url, $query) {
$url = str_replace(',', '%2C', $url);
}
}
```