Page MenuHomePhabricator

Deploy sitemaps API for Commons
Open, In Progress, Needs TriagePublic

Description

Deploy a sitemap for Commons to test the theory that it will improve search engine discovery of file description pages and categories.

  • Develop sitemaps API (T396684)
  • Enable sitemaps API in WMF config
  • Measure performance, tune size parameter
  • Submit Commons sitemap to Google
  • Monitor impact
  • Add Commons sitemap to robots.txt
  • Submit Sitemap to Bing (and thus DuckDuckGo)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
MusikAnimal changed the task status from Open to In Progress.Jul 21 2025, 6:08 PM

Change #1173575 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Enable sitemaps API

https://gerrit.wikimedia.org/r/1173575

Change #1173575 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable sitemaps API

https://gerrit.wikimedia.org/r/1173575

Mentioned in SAL (#wikimedia-operations) [2025-08-01T05:14:39Z] <tstarling@deploy1003> Started scap sync-world: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-01T05:16:41Z] <tstarling@deploy1003> tstarling: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-01T06:04:39Z] <tstarling@deploy1003> Finished scap sync-world: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]] (duration: 49m 59s)

Change #1174970 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] In sitemap responses set CC: public

https://gerrit.wikimedia.org/r/1174970

Commons sitemap files typically take 300-400ms to generate, and about 70-100ms for a WAN cache hit.

Headers indicated there was no CDN caching, and there was a Cache-Control: no-cache response header, likely set by this ancient CommonSettings.php hack:

# Godforsaken hack to work around problems with the reverse proxy caching changes...
#
# To minimize damage on fatal PHP errors, output a default no-cache header
# It will be overridden in cases where we actually specify caching behavior.
#
# More modern PHP versions will send a 500 result code on fatal error,
# at least sometimes, but what we're running will send a 200.
if ( PHP_SAPI !== 'cli' ) {
	header( "Cache-control: no-cache" );
}

This hack was introduced in January 2008.

Change #1175121 had a related patch set uploaded (by Krinkle; author: Tim Starling):

[mediawiki/core@wmf/1.45.0-wmf.12] In sitemap responses set CC: public

https://gerrit.wikimedia.org/r/1175121

Change #1174970 merged by jenkins-bot:

[mediawiki/core@master] In sitemap responses set CC: public

https://gerrit.wikimedia.org/r/1174970

Change #1175121 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.12] In sitemap responses set CC: public

https://gerrit.wikimedia.org/r/1175121

Mentioned in SAL (#wikimedia-operations) [2025-08-04T04:48:23Z] <tstarling@deploy1003> Started scap sync-world: Backport for [[gerrit:1175121|In sitemap responses set CC: public (T400023)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-04T05:09:13Z] <tstarling@deploy1003> krinkle, tstarling: Backport for [[gerrit:1175121|In sitemap responses set CC: public (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-04T05:25:26Z] <tstarling@deploy1003> Finished scap sync-world: Backport for [[gerrit:1175121|In sitemap responses set CC: public (T400023)]] (duration: 37m 03s)

After the CC: public fix, I am now seeing CDN cache hits.

I need to add myself as an "owner" in Google Search Console in order to be allowed to submit sitemaps.

Change #1175631 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Authorize self for Google Search Console

https://gerrit.wikimedia.org/r/1175631

Change #1175842 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/dns@master] Authorize self for Google Search Console

https://gerrit.wikimedia.org/r/1175842

Change #1175842 abandoned by Tim Starling:

[operations/dns@master] Authorize self for Google Search Console

Reason:

doesn't work because commons is a CNAME

https://gerrit.wikimedia.org/r/1175842

Change #1175631 merged by jenkins-bot:

[operations/mediawiki-config@master] Authorize self for Google Search Console

https://gerrit.wikimedia.org/r/1175631

Mentioned in SAL (#wikimedia-operations) [2025-08-05T09:02:16Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1175631|Authorize self for Google Search Console (T400023)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-05T09:07:38Z] <hashar@deploy1003> tstarling, hashar: Backport for [[gerrit:1175631|Authorize self for Google Search Console (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-05T09:20:07Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1175631|Authorize self for Google Search Console (T400023)]] (duration: 17m 50s)

Change #1175851 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] In robots.txt permit access to the sitemap API

https://gerrit.wikimedia.org/r/1175851

Change #1175851 merged by jenkins-bot:

[operations/mediawiki-config@master] In robots.txt permit access to the sitemap API

https://gerrit.wikimedia.org/r/1175851

Mentioned in SAL (#wikimedia-operations) [2025-08-05T10:04:12Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-05T10:06:01Z] <hashar@deploy1003> tstarling, hashar: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-05T10:12:13Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]] (duration: 08m 01s)

I submitted the sitemap index to Google. The request was initially denied by robots.txt. That's fixed now. The documentation indicates that Google will automatically request it again at some point.

I'm running a script to prime the cache by requesting the sitemap files.

I resubmitted the sitemap. Google is downloading it at a rate of 1 req/s.

Google has finished downloading the sitemap files, and it reports 160,704,052 discovered pages.

Google has finished downloading the sitemap files, and it reports 160,704,052 discovered pages.

Wow, that is, for lack of a better word, awesome!

The indexed page count has gone up from 58M to 118M so far, still rising. There was a crawl request spike, comparable in peak request rate to the previous spikes. Also the click rate according to Google continues to break new records: from 217k on August 4 up to 277k on August 11. I'll post graphs to the parent task once things have settled down.

A couple of things we might want to consider:
* Is there any reason not to cache for some time the sitemap api URLs at the edge? Even a few hours would be beneficial I think if we ever publicize this more.

  • Is there a reason to include User_talk: pages and similar in the sitemap? these pages are rarely cached and relatively expensive to render in some cases, do we care about them being indexed in search engines?

Sorry, for reasons I don't understand my first request for that page got a x-cache-status: pass, I see now it's cacheable at the edge.

The sitemap excludes namespaces that are wholly no-indexed via $wgNamespaceRobotPolicies. For enwiki that means User_talk should be excluded.

Checking emperically, https://quarry.wmcloud.org/query/96426 says that the oldest User_talk pages are:

  • https://en.wikipedia.org/wiki/User_talk:AnonymousCoward (page_id: 2311)
  • https://en.wikipedia.org/wiki/User_talk:Armillary/Old (page_id: 2930)

https://en.wikipedia.org/w/rest.php/site/v1/sitemap/0/page/0 covers

There are no User_talk pages in this sitemap range, which suggests they are indeed excluded.

  • Is there a reason to include User_talk: pages and similar in the sitemap? these pages are rarely cached and relatively expensive to render in some cases, do we care about them being indexed in search engines?

Many wikis have the user and user talk namespaces excluded from indexing. We could consult on extending that to all wikis. I don't know if anyone really wants user pages to be indexed. A lot of configuration variables are like this -- we tend to have bad defaults because we have a good process for changing a configuration variable on a single wiki, but no good process for changing something globally.

Now that Commons user talk pages have been discovered, we would need to noindex them to get Google to stop crawling them. Just removing them from the sitemap is not enough to make Google undiscover them.

I think main user pages should definitely be indexed. They are a place for contributors to represent themself. Abusive user pages are removed very fast. For subpages below User:Example/ indexing is often not useful as most of these pages are just personal notes or temporary maintenance pages.

Talk pages are more difficult. They might contain important information in their archive pages. But these pages are often totally unreviewed and only clear harassment or similar is removed from them. I would lean towards not indexing talk pages.

I think main user pages should definitely be indexed. They are a place for contributors to represent themself.

Within the project(s') scope that makes sense, and Wikimedians know how to find a person's user page. In public, that makes way less sense to me, as user pages are not a replacement for your "internet profile".

I think main user pages should definitely be indexed. They are a place for contributors to represent themself.

Within the project(s') scope that makes sense, and Wikimedians know how to find a person's user page. In public, that makes way less sense to me, as user pages are not a replacement for your "internet profile".

For Wikipedia that might be the case but for photographers on Commons a findable user page can help to get invitations to events or show the portfolio for accreditation. At the end this results in more and better photos for the project. We have strict rules that pages advertising the photographic work are only allowed for actual contributors and not as s social media page for everyone.

Will adding the the Commons sitemap to robots.txt mean that other search engines will also pick it up? duckduckgo seems not to have many File pages indexed, for example - commons site image search for "dog" on duckduckgo compared to google

Will adding the the Commons sitemap to robots.txt mean that other search engines will also pick it up? duckduckgo seems not to have many File pages indexed, for example - commons site image search for "dog" on duckduckgo compared to google

according to their docs it should.. But as DuckDuckGo largely relies on bing.com for indexing, I'm assuming it will mostly matter that bing is able to pick it up first :)

@Cparle do you know if we have access to bing webmaster tools btw ? (as bing is indirectly feeding DuckDuckGo) [so that we can monitor if bing picks up the sitemap when added to robots.txt]

As this is Commons only so far, and our infra robots.txt is shared across all, we could add it via https://commons.wikimedia.org/wiki/MediaWiki:Robots.txt

Sitemap: https://commons.wikimedia.org/w/rest.php/site/v1/sitemap/0

But it would be nice to be able to observe impact for bing.com if we can.

The sitemap has only been submitted direct to Google, it's not in robots.txt and it hasn't been submitted to Bing. SRE are afraid that adding it to robots.txt will cause too much crawler traffic.

I don't personally have access to Bing Webmaster Tools, although from the documentation it sounds like there is an admin account.

TheDJ updated the task description. (Show Details)

Change #1214148 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google

https://gerrit.wikimedia.org/r/1214148

The sitemap has only been submitted direct to Google, it's not in robots.txt and it hasn't been submitted to Bing. SRE are afraid that adding it to robots.txt will cause too much crawler traffic.

I synced with SRE. There are no infrastructure concern with submitting sitemaps for more wikis or to more search engines, and indeed we've been doing this for numerous wikis as part of mitigating T380573: UkWiki article not indexed on Google without issue.

The sitemap has already been called out in robots.txt for a few months now, but not in the proper format. I'm fixing that now, so that I don't have to do this manually for 900 wikis in Google Search Console, Bing Webmaster Tools, Yandex, etc.

There are however other concerns with the Sitemap API in terms of its documentation and long-term access control, which are worked on in follow-up tasks for WE "5.2.5: Sitemap Endpoint Cleanup". We won't formally announce the sitemap (e.g. docs, blog, mailing list, enterprise) until after that work completes.

Change #1214148 merged by jenkins-bot:

[operations/mediawiki-config@master] Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google

https://gerrit.wikimedia.org/r/1214148

Mentioned in SAL (#wikimedia-operations) [2025-12-03T02:59:36Z] <krinkle@deploy2002> Started scap sync-world: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]]

Mentioned in SAL (#wikimedia-operations) [2025-12-03T03:02:25Z] <krinkle@deploy2002> krinkle: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] synced to the testservers (see https://wiki

Mentioned in SAL (#wikimedia-operations) [2025-12-03T03:08:02Z] <krinkle@deploy2002> Finished scap sync-world: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] (duration: 08m 26s)

Change #1214201 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] robots.php: Avoid "404 Not Found" for Sitemap rule

https://gerrit.wikimedia.org/r/1214201

Change #1214201 merged by jenkins-bot:

[operations/mediawiki-config@master] robots.php: Avoid "404 Not Found" for Sitemap rule

https://gerrit.wikimedia.org/r/1214201

Mentioned in SAL (#wikimedia-operations) [2025-12-03T03:15:12Z] <krinkle@deploy2002> Started scap sync-world: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-03T03:17:53Z] <krinkle@deploy2002> krinkle: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-03T03:26:20Z] <krinkle@deploy2002> Finished scap sync-world: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]] (duration: 11m 08s)