Page MenuHomePhabricator

Sitemaps API
Closed, ResolvedPublic5 Estimated Story Points

Description

Sitemap generation is operationally complicated because sitemaps are treated like dumps. But generating them is fast enough that they could be treated like an API instead.

Consider an endpoint like /w/rest.php/site/v1/sitemap/<indexId>/<fileId>. If you allow N URLs per sitemap file, and M sitemap files per index, then this endpoint would stream out the URLs in the page_id range from N(jM + i) to N(jM + i + 1) - 1. The number of URLs produced would depend on how many pages have been deleted in the range, but would be at most N.

In WMF production, on Commons, the query

SELECT page_namespace, page_title, page_touched FROM page WHERE page_id BETWEEN 4000000 AND 4030000

only takes 40-50ms. We might not even need an object cache, we could just rely on the CDN to merge requests and cache responses.

The index file, say /w/rest.php/site/v1/sitemap/<indexId>, only needs to know the maximum page_id to figure out how many sitemap files it needs to link to.

It's not clear right now whether T54647 would benefit from having sitemaps for Commons. But if we did need sitemaps for Commons, this is how I think we should make them.

This would be a useful facility to have in core for the benefit of third party users. They could alias /sitemap.xml to /w/rest.php/site/v1/sitemap/0 to enable search engine discovery of the pages on their wiki.

Event Timeline

Change #1163495 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Sitemaps API

https://gerrit.wikimedia.org/r/1163495

Quick question (which very much could be from my own misunderstanding) — I tested the patch and found that language variants were being included:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://localhost:8080/wiki/Main_Page</loc>
    <lastmod>2025-06-04T11:59:18Z</lastmod>
  </url>
  <url>
    <loc>http://localhost:8080/w/index.php?title=Main_Page&variant=en-x-piglatin</loc>
    <lastmod>2025-06-04T11:59:18Z</lastmod>
  </url>
  <url>
    <loc>http://localhost:8080/wiki/Page2</loc>
    <lastmod>2023-07-26T12:27:22Z</lastmod>
  </url>
  <url>
    <loc>http://localhost:8080/w/index.php?title=Page2&variant=en-x-piglatin</loc>
    <lastmod>2023-07-26T12:27:22Z</lastmod>
  </url>
...

this is desired, correct?

KSiebert changed the task status from Open to In Progress.Jul 16 2025, 9:50 AM

For the routing, would y'all be open to using rest.php/site/v1/sitemap/{id} instead of rest.php/v1/sitemap/{id}? It seems like it's likely that there will be other endpoints related to site metadata to help crawlers in the future. With that assumption, it makes sense to bundle under a 'site' module.

For more context, we will be transitioning the APIs more broadly into a module structure, for ease of management (ie: independent versioning, self-contained spec definitions) and reducing cognitive load for users.

Cross linking T395719 since we are going to fix the way that module definition files are registered in that context.

Quick question (which very much could be from my own misunderstanding) — I tested the patch and found that language variants were being included:
this is desired, correct?

Yes.

For the routing, would y'all be open to using rest.php/site/v1/sitemap/{id} instead of rest.php/v1/sitemap/{id}? It seems like it's likely that there will be other endpoints related to site metadata to help crawlers in the future. With that assumption, it makes sense to bundle under a 'site' module.

Done in PS10.

Change #1163495 merged by jenkins-bot:

[mediawiki/core@master] Sitemaps API

https://gerrit.wikimedia.org/r/1163495

Change #1175851 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] In robots.txt permit access to the sitemap API

https://gerrit.wikimedia.org/r/1175851

Change #1175851 merged by jenkins-bot:

[operations/mediawiki-config@master] In robots.txt permit access to the sitemap API

https://gerrit.wikimedia.org/r/1175851

Mentioned in SAL (#wikimedia-operations) [2025-08-05T10:04:12Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-05T10:06:01Z] <hashar@deploy1003> tstarling, hashar: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-05T10:12:13Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1175851|In robots.txt permit access to the sitemap API (T400023 T396684)]] (duration: 08m 01s)