Page MenuHomePhabricator

[5.2.5 Milestone] Introduce API Gateway access controls on sitemap endpoints
Closed, ResolvedPublic

Description

Background

Sitemaps are intended to supplement crawling to ensure that all pages and resources are known and discoverable by search engines. However, sitemaps are now uniquely being used to more effectively scrape websites for their content at scale, with countless articles published around the internet outlining how to use sitemaps to more effectively scrape sites for AI training and other purposes. To prevent a scenario where we are unintentionally encouraging more scraping, we should be more intentional with how we grant access to the sitemap endpoints. The goal of introducing access controls is to incentivize users to self-disclose their activity, so that we may better understand our users and enable them to more meaningfully engage with WMF and our mission.

Problem statement

The introduction of the sitemap endpoints creates a new pathway for users to access Wikimedia project content, during a time when we are trying to better understand our users and direct them to recommended, sustainable solutions. Because the sitemap endpoints themselves are somewhat high risk for abuse (for example, lazy scrapers may not retain their own last updated dates and always crawl and scrape the full map, resulting in expensive page rendering and cache pollution), we want to know who is using them and how sitemap adoption relates to other API and crawling activities. Additionally, in cases where abusive behavior is detected, we want to ensure that we have a means of contacting and potentially redirecting the developer to a more appropriate and sustainable solution.

Scope

Allow-list implementation
To limit the risk of opening the door to all scrapers while still maintaining simple access for trusted bots and technical partners, we recommend allowing specific traffic without additional authentication required. Specifically in the case of Google, there are limitations to how a sitemap can be provided to their search console – no additional information can be included (such as headers) beyond the URL for the sitemap location.

NOTE: If we move the sitemap to an authenticated or allowlist format, should it still be called out in robots.txt? Does it need to move, conditionally surface, or otherwise include additional instructions for how to use it (like through a comment)?

Conditions of acceptance:

Control access to the sitemap endpoints within the API Gateway.

  • Grant access to known bots:
  • If a token is not provided and the request does not originate from a “trusted bot”, return helpful error messages that direct users towards how to sign up for the trusted bot program.

Implementation details

Traffic will be categorized at the edge, with a header that designates caller profile.

Blockers & Dependencies

SRE expects the new traffic categorization header to be completed in November. This work is blocked until that work is complete.
Once this work is completed, the spec documentation will need to be updated. Work closely with MWI on timelines so that we can ensure that endpoint expectations and instructions are clear and complete.

Event Timeline

HCoplin-WMF renamed this task from Introduce API Gateway access controls on sitemap endpoints to [5.2.5 Milestone] Introduce API Gateway access controls on sitemap endpoints.

I have a few questions. Apologies if these are already answered elsewhere, or if there is private information I should look at. Let me know.

  • Do we know of scrapers that skip sites without a sitemap, rather than defaulting to recursive scraping?

A scraper generally has two choices: They can crawl recursively from a starting point (e.g. the main page), or they can crawl a list and use that to detect updates (sitemap). Crawling recursively means they tend to hit lots of expensive and uncached URLs that are not valuable to them but expensive to us (old revisions, history queries and special pages with infinite query parameter permutations, no-indexed namespaces such as User_talk). It also means they won't know when pages have changed, and repeat all this periodically. Crawling from a sitemap means they hit only cheap cachable canonical article URLs, and they know when they have last changed without loading our infrastructure. I think we should encourage the latter.

I can see how making the list easily available sounds like it would encourage scraping, but recursive scrapers is (afaik) the original and most established mode of scraping.

  • Can we detect sitemap scraping and/or is there a specific cost related to sitemaps we want to throttle?

A sitemap is not by itself expensive or scraping, it is merely a list of pages that exist on the wiki. This is functionally equivalent to API:Allpages, Special:AllPages, and all-titles.xml dumps. These allow one to openly paginate through a list of all pages. The difference is that the Sitemap implements a standard protocol, as opposed to requiring a MW-specific API client. Having the list available in a standard way like this, I think is ultimately benefitial to us, because it means independent search engines and a long tail of other well-behaved scrapers can use that instead of expensive recursion to uncached URLs, and use the last-updated information to avoid most repeats.

To scrape a site via a sitemap means first loading the sitemap, and then slowly visiting the URLs on that list (no recursion). That actual scraping would not show up in our logs as "sitemap"-related, but as regular requests to URLs like https://en.wikipedia.org/wiki/Banana for which we already have rate limiting and cache-miss thresholds in place (with more to come this year based on heuristics, edge uniques, JWT/authentication rate limits, etc). As such, it seems the tightening of our main CDN rate limits and the JWT/auth work to allow opting into higher limts, should already apply here if/when someone uses the sitemap for scraping.

  • Who is the audience for consuming sitemaps through MW authentication? Are there established client software for crawling non-public sitemaps?

We already have MediaWiki APIs ("action=query&list=allpages") that allow consumers who target us specifically, to paginate through such a list. We also provide these monthly as a WMF dump (all-titles.xml on dumps.wikimedia.org, a small file with a page list for each wiki). A sitemap provides the same information as those APIs, in an interoperable format.

Change #1201740 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] robots.php: Clean up unused site, lang, and x-subdomain

https://gerrit.wikimedia.org/r/1201740

Just noting that we met with SRE, and the "x-trusted-request" categorization header is officially available! Docs here: https://wikitech.wikimedia.org/wiki/CDN/Backend_api#x-trusted-request

Header availability should unblock this work. @JTweed-WMF -- let me know if y'all need anything else, and when you have an expected timeline.

Change #1214156 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Move error message from footer to body for HTTP 4xx responses

https://gerrit.wikimedia.org/r/1214156

Change #1201740 merged by jenkins-bot:

[operations/mediawiki-config@master] robots.php: Clean up unused site, lang, and x-subdomain

https://gerrit.wikimedia.org/r/1201740

Mentioned in SAL (#wikimedia-operations) [2025-12-03T02:59:36Z] <krinkle@deploy2002> Started scap sync-world: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]]

Mentioned in SAL (#wikimedia-operations) [2025-12-03T03:02:25Z] <krinkle@deploy2002> krinkle: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] synced to the testservers (see https://wiki

Mentioned in SAL (#wikimedia-operations) [2025-12-03T03:08:02Z] <krinkle@deploy2002> Finished scap sync-world: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] (duration: 08m 26s)

Hey -- just following up on this. Do y'all have an expected delivery date? Can I assume this can be done in early-mid Jan?

Can I assume this can be done in early-mid Jan?

Yes, in one or two weeks by mid-Jan. Note that I was OOO last week, working Wed-Fri this week, and OOO next week.

One patch is up for review, which lays the groundwork for the error page. I'll try to get this reviewed and deployed this week.

[operations/puppet@production] varnish: Move error message from footer to body for HTTP 4xx responses
https://gerrit.wikimedia.org/r/1214156

I expect to finish a second patch by next week, that builds on this one. Those two should resolve this task.

Thanks for the update, @Krinkle ! Totally understand about folks being out around the holidays. No worries at all. Appreciate the clarification for timeline, and that it's great news!

Hello, again! I know you've been out again, so just wanted to check in since I didn't see the patches come up. How are you looking to get this wrapped up by the end of the month? I would like to send out a hypothesis update this week with new confirmed delivery dates.

@Krinkle also just want to quickly confirm that you were able to provide a reasonable error message with instructions for how to engage with us if you call the endpoints but don't have access. The message should include the same instructions to get in contact with SRE as the bot limiter. I remember we had some back and forth about what that might look like within MediaWiki, and I don't totally remember where we left off, since it was also a few months ago.

I'm awaiting code review for the above patch. This is now in progress (thx Brett). I've uploaded the next patch meanwhile, for testing in the Beta Cluster.

Change #1233188 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] varnish: Restrict unauth sitemap access to verified crawlers (cat B)

https://gerrit.wikimedia.org/r/1233188

Preview from Beta Cluster:

Screenshot 2026-01-26 at 13.44.47.png (1×1 px, 115 KB)

This uses the same error page format we use for other bot traffic restrictions, and advertises the e-mail address discussed at the Lisbon offsite.

Perfect!! Thank you so much, @Krinkle . That is great on both fronts :D

@Krinkle and @BCornwall , do you have any updates on the status of the code review?

Change #1214156 merged by BCornwall:

[operations/puppet@production] varnish: Move error message from footer to body for HTTP 4xx responses

https://gerrit.wikimedia.org/r/1214156

Change #1233188 merged by BCornwall:

[operations/puppet@production] varnish: Restrict unauth sitemap access to verified crawlers (cat B)

https://gerrit.wikimedia.org/r/1233188

Krinkle triaged this task as Medium priority.