Background
Sitemaps are intended to supplement crawling to ensure that all pages and resources are known and discoverable by search engines. However, sitemaps are now uniquely being used to more effectively scrape websites for their content at scale, with countless articles published around the internet outlining how to use sitemaps to more effectively scrape sites for AI training and other purposes. To prevent a scenario where we are unintentionally encouraging more scraping, we should be more intentional with how we grant access to the sitemap endpoints. The goal of introducing access controls is to incentivize users to self-disclose their activity, so that we may better understand our users and enable them to more meaningfully engage with WMF and our mission.
Problem statement
The introduction of the sitemap endpoints creates a new pathway for users to access Wikimedia project content, during a time when we are trying to better understand our users and direct them to recommended, sustainable solutions. Because the sitemap endpoints themselves are somewhat high risk for abuse (for example, lazy scrapers may not retain their own last updated dates and always crawl and scrape the full map, resulting in expensive page rendering and cache pollution), we want to know who is using them and how sitemap adoption relates to other API and crawling activities. Additionally, in cases where abusive behavior is detected, we want to ensure that we have a means of contacting and potentially redirecting the developer to a more appropriate and sustainable solution.
Scope
Allow-list implementation
To limit the risk of opening the door to all scrapers while still maintaining simple access for trusted bots and technical partners, we recommend allowing specific traffic without additional authentication required. Specifically in the case of Google, there are limitations to how a sitemap can be provided to their search console – no additional information can be included (such as headers) beyond the URL for the sitemap location.
Conditions of acceptance:
Control access to the sitemap endpoints within the API Gateway.
- Grant access to known bots:
- WE5.4.2 (owned by SRE) is creating a new header to categorize traffic at the edge; one of the categories assigned is for “trusted bots”.
- In the case of the “trusted bot” flag, allow traffic to proceed without authentication: https://wikitech.wikimedia.org/wiki/CDN/Backend_api#x-trusted-request --> category "B"
- If a token is not provided and the request does not originate from a “trusted bot”, return helpful error messages that direct users towards how to sign up for the trusted bot program.
Implementation details
Traffic will be categorized at the edge, with a header that designates caller profile.
Blockers & Dependencies
SRE expects the new traffic categorization header to be completed in November. This work is blocked until that work is complete.
Once this work is completed, the spec documentation will need to be updated. Work closely with MWI on timelines so that we can ensure that endpoint expectations and instructions are clear and complete.
