Page MenuHomePhabricator

Responsible Use of Infrastructure
Closed, ResolvedPublic

Description

Responsible Use of Infrastructure

This session will give an overview on the impact of high volume bot traffic on Wikimedia’s infrastructure and the work that is underway to establish sustainable pathways for content reuse.

Related links:

For everyone who wants to discuss more, we’ll have a workshop right after this session!

  • Session duration (up to 90min): 45 minutes
  • Session type (presentation, workshop, discussion, etc.): Presentation
  • Language of session (English, Arabic, etc.): English
  • Prerequisites (some Python, etc.): None

Session slides: https://commons.wikimedia.org/wiki/File:Responsible_Use_of_Infrastructure.pdf

Etherpad: https://etherpad.wikimedia.org/p/WMHack25__Responsible_Use_of_Infrastructure

  • Session coordinator: Kurmanbek
  • Session notetaker: Carly Bogen

Presenter(s)

Participants

  • *

Notes

  • How many of you use crawlers as part of your projects?
    • Show of hands: 15-25? people
  • A working group was formed to evaluate what can be done about the impact of scrapers on projects and how to handle it more systemically going forward. There is a diff blog with a summary of what we're covering today.
  • Scraping is an external trend. With the rise of AI, there's a much higher demand for human-created content, which is used to train LLMs, which then power chat-based AI search tools. Even though WMF has ~40 SREs, it still has a lot of impact. WMF has observed a 50% increase of bandwidth usage largely caused by automated bots since January 2024. Crawlers target any URL in our infrastructure. Not only scraping of the Wikimedia projects, but also Phab, Gitlab, tools on cloud services, etc.
  • Framing the challenge: our content is free, our infrastructure is not. Our mission relies on the peopel to find us and join as readers, contributors and donors. For that to be possible, attribution is critical. It is also hard to distinguish one bot user from another. Anyone can pretend to b e "AppleBot". Blocking actions can impact legit users and abusers alike.
  • Our content delivery network today is optimized for users. 2 primary data centers are in the United States. We have 5 regional caching centers. Receives about 25billion page views monthly. We can sustain large traffic spikes by humans - requests by humans frequently terminate in the cache, but requests by bots are more likely to get passed on to the main data center because they visit less popular pages.
  • Example of a typical spike: someone famous dies. When Kobe Bryant died in 2020, there are over 30 million hits (page views and edits). Everyone links to Wikimedia to read about them, sometimes there's a video on the page, and editors are frequently updating the page. Which means heavy read traffic focuses on an article that has constantly been invalided from cache.
  • This is typical and we can handle it. But now, with the baseline increased by 50% from bot traffic, it puts us at risk because load on infrastructure is higher. How can we decrease the overall average to get back to a sustainable balance?
  • Bot traffic is expensive: 65% of our most expensive traffic (not in cache) comes from bots. It means constant disruption and high workload for infrastructure teams, and takes resources away from supporting the Wikimedia projects. Described as "like playing whac-a-mole".
  • Attribution as key, because our mission relies on people finding us. Scraping isn't necessarily bad, if it's for good reason and is reasonable, and brings people back and makes people aware that Wikipedia exists.
  • License conform reuse is key to enable people to find us and come back to Wikipedia, whether the access method is scraping, dumps, APIs, or other mechanisms.
  • It was previously difficult to add attribution data to LLMs if it wasn't in the training set, but today live data is possible via RAG (Retrieval augmented generation). The sources are now known to the code rendering the response.
  • In summary: we need a sustainable "knowledge as a service" model. That means..
  • reinvesting in the MediaWiki API ecosystem, because APIs are sustainable. What can we offer?
  • enabling governance for API and scraping traffic, including tiered levels of access & the ability to enforce limits systemically. This enables us to stop playing whac-a-mole. E.g. wikis have authentication. This gives insights for thinking about what the projects need, and enables community to provide governance on wikis. It's a lot harder to do this in the world of bot traffic. We are working on figuring that out now.
  • cohesive approach to attribution, incentivising companies to help bring users back. We can provide guidance, what can we ask others to do? What politices?
  • direct users to Enterprise's services where appropriate, which are made for large volume commercial users.
  • The carrot and the stick - one doesn't work without the other. Both providing sustainable options and enforcing responsible reuse.
  • Example: Why don't you just provide a Commons dump?
  • It would not solve the problem. Carrots without sticks won't work.
  • There are significant technical and potentially legal challenges.
  • The community can continue to scrape (trusted bots program)
  • There are likely better ways to provide sustainable access to Commons
  • Volunteers are not the problem - scraping is not usually something our community does (with some exceptions). Solutions that work for our community might not work for general large volume users.
  • Action plan: Wiki Experiences 5 objective: If we know our users (5.1) we can build an API offering that meets their needs (5.2) while also providing guidance on how to reuse content (5.3) and prventing abuse (5.4).
  • Timeline: Roughly started with getting data insights about scraping, dumps and API usage. Then created an action plan of what to do right now and what we're working towards, that covers the technical approach, how we support the community - this is what is described in the WE5 objective in the FY25-26 annual plan). We then shared a public blog post which was picked up by 200+ media outlets. Then updated the technical robot policy to clarify and apply limits, which hadn't been updated since 2009. Now we are doing outreach with developers about their data access/API needs, and exploring approaches to developer authentication for API traffic and trusted scraping.

*Plan elements:

  • Reinvst in MediaWiki API ecosystem
  • Improve bot detection & filtering at the edge (edge uniques help with this)
  • Increase % of traffic that can be associated with a known account, like the trusted bots program and other mechanisms to enable our community
  • Apply limits as per policies
  • Integrate attribution guidelines with applications where possible
  • There are many open questions: *How can we bucket traffic and what methods can we use to identify? There will be a workshop next. What can be done to enable the right levels for proposed authorized groups?

Workshop Notes

  • Exploring whether the proposed association groups are feasible (See slide)
  • These are questions to explore, not saying everything must be authentication. Want there to be nuance so we can minimize the effect on users.
  • See https://phabricator.wikimedia.org/T393165 for the questions for each discussion group to consider
  • Community cocnerns group report back:
    • Want ot make sure our target audience - mission aligned educational people who need encylopedic information get it and are not affected by the new policies.
  • Need to do user studies to understand what's causing the 50% research
  • How to push people into using the
  • Developer impact group
  • We were supposed to explore developer preferences and impact of enforcing authentication, but we talked about what it means to support different types and the cmoplexity it introdues from a developer and infrastructure perspective. Talked about a few options to techncially resolve it, and more discussion is needed on tradeoffs, but there are some obvious points of simplifcadtion we can explore - e.g. do we really need to continue supporting OAuth 1, what should we do with complexity around bot passwords and cookies and whether each auth type needsits own specific thresholds or not, and how we should tell the story to developesr as they're onboarding for what options to use and why.
  • A lot harder for new developers to know how to do the right thing if we support too many options.
  • Traffic attribution group
  • People who have tools on toolforge already have a high level of trust. We don't need to be restrictive, but even if we give them a high limit we still need to identify them to count. There were some ideas about how to automate this so you don't have to do it manually.
  • If you have multiple levels that might apply to a request based on a criteria, it depends
  • Incident control - when we identify someone as a bad actor, we want to block them completely, but for how long? Probably not forever, maybe someone else uses that IP tomorrow.
  • For appropriate limits for each group, we need more data, but can't get that until we start to identify actors so we know how much they are currently using. For upper limits, should write down what causes ops people to take action. When do they get paged, etc.

Questions

  • After the publication of the blog post, did the conversations with large commercial bot companies change?
    • It gives us a different position because there's something to point to about the impact. That plus the policy are important parts for the conversations and visibliity. It changed at least 2 or 3 conversations.
  • The solutions oyu're looking at seemed focused on Foudnation projects. A lot of other MediaWiki operators are probably suffering from the same problems. Will you be implementing solutions that are also useful for third-party users of the MediaWiki stack, or entirely Foundation production stack specific?

*It's about what happens with our network, so the specific solution may be specific to Wikimedia, but should share lessons learned as they are applicable. Other things in our infrastructure - beta cluster, fundraising stack, translatewiki etc. What tools can we develop that benefit all? The community can reuse those. I suspect some of it will be Wikimedia specific because it's about who comes to our site.

  • Some MediaWiki community member have started this documentation: https://www.mediawiki.org/wiki/Handling_web_crawlers - please join to have a good place to document how to best handling web crawlers when you are using MediaWiki
  • Clarification: important to keep in mind that the way the community manages bot accounts and privileges is by limiting write access. But this is focusing on the impact of read access on our infrastructure. Sometimes these get mixed up, but they're different kinds of problems needing different kinds of solutions.
  • What does this mean for third party ToolForge tools which may need authentication to allow their users to continue use? I assume a lot of ToolForge tools don't have this now. Is this part of the consideration?
  • It is, yes. Looking at privleged access points and how we can support cloud services. Wil talk about it in more detail in the next session.
  • Big takeaway is that there's a tradeoff, and we want to stay aligned with the mission.
  • Is it true that you think it's a bad idea to provide a Commons dump? Need to understand the use case.
  • It is not the answer to the scraping problem. There are better ways to address the use case.
  • At Wikimedia Finland, downloaded all of Commons once to do some calculations, and we need to do it again, so it will be another 3 month project to download another Petabyte, so the dump would have been useful.
  • https://wikimania.wikimedia.org/wiki/2024:Program/Scalable_duplicate_photo_detection_for_Wikimedia_Commons
  • Please check out the policy and look at bottraffic@wikimedia.org and we can think about how to help with your use case.
  • Text dumps hasn't helped bots, so we don't think it will help for Commons to have image dumps.
  • Need to solve the bot problem but also have open data available.
  • APIs are never a solution to all problems: they can be nice and convenient in some tasks, but won't implement all the functionalities that would be convenient necessary for all use cases, so it's also good to be able to download the dump and have the possibility to use your own resources (as an "escape valve" for expectations).
  • From an SRE perspective, is it possible to more agressively caching the APIs to mitigate the problem?
    • Again, to the whac-a-mole thing. You don't know if you're caching the right things. We want to prioritize the human users.
    • We can't predict what a bot is going to access next. You either cache everything or nothing.
    • Some have suggested that we do cache everything. That's likely too expensive, but not sure exactly. How much would it cost to cache everything, all languages?
    • We have about 500 gb in the databases, but that's just the wikitext. Rendering it makes it bigger. Then there's things being changed, purged, etc. Might be a bandaid. If you change one thing in a URL, it's another thing to cache. If you add revisions it's even more.
    • Is API caching cheaper? - Same thing.
    • Complex problem to look at from multiple angles. A lot of it hinges on having a better idea of who is trying to do what so we can give appropriate access.
  • Sounds like we're building technical measures to address a social problem. Does the Foundation have a push to try to push the social element, e.g. an offiicial statmenet that this kind of scraping is bad, and we don't do business with organizations that do, and pushing for regulation?
    • We updated the robot policy. Now we can point to that and ask companies to behave accordingly. That opens the door for clarity on our expectation. The blog post did some of that too. Companies who do this don't look good after the blog post. Want to incentivize people to do the right thing.
  • Most large internet websites force users to accept terms and conditions before engaging with their service, but Wikipedia does not. Is the bot policy binding? Is it legally enforeacable?
    • Yes, though probably depends on your jurisdiction.
  • If we're controlling the flow of bots in the APIs and bots/companies don't care, can they bypass it? what stop it?
    • Most companies are doing that already - scraping rather than using the APIs. there's separate work to deal with that program, e.g. the trusted bots program. Enforcement around the APIs as well - we don't know what will happen, and want to better understand who is using the APIs and control that too so we're ready if usage increases.
    • Was the graph with the 50% increase from APIs or bot scraping?
      • Web scraping.

        Activity : 3 breakout groups
      • Community concerns -
      • Developer impacts -
      • Traffic attribution - Q's: What attribution should be required

Summary from gorup 3 (rate limits):

  • for WMCS, we still need to identify individual tools, so we have a counter for each. Could be done magically be sending traffic through a sidecar proxy.
  • if we have limits per user and per tool, should the more permissive or more restrictive limit win?
  • need to be clar about when the client should retry
  • if we block "permanently" because of abuse, how long is "permanently"?
  • To determine appropriate limits for each group, we need to first be able to identify clients in each group, so we can start collecting per-client stats, as a baseline.
  • We should establish an upper limit by observing at what point SREs take action

Useful links

  • Wikimedia Foundation Annual Plan/2024-2025/External Trends :

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2024-2025/External_Trends

Photos

Social