Background
Hypothesis WE5.2.13 will introduce user-agent enforcement for the dumps website. This work is being done because 70% of deduplicated dumps requests are automated in nature, and of that 70%, 69% are completely unidentifiable. This is largely the case because the dumps website exists outside of the CDN, meaning the previous iterations of user-agent policy enforcement did not apply.
The scope of the related work will follow the initial pattern for user-agent enforcement. It will be a simple approach where requests from missing and non-compliant user-agents will simply be blocked. No tiered access will be implemented at this time.
Scope
Because we are starting to limit access and will begin blocking some requests, we would like to regularly measure the following changes over time:
- The number or percent of blocked requests.
- The conversion rate of users choosing to self-identify proactively and/or after getting blocked.
- Rates of usage for dumps v1 vs MWCH/v2 --> Are we seeing users migrate; are we seeing a higher rate of compliance for the new version
For the later metric, we can either measure it by total unique user agents accessing the dumps website, and/or tracking previously blocked IP addresses that are now providing user-agent information. Other approaches or methodologies for inferring the rate of users choosing to update their integrations to match our policies.
If possible, it would also be nice to have similar breakdowns to the research conducted earlier in the fiscal year so that we may further breakdown the analysis to specific cohorts, like what is likely coming from community affiliated bots and what is likely external reuse. The API leads can inform those groups, but we will not have the same access to data populated at the edge to help with categorization.
Expected outcomes
This data will allow us to infer how many dumps website users are likely valid users and use cases. In other words, because of the high rate of fully anonymous traffic, we assume it is likely that a high proportion of callers are "dumb" scrapers, inactive, and/or commercial in nature. Getting a better sense of user-agents will also allow us to identify callers who are likely commercial in nature so that we can more effectively redirect them to preferred commercial pathways.
Known limitations
Dumps requests are not currently captured in web requests. Previous research was accomplished using Apache logs. While this may still be sufficient, it will introduce additional complexity and may create challenges for how we are able to measure.
Related work
- Dumps research: https://phabricator.wikimedia.org/T383175
- Dumps follow up: https://phabricator.wikimedia.org/T402963
- User-agent enforcement: https://phabricator.wikimedia.org/T400119
Dependencies
Possibly on DPE to make the logs available?
Next steps
Deadlines
Ideally we would like to have an idea of "standard" traffic before introducing the user-agent constraints. We do not yet have a hard delivery date for user-agent enforcement, but it is expected to be complete before the end of Q4. That means that this work should ideally happen early in the quarter, so that we may have a baseline before the blocks are introduced. Similarly, we would like to be able to monitor the changes in caller behavior for self-identification throughout the course of the project.