Page MenuHomePhabricator

NEW BUG REPORT: Investigate rise in May 2025 Reader metrics
Closed, ResolvedPublic

Description

Data Platform Engineering Bug Report or Data Problem Form

Please fill out the following
Please ensure you set priority

What kind of problem are you reporting?

  • Access related problem
  • Service related problem
  • Data related problem
For a data related problem:

Is this a data quality issue?

  • YES. we are seeing unprecendented increases in our Reader metrics - unique devices and pageviews

What datasets and/or dashboards are affected?

  • Datasets
    • webrequest, all automata labeling tables, pageview_actor, pagevew_daily
    • all 4 unique device tables

What are the observed vs expected results? Please include information such as location of data, any initial assessments, sql statements, screenshots.

  • Most of the traffic increase is coming from Brazil. Pageviews (+440% YoY) and Unique devices(300M unique devices) in Brazil have seen a huge spike in May 2025 (higher than the US). They have been on the rise since January this year.
  • image.png (1×2 px, 179 KB)
  • This is likely caused by automated traffic since majority of the traffic to Brazil is from none referer traffic and unique devices are coming with no cookies.

Incident document: https://docs.google.com/document/d/1AKGkS6vvNAa25zckqE_uxse3Bt5YhFwpnytHICfLUBs/edit?tab=t.0

Event Timeline

Mayakp.wiki triaged this task as Medium priority.Jun 3 2025, 5:03 PM
  • We looked at Daily uniques to spot any per day day spikes
  • Major Spikes start from end of april, increasing all through May 2025

image.png (590×986 px, 56 KB)

  • The spike in traffic seems to be directed to Main page (uri_path: ../w/index.php)
  • Traffic is coming from different ip addresses
  • Unique devices are inflated across all project families in Brazil
Mayakp.wiki renamed this task from Investigate rise in Brazil unique devices to Investigate rise in Brazil unique devices and reader metrics.Jun 4 2025, 12:29 AM

Slightly note of an interest, I saw a very similar pattern recently for Miraheze in our Cloudflare data so possible this affects for than the WMF. I can pull actual data later.

Mayakp.wiki renamed this task from Investigate rise in Brazil unique devices and reader metrics to Investigate rise in May 2025 unique devices and reader metrics.Jun 6 2025, 10:17 PM
Mayakp.wiki updated the task description. (Show Details)

More reports of inflated measures of unique devices and pageview counts for non-wikipedia projects

Some results from our initial analysis

Estimates of how much this is inflating our pageviews?

  • Content interactions: +10.5%YoY
  • User pageviews: +9% YoY
  • User pageviews in Brazil: +440%YoY, 1.5B this year vs 275M last year

Why is this traffic not being filtered as automated?

  • Coming from many different UA, and hence many actor_signature_per_project_family
  • And not triggering the automata threshold
  • Some requests were identified as automated but most of it came as user

More investigation required
Open Question: Why so many UAs and actor signatures?

How does this compare to what we have previously seen from Singapore?

  • BR traffic is not dominated by a few ASNs
  • 30% from Google’s ASN- notable concentration, but not enough to immediately explain all traffic increase.
  • 70% from other smaller ASNs
  • SG: it was a few big players like Huawei and Cloudfare

Nature of BR traffic using x-analytics map cloud indicator

  • BR: 99% of user traffic is showing up as not cloud on May 25
  • Singapore: 99% of user traffic is "cloud"
Mayakp.wiki renamed this task from Investigate rise in May 2025 unique devices and reader metrics to NEW BUG REPORT: Investigate rise in May 2025 Reader metrics.Jun 9 2025, 6:00 PM
Mayakp.wiki raised the priority of this task from Medium to High.
Mayakp.wiki updated the task description. (Show Details)

Retention window extension - In anticipation of the need for a backfill and to give teams more time to investigate this issue without data loss, we have been granted a temporary extension to our typical 90 day data retention window for wmf.webrequest and related datasets.

Change #1155178 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Temporarily bump analytics webrequest retention

https://gerrit.wikimedia.org/r/1155178

Change #1155178 merged by Stevemunene:

[operations/puppet@production] Temporarily bump analytics webrequest retention

https://gerrit.wikimedia.org/r/1155178

Change #1155621 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Fix analytics webrequest data purge

https://gerrit.wikimedia.org/r/1155621

Change #1155621 merged by Btullis:

[operations/puppet@production] Fix analytics webrequest data purge

https://gerrit.wikimedia.org/r/1155621

One potential reason for this surge is linked to the activities of an actor, working mostly out of some of the countries named here, whose activity has surged right around April 25th.

We've identified the agent by characteristics in the web requests that I don't feel we can fully share in a public task, but you can see the data if you have access at https://w.wiki/ESZp (set time granularity to "daily" to better see how things are progressing). SRE has already rate-limited parts of this traffic last friday to protect from excessive bandwidth usage from media downloads, but the limits have since been lifted.

More details were added to the incident document as well.

Update of some new findings:

Our heuristics filter currently classifies a user as 'user' if their pageview_count < 10 over an aggregated window of time before other heuristics are checked. We are seeing examples of users that are seemingly automata that have high per minute pageviews but are slipping through the heuristics because their pageviews are < 10. So short bursts of pageview traffic is coming from these users which is small enough to bypass the first filter, but would be caught if they are checked against the pageview per minute filter, which filters a user as automated if their pageview per minute > 30. Preliminary analysis on sampled data suggests this traffic might constitute the majority of the suspected automatic traffic (by looking at the traffic that lacks a referer header) so DPE will try to re-order the heuristics to priorize pageviews per minute and then run the traffic through the updated heuristics to see what the pageviews/unique devices count corrects to. This should theoretically have a minimal impact on false-positives because it is just applying an existing automata filter on traffic that was bypassing it.

Screenshot shows percentile of 'users' inside each value of pageviews per minute:

image.png (367×696 px, 16 KB)

Yesterday we tried an alternative approach that aims to identify which IPs belong to the bot-net by looking at their combined behavior, as opposed of individual properties.
And so far it seems the classification that comes out could identify a good part of the bots.
I try to explain it here:

We realized that the bots are requesting lots of pages with /w/index.php uri_path. Which was a bit puzzling at first.
But then we realized that it was probably a consequence of how a crawler/spider operates, requesting all links they can find in a page, recursively.
Many of the links that appear in wiki articles, like history, or diff, link you to a /w/index.php page.
We saw that such diff requests were very common in this bot traffic, and tried to exploit that to gather the list of IPs of the bot-net. I can explain later more about why diff requests are convenient to do this calculation.
Collecting IPs from all diff requests would have included user traffic, though. So we collected IPs only in those diff requests that had hit the same page_title a number of times with different actors each over a short period of time.

We tried this approach against 2025-04-12 (baseline traffic) and it filtered out 3-4% of user traffic; and against 2025-05-25 (peak bot traffic) and it filtered out about 85% of user traffic, the remaining traffic being close to the baseline.
We think this looks like something we could use, and want to do more tests across a bigger time interval and a bigger geographic span. To make sure the approach identifies the bot traffic while not eating out user traffic.

Update from June 26 meeting:

  • DPE is finalizing a new heuristic to automated detection that groups IPs that have visited multiple diff pages in a span of 24hrs or requests made to diff pages of the same page title by multiple IPs in an hour from a particular country.
  • Backfilling the data will take ~2-4 weeks from now. Except end of July 2025 to backfill May-July data.

This is now a hypothesis 1.1.3 under SDS1.1
If we identify the causes driving recent increases in our pageview metrics and subsequently refine our bot-filtering heuristic and backfill historical data, we will restore confidence in the accuracy of our readership metrics and the metric's distinction between user and automated traffic.

We are currently working on reducing false positives in user pageviews.

Movement Insights is currently testing 1 week of baseline (April) and Issue (May) data; as well as one day of July for false positives to make sure we're not labeling actual user data as 'automated' traffic.

In the last couple hours, we've deployed the improvements to the automated traffic detection.
From this hour forward, the pageviews and unique devices user metrics should show a slight decrease in volume, due to the share of traffic that we are now recognizing as automated.
In short we will begin the backfilling process to also correct the last couple months worth of data with the same approach.