Page MenuHomePhabricator

SDS 1.3.6 SPUR bot detection analysis
Closed, ResolvedPublic

Description

Following notebook from @Hghani, let's review the notebook:
https://gitlab.wikimedia.org/hghani/movement-insights-requests/-/tree/main/SDS%201.3?ref_type=heads

Second notebook to summarize and streamline the results of the previous one.

Third Notebook looking at HAP signals + spur to assess false positives: notebook

Details

Other Assignee
Hghani

Event Timeline

Awesome study. Thanks for adding the split by countries.

As you asked me, I've read your Notebook and I haven't detected any incoherences in the requests.

About the interpretations of those requests, most % are calculated against a total with preconditions. Hard to decipher without reading the code. To facilitate readability you may add a summary table at the end for the other readers.

Thanks for reviewing, I have summarised the observations so far below and I have added a new notebook to the repo that has streamlined the tables. This should hopefully be easier to interpret.

Analysis results:

General overlap between Spur and pageview data – User traffic:

Testing IP overlap over a day’s worth of data in each month from April to September 2025 using the September 18 snapshot from user pageviews in wmf.pageview_actor to spur.us Ips that have some value in the risk category starts:
6% (April 20)-8% (Sept 20) overlap. The overlap increases as the data is pulled from a data closer to the snapshot data.
The total pageviews that come from the overlapping IPs is about 15-20% if we look at any risk category. It is higher during the incident and lower before the incident period in May 2025.
About 80% of the overlap comes from CALLBACK_PROXY, the rest from the TUNNEL category. WEB_SCRAPING and the other categories have negligible overlap (<1%).

Therefore, if we assume that any risk category = automated we would expect a substantial conversion of almost 20% of user traffic to automated.

General overlap between Spur and pageview data – Automated traffic:

About 13-15% of pageviews come from IPs/actors that have been flagged in our pageview table that are found in the spur dataset. Similarly distributed to the user traffic, these are in the CALLBACK_PROXY and TUNNEL risk categories and negligibly in the other categories. Extending the time period to the last few months in their entirety has a similar distribution. Therefore the majority of traffic that our pipeline has classified as ‘automated’ do not overlap with the IPs in the spur dataset.
Mapping the overlapping pageviews by label_reason (given by our automated labeling pipeline) shows us that these are split almost evenly between the new ‘ip with spider behavior’ reason introduced during the incident and the over ‘800 pageview requests’ reason.
Traffic broken down by country:

Decomposing the traffic by country yielded an interesting result: 60% of pageviews would overlap if we match IPs that are in the spur dataset with the automated traffic that was captured by our new heuristic from Brazil (main origin source of the May incident traffic) during the incident period. This number goes up to 85% of pageviews overlapped in September 2025.

Looking at an earlier snapshot (April 30th -before the May incident and May 30th -during the incident) we still see ~60% of the pageviews from Brazil that were flagged by the new heuristic overlapping with the Spur dataset.

Conclusion so far:

Assuming all of the spur IP risk categories reflect automated traffic we'd see a 15-20% of pageviews convert to automated. Moreover, most of what we classified as automated does not overlap with the spur dataset. On the other hand, the majority of IPs that were flagged by the new heuristics in Brazil did appear in the spur dataset: 60-85% of those pageviews overlapped with source IPs that are also in spur (higher overlap when the pageview data is closer to the snapshot date) but even with an older snapshot from before the incident began about 60% of the May incident IPs in Brazil wer overlapping with the spur dataset. The issue is that the accuracy of the spur is questionable in that it's hard to tell how many are false positives.

Therefore, seems the spur dataset is useful for a particular kind of automated traffic: the botnet traffic that occurred in May but we have to refine it. I think the general idea that came out of the discussion (as expected I think) is that the spur data is useful as a supplementary tool along with other bot detection techniques but the how is yet to be determined.

Next steps is to try and isolate false positives by using other data that could serve as 'ground truth'. Some candidates are:

  1. the intersection of Spur and hCaptcha's "suspicious" score
  2. x_analytics_map.x_requestctl also gives us an approximate "ground truth" for obvious bots based on SRE's experience

These are in addition to using the May incident IPs as a test case to work from.

Hghani updated the task description. (Show Details)

Following up from the previous update:

HAP signals that were used:

CASE WHEN requestctl IN ('hap:h1_requests','hap:old_tls_requests','hap:chrome_forged_ua','hap:firefox_forged_ua') THEN 1 ELSE 0 END AS has_hap

Note: These signals are only available after the May incident starting in about late August and September. The numbers fluctuate and the figures are approximate. Exact numbers can be found in the linked notebook for the days that were included in the sample.

HAP Signals by themselves:

  1. Using the Hap signals by themselves would convert 2-4% of user pageviews to automated if we assume HAP signals mean they are automated.
  2. About 10-20 % of automated traffic that we have already labelled as automated contains one of the HAP signals indicated above.
  3. The figures above fluctuate day to day by large amounts.

Combined with Spur:

  1. Taking IPs from pageview_actor that are on Spur under any risk category and evaluating their pageviews where a HAP rule reduces the previous number 13-15% (new analysis shows 18% on other days) to 2% or less which means 2% of total pageviews are corroborated as suspicious by introducing the HAP signals in conjunction with risk category on the spur.us list.
  1. Excluding Brazil, HAP signals overlap more with pageviews that we classify as automated (overlaps with ~ 40% of automated traffic that we have classified as automated from US and 16% with the rest of the world).
  1. The inverse is true of the spur traffic: it can capture most of the Brazil automated traffic that we have captured during the May incident (>60%) but much less overlap with US (~6%) and other countries (~9%) (all figures fluctuate day to day).

Therefore, it seems that using HAP signals in conjunction with the Spur traffic might not be ideal to figure out what proportion of the spur IPs are generating legitimate automated traffic because the spur IPs mostly correspond to the kind of traffic we experienced in May (probably botnet) and HAP signals overlap more with other kinds of automated traffic. Further testing against other sources of truth would be informative.

PROXIES:

Testing client.proxies in the spur data set illustrate that most of the traffic coming from proxies like Luminati in Brazil have been classified as automatic by our heuristics. However, the same proxy has been classified as “User” from the US. The split is pretty consistent with about ~20% of Proxy traffic from the US being labeled as automated and +80% of proxy traffic from Brazil being labelled as automated.

Link to notebook

Antoine_Quhen renamed this task from SDS 1.3.6 First analysis review to SDS 1.3.6 SPUR bot detection analysis.Oct 24 2025, 2:11 PM

Updated my last notebook with some additional findings:

  • using client.proxies as a standalone signal consistently reduces the share of user pageviews overlapping with SPUR by about 2–3 % compared with using any SPUR risk label.
  • every IP on the spur list that has a client.proxy also has a risk category tagged to it.

Waiting for client side signal for more Spur.us dataset evaluations.

In the meantime we may:

  • study using only some proxy providers.
  • Perform cross analysis of spur.us DC dataset (already included in hap:is_from_dc)

Following the multi signal analysis we may productionize the anonymous+residential IPs dataset import (with GeoIP data too) to help with more accurate geo-location and filtering of residential IPs.