Page MenuHomePhabricator

Modify the automated traffic detection to be applied at the project family level
Closed, ResolvedPublic

Description

In T373630#10226978, @Hghani wrote:

Problem Statement

Based on further analysis, we've identified that certain actor signatures identified in the findings below are generating highly suspicious traffic. Specifically, they are sending requests every few seconds and to a different domain with each request. Upon reviewing our automated filtering logic, it appears that we apply automata labelling at the actor_signature level. This means our current heuristics may fail to label traffic accurately when users frequently switch domains, as each new domain visit generates a different actor signature, resetting metrics like pages per minute or total pageviews.

This oversight is likely a major factor behind the nearly 2000% increase in unique devices from Singapore, as these actor signatures are bypassing our automata filters. In addition to the redirect problem discussed earlier in this ticket, we believe this pattern accounts for a significant portion of the overall increase in unique devices we've observed.

To address this, we propose applying automata labelling at the actor_signature_project_family level. This would continue to capture automata behavior at the domain level while also including actors who evade detection by switching domains. After discussing this with @JAllemandou , he suggested this approach sounds reasonable, but it would require a thorough impact analysis before implementation.

@JAllemandou I'd like to add some new findings from looking at Singapore's unique device increase which extends from a discussion with traffic.

For September 2024 we see about 1800% increase in Unique Devices from Singapore, YoY which puts total unique devices counted in Singapore at 125,911,935 for a population of 5 million, which is about 36% of total new unique devices year-over-year (even larger contribution to total uniques than USA uniques). The majority (at peak 99% of unique device: see chart below) of these uniques, similar to last month and the month before, are coming from Cloud based ISPs such as Huawei Cloud (specifically: looking at Huawei Cloud (AS136907)). Many of these are actor signatures with thousands of uniques attributed to them.

image.png (561×1 px, 74 KB)

Consulting with @Vgutierrez, we believe a large chunk of this traffic is likely automated because:
a) Lack of referrer header.
b) UA looks like it's being faked.
c) The provider is Cloud infrastructure.
d) the volume of unique devices is basically impossible to be generated legitimately from Singapore.

This increase is occurring on pageviews and so the increase appears in both the project domain and project family tables which means if this traffic is automated, it is not being captured by our automata filter. It also means it is likely separate issue from the redirect problem identified above, and given the increases we are seeing in Singapore every month since Jan 2024, this number may continue to explode higher.

image.png (115×2 px, 35 KB)

September will be another month where we will see about 18% YoY increase in Uniques and it seems like a significant portion is coming from Singapore this month. I’m not sure if there’s an easy fix, given that this traffic is bypassing our automata filter and we’re uncertain of its exact nature. However, the significant impact this suspicious traffic is having on the increase in unique devices adds another piece to the puzzle.

What needs to be done
  1. In the Automata pipeline, starting with webrequest_actor_metrics_hourly, replace line#39 get_actor_signature to get project family actor signature.
    • This will ensure that automata logic is applied at the project family level instead of domain level. Project family level is a super set that will filter automated bots at both project and domain levels. So this logic change works for existing and desired use case
  2. Next the pageview actor table should use the actor_signature_per_project_family to join with automata label table to filter out automated actors at the project level pageview_actor.hql#74

Details

Related Changes in Gerrit:

Event Timeline

Things I think we should validate in a one-off analysis before implementing:

  • Using the project-family actor signature with the automated-traffic detection algorithm removes the fake unique-devices rows for an example date
  • The difference between traffic flagged as automated with per-domain-actor-signature and with per-project-family-actor-signature can be considered as automated by experienced humans (we have no better way of judging...)

Let me know if what I'm saying makes no sense to you :)

@Hghani - could you take a look at Joseph's comment?

@JAllemandou
The first makes sense.

Can you elaborate on the second point? Do you mean the difference in the traffic between flagging by project family vs project domain can be attributed to a human that is purposefully bypassing automata filtering? Or do you mean something else, thank you!

Can you elaborate on the second point? Do you mean the difference in the traffic between flagging by project family vs project domain can be attributed to a human that is purposefully bypassing automata filtering? Or do you mean something else, thank you!

I don't think we can easily judge if someone is trying to "purposefully bypass" our algorithm. I'd like if you could verify that the newly portion of traffic flagged as automated using per-project-family actor-signature (the one that was not flagged when using per-domain-family actor-signature) is feeling like generated by automated actors when looking at it manually.

We wish to verify that the change doesn't bring in newly automated flagged traffic that we think shouldn't be flag.

Let's talk about this demand if you wish @Hghani :)

We completed an initial impact analysis on the proposal to apply automatic traffic at the actor_signature_per_project_family level which we expect will significantly reduce the unique devices count on Singapore, and will be consulting with DPE on implementation/next steps this week.

In our analysis we replicated the unique devices pipeline starting at wmf.webrequest where we added the actor_signature_per_project_family column and applied the heuristics as-is on actor_signature_per_project_family. This was ran on a single day's data (Sept 25 2024).

Our results indicate the majority of Singapore unique devices are automated (we saw a drop of ~90%) and there is minimal change on other countries' unique devices count on that day.

Uploaded results of this analysis to gitlab in a notebook.

Summary:
Our analysis shows that labelling traffic as automata by the project family actor signature for Sept 25th 2024 has the following results 1) 4.77% reduction in Unique Devices; about 70% of the new automata labelling is coming from Singapore which implies there is something unique about Singapore traffic (as expected from their anomalous increase in uniques); 2) There is minimal effect per country on their uniques count; 3) Pageviews shows a reduction of 25% half of which is coming from Singapore; 4) Spot checking actors who were labelled as automated in the new logic shows signs of true automation because of their webrequest activity and lack of referer.

After consulting with DPE here are next steps:

  1. Evidence that Singapore's data is automated is strong and this should be corrected.
  2. It is evident that applying the automata labelling will 'fix' most if not all of Singapore's unique device overcounting issue.
  3. There will be a non-trivial drop in user pageviews, and there is a potential that some legitimate users could be now flagged as automated as a result of this change (in the case of IPs being conflated).
  4. To minimize (3) and optimize the fix some additional analysis would be useful: Look at the distribution of label reasons and pageview request counts in Singapore vs other countries to determine if there is an optimal threshold that will target automated users; do per country analysis on referrer type to determine if null referrer type is distributed throughout all the countries of those which were labelled as automated; look at pageview seasonality during the day to determine if hourly behavior is automated.
  5. Implementing the fix will not take more than a few days once decision is agreed upon.

I've looked at the additional analysis questions mentioned in the last update and put the results on this google doc along with the underlying queries for review/replication.

Summary:

(1) Referrer count by country shows that of the newly labelled actor signatures, 78% from the U.S have Null referrers and 90% from Singapore have null referrers. For reference, about 75% of referrers are null just looking at our regular wmf table for this date (Sept 24 2024).

(2) Distribution of Label reasons indicates that of the newly labelled automatic agents, 38% are labelled as 'too many requests w/o cookies and small page_title variability' and 33% are labelled as 'more than 800 req'. Compared to our original numbers, 28% fall into 'too many requests w/o cookies and small page_title variability' and 15% fall into 'more than 800 req'. So there is a drop in high request ratio but an increase requests and request ratio.

(3) Comparing Singapore and US pageviews for those who were newly labelled as automated, there is a large gap where many more automata from Singapore have 2000-3000 pageviews. But labelling of automata is split between 'too many requests w/o cookies...' and pageview_requests > 800.

(4) To assess daily seasonality we plotted the pageviews for the day for only the newly labelled actor signatures to see if the daily pattern matches the usual circadian pattern. Neither the plot for the U.S nor Singapore appears to match a usual circadian plot (Canada's pageview plot for users is referenced).

The evidence suggests consistency with the previous analysis that are we are capturing mostly automated traffic, but review/discussion of the findings would be useful.

Thank you @Hghani for the analysis. I'm happy with the evidence you found to move forward with the implementation of the new heuristic.
The backfilling aspect is still discussed, but I'll try to have a patch ready so that the heuristic change applies from beginning of December (or as close to the beginning of the month as possible).

Change #1100067 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update webrequest_actor heuristic

https://gerrit.wikimedia.org/r/1100067

Change #1100067 merged by Joal:

[analytics/refinery@master] Update webrequest_actor heuristic

https://gerrit.wikimedia.org/r/1100067