Page MenuHomePhabricator

Investigate recent rise in Unique Devices
Closed, ResolvedPublic

Description

For the past 3 months we have been noticing a rise in unique devices. In July we had 1.8B unique devices, a +19% YoY increase and in August it went up +20% YoY.
The increases are noticeable in certain regions and countries. for ex. in July ESEAP increased 23% YoY, driven by large increases in Singapore (+330%) and Hong Kong(+220%). We suspect that the increase in unique devices may be linked to a rise in VPN usage in the countries where we're observing these trends.

We want to investigate further to find out if this is indeed caused by VPNs or due to some other reason.

Conversely pageviews have been dropping YoY which can be attributed to our work to disable classes of automated traffic originating from Google Chrome. These declines are in line with our expectations.

Related Objects

Event Timeline

I find this interesting because I can come up with reasons why VPN usage could increase OR decrease the unique device counts. My initial though was that increased VPN usage would actually lead to a drop in unique devices counts. Reason being that unique devices is the sum of known unique devices (based on the WMF-last-access cookie: uniques_underestimate in the hive table) and then the additional estimate of how many of the pageviews we see from devices that lack cookies are actually unique (uniques_offset in the hive table). That latter estimate is based on checking how many of the nocookies pageviews have a unique actor signature (essentially UA+IP). So if more folks are using VPNs, then we have more folks on shared IP addresses and therefore a lower count of unique actor signatures and lower unique device count. However, if VPNs are also clearing folks' cookies (and not just obscuring their IP address), then you could see an increase in unique devices because a lot more devices are showing up with nocookies and even if there are more shared IP addresses, you still end up with these devices getting double-, triple-, etc. counted and an inflated unique device count.

It's a really interesting question too because there's a discussion around improved unique device counting strategies that still rely on cookies (so would still be susceptible to the second possible effect). Knowing a bit more about how VPNs are affecting current unique device estimates might help guide that discussion too.

Another weird finding from today. We have a bunch of requests where one of WMF-Last-Access-Global and WMF-Last-Access is set while the other is not set, would love some input from those who know how the VCL works better.

 select if(x_analytics_map['nocookies'] is null, 'accepts cookies', 'does not accept cookies') cookies,
        length(x_analytics_map['WMF-Last-Access-Global']) global_length,
        length(x_analytics_map['WMF-Last-Access']) local_length,
        count(*)
    FROM wmf.pageview_actor
    WHERE x_analytics_map IS NOT NULL
      AND agent_type = 'user'
      AND is_pageview = TRUE
      AND year = 2024 AND month = 8 AND day = 14 and hour = 10
 group by x_analytics_map['nocookies'] is null,
        length(x_analytics_map['WMF-Last-Access-Global']),
        length(x_analytics_map['WMF-Last-Access'])


cookies                 global_length   local_length    count(1)
accepts cookies         NULL            NULL            141697
accepts cookies         11              11              15651640
does not accept cookies NULL            NULL            2496307
accepts cookies         11              NULL            1810474
accepts cookies         NULL            11              162134

This does not make sense from reading VCL code:

https://github.com/wikimedia/operations-puppet/blob/4d3a9dee8a3a636c650aa77c3680c77fefae8a12/modules/varnish/templates/analytics.inc.vcl.erb#L84

We found this by hacking up the uniques logic from here

WITH last_access_dates AS (
    SELECT
        year,
        month,
        day,
        normalized_host.project_class AS project_family,
        geocoded_data['country_code'] AS country_code,
        -- Sometimes (~1 out of 1B times) WMF-Last-Access-Global is corrupted.
        -- and Spark can not parse it. Check for the length of the string.
        IF(length(x_analytics_map['WMF-Last-Access-Global']) = 11,
           unix_timestamp(x_analytics_map['WMF-Last-Access-Global'], 'dd-MMM-yyyy'),
           NULL) AS last_access_global,
        x_analytics_map['nocookies'] AS nocookies,
        actor_signature_per_project_family AS actor_signature
    FROM wmf.pageview_actor
    WHERE x_analytics_map IS NOT NULL
      AND agent_type = 'user'
      AND (is_pageview OR is_redirect_to_pageview)
      AND year = 2024
      AND month = 8
      AND day = 14
      AND lower(uri_host) like '%wikipedia.org'
      AND geocoded_data['country_code'] = 'US'
)

    SELECT
        project_family,
        country_code,
        SUM(if(project_family IS NOT NULL AND last_access_global IS NULL AND nocookies is NULL, 1, 0)) first_visits,
        SUM(if(last_access_global IS NOT NULL, 1, 0)) has_last_access,
        SUM(if(last_access_global IS NOT NULL and (last_access_global < unix_timestamp('2024-08-14', 'yyyy-MM-dd')), 1, 0)) last_access_before_today
    FROM
        last_access_dates
    GROUP BY
        project_family,
        country_code
;

And another thing we looked at was the automata labeling logic, we were curious how many unique actor signatures were kind of "in the middle" between the things clearly labeled as user and clearly labeled as automata:

    SELECT
        SUM(if( pageview_count < 10, 1, 0 )) user_less_than_10_req,
        SUM(if( pageview_count > 800, 1, 0)) automated_high_request,
        SUM(if( pageview_rate_per_min >= 3, 1, 0)) automated_high_request_rate,
        SUM(if( (nocookies > 10 AND pageview_count > 100
              AND (avg_distinct_pages_visited_count * rolled_up_hours / CAST(pageview_count AS DOUBLE)) < 0.2)
          , 1, 0)) automated_too_many_without_cookies_few_titles,
        SUM(if( user_agent_length > 400 or user_agent_length < 25 , 1, 0)) automated_suspicious,
        SUM(if( pageview_rate_per_min = 0 , 1, 0)) user_low_ratio,
        SUM(if(
                (pageview_count between 10 and 800)
                and (pageview_rate_per_min < 3)
                and (user_agent_length <= 400)
                and (not (nocookies > 10 AND pageview_count > 100
              AND (avg_distinct_pages_visited_count * rolled_up_hours / CAST(pageview_count AS DOUBLE)) < 0.2))
    , 1, 0)) in_the_middle
    FROM wmf.webrequest_actor_metrics_rollup_hourly
    WHERE
        year=2024 and month=8 and day=14


user_less_than_10_req:                         3958617963
automated_high_request:                        238399
automated_high_request_rate:                   306448165
automated_too_many_without_cookies_few_titles: 175972
automated_suspicious:                          3643074
user_low_ratio:                                1034787245
in_the_middle                                  163371272
Mayakp.wiki moved this task from Incoming to Doing on the Movement-Insights board.

Update on this investigation:

By investigating into pageviews and unique devices pipelines we have observed the following that has helped explain the rise of unique devices:

  • We identified significant spikes in unique device counts on certain days in July and August 2024, unlike any other month in the past year. These increases were observed exclusively in the "unique devices by project family" table and not in the "unique devices by domain" table, and they predominantly involved 'fresh sessions' (i.e., sessions where cookies were enabled but no cookie was found). Monthly unique device metrics use the unique devices by project family table. Notably, these spikes were also absent from the pageview_hourly data.
  • Upon reviewing the logic behind both tables, we found that the "unique devices by project family" table includes web requests flagged as either is_pageview or is_redirect_to_pageview (redirects counted as pseudo-pageviews for tracking purposes), whereas the "unique devices by domain" table only accounts for is_pageview requests.
  • Further analysis of the fresh sessions unique to the project family table, which consisted solely of redirects, revealed approximately 200 million unique devices linked to a small number of users. These users had unidentified device types and exhibited similar user_agent strings (for instance, certain actor_signatures were associated with up to 500k unique devices in a single day). These requests consistently targeted the same Wikipedia pages and were resolved with a 301 status code.
  • These actors were not flagged as automated traffic, as our detection heuristics are applied exclusively to pageviews, not redirects. This oversight explains the disproportionate increase in unique devices during July and August, despite no corresponding rise in actual pageviews. It appears that we should change the labelling logic applied to catch automated actors in our unique devices counts.

Next steps:

  1. Subsequent investigation will be made to determine the extent of the effect of actors on our unique devices and to help uncover any additional contributors to the increase in unique devices.
  2. Consulting with Traffic to understand the cookie behavior and other anomalies related to the redirects and web requests.
  3. Consult with DPE about changing the automated criteria to include redirect requests cc @Milimetric

Is this the same issue as T364872: Unique devices per country spikes on wikifunctions , which was also largely in Singapore and Hong Kong?

Hi @Pcoombe , it is related. T364872 was specifically for the wikifunctions project family. It is a relatively new project family and the impact was small. However, we are now seeing an impact to our Core metrics and seems to have affected the wikipedia project family which is what we're investigating in this task.
I believe the root cause could be similar, we are yet to confirm that.

Change #1076563 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] [analytics][webrequest] Extend refined webrequest retention to 180 days

https://gerrit.wikimedia.org/r/1076563

I've prepared a patch waiting for the go from privacy/legal teams https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076563 . Note that we are deleting 1 day of data everyday, so even if we merge the patch , it will take time to accumulate days of data .

Affected datasets:

  • refined webrequest
  • pageview_actor
  • webrequest_actor/metrics/hourly
  • webrequest_actor/metrics/rollup/hourly
  • webrequest_actor/label/hourly

Those datasets are not purged:

  • pageview_hourly
  • unique devices tables

Raw webrequest data will still be purged after 90 days.

@JAllemandou pointed out a critical issue: we don’t have enough free space on HDFS to accommodate a doubling of webrequest data.

Currently, webrequest data takes up 515TB (replication included) for a 90-day retention period https://superset.wikimedia.org/superset/dashboard/409/. If we extend the retention to 180 days, its size will double.

Our HDFS cluster has a total capacity of 5.6PB, with only 15% of space remaining, approximately 0.8PB https://grafana.wikimedia.org/d/000000585/hadoop.

If we consider that an HDFS cluster works best if its remaining capacity is at least 20%, then it would be risky to double the retention of webrequest.

Change #1076563 merged by Btullis:

[operations/puppet@production] [analytics][webrequest] Extend retention for unique devices analysis

https://gerrit.wikimedia.org/r/1076563

With regard to the HDFS capacity/utilization issue, we have a number of options that could be explored.

It's worth mentioning that there is a mitigation plan in place, as mentioned by @JAllemandou on the patch.

The mitigation plan is to reclaim some space where possible, monitor growing data, and take action fast on the unique devices issue to not grow data too much if possible.

This may be enough on its own to deal with the HDFS space issue, especially of we can finish the investigation into the unique devices metric before the 180 days' maximum retention period is reached.
I have not looked closely at the existing data to ascertain if there are some easy wins that might give anough space for us to feel comfortable, but I know that some work of this nature is already under way such as T376118: Update druid config to automatically drop unused segments.

Although we can't order new Hadoop worker nodes within the necessary time-frame, I have two alternative suggestions that we could also consider as part of the mitigation plan.
I'll set them out in overview here, then we can look at the details of how each might work.

1) Use the storage of the presto workers and migrate Presto/Trino to Kubernetes

We currently have a cluster of 15 presto worker nodes, named an-presto10[01-15] - Each of these hosts has 12 hard drives installed, each of which is 4 TB is size. However, these drives are currently unused by Presto.
Five of the nodes (an-presto100[1-5]) are due for refresh, so we have a new set of five nodes (an-presto10[16-20]) ready to replace them.

If we look at the total capacity of unused drives, this is 20 hosts, each with 12 drives of 4 TB in size. 20*12*4 = 960 TB of raw hard drive capacity, which is unused.
These servers are the same specification as Hadoop workers, so we could rename them and add them to the Hadoop cluster easily.

This 960 TB of raw capacity will give us 320 TB of additional capacity for HDFS, based on the replication factor of 3 that we apply to files on HDFS.

In order to free up the Presto cluster for this migration, we would look to migrate the service from the current hardware to the dse-k8s Kubernetes cluster.
Whilst doing so, we would very likely replace Presto with Trino - which is a migration long-discussed in T266640: Decide whether to migrate from Presto to Trino.

Given that Trino/Presto is effectively stateless, it is a good fit for Kubernetes. We would look to deploy one Trino worker process to each of the dse-k8s-worker[1-9] nodes, plus the coordinator services.
It would take some considerable effort from the Data-Platform-SRE team to effect this change, plus changes would need to be made to Superset and wmfdata-python to make use of the new cluster.

2) Move some of the data on HDFS to the Ceph storage cluster - using the S3 or Swift compatible gateway

We have a new storage platform available, which is the Ceph cluster that is operated by the Data-Platform-SRE team in support of Data-Platform workloads.

This has a tier of storage that uses HDDs. This cluster comprises 5 hosts, each with 12 hard drives of 18 TB in size. That is a total of ~ 1 PB of raw storage, of which 28 TB is in use.
With a replication factor of 3, we would currently have around 311 TB of space available on this HDD based tier.
{F57578168,width=50%}
Currently this tier is the default storage tier for the S3 and Swift based service that runs on https://rgw.eqiad.dpe.anycast.wmnet

We could move some data from HDFS to this service, using either of these protocols. It could be a short-term move to facilitate this, after which it is moved back to HDFS, or the data could be permanently migrated along with any workflows that produce or consume it.

These two ideas are not mutually exclusive, so we could potentially implement both, or elements of each individually.

If we did wish to explore option 1, it would certainly take some resources from the Data-Platform-SRE team to work on a migration of Presto to Trino on Kubernetes.
This would, therefore, be likely to have a knock-on effect on the delivery timelines of some of our existing goals, such as T362788: Migrate Airflow to the dse-k8s cluster and T362105: EPIC: OpenSearch on K8s (formerly Mutualized opensearch cluster) - FY25/26 WE4.2.6.

We could, perhaps, reduce risk and/or time taken by sticking with Presto instead of migrating to Trino. Once on Kubernetes, we could likely perform any migration from Presto to Trino afterwards.

For option 2 and the the movement of data to Ceph/S3, I would expect this work to be carried out more by the Data-Engineering team than by Data-Platform-SRE.
Once the access credentials are in place and some initial testing has taken place, I feel that the choice of which data files are moved would be more a decision for that team.

I think this task can be closed, the investigation is done. We're now in the "solving problem" state :)

@JAllemandou I'd like to add some new findings from looking at Singapore's unique device increase which extends from a discussion with traffic.

For September 2024 we see about 1800% increase in Unique Devices from Singapore, YoY which puts total unique devices counted in Singapore at 125,911,935 for a population of 5 million, which is about 36% of total new unique devices year-over-year (even larger contribution to total uniques than USA uniques). The majority (at peak 99% of unique device: see chart below) of these uniques, similar to last month and the month before, are coming from Cloud based ISPs such as Huawei Cloud (specifically: looking at Huawei Cloud (AS136907)). Many of these are actor signatures with thousands of uniques attributed to them.

image.png (561×1 px, 74 KB)

Consulting with @Vgutierrez, we believe a large chunk of this traffic is likely automated because:
a) Lack of referrer header.
b) UA looks like it's being faked.
c) The provider is Cloud infrastructure.
d) the volume of unique devices is basically impossible to be generated legitimately from Singapore.

This increase is occurring on pageviews and so the increase appears in both the project domain and project family tables which means if this traffic is automated, it is not being captured by our automata filter. It also means it is likely separate issue from the redirect problem identified above, and given the increases we are seeing in Singapore every month since Jan 2024, this number may continue to explode higher.

image.png (115×2 px, 35 KB)

September will be another month where we will see about 18% YoY increase in Uniques and it seems like a significant portion is coming from Singapore this month. I’m not sure if there’s an easy fix, given that this traffic is bypassing our automata filter and we’re uncertain of its exact nature. However, the significant impact this suspicious traffic is having on the increase in unique devices adds another piece to the puzzle.

Thank you for the great analysis and explanation @Hghani.

As the problem is on pageview and the actors are not flagged by our current heuristic, the problem into the better automated traffic detection space, that has not been prioritized as of today :(

Based on further analysis, we've identified that certain actor signatures identified in the findings below are generating highly suspicious traffic. Specifically, they are sending requests every few seconds and to a different domain with each request. Upon reviewing our automated filtering logic, it appears that we apply automata labelling at the actor_signature level. This means our current heuristics may fail to label traffic accurately when users frequently switch domains, as each new domain visit generates a different actor signature, resetting metrics like pages per minute or total pageviews.

This oversight is likely a major factor behind the nearly 2000% increase in unique devices from Singapore, as these actor signatures are bypassing our automata filters. In addition to the redirect problem discussed earlier in this ticket, we believe this pattern accounts for a significant portion of the overall increase in unique devices we've observed.

To address this, we propose applying automata labelling at the actor_signature_project_family level. This would continue to capture automata behavior at the domain level while also including actors who evade detection by switching domains. After discussing this with @JAllemandou , he suggested this approach sounds reasonable, but it would require a thorough impact analysis before implementation.

@JAllemandou I'd like to add some new findings from looking at Singapore's unique device increase which extends from a discussion with traffic.

For September 2024 we see about 1800% increase in Unique Devices from Singapore, YoY which puts total unique devices counted in Singapore at 125,911,935 for a population of 5 million, which is about 36% of total new unique devices year-over-year (even larger contribution to total uniques than USA uniques). The majority (at peak 99% of unique device: see chart below) of these uniques, similar to last month and the month before, are coming from Cloud based ISPs such as Huawei Cloud (specifically: looking at Huawei Cloud (AS136907)). Many of these are actor signatures with thousands of uniques attributed to them.

image.png (561×1 px, 74 KB)

Consulting with @Vgutierrez, we believe a large chunk of this traffic is likely automated because:
a) Lack of referrer header.
b) UA looks like it's being faked.
c) The provider is Cloud infrastructure.
d) the volume of unique devices is basically impossible to be generated legitimately from Singapore.

This increase is occurring on pageviews and so the increase appears in both the project domain and project family tables which means if this traffic is automated, it is not being captured by our automata filter. It also means it is likely separate issue from the redirect problem identified above, and given the increases we are seeing in Singapore every month since Jan 2024, this number may continue to explode higher.

image.png (115×2 px, 35 KB)

September will be another month where we will see about 18% YoY increase in Uniques and it seems like a significant portion is coming from Singapore this month. I’m not sure if there’s an easy fix, given that this traffic is bypassing our automata filter and we’re uncertain of its exact nature. However, the significant impact this suspicious traffic is having on the increase in unique devices adds another piece to the puzzle.

OSefu-WMF subscribed.

Closing this task as investigation is largely complete.