Page MenuHomePhabricator

Unique devices per country spikes on wikifunctions
Open, LowPublicBUG REPORT

Description

Data Platform Engineering Bug Report or Data Problem Form.

What kind of problem are you reporting?

  • Access related problem
  • Service related problem
  • Data related problem
For a data related problem:
  • Is this a data quality issue? Yes
  • What datasets and/or dashboards are affected?

unique_devices_per_domain_daily
unique_devices_per_domain_monthly
unique_devices_per_project_family_daily
unique_devices_per_project_family_monthly

Both druid and hive datasets appear impacted.

  • What are the observed vs expected results? Please include information such as location of data, any initial assessments, sql statements, screenshots.

There are significant and unexpected spikes in monthly and daily unique devices to the wikifunctions project family and domains starting in February 2024.

In unique_devices_per_project_family_monthly and unique_devices_per_project_family_daily, there was about a 20x increase in unique devices to wikifunctions across all countries in February 2024. These levels have sustained through April 2024, with another observed spike for Hong Kong in March 2024 where unique devices went from 20K in March to 513K in April 2024.

Screenshot 2024-05-14 at 10.58.54 AM.png (784×2 px, 159 KB)

Screenshot 2024-05-14 at 11.03.26 AM.png (762×2 px, 207 KB)

Trends are different when querying unique_devices_per_domain_monthly where a smaller spike is just seen in Singapore and the United States. This increase is not as high as the increase observed in the per project family datasets but still unexpected based on project trends.

Screenshot 2024-05-14 at 11.00.10 AM.png (798×2 px, 188 KB)

Here's the superset dashboard I created while exploring this issue.

It's possible this issue might be related to the issue reported in T344381.

Event Timeline

Removing the Data Products tag until the preliminary data quality review is conducted. If any implementation requirements are defined, please tag us again. TY!

@MNeisler : Can you provide impact of this issue. I would like to know

  • proportion of wikifunctions unique devices to overall unique devices (now as well as prior to this bug)
  • how is this impacting the work you're doing for Abstract wp team ?
    • how often are the superset dashboards monitored? by whom?
    • what kind of decisions are the Abstract team making from knowing the unique devices that access wikifunctions.
    • any community involvement?
  • anything else that you would like to add

thank you in advance! this is to help prioritize investigating if the root cause is same as T344381.

@Mayakp.wiki

Thanks for following up. See responses inline below and let me know if any additional information would be helpful:

proportion of wikifunctions unique devices to overall unique devices (now as well as prior to this bug)

Unique devices to wikifunctions previously represented about 0.003% of overall unique devices prior to the bug. It now represents about 0.12% with these spikes.

Based on druid.unique_devices_per_project_family_monthly logged for Jan 2024 (prior to bug) to April 2024 (post bug). See total unique devices numbers shown below

project familyJan 2024(prior to bug)April 2024( post bug)
wikifunctions50.2 K2.26M
All1.86B1.86B

Note: I haven't fully investigated yet but the increase in the proportion of unique devices to wikifunctions appears to be higher when looking at some specific countries. For example, we specifically observed a significant increase in unique devices to Hong Kong from wikifunctions.

Unique devices from Hong Kong to wikifunctions previously represented about 0.007% of overall unique devices prior to the bug. It now represents about 1.8% with these spikes.

how is this impacting the work you're doing for Abstract wp team ?

This data is actively referenced by the Abstract Wikipedia team as one of the key top level metrics to monitor the overall growth of the wikifunctions community since the project's initial launch in July 2023. The team is especially interested in the diversity of its community.

Right now, I am unable to use or reference any of the unique device data logged since February 2024 to provide insights into the diversity of wikifunctions readers. This most recently impacted a research report I worked on to understand the current wikifunctions community T355810 (The geographical distribution section of the report was limited to unique devices data prior to spikes in February 2024).

how often are the superset dashboards monitored? by whom?

The Abstract Wikipedia Team actively monitors the Wikifunctions top-level tracker superset dashboard, which includes several charts monitoring unique devices to wikifunctions (See diversity tab).

I present a metrics update to the team every month using this dashboard. It is also used by several team members to monitor metrics on the project and referenced in status updates to the team.

what kind of decisions are the Abstract team making from knowing the unique devices that access wikifunctions.

This is not blocking any specific product decisions or open tasks at the moment but unique devices is identified as one of the teams' key metrics to understand the wikifunctions community and inform future design changes. One key research question identified by the team is understanding if there are gaps between people who contribute and consume content.

any community involvement?

Not at the moment

Thanks @MNeisler for the response. While the contribution of wikifunctions is less to overall unique devices, this does seem like a big problem for the team as they are unable to to look at a key metric.
As a first step to investigating if this is connected to T344381, I tried looking at the Central Notice calendar to see if any banners have been running that may have been causing incorrect redirect requests. I don't see any banner campaigns running in HK so this doesn't give us much.
@TAndic, do you know if there are banners or campaigns running in Hong Kong or Singapore? I checked the calendar and the archive, but didn't see anything. Martin asked me to reach out to you for help with the central notice and campaigns.

As a next step, we will need to query webrequest table to look for 301/302 (redirect) requests.

Hi @Mayakp.wiki -- the only information I have is the Central Notice calendars, I checked the admin deployments and archive as well, and the last campaign I see specifically targeting SG or HK is from March, which ran from the 3rd to the 9th. Pinging @Pcoombe and @JBrungs_WMF in case they know of any relevant information sources that I might not :)

By far the most shown CentralNotice campaign so far this year (by a factor of about 20) has been Wiki Loves Folklore which ran 1 February to 31 March across all countries. So that lines up with the initial all country increase in unique devices, but not the fact that it has been sustained.

I also don't see any CentralNotice campaigns in Hong Kong or Singapore which could explain the spikes there

select day, http_status, count(*) count_by_status
  from pageview_actor
 where year=2024 and month=4 and day in (19,26)
   and geocoded_data['country_code'] = 'HK'
   and normalized_host.project_class = 'wikifunctions'
 group by day, http_status
dayhttp_statuscount_by_status
19200389
193014311
19302931
26200198
26301133801
263021028
select user_agent_map['os_family'], user_agent_map['browser_family'], agent_type, count(*)
  from pageview_actor
 where year=2024 and month=4 and day = 26
   and geocoded_data['country_code'] = 'HK'
   and normalized_host.project_class = 'wikifunctions'
   and http_status = '301'
 group by user_agent_map['os_family'], user_agent_map['browser_family'], agent_type
_col0_col1agent_type_col3
WindowsChromeuser133801
select referer_class, count(*)
  from pageview_actor
 where year=2024 and month=4 and day = 26
   and geocoded_data['country_code'] = 'HK'
   and normalized_host.project_class = 'wikifunctions'
   and http_status = '301'
 group by referer_class
referer_class_col1
internal1
none133800

(looked at page_title and for all the 301 requests above it was "NULL". Some hesitation on whether we parse page_title correctly with the Pageview definition since Wikifunctions uses a different URI pattern. (other requests to wikifunctions seem to parse the page title correctly))

select access_method, count(*)
  from pageview_actor
 where year=2024 and month=4 and day in (26)
   and geocoded_data['country_code'] = 'HK'
   and normalized_host.project_class = 'wikifunctions'
 group by access_method
access_method_col1
desktop134872
mobile web155

Dan and I looked at this a bit more in our data operating theater and have a strong suspicion that this is indeed caused by re-direct requests. We compared April 19 which was akin to previous trends and Apr 26 which saw spikes and you can see the difference in 301 requests between the two days.

I looked at the uri_query of the 301 requests on 4/26 and found that a lot of them are ?from=Essay.svg%7Cenwiki%7C36141410&limit=20&target=Essay.svg&title=Special%3AGlobalUsage , and from a variety of projects like elwiki, cawiki, dewiki etc. When I searched for Special:GlobalUsage I got the Global file usage search page.
I think that once a user (logged in or out) clicks out of the search result, or changes the limit of the results on the page, it probably triggers a redirect and gets counted as a unique device.

Who can we reach out to, for confirming this and fixing the behavior of a Special page? is there a team that manages the Special pages?

OSefu-WMF lowered the priority of this task from High to Low.

@Mayakp.wiki: Special:GlobalUsage comes from Extension:GlobalUsage (GlobalUsage), which is a volunteer-authored extension.

According to https://www.mediawiki.org/wiki/Developers/Maintainers#MediaWiki_extensions_deployed_at_Wikimedia_Foundation and https://www.mediawiki.org/wiki/Wikimedia_Product/Component_responsibility it's owned by Structured Content team (formerly Structured Data) but the SLA is "Non-security patches not reviewed."

I think this means anyone can just fix the behavior and deploy the change.

It also means there's nobody to ask to fix the behavior. I believe this requires engineering help from DPE.

select normalized_host.project_class, ip, count(1) as view_count
from pageview_actor 
where year = 2024 and month = 6 and day = 28
  and http_status = '301'
  and agent_type = 'user'
  and uri_path = '/w/index.php'
  and regexp_like(uri_query, 'title=Special%3AGlobalUsage')
  and (is_redirect_to_pageview or is_pageview)
group by 1, 2
order by view_count desc
limit 1000

There are…a lot of "pageviews" coming from just 2 IP addresses that day.

Special:GlobalUsage on Wikifunctions is particularly utilized:

select normalized_host.project_class, count(1) as view_count
from pageview_actor 
where year = 2024 and month = 6 and day = 28
  and http_status = '301'
  and agent_type = 'user'
  and uri_path = '/w/index.php'
  and regexp_like(uri_query, 'title=Special%3AGlobalUsage')
  and (is_redirect_to_pageview or is_pageview)
group by 1
order by view_count desc
limit 1000
project classpageviews
wikifunctions41889
wikimedia814
wikipedia<250

These shouldn't actually be counted as pageviews. Technically there are only 4 Special pages that we count pageviews for, but maybe because this ends up redirecting to /w/index.php it bypasses out filter and gets counted.

I think T240676: Develop a consistent rule for which special pages count as pageviews deserves to be revisited.

Jdforrester-WMF subscribed.

There are…a lot of "pageviews" coming from just 2 IP addresses that day.

Special:GlobalUsage on Wikifunctions is particularly utilized:

[…]

These shouldn't actually be counted as pageviews.

Agreed. Possibly on Commons it should count, however?