Page MenuHomePhabricator

Analyze possible bot traffic for enwiki article Index (statistics), Index & XXX:_Return_of_Xander_Cage
Closed, ResolvedPublic

Description

The enwiki articles Index (statistics), Index & XXX:_Return_of_Xander_Cage and others are all unusually sitting at the top pageviews for enwiki in January 2023.

This seems to follow a similar although less extreme traffic pattern as Cookie (informatique) covered in task T313114:

  • A large number of IP addresses using an outdated user agent

Event Timeline

Pulled pageview counts and flagged automated views as percent of total views: Data

@SNowick_WMF

Where necessary from the raw web request logs, could you pick out some insights into the requests themselves including details relating to

  • Geography
  • IP addresses (both looking those most active but also especially looking at the distribution of requests per days)
  • User agents
  • Breakdowns between mobile and desktop for each article over time.

For the those IP addresses that are generating the most traffic, I'd be inclined to know whether they are generating traffic on other pages.

Findings thus far for Index (statistics):

Geography:
Majority of anomalous hits originate from the US, UK is a distant second.

Ip Addresses:
Multiple IPs with 2000+ hits in 20 day date range,
Useragents:
As noted, the main useragent for suspicious behavior is from Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36
Breakdown of Access Method:
Desktop: 81.3%, Mobile Web: 18.5%, Mobile App: .2%
There are multiple pages that all 5 top IPs have visits too, pages have "index" in page name

Index (statistics)Turnilo Link
See repeat IP requestshttps://w.wiki/6K6g
Useragent - most from distinct outdated useragenthttps://w.wiki/6K6f
Geo is US and UK mostlyhttps://w.wiki/6K6w
Non public cloud traffichttps://w.wiki/6K6x

Next Steps/Possible further investigation:
Referral Data: (inbound, outbound)
pageview_actor data has mostly null values for referrer, will need to investigate further.
If we want to use Google Console it appears we will need to verify that page by adding a meta tag or other validation since it's not in the pre-populated Links report for top viewed pages)

Actor labelling:
Manually Classify the pageviews_actorpredictions.actor_label_hourly can be joined with pageview_actor on actor_signature to classify traffic

Findings thus far for XXX:_Return_of_Xander_Cage :

Most notable is the majority (99%) of views are from Mobile Web with the most used IPs coming from Google proxies.

Geography:
Primarily originating from India, Bangladesh, Pakistan using Google proxy IPs.

Ip Addresses:
Top 100 IPs mostly Google Proxy IPS

Useragents:
Variety of useragents, top result is Linux w/ old Safari version

Breakdown of Access Method:
99% Mobile Web, Desktop 0.9%, Mobile App 0.5%

IP Traffic Onsite:
Querying by top IP addresses yields millions of results per IP as these are static Google proxy IPs,
this analysis may not show anything useful since IP cannot be linked to individual users.

Referer Data:
Top referer is Google, next multiple referrals from site 'http s.newsearchers dot com'

Turnilo Charts

IndexTurnilo Link
IP requestshttps://w.wiki/6KXW
Useragent - variedhttps://w.wiki/6KXV
Geodistribution - Primarily Indiahttps://w.wiki/6KXT
Public cloud traffic (mostly false)https://w.wiki/6KXb

Findings thus far for Index:

The main anomalous activity is a spike in views that begins on 2022-01-14 and ends on 2022-01-19, IP analysis attached focuses on IPs active during that date range.

Geography:
Turnilo data reports US traffic but looking at most active IPs during spike are from multiple other regions (possibly VPNs).

Ip Addresses:
Top 100 active IPs (with some IPLOOKUP info)

Useragents:
Majority of useragents (96.7%) in spike date range are somewhat outdated Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36 and (3%) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36

Breakdown of Access Method:
Desktop: 85.6%, Mobile Web: 14.3%, Mobile App 0.1%

IP Traffic on Wikipedia: [[
Not much concurrence of pages visited by top active IPs, except Index (statistics)

Referer Data:
Not seeing anything unexpected here, note biggest count of values are null so this analysis is limited.

Turnilo links:

IndexTurnilo Link
Hourly spikehttps://w.wiki/6KYV
IP requestshttps://w.wiki/6KXK
Useragenthttps://w.wiki/6KXG
Geo mostly US - this may be incorrecthttps://w.wiki/6KXF
public cloud traffichttps://w.wiki/6KXM