Page MenuHomePhabricator

Unexpected increase in traffic for 4 languages in same region, on smaller projects
Open, NormalPublic

Description

FWIW, I just noticed a considerable increase in pageviews in up to four languages in recent months across our smaller projects (so except Wikipedia)

Those languages are all in the same region, which makes it kind of odd or even suspect.

The overall increase in page views across all projects and languages is negligible.
But it might be wise to check again next month.

wikinews:
https://stats.wikimedia.org/wikinews/EN/TablesPageViewsMonthlyOriginal.htm
Russian, Serbian, Ukranian, Bulgarian.

wikivoyage
https://stats.wikimedia.org/wikivoyage/EN/TablesPageViewsMonthlyOriginal.htm
Russian, Ukranian (Serbian and Bulgarian don't exist)

wikiversity
https://stats.wikimedia.org/wikiversity/EN/TablesPageViewsMonthlyOriginal.htm
Russian

wikisource
https://stats.wikimedia.org/wikisource/EN/TablesPageViewsMonthlyOriginal.htm
Serbian, Macedonian, Ukranian (not in Russian)

wikiquotes
https://stats.wikimedia.org/wikiquote/EN/TablesPageViewsMonthlyOriginal.htm
Ukranian, Bulgarian, Serbian (just a bit in Russian)

wiktionary
https://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthlyOriginal.htm
Serbian, Macedonian (not in Russian)

wikibooks
https://stats.wikimedia.org/wikibooks/EN/TablesPageViewsMonthlyOriginal.htm
Russian, Serbian, Macedonian, Bulgarian

the following peaks seem coincidence, as these languages aren't exceptional in other projects
wiktionary Laotian
wikivoyage Hebrew,
wikibooks Tatar and Sinhala
wikipedia Cassubian

Event Timeline

ezachte created this task.May 24 2016, 12:28 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 24 2016, 12:28 PM
ezachte added a comment.EditedJun 5 2016, 12:28 PM

I collected all *non-bot* requests to ru.wikinews.org in May [1]
To my amazement I found robots.txt was the second highest requested page [2]. Isn't that odd?

Seems we don't recognize some bot traffic for East-European languages, as such.

Some figures :
May 8-14, 2016: en.wikinews requests for robots.txt are hardly non-existent in a 8-day sample: ratio index.php : robots.txt is 624095: 6282 = roughly 1000:1

For full May 2016: ru.wikinews.org ratio index.php : robots.txt is 398979 : 86794 = roughly 5:1

SELECT

uri_host,
 uri_path,
 COUNT(*) AS count

FROM

webrequest

WHERE

uri_host = 'ru.wikinews.org' 
AND agent_type = 'user'
AND year=2016 
AND month=5

AND day [in some range] (I collected in 5 chunks then merged with perl)
GROUP BY

uri_path,
uri_host

ORDER BY

count DESC, 
uri_host

LIMIT 10000000 ;

count,title
1657389,/w/index.php
328434,/robots.txt
274858,/w/load.php
158972,/w/api.php
28966,/w/resources/assets/poweredby_mediawiki_88x31.png
24990,/static/images/wikimedia-button.png
23791,/static/favicon/wikinews.ico
22036,/beacon/event
18105,/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0
5788,/beacon/media
4729,/w/
3911,/
3489,/w/extensions/FlaggedRevs/frontend/modules/img/arrow-down.png
3264,/beacon/statsv

etc
etc

I've always wondered about the fluctuations in certain Russian wikis, including Wikiquote: +50% in February-April 2016 and in March-April 2013. https://stats.wikimedia.org/wikiquote/EN/TablesPageViewsMonthly.htm

1657389,/w/index.php
328434,/robots.txt
274858,/w/load.php
158972,/w/api.php

Only the /wiki URLs are counted, right?

99% of the robots.txt queries on ru.wikipedia.org have same user agent string:
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1)"
(so no clue here for bot detection)
and same ip address, which resolves to location in Amsterdam belonging to a top 5 global IT corporation

@Nemo_bis

Only the /wiki URLs are counted, right?

Not sure, but only text/html requests.

My query grabbed all requests, including images, etc, so also robots.txt.

WHERE

uri_path LIKE '%robots.txt%'
AND agent_type = 'user'
AND ip = [same ip address as mentioned above]
AND year=2016
AND month=5
AND day<8

yields 23,803,418 requests for robots.txt on all wikis from this ip address
or 23,803,418 / 7 = 3,400,488 per day

without "uri_path LIKE '%robots.txt%'" in the WHERE clause it's
39,607,724 / 7 = 5,658,246 per day, so not even twice as much (strange)

I discovered earlier that a stuck F5 key can generate that many requests on one single desktop. 5M per day is not that many requests overall, but unfortunate that it swamps our smaller projects.

Nuria added a comment.Jul 21 2016, 4:42 PM

This is harder that it seems as banning requests per IP might ban (or tag as spider a bunch of lawful bot traffic, for ex: all verizon iphone users in NYC)

Restricted Application added a project: Analytics. · View Herald TranscriptAug 7 2017, 1:32 PM
mforns triaged this task as Normal priority.Apr 16 2018, 3:51 PM
mforns raised the priority of this task from Normal to Needs Triage.Apr 16 2018, 3:52 PM
Milimetric triaged this task as Normal priority.Jul 5 2018, 4:37 PM
Restricted Application added a subscriber: Petar.petkovic. · View Herald TranscriptJul 5 2018, 4:37 PM