During Feb 14th and 15th we received traffic anomaly alerts for a group of countries including Uzbekistan, Kazakhstan, Libya and Pakistan.
All those showed an increase in traffic not recognized as bots. One particularity is that the traffic increase was attributed to either en.wikipedia, commons.wikimedia, species.wikimedia and mediawiki.org. This last one was the most clear example for Uzbekistan, when on Feb 14th at 6am UTC, was the most visited wiki in the country (with other wikis showing normal traffic levels). See chart: https://tinyurl.com/oxwtczba
Then this chart has been pointed to us (thanks @MusikAnimal ): https://pageviews.toolforge.org/siteviews/?platform=desktop&source=pageviews&agent=user&range=latest-20&sites=en.wikibooks.org|en.wikinews.org|en.wikiquote.org|en.wikisource.org|en.wikiversity.org|en.wikivoyage.org
showing that the problem is actually broader than what we had already seen.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T280565 Improve pageview automated traffic detection heuristics | |||
Resolved | cchen | T274823 Big increase in traffic for projects except 'wikipedia' family since Feb 14th |
Event Timeline
I have done some checking:
- MaxMind database update was on Feb 9th and archived files got deleted on Feb 11th - This seems unrelated.
- There clearly seem to have a small number of IPs making most requests for projects having seen a change (en.wikipedia, commons.wikipedia` for instance).
- The requests show a high variability of user agent, but the number of request per agent is extremely regular - this looks like automated traffic trying to desguise itself by changing user-agent.
- The requests show a high variability in the visited pages, so the impact on per-page metric is relatively small.
There clearly seem to have a small number of IPs making most requests for projects having seen a change (en.wikipedia, commons.wikipedia` for instance).
Thanks for looking into this! That makes sense. It's curious how the automated traffic detection didn't catch those, if they share IPs. Maybe we can improve the heuristics for this particular case.
It's curious how the automated traffic detection didn't catch those, if they share IPs. Maybe we can improve the heuristics for this particular case.
The reason traffic has not been flagged is because there is no (ip, user_agent) pair making more than 800 request per moving-24h. Some IPs are prevalent, but the telecom company and I assume they do nating. Also, there is a wide variability in page visited. The only possible heuristic I can think of that could catch traffic with low-volume is regularity querying (doing repetitive querying at regular interval) - But this is a complicated heuristic :)
Thanks for opening this task, Marcel.
Joal, thanks for investigating this: it is helpful context for some past and possibly future alerts as well that we may have (had) trouble understanding.
We could add a tag to pageviews generated by actors with high-trafic IPs.
It would not change the way we process, count or classify traffic today,
but we could use it to filter out this type of traffic when doing analyses like traffic anomalies.
There seem to be a broader issue with related countries: https://pageviews.toolforge.org/siteviews/?platform=desktop&source=pageviews&agent=user&range=latest-20&sites=en.wikibooks.org|en.wikinews.org|en.wikiquote.org|en.wikisource.org|en.wikiversity.org|en.wikivoyage.org
I checked countries quickly for some projects and for all of the ones I checked the raise of traffic was always from the same counties: India, Russia, Uzkekistan, Kazakhstan, Ukraine. Some of these countries (Russia, Kazakhstan, Uzbeksitan) were in the list of countries raised by entropy alarms in the past days.
@kzimmerman : Could your team provide help on this?
@JAllemandou it looks like you checked the main dimensions to investigate; the other thing is that the jump only happens on desktop (mobile web looks normal). Connie's going to raise this in our team sharing meeting tomorrow; I'll add you as optional though I think it's too late your time.
Hi all, I also found that big increase traffic for projects in most local wikipedias in Indonesia has same problem, except bug.wiki. Please check bug.wikipedia.org|gor.wikipedia.org|tet.wikipedia.org|su.wikipedia.org|min.wikipedia.org|ace.wikipedia.org|jv.wikipedia.org|bjn.wikipedia.org|map-bms.wikipedia.org| | this for more info. Love to know what actually happen and how to handle this in the future.
@cchen can you summarize the findings from you & @JAllemandou here, for future reference? My understanding is that you didn't find solid trends that could identify the traffic as bots, but we still suspect bot traffic and will have to speak to this in the Key Product Metrics.