Page MenuHomePhabricator

Analyse possible bot traffic for ptwiki article Ambev
Open, HighPublic

Description

The article https://pt.wikipedia.org/wiki/AMBEV is constantly among the top 3 most accessed pages in ptwiki.

There was a jump in views in october 2020, with a 30-fold access rate, which kept consistent for many months.

image.png (445×866 px, 235 KB)

It is a page for a company, so there is always the possibility of marketing strategy via the web app, or similar.
The page for Cleópatra show a similar jump around the same time.

image.png (521×853 px, 254 KB)

Event Timeline

I quickly played with the more detailed data in Turnillo for those two pages. The request look to come from different IP addresses and user agents, and the agents look like a proper desktop/mobile agents.

For the records, https://w.wiki/3cXe [restricted link] is the query I used.

@Urbanecm thanks for taking a look at it. It is still weird though, I follow the Wikipedia app for about 6 different languages, and I've not seen any other items with similar patterns. They always are top-ranked in ptwiki, no matter what happens on the week.

The page for YouTube on ptwiki has developed a similar pattern over the last month too...

image.png (569×862 px, 283 KB)

That's about the same story :/. It has plausible user agents (nothing like windows 10 on the mobile domain), plausible ISPs (they match distribution across all of ptwiki, plus match market share rights well). The IPs are also quite unique.

I agree this kind of behavior at multiple pages is weird, but unfortunately I was unable to find something bot-like :/. Tagging Analytics in case they want to take another look, maybe I missed something obvious.

odimitrijevic triaged this task as High priority.
odimitrijevic moved this task from Incoming to Ops Week on the Analytics board.
Milimetric edited projects, added Analytics-Radar; removed Analytics.

I took a quick look and I agree the requests seem diverse enough to be organic, even if they're really not. The only thing of note I found is that for YouTube, the pattern doesn't show up on the .m domain: https://w.wiki/3dYA. I'm also going to sweep this under the T138207 rug, just to keep all the possible examples in one place.

(TODO) We could use a standard way of handling these tasks at WMF. I feel like it would be ignored if it was on a wiki, but maybe everyone that matters is cc-ed here so...

  • data anomaly that tickles our bot/automata spidey-sense happens
  • who should get tagged first? product-analytics? analytics?
  • quick initial analysis (inspired by @Urbanecm's thoughts above, @JAllemandou do you have other tricks and things you look for?)
    • Purpose: try to find some simple heuristic we can easily add to our current bot classifier
    • # check webrequest sampled in Turnilo, split by different dimensions to see if the bump is exclusive to a specific dimension value (eg. specific country)
    • # look at UA qualitatively (do they look real and normal?)
    • # look at IP and UA diversity
    • # look at ISP distribution comparing with the overall wiki project
  • until T138207 is resolved, we can keep adding examples as subtasks there
BTullis subscribed.

Should we close this ticket? We still have the wider topic ticket open? T138207: [Open question] Improve bot identification at scale