Pageview API: Better filtering of bot traffic on top enpoints
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• Nuria
	Jan 13 2016, 12:00 AM

Description

Pageview API: Better filtering of bot traffic on top enpoints

There are pages like "java" and "web scrapping" on the top10 at all times. We know our bot filtering is less than desirable but could we use the nocookie tagging here? We are going to loose 10% of real user traffic but it would be a cheap proxy to deal with bots.

Let's discuss

Some users complaining:
https://twitter.com/ReaderMeter/status/684804121208045569

Related Objects
Search...

Status	Assigned	Task
Open	None	T138207 [Open question] Improve bot identification at scale
Resolved	None	T123442 Pageview API: Better filtering of bot traffic on top enpoints
Duplicate	None	T144715 Top Pageview stats for August 27th doesn't look right
Declined	None	T200630 Eventlogging's processors stopped working
Declined	Milimetric	T200760 Set a timeout for regex parsing in the Eventlogging processors
Resolved	elukey	T200765 Simplify and document how to increase log verbosity/level for Eventlogging
Resolved	Ottomata	T200769 Upgrade librdkafka on eventlog1002
Resolved	JAllemandou	T212854 Upgrade ua parser to latest version for both java and python

Event Timeline

• Nuria created this task.Jan 13 2016, 12:00 AM

• Nuria raised the priority of this task from to Medium.

• Nuria updated the task description. (Show Details)

• Nuria added a project: Analytics-Backlog.

• Nuria subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2016, 12:00 AM

• madhuvishy edited projects, added Analytics; removed Analytics-Backlog.Jan 13 2016, 1:03 AM

• madhuvishy set Security to None.

Did a quick check this morning:

Top end point doesn't contain what we flag as bots (namelly spiders), it only contains what we flag "user".
Double checked pages "Java_(programming_language)" for Jan 10th and "Web_scraping" for Jan 12th
- In pageview_hourly table, I have found 2 to 3 browser data + city groups having more than 70k requests per day.
- In webrequest, I have found for a given hour of the specified day than less than 5 ips mke most of the traffic with unreasonable request number for a human (thousands).

A heuristic could be to remove from pageview_hourly the distinct pages (by project, language_variant and page_title) viewed more than X views from a single IP.
If we go in that direction, we'd need to double check on data loss first.

Milimetric subscribed.Jan 20 2016, 6:46 PM

Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.Jan 28 2016, 6:20 PM

Milimetric merged a task: T125361: Pageview API should filter artificial traffic.Feb 4 2016, 6:18 PM

Milimetric added subscribers: StudiesWorld, Slaporte.

Milimetric merged a task: T121934: We may be missing some more spiders when tagging pageviews {slug}.Feb 29 2016, 5:28 PM

Milimetric added a subscriber: • Tbayer.

• Nuria moved this task from Analytics Query Service to Event Platform on the Analytics board.May 2 2016, 4:50 PM

• Nuria added a project: Pageviews-API.

• Nuria moved this task from Event Platform to Dashiki on the Analytics board.May 2 2016, 4:52 PM

Slaporte awarded a token.May 7 2016, 3:41 PM

I like @JAllemandou's approach. Excluding IPs with a completely unrealistic amount of human traffic is probably simple enough. One way to validate this is to compare the results to human filtered list in the English Wikipedia Top 25 report. It uses the percentage of mobile views to detect and remove artificial traffic:

Since mobile view data became available to the Report in October 2014, we exclude articles that have almost no mobile views (~2% or less) or almost all mobile views (~95% or more) because they are very likely to be automated views based on our experience and research of the issue.

You can see West.andrew.g's chart of pages (with % mobile) here.

• Nuria moved this task from Dashiki to Deprioritized on the Analytics board.May 23 2016, 4:20 PM

• Nuria added a subtask: T144715: Top Pageview stats for August 27th doesn't look right.Sep 12 2016, 4:02 PM

• Nuria moved this task from Deprioritized to Backlog (Later) on the Analytics board.

We deprioritized this task earlier on on Q1, moving back to Q2 just in case we want to take a 2md look.

• Nuria merged a task: T144715: Top Pageview stats for August 27th doesn't look right.Oct 5 2016, 4:21 PM

• Nuria added subscribers: MusikAnimal, • JMinor.

• Nuria moved this task from Backlog (Later) to Dashiki on the Analytics board.Dec 19 2016, 5:16 PM

Milimetric edited projects, added Analytics-Kanban; removed Analytics.Dec 20 2016, 3:36 PM

Milimetric removed a project: Analytics-Kanban.

Restricted Application added a project: Analytics. · View Herald TranscriptDec 20 2016, 3:37 PM

Milimetric mentioned this in T153723: Herald rule thought to conflict.Dec 20 2016, 3:41 PM

• Nuria added a parent task: T138207: [Open question] Improve bot identification at scale.Jan 31 2017, 9:13 PM

I have released a new version of Topviews that shows the percentage of mobile views each page receives: http://tools.wmflabs.org/topviews/?project=en.wikipedia.org&platform=all-access&mobileviews=true

It automatically hides some false positives, but before this was done, you could see that 404.php has 0% mobile views (rounded down). For enwiki at least, this immediately indicates a false positive. There was also Oxford Manifesto, again with 0% mobile views. Then we have pages like XHamster and XXX with over 90% mobile views. Those I'm quite certain are also false positives.

So I guess the big indicator is if the percentage of mobile views is either extremely low or extremely high. However, as you might expect this is not consistent across projects. For instance, see results for swwiki: http://tools.wmflabs.org/topviews-test/?project=sw.wikipedia.org&platform=all-access&mobileviews=1&debug=true Here the percentage of mobile views is regularly over 90%, presumably because mobile devices in this part of the world are the most popular portal to the internet. So the logic we use for enwiki won't work there.

Not sure if these findings are helpful but I thought I'd share :)

That's great insight, thank you @MusikAnimal

Meant to post this earlier, but great work @MusikAnimal! I'm eager to see this codified into some sort of anti-spam correction, but I'm concerned by articles like "Oxford Manifesto", which also have <0.1% mobile. Though on second thought, the page does look a bit anomalous to be ranking so highly.

• Nuria moved this task from Dashiki to Incoming on the Analytics board.Feb 10 2017, 4:04 PM

Milimetric moved this task from Incoming to Wikistats on the Analytics board.Feb 16 2017, 5:13 PM

Milimetric moved this task from Wikistats to Dashiki on the Analytics board.

So for March 14 we had this: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/ru.wikipedia/all-access/2017/03/14

On the Russian Wikipedia, roughly half of the top 200 pages are false positives. The bot (or whatever) was apparently written to scrape pages alphabetically, starting with the characters "Бе". They begin consistently at around 7,480 hits, and the hits slowly decrease as the bot iterated alphabetically through the pages. Single-page false positives like this happen all the time, but this is the first time I've seen it on a large scale for a single endpoint.

I didn't check all the pages, but the several I did reflect the same scenario I see with most false positives, where the vast majority of traffic comes from a single city. I haven't been checking IPs because those queries take a lot longer, so I can't say for sure if your everyday false positives are from the same IP, but that's most likely the case. Going by city should still be sufficient, provided you make the threshold high enough. So I would suggest if the top city has over say, 1000 times as many pageviews as the next city (an unreasonable amount), it's safe to assume it's a FP. That's a very simple but (in my experience) effective comparison, and would filter out most of the false positives I've uncovered. The query I typically use:

SELECT
  city,
  SUM(view_count) AS viewcount
FROM
  pageview_hourly
WHERE
  page_title = 'Без_границ_(организация)'
  AND project = 'ru.wikipedia'
  AND year = 2017
  AND month = 3
  AND day = 14
GROUP BY city
ORDER BY viewcount DESC

And from here compare the counts for the top city to the others. In this case the false positives also had less than 0.1% mobile pageviews, a tactic that works for ru.wikipedia. By contrast I think comparing the top cities might work for any wiki, again provided the threshold is crazy high. I can't imagine how pageviews originating in the top city could be 1,000 times more than the next. This could happen with New York City vs Hertford, North Carolina, but I doubt you'd see two cities like that side by side when sorted by viewcount.

What do you think? Is it possible to automate these queries and exclude any pages meeting the criteria? The beeline queries are pretty slow (though you probably have a better way to do it), so if it helps we could maybe only test the top 100 pages. That would however mean you'd first need to compute the top 1,000 then test for false positives. Not sure if we're OK with returning less than the advertised 1,000 pages, but if so maybe compute the top 1,100 or so to give a little wiggle room. If we were somehow able to do this it'd greatly improve the data. There are other mysterious false positives with IPs all around the world, and those we may not ever figure out, but I think we should attempt to filter out the obvious ones.

MusikAnimal awarded a token.Mar 16 2017, 1:59 AM

The problem I'd be worried about is when traffic from a specific city makes sense, like there is local news about that city that isn't relevant to the rest of the world. We'd have to find events like that and figure out how they're different from false positives like the ones you identified here. More importantly than stats, it seems we're being bombarded with fake traffic. A solution to this seems highly desirable. Will try and up the priority.

Milimetric raised the priority of this task from Medium to High.Mar 16 2017, 10:56 AM

This is on our rather for Q1/Q2 (July 2017/september 2017) we will not be able to tackle this problem any sooner.

In T123442#3105516, @Milimetric wrote:

The problem I'd be worried about is when traffic from a specific city makes sense, like there is local news about that city that isn't relevant to the rest of the world. We'd have to find events like that and figure out how they're different from false positives like the ones you identified here. More importantly than stats, it seems we're being bombarded with fake traffic. A solution to this seems highly desirable. Will try and up the priority.

I would assume at least the top 100 would reflect more nation-wide or international attention, and the top 100 are what most people are interested in. Usually any huge news that happens in NYC is going to be reported elsewhere, for instance. I can try to find out for sure, but again I bet these bots usually operate from a single IP, so going by that should alleviate your concern. There are some exceptions, like countries or perhaps individual cities where the public predominately share the same IP or a small pool of IPs. To account for that you might consider restricting this false positive detection to the more popular projects that don't have weird edge cases and are more subject to fake traffic.

• Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.May 16 2017, 12:51 PM

• Nuria moved this task from Backlog (Later) to Datasets on the Analytics board.Jan 11 2018, 5:33 PM

I was talking to someone about bot detection, and they mentioned that they have gotten good mileage in bot filtering by grading ip addresses by the ratio of html pages requested. I ran a quick query against a day's webrequest logs to get a top level idea of whats plausible:

select count(1) as n_ip, percentile_approx(n_html/n_req, array(0.001, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 0.999)) as percentiles from (select sum(if(substring(content_type, 1, 9) == "text
/html", 1, 0)) as n_html, count(1) as n_req from webrequest where year=2018 and month=4 and day=17 group by client_ip ) x;

number of ip addresses: 166962302
percentiles for ratio of requests returning html over total requests by ip address:

0.1%	1%	5%	25%	50%	75%	95%	99%	99.9%
0	0	0	0	0.046	0.097	0.285	0.999	0.999

I was surprised how many IP's don't request html. I verified with a direct count that indeed 64M out of 167M ip addresses requests less than 1 html page per 1000 requests. This could be a problem with my ad hoc classification method.

This seems to at least have the ability to differentiate ip addresses, although it would require a good bit more evaluation to determine if we could do something useful with it. Not sure if apps ever get an html response either, or if their content is embedded in a reply of a different content type.

This is so cool, @EBernhardson, thank you. Formatting for my future reference:

select count(1) as n_ip,
       percentile_approx(
           n_html/n_req,
           array(0.001, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 0.999)
       ) as percentiles

  from (select sum(if(content_type like '%text/html%', 1, 0)) as n_html,
               count(1) as n_req
          from webrequest
         where year=2018
           and month=4
           and day=17
         group by client_ip)  html_vs_total_webrequests;

• Nuria added a subtask: T200630: Eventlogging's processors stopped working.Sep 27 2018, 11:51 PM

Milimetric closed subtask T200630: Eventlogging's processors stopped working as Declined.Mar 4 2019, 4:51 PM

MusikAnimal mentioned this in T232992: Manipulation of pageview statistics German Wikipedia.Sep 17 2019, 12:13 AM

MusikAnimal mentioned this in T236121: Trending articles is showing pages that had fake traffic.Oct 22 2019, 5:56 AM

Bencemac mentioned this in T237282: Topviews Analysis of the Hungarian Wikipedia is flooded with spam .Nov 18 2019, 4:15 PM

Aklapper mentioned this in T237206: Add captcha for IPs which are generating much monotonous traffic.Nov 19 2019, 12:52 PM

Man77 awarded a token.Nov 20 2019, 4:40 PM

Aklapper mentioned this in T239532: "Venuše (planeta)" on cs.wp has surprisingly high numbers in Pageviews Analysis (and also Topviews Analysis).Dec 1 2019, 1:17 PM

Closing, automated marker has been deployed and top endpoints will not be reporting data marked as 'automated". See: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection

• Nuria closed this task as Resolved.May 22 2020, 5:26 AM

Pageview API: Better filtering of bot traffic on top enpointsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Pageview API: Better filtering of bot traffic on top enpoints
Closed, ResolvedPublic
Actions

Related Objects
Search...