Page MenuHomePhabricator

"egranary digital library system" UA should be listed as a spider
Closed, DuplicatePublic

Description

When running a query on the webrequest table for 2016/05/10 on wikidata wiki looking at the page Special:RecentChangesLinked grouped by agent type I get

count agent_type
299 user
123347 spider

The same timeframe through the pageview api shows very different numbers.

{"items":[{"project":"wikidata","article":"Special:RecentChangesLinked","granularity":"daily","timestamp":"2016051000","access":"all-access","agent":"all-agents","views":91400}]}
{"items":[{"project":"wikidata","article":"Special:RecentChangesLinked","granularity":"daily","timestamp":"2016051000","access":"all-access","agent":"spider","views":72}]}
{"items":[{"project":"wikidata","article":"Special:RecentChangesLinked","granularity":"daily","timestamp":"2016051000","access":"all-access","agent":"bot","views":0}]}
{"items":[{"project":"wikidata","article":"Special:RecentChangesLinked","granularity":"daily","timestamp":"2016051000","access":"all-access","agent":"user","views":91328}]}

As can be seen there should be very few actual user hit here, but the api reports tens of thousands.

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/wikidata/all-access/user/Special%3ARecentChangesLinked/daily/20160510/20160510

It seems the data is correct in webrequest but is getting messed up when being further processed for the page view API.

The query used on webrequest is below (sorry if it looks terrible):

SELECT
  count(*) as count, agent_type
FROM
  webrequest
WHERE
  year = 2016
  AND month = 5
  AND day = 10
  AND uri_host = "www.wikidata.org"
  AND http_status = 200
  AND http_method = "GET"
  AND uri_path LIKE "/wiki/Special:RecentChangesLinked%"
GROUP BY agent_type
ORDER BY count
LIMIT 999999;

Further investigation in comments below.
The large spike we noticed appears to be from a UA like "egranary digital library system"

Event Timeline

Addshore renamed this task from Pageview API not categorizing spiders correctly to Pageview API not reporting spiders correctly.May 12 2016, 6:46 PM
Addshore moved this task from incoming to monitoring on the Wikidata board.

Results look correct to me with that query:

SELECT
    agent_type,
    count(1) as count
FROM
  webrequest
WHERE
  year = 2016
  AND month = 5
  AND day = 10
  AND uri_host LIKE "%wikidata.org"
  AND is_pageview
  AND pageview_info['page_title'] = "Special:RecentChangesLinked"
GROUP BY agent_type
ORDER BY count
LIMIT 999999;
----------------------------------------------
agent_type	count
spider	72
user	91328

What has changed:

  • include wikidata.org as well as www.wikidata.org
  • filter for pageview only (instead of all requests)
  • filter pageview title instead of uri (there are multiple ways to query the same page with mediawiki).

Okay, so the first queries we used while looking also caught pages such as Special:RecentChangesLinked/Q13215 which are tracked as views to different pages!

Now all of the numbers look like the match up.

It looks like the UA throwing the data off here is not included in the spiders category but should be!

"mozilla [en] egranary digital library system"

That appears to be what caused the flood that was noticed.

Addshore renamed this task from Pageview API not reporting spiders correctly to "egranary digital library system" UA should be listed as a spider.May 12 2016, 9:43 PM
Addshore updated the task description. (Show Details)

Okay, so the first queries we used while looking also caught pages such as Special:RecentChangesLinked/Q13215 which are tracked as views to different pages!

"%" is the wildcard character for LIKE in HQL and SQL. So uri_path LIKE "/wiki/Special:RecentChangesLinked%" would be expected to catch these, yes.

BTW, was there a particular reason why you were querying the webrequest table and not the much smaller pageview_hourly table? It gives you the same numbers with much less computational effort:

SELECT
    agent_type,
    SUM(view_count) AS count
FROM
  wmf.pageview_hourly
WHERE
  year = 2016
  AND month = 5
  AND day = 10
  AND project = "wikidata"  
  AND page_title = "Special:RecentChangesLinked"
GROUP BY agent_type
ORDER BY count
LIMIT 999999;
...
agent_type	count
spider	72
user	91328
2 rows selected (108.284 seconds)

Okay, so the first queries we used while looking also caught pages such as Special:RecentChangesLinked/Q13215 which are tracked as views to different pages!

My issue here is I didn't realise that these would be tracked as separate page views! I would have thought sub pages of special pages would result in views of the main page.

As for webrequest vs pageview_hourly I was unaware of pageview_hourly until yesterday.

@Tbayer : I suggested @Addshore to request webrequest on a specific hour for detailed user_agent analysis.
For this check @Addshore, I would really have gone for ONE HOUR of data, making the volume of data to work real smaller (data is partitionned up to hour).
As for using the pageview_hourly for double checking number, this is the first thing I did, and I have finally double checked everything was ok on webrequest.

@Tbayer : I suggested @Addshore to request webrequest on a specific hour for detailed user_agent analysis.
For this check @Addshore, I would really have gone for ONE HOUR of data, making the volume of data to work real smaller (data is partitionned up to hour).
As for using the pageview_hourly for double checking number, this is the first thing I did, and I have finally double checked everything was ok on webrequest.

Of course my queries when initially looking where for a single hour in the day, but not knowing about webrequest_daily of course means the only way to get the numbers for the whole day is to query the whole day on webrequest.

.Please try to notify owner of UA policy. If they add the word "bot" to UA this would automatically be marked as spider.

.Please try to notify owner of UA policy. If they add the word "bot" to UA this would automatically be marked as spider.

Who is responsible for this?

The policy doesn't have an specific owner, if that is what you are asking. Here is is: https://meta.wikimedia.org/wiki/User-Agent_policy

This is again another instance of bot traffic that slips by, this UA might not be causing trouble now but there will be others. Merging into parent task