Page MenuHomePhabricator

View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. [5 pts]
Closed, ResolvedPublic

Description

See also T116531: Monthly page view stats for wikibooks, wikinews, wikiquote, wikisource, wikiversity for July 2015 are extremely anomalous

This is pageviews per day for wikinews according to hive table pageviews_hourly

pageviews_en_wikinews_from_hive_table_pageview_hourly.jpg (393×715 px, 87 KB)

This is pageviews per day for wikinews according to hive based webstatscollector 2.0 data

pageviews_en_wikinews_from_webstatscollector.jpg (354×728 px, 70 KB)

this is also visible in monthly pageview stats

These webstatscollector counts are visualized in Wikistats in several reports, but the hive beased data feed is in error

pasted_file (240×640 px, 30 KB)

From squid logs I get a totally different number yet again (see below)

pageviews_en_wikinews_from_webstatscollector2.jpg (577×291 px, 137 KB)

Event Timeline

ezachte raised the priority of this task from to High.
ezachte updated the task description. (Show Details)
ezachte added a project: Analytics.
ezachte subscribed.

hive query for

pageviews_en_wikinews_from_hive_table_pageview_hourly.jpg (393×715 px, 87 KB)

USE wmf ;

SELECT agent_type, project, access_method, day, sum(view_count) AS count
FROM pageview_hourly
WHERE year=2015 AND month=7 AND project LIKE 'en.wikinews'
GROUP BY agent_type, project, access_method, day, ORDER BY agent_type, project, access_method,day
LIMIT 10000000 ;

webstatscollector 2.0 output:
in stat1002:/mnt/data/xmldatadumps/public/other/pagecounts-raw/2015/2015-07>

grep en.n projectcounts-20150701*

projectcounts-20150701-000000:en.n - 3161 37975119
projectcounts-20150701-010000:en.n - 3236 38936395
projectcounts-20150701-020000:en.n - 3251 38751730
projectcounts-20150701-030000:en.n - 3332 40136550
projectcounts-20150701-040000:en.n - 3032 40128064
projectcounts-20150701-050000:en.n - 2925 38769724
projectcounts-20150701-060000:en.n - 2700 39367180
projectcounts-20150701-070000:en.n - 2415 29279467
projectcounts-20150701-080000:en.n - 2475 35211365
projectcounts-20150701-090000:en.n - 2012 27476388
projectcounts-20150701-100000:en.n - 2409 34169367
projectcounts-20150701-110000:en.n - 3298 35145790
projectcounts-20150701-120000:en.n - 5005 45849253
projectcounts-20150701-130000:en.n - 4415 34307981
projectcounts-20150701-140000:en.n - 7749 52394829
projectcounts-20150701-150000:en.n - 7596 42170727
projectcounts-20150701-160000:en.n - 6983 47572713
projectcounts-20150701-170000:en.n - 5951 48150755
projectcounts-20150701-180000:en.n - 6404 39822572
projectcounts-20150701-190000:en.n - 7007 37231337
projectcounts-20150701-200000:en.n - 7067 40687891
projectcounts-20150701-210000:en.n - 9591 49919920
projectcounts-20150701-220000:en.n - 17140 45253790
projectcounts-20150701-230000:en.n - 11966 36673186

total views 131,120


grep en.n projectcounts-20150710*

projectcounts-20150710-000000:en.n - 254083 58312257
projectcounts-20150710-010000:en.n - 245438 80237806
projectcounts-20150710-020000:en.n - 248881 113398761
projectcounts-20150710-030000:en.n - 244359 75417827
projectcounts-20150710-040000:en.n - 232030 64825281
projectcounts-20150710-050000:en.n - 222069 58658800
projectcounts-20150710-060000:en.n - 234246 97603105
projectcounts-20150710-070000:en.n - 269986 109148193
projectcounts-20150710-080000:en.n - 301375 109564043
projectcounts-20150710-090000:en.n - 325433 146434532
projectcounts-20150710-100000:en.n - 331040 81287251
projectcounts-20150710-110000:en.n - 320869 69203438
projectcounts-20150710-120000:en.n - 337579 73193525
projectcounts-20150710-130000:en.n - 365053 67932029
projectcounts-20150710-140000:en.n - 397178 83460732
projectcounts-20150710-150000:en.n - 411419 122601254
projectcounts-20150710-160000:en.n - 392081 149209960
projectcounts-20150710-170000:en.n - 361272 76487503
projectcounts-20150710-180000:en.n - 350990 66730225
projectcounts-20150710-190000:en.n - 350482 68691343
projectcounts-20150710-200000:en.n - 349702 65758429
projectcounts-20150710-210000:en.n - 338237 62132890
projectcounts-20150710-220000:en.n - 299359 57446447
projectcounts-20150710-230000:en.n - 250146 58156917

total views 7,403,337

When you say 'squid logs' what do you mean?

filtering 1:1000 sampled squid logs for wikinews html requests

zcat sampled-1000.tsv.log-20150710.gz | grep 'wikinews' > ~/wikinews_20150710.txt
cat ~/wikinews_20150701.txt | grep en.wikinews | grep text/html | cut -f 9 | sed 's/\?.*$//' | sort | uniq -c | sort -n -r

returns 255 event for July 10 and 230 for July 01
almost half of which are WMF related not user related, 112 for July 01 and 127 for July 10

e.g. for July 01

60 http://en.wikinews.org/wiki/Special:CentralAutoLogin/start
52 http://en.wikinews.org/wiki/Special:CentralAutoLogin/createSession

Who can make sense of these widely disparant figures?

@Ottomata squid logs is 1:1000 sampled at stat1002:/a/squid/archive/sampled>

@ezachte: As I understand it the sampled logs and pageviews cannot be compared directly, the pageview definition does not count many of the incoming requests as pageviews.
cc @Ironholds who did some work on this regard comparing the old and new definitions.

FYI, there used to be R code to compute pageviews from the sample logs. Now, counts would also differ with the ones found on pageview_hourly cause the R code is quite outadated. Details on that regard are here: https://phabricator.wikimedia.org/T108925

@Nuria Right, I know actually, so yes 128 (x 1000) for sampled logs (255 - 127 CentralAuthoLogin) comes somewhat close to hive number from pageviews_hourly for July 10: 48k spider + 35k user = 82k. The 128k from squid logs is the upper limit as that factors in mime type only.

That leaves the major issue why projectcounts files go from 131,120 (July 1) to 7,403,337 (July 10) and no such crazy jump in pageviews_hourly.

BTW wikinews stands for similar crazy peaks in other projects.
See also https://phabricator.wikimedia.org/maniphest/task/edit/116531/

This text in description is misplaced "From squid logs I get a totally different number yet again (see below)" as both chart above and below this text refer to same data feed. (read below as 'upcoming comments')

We have a table in hive that has the sample logs loaded for a year, I will look a bit into this. N
Now, bottom line is that pageview definition is not considering that spike you see as pageviews. Which is not that surprising as old and new definition differ quite a bit. Will report.

Nuria renamed this task from View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. to View counts in squid logs, webstatscollector 2.0 and hive are very dissimilar for several projects. [5 pts].Oct 30 2015, 3:34 PM

Looking at project counts for en.wikinews for the 10th of July on the old definition I can see huge volume of "supposed" pageviews for "Special:HideBanners".

Removing "hideBanners" pageviews the total for July 10th is about 90.000 pageviews. More inline with the ~100.00 pageviews of July 1st.

I think this explains the issue, let me know otherwise.

You can see this on hive if you look at table pagecounts-all-sites from which projectcounts (according to old definition) are computed.

Thanks Nuria, so we're zooming in on what happened. I'm still wondering though, how can we have 56 Special:HideBanners requests for every real page request? Doesn't that seem odd? Would we have to ask ops to explain?

I'm still wondering though, how can we have 56 Special:HideBanners requests for every real page request?

Very odd, but I know next to nothing about how those banners come to be. Do you want to send a note to engineering of fundraising? cc @awight in case he can provide some info

https://wikimediafoundation.org/wiki/Template:Hide_banners gets millions of views and generates a dozen requests.
https://wikimediafoundation.org/w/index.php?title=Template%3AHide_banners&type=revision&diff=103891&oldid=98749 should fix the bug, because the full URL doesn't redirect to the wiki/Special:* URL and is not counted (is it?).

I think we need a general audit of all extensions which programmatically call URLs other than the full URL, because that's something that causes all sorts of problems. Grepping the code of all extensions for "wiki/", "?action" and similar can be productive but is not thorough. A search in logs for patterns like overly frequent "wiki/Special:*" URLs and overly frequent "wiki/[^?]+(\?action=.+)" patterns may help find all bugs worth reporting.

Hi! Special:HideBanner is used to set "hide" cookies on multiple domains when the user clicks on a banner's close button. It's configured here.

56 requests per real page request is too much, though. See, for example, this community banner:

https://meta.wikimedia.org/w/index.php?title=WMF_Resolutions/Replacement_Board_member_2_2007/ja&banner=WMCZ_Czech_WikiConference_2015&uselang=en&force=1

If you click on the close button, you'll get 10 requests to Special:HideBanner (the same number of URLs that are in the config).

Hmmm just to add, the 56:1 ratio could make sense for a smaller project, since it would be called for all the close-button-clicks from larger projects.

Change 250384 had a related patch set uploaded (by Nemo bis):
Use full URL in $wgNoticeHideUrls

https://gerrit.wikimedia.org/r/250384

The current $wgNoticeHideUrls also explains why Wiktionary and Wikivoyage pageviews are more regular: they were not included in the configuration. Any reason not to add them?

The new pageview definition excludes HideBanners based on the MIME type (we should probably amend it slightly to exclude the uri_path too, just to be on the safe side, in case people change things without telling us); I'm not sure why we should be patching MediaWiki proper to resolve issues with the legacy definition. For that to be valuable we'd have to end up working MW around the definition, which is sort of the opposite of what we want.

We have the new definition generating data. We are (about to) have a public API for that data, too. While exploring the discrepancies between the old and new definition's data is useful, forward-maintenance on the legacy definition is probably not - particularly when that involves changes to the actual functioning of the site.

Change 250389 had a related patch set uploaded (by OliverKeyes):
Expand the prohibited uri_paths in the Pageview definition.

https://gerrit.wikimedia.org/r/250389

Change 250389 merged by Nuria:
Expand the prohibited uri_paths in the Pageview definition.

https://gerrit.wikimedia.org/r/250389

For the curious: I've made two tasks about limiting requests or improving performance for Special:HideBanners: T117433 T117435. Thanks much for shouting this out! :)

@AndyRussG thanks for chiming in. Now I understand what this is about.

Does it make sense to you that we have this massive overcount on some recent months, but not on others?
Maybe no central notice was shown in those months?

broken_monthly_counts_for_smaller_projects.JPG (725×647 px, 106 KB)

Thanks, Erik

Does it make sense to you that we have this massive overcount on some recent months, but not on others?

In general, yeah, definitely. Although there are almost always some CentralNotice campaigns going on, the number of users seeing banners, and the nature of the banners shown, can vary greatly. I'd be much more surprised if there weren't big swings in the number of users closing banners. (I haven't looked into the specific peaks in the plots, however.)

Dereckson reopened this task as Open.EditedMar 19 2016, 1:28 AM
Dereckson subscribed.

@Ironholds @Nuria @awight What should we do with the Gerrit change 250384 submitted by @Nemo_bis?

To be perfectly honest it's pretty pointless; the definition doesn't care if a HideBanners request has got a shortlink or not, it excludes it either way.

Resolving since I'm not seeing a "this hasn't been fixed as evidenced by..."

Nemo_bis reopened this task as Open.EditedMar 19 2016, 7:35 AM
Nemo_bis added a project: Technical-Debt.

Making MediaWiki behave more consistently is certainly a goal.

P.s.: We could file a separate report for that though.

Well, quite. The bug currently says it needs fixing because [analytics concern that is now moot]; I'd suggest a seperate issue opened for MediaWiki proper and then going through that gauntlet. AnEng aren't the best people to determine non-analytics MW concerns.

Moved to T130442: Use standard URL index.php?title= ... for background requests.

AnEng's review of the MediaWiki change wasn't needed indeed; it would be useful if you were able to make a list of fake /wiki/ URLs generated or requested often by MediaWiki, but you can file that in another bug if you desire to close this.

AnEng's review of the MediaWiki change wasn't needed indeed;

True. Analytics does not commit much code to mediawiki and while we like to be cc-ed in CRs such as this one we cannot really merge it.