Page MenuHomePhabricator

Pagecounts all sites data issues
Closed, ResolvedPublic5 Story Points

Description

Pagecounts all sites data issues.

  • Huge spike on meta pagecounts that looks "not real' (see attached screenshoot)
  • data for wikidata not there?

Seems like wikidata is not query-able, we likely have stored the counts with the wrong site name:

https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/www.wikidata.org/all-sites/monthly/2015010918/2015040100

Details

Related Gerrit Patches:

Event Timeline

Nuria created this task.Apr 4 2017, 3:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 4 2017, 3:19 PM
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.EditedApr 4 2017, 8:29 PM

Data for wikidata in cassandra it's not fetch-able:

cassandra@cqlsh> select * from "local_group_default_T_lgc_pagecounts_per_project".data where "_domain"='analytics.wikimedia.org' and "access-site" in ('desktop-site','mobile-site','all-sites') and granularity in ('hourly', 'daily', 'monthly') and project='www.wikidata' and timestamp ='2016010800';

Returns records:
cassandra@cqlsh> select * from "local_group_default_T_lgc_pagecounts_per_project".data where "_domain"='analytics.wikimedia.org' and "access-site" in ('desktop-site','mobile-site','all-sites') and granularity in ('hourly', 'daily', 'monthly') and project='www.wikidata' and timestamp ='2016010800';

_domain | project | access-site | granularity | timestamp | _tid | _del | count
-------------------------+--------------+--------------+-------------+------------+--------------------------------------+------+---------
analytics.wikimedia.org | www.wikidata | all-sites | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 4943451
analytics.wikimedia.org | www.wikidata | all-sites | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 188718
analytics.wikimedia.org | www.wikidata | desktop-site | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 4803906
analytics.wikimedia.org | www.wikidata | desktop-site | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 182718
analytics.wikimedia.org | www.wikidata | mobile-site | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 139545
analytics.wikimedia.org | www.wikidata | mobile-site | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 6000

So something is going on with aqs parsing besides the fact that wikidata is persisted like www.wikidata, should be "wikidata"

cassandra@cqlsh> select * from "local_group_default_T_pageviews_per_project".data where "_domain"='analytics.wikimedia.org' and "access" in ('desktop','mobile','all') and granularity in ('hourly', 'daily', 'monthly') and project='wikidata' and timestamp ='2016010800' and agent='user';

_domain | project | access | agent | granularity | timestamp | _tid | _del | v | views
-------------------------+----------+---------+-------+-------------+------------+--------------------------------------+------+--------+-------
analytics.wikimedia.org | wikidata | desktop | user | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 283772 | null
analytics.wikimedia.org | wikidata | desktop | user | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 6947 | null

Nuria added a comment.EditedApr 4 2017, 8:43 PM

Pageviews on meta definitely not real, see wikistats: (that interval is removed) https://stats.wikimedia.org/wikispecial/EN/ReportCardTopWikis.htm#lang_meta

Nuria added a comment.EditedApr 4 2017, 8:46 PM

mmm.. is this right?

hive (wmf)> select * from domain_abbrev_map where hostname like '%meta%';
OK
domain_abbrev_map.domain_abbrev domain_abbrev_map.hostname domain_abbrev_map.access_site
meta.m meta.wikimedia.org desktop
meta.m.m meta.wikimedia.org mobile
meta.zero.m meta.wikimedia.org zero

Also wikidata seems to have issues but the map seems ok:

www.wd www.wikidata.org desktop
m.wd www.wikidata.org mobile
zero.wd www.wikidata.org zero

Nuria added a comment.EditedApr 5 2017, 12:20 AM

Data from projectcounst_raw, note the meta.m and meta.mw sites have spikes but the one present on reportcard lines up with data on meta.m, that site, according to our abbreviation scheme is the meta desktop domain.

Spike is "real" on table so i guess we need to dig files to see where the misstranslation is happening.

Nuria added a comment.EditedApr 6 2017, 6:08 PM

At the time of loading:
www.wikidata.org needs to be changed to wikidata
www.mediawiki.org needs to be changed to mediawiki

And code needs to be corrected ( i think) on javascript so those two are parsed correctly

Change 346802 had a related patch set uploaded (by Nuria):
[analytics/refinery@master] Corrected triming of hostname

https://gerrit.wikimedia.org/r/346802

Nuria added a comment.Apr 6 2017, 10:30 PM

Realoaded data (took about 2 hours) and now wikidata project column looks correct on db:

cassandra@cqlsh> select * from "local_group_default_T_lgc_pagecounts_per_project".data where "_domain"='analytics.wikimedia.org' and "access-site" in ('desktop-site','mobile-site','all-sites') and granularity in ('hourly', 'daily', 'monthly') and project='wikidata' and timestamp ='2016010800';

_domain | project | access-site | granularity | timestamp | _tid | _del | count
-------------------------+----------+--------------+-------------+------------+--------------------------------------+------+---------
analytics.wikimedia.org | wikidata | all-sites | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 4943451
analytics.wikimedia.org | wikidata | all-sites | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 188718
analytics.wikimedia.org | wikidata | desktop-site | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 4803906
analytics.wikimedia.org | wikidata | desktop-site | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 182718
analytics.wikimedia.org | wikidata | mobile-site | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 139545
analytics.wikimedia.org | wikidata | mobile-site | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 6000

Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.Apr 7 2017, 7:26 PM

Change 346802 merged by Nuria:
[analytics/refinery@master] Correcting loading of pagecounts into cassandra

https://gerrit.wikimedia.org/r/346802

Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Apr 11 2017, 3:02 PM
Nuria set the point value for this task to 5.Apr 11 2017, 3:06 PM
ezachte added a subscriber: ezachte.EditedApr 11 2017, 4:25 PM

I found this in DammitSummarizeProjectviews.pl:

  1. quick fix: fake counts for meta for a period where we had > 10 billion hits on meta due to fundraiser artefact, all to wiki/Special:RecordImpression

if (($language =~ /^meta/) && (($yyyymm ge '201208') && ($yyyymm le '201504')))
{

next if $language =~ /^meta\./ ; # ignore mobile and zero, set fixed count for desktop
$count = sprintf ("%.0f",7000000 / (24 * 30)) ;

}

Nuria closed this task as Resolved.Apr 11 2017, 11:48 PM