Pagecounts all sites data issues.
- Huge spike on meta pagecounts that looks "not real' (see attached screenshoot)
- data for wikidata not there?
Seems like wikidata is not query-able, we likely have stored the counts with the wrong site name:
| • Nuria | |
| Apr 4 2017, 3:19 PM |
| F7226405: Screen Shot 2017-04-04 at 5.16.07 PM.png | |
| Apr 5 2017, 12:20 AM |
| F7226417: Screen Shot 2017-04-04 at 5.15.44 PM.png | |
| Apr 5 2017, 12:20 AM |
| F7226476: Screen Shot 2017-04-04 at 5.19.18 PM.png | |
| Apr 5 2017, 12:20 AM |
| F7216784: Screen Shot 2017-04-04 at 11.19.29 AM.png | |
| Apr 4 2017, 6:20 PM |
Pagecounts all sites data issues.
Seems like wikidata is not query-able, we likely have stored the counts with the wrong site name:
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Correcting loading of pagecounts into cassandra | analytics/refinery | master | +4 -3 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | • chasemp | T146308 Kill limn1 | |||
| Resolved | mforns | T126358 Migrate the simplest limn dashboards to dashiki tabular {frog} | |||
| Resolved | JAllemandou | T126767 Make Unique Devices dataset public {mole} | |||
| Resolved | mforns | T130117 Move reportcard to dashiki and new datasources | |||
| Resolved | mforns | T156388 Populate aqs with legacy page-counts | |||
| Resolved | • Nuria | T162157 Pagecounts all sites data issues |
Data for wikidata in cassandra it's not fetch-able:
cassandra@cqlsh> select * from "local_group_default_T_lgc_pagecounts_per_project".data where "_domain"='analytics.wikimedia.org' and "access-site" in ('desktop-site','mobile-site','all-sites') and granularity in ('hourly', 'daily', 'monthly') and project='www.wikidata' and timestamp ='2016010800';
Returns records:
cassandra@cqlsh> select * from "local_group_default_T_lgc_pagecounts_per_project".data where "_domain"='analytics.wikimedia.org' and "access-site" in ('desktop-site','mobile-site','all-sites') and granularity in ('hourly', 'daily', 'monthly') and project='www.wikidata' and timestamp ='2016010800';
_domain | project | access-site | granularity | timestamp | _tid | _del | count
-------------------------+--------------+--------------+-------------+------------+--------------------------------------+------+---------
analytics.wikimedia.org | www.wikidata | all-sites | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 4943451
analytics.wikimedia.org | www.wikidata | all-sites | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 188718
analytics.wikimedia.org | www.wikidata | desktop-site | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 4803906
analytics.wikimedia.org | www.wikidata | desktop-site | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 182718
analytics.wikimedia.org | www.wikidata | mobile-site | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 139545
analytics.wikimedia.org | www.wikidata | mobile-site | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 6000
So something is going on with aqs parsing besides the fact that wikidata is persisted like www.wikidata, should be "wikidata"
cassandra@cqlsh> select * from "local_group_default_T_pageviews_per_project".data where "_domain"='analytics.wikimedia.org' and "access" in ('desktop','mobile','all') and granularity in ('hourly', 'daily', 'monthly') and project='wikidata' and timestamp ='2016010800' and agent='user';
_domain | project | access | agent | granularity | timestamp | _tid | _del | v | views
-------------------------+----------+---------+-------+-------------+------------+--------------------------------------+------+--------+-------
analytics.wikimedia.org | wikidata | desktop | user | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 283772 | null
analytics.wikimedia.org | wikidata | desktop | user | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 6947 | null
Pageviews on meta definitely not real, see wikistats: (that interval is removed) https://stats.wikimedia.org/wikispecial/EN/ReportCardTopWikis.htm#lang_meta
mmm.. is this right?
hive (wmf)> select * from domain_abbrev_map where hostname like '%meta%';
OK
domain_abbrev_map.domain_abbrev domain_abbrev_map.hostname domain_abbrev_map.access_site
meta.m meta.wikimedia.org desktop
meta.m.m meta.wikimedia.org mobile
meta.zero.m meta.wikimedia.org zero
Also wikidata seems to have issues but the map seems ok:
www.wd www.wikidata.org desktop
m.wd www.wikidata.org mobile
zero.wd www.wikidata.org zero
Data from projectcounst_raw, note the meta.m and meta.mw sites have spikes but the one present on reportcard lines up with data on meta.m, that site, according to our abbreviation scheme is the meta desktop domain.
Spike is "real" on table so i guess we need to dig files to see where the misstranslation is happening.
At the time of loading:
www.wikidata.org needs to be changed to wikidata
www.mediawiki.org needs to be changed to mediawiki
And code needs to be corrected ( i think) on javascript so those two are parsed correctly
Change 346802 had a related patch set uploaded (by Nuria):
[analytics/refinery@master] Corrected triming of hostname
Realoaded data (took about 2 hours) and now wikidata project column looks correct on db:
cassandra@cqlsh> select * from "local_group_default_T_lgc_pagecounts_per_project".data where "_domain"='analytics.wikimedia.org' and "access-site" in ('desktop-site','mobile-site','all-sites') and granularity in ('hourly', 'daily', 'monthly') and project='wikidata' and timestamp ='2016010800';
_domain | project | access-site | granularity | timestamp | _tid | _del | count
-------------------------+----------+--------------+-------------+------------+--------------------------------------+------+---------
analytics.wikimedia.org | wikidata | all-sites | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 4943451
analytics.wikimedia.org | wikidata | all-sites | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 188718
analytics.wikimedia.org | wikidata | desktop-site | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 4803906
analytics.wikimedia.org | wikidata | desktop-site | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 182718
analytics.wikimedia.org | wikidata | mobile-site | daily | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 139545
analytics.wikimedia.org | wikidata | mobile-site | hourly | 2016010800 | 13814000-1dd2-11b2-8080-808080808080 | null | 6000
Change 346802 merged by Nuria:
[analytics/refinery@master] Correcting loading of pagecounts into cassandra
I found this in DammitSummarizeProjectviews.pl:
if (($language =~ /^meta/) && (($yyyymm ge '201208') && ($yyyymm le '201504')))
{
next if $language =~ /^meta\./ ; # ignore mobile and zero, set fixed count for desktop
$count = sprintf ("%.0f",7000000 / (24 * 30)) ;}