Page MenuHomePhabricator

Count project-wide unique devices (like *.wikipedia.org)
Closed, ResolvedPublic21 Story Points

Description

Hive code to count global unique devices per top domain (like *.wikipedia.org). Initial work will be just quality checking to make sure global counts and per-site counts are in agreement, once that vetting is done we would need to calculate global counts per domain and study what percentage offset/ estimate represent of the total number.

Things to do for deploy:

  • Stop currently running oozie unique_devices_project_wide jobs (daily, monthly and daily-druid)
  • Hive:
    • Create unique_devices_project_wide hive tables (daily and monthly)
    • Move already computed time-partitioned folder structure from /user/joal/wmf/data/wmf/unique_devices/project_wide/ to /wmf/data/wmf/unique_devices/project_wide/
    • Run MSCK repair on both daily and monthly prod tables.
    • drop tables in joal database
  • Archives: Move exisitng project-wide archives to /user/joal, not yet ready for external visibility
  • Restart Oozie jobs with production settings and last-run dates, except for druid-daily that needs to be fully re-run (bug in previous run)
    • Don't forget to setup archive folder to /user/joal

Details

Related Gerrit Patches:
analytics/refinery : masterAdd unique devices project-wide oozie jobs
analytics/refinery/source : masterProvide RedirectToPageview function and UDF
analytics/refinery : masterAdd uniques global jobs and correct uniques

Event Timeline

Nuria renamed this task from Hive code to count global devices to Hive code to count global unique devices per project.Aug 25 2016, 6:55 PM
Nuria created this task.
Nuria moved this task from Incoming to Operational Excellence Future on the Analytics board.
Nuria renamed this task from Hive code to count global unique devices per project to Hive code to count global unique devices per top domain (like *.wikipedia.org).Aug 25 2016, 7:18 PM
Nuria added a comment.EditedFeb 13 2017, 5:02 PM

Let's do first daily counts:

  • code for last access to count global counts on 13 domains
  • early vetting to clear out bugs
  • communicate with reading for 2nd round of vetting
  • can we group the code in any way such "global counts" and "per site" can be calculated on the same pass?
Nuria set the point value for this task to 8.Feb 13 2017, 5:05 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria updated the task description. (Show Details)
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Checked this morning: We haz data !

hourhas global cookienumber of rows
7false185103441
8false200658355
9false163224938
9true46227295
10false79755747
10true136817645

Checked with daily requests, using a modifed version of offset (instead of uri_host splitting, use normalised_host.project_class, as per global cookie definition).
Only project_class having more natural uniques than offset is wikipedia. All others have a lot more offset than uniques, which is not satisfying.

la.project_classuniques_underestimateuniques_offsetuniques_estimateratio underestimate / estimate
mediawiki200482101021418.867000
wikibooks3855129828333683419.955908
wikidata325630208334649.054928
wikimedia987989571719451536.216133
wikimediafoundation2562293883195010.678072
wikinews1052697180239.587219
wikipedia292787181953429048813008133.909145
wikiquote2993224600427593611.755894
wikisource2115116918619033720.086625
wikiversity343352012554459.291238
wikivoyage2963399334289611.404330
wiktionary198795988024118681927.222437
Nuria added a comment.Feb 20 2017, 5:01 PM

I think we probably need to take a second look at this calculation, compare the wikidata numbers with the ones we are already calculating, data below for february.

use wmf;
select month, day,uri_host, max(uniques_offset), max(uniques_underestimate) from last_access_uniques_daily where year=2017 and month=02 and uri_host like '%ikidata%' group by month,day, uri_host.

month day uri_host _c3 _c4
2 1 m.wikidata.org 2090 357
2 1 www.wikidata.org 3321 556
2 2 m.wikidata.org 2219 357
2 2 www.wikidata.org 3208 559
2 3 m.wikidata.org 1866 354
2 3 www.wikidata.org 3009 1212
2 4 m.wikidata.org 2517 360
2 4 www.wikidata.org 2591 307
2 5 m.wikidata.org 2685 327
2 5 www.wikidata.org 2677 479
2 6 m.wikidata.org 2051 318
2 6 www.wikidata.org 3169 631
2 7 m.wikidata.org 2104 320
2 7 www.wikidata.org 3334 468
2 8 m.wikidata.org 2126 360
2 8 www.wikidata.org 3500 1182
2 9 m.wikidata.org 2018 375
2 9 www.wikidata.org 3426 1604
2 10 m.wikidata.org 1865 374
2 10 www.wikidata.org 3157 975
2 11 m.wikidata.org 2158 364
2 11 www.wikidata.org 2594 1316
2 12 m.wikidata.org 2705 352
2 12 www.wikidata.org 2659 553
2 13 m.wikidata.org 2111 335
2 13 www.wikidata.org 3596 1203
2 14 m.wikidata.org 1171 2106
2 14 www.wikidata.org 3576 717
2 15 m.wikidata.org 787 2494
2 15 www.wikidata.org 3330 586
2 16 m.wikidata.org 771 2457
2 16 www.wikidata.org 3309 818
2 17 m.wikidata.org 663 2240
2 17 www.wikidata.org 3100 478
2 18 m.wikidata.org 639 2746
2 18 www.wikidata.org 2389 308
2 19 m.wikidata.org 723 3358
2 19 www.wikidata.org 2582 736

Actually numbers for uniques underestimates are close enough (my computation was for yesterday 2017-02-19, forgot to mention):

  • 3305 using by host method (summing mobile and desktop)
  • 3256 using global method

However it seems computing offset using project_class instead of host doesn' work.

Nuria added a comment.Feb 20 2017, 5:09 PM

Right, offset is "off" (jaja!) but underestimate matches.

I think we probably need to take a second look at this calculation, compare the wikidata numbers with the ones we are already calculating, data below for february.
use wmf;
select month, day,uri_host, max(uniques_offset), max(uniques_underestimate) from last_access_uniques_daily where year=2017 and month=02 and uri_host like '%ikidata%' group by month,day, uri_host;

@Nuria: Using max instead of sum when data is split by country couldn't work :)

use wmf;
select uri_host, **sum**(uniques_offset), **sum**(uniques_underestimate) from last_access_uniques_daily where year=2017 and month=02 and day = 19 and uri_host like '%ikidata%' group by uri_host;

Results:

uri_hostuniques_offsetuniques_understimates
m.wikidata.org395625218
www.wikidata.org264195507
--------------------
sum3037530725
Global cookie numbers302083256

Actually offset work, but not uniques underestimates !!!!

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Mar 8 2017, 8:04 PM
Nuria renamed this task from Hive code to count global unique devices per top domain (like *.wikipedia.org) to Count global unique devices per top domain (like *.wikipedia.org).Mar 13 2017, 7:31 PM
Nuria updated the task description. (Show Details)
Nuria triaged this task as High priority.Mar 16 2017, 5:19 PM

Change 352099 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add uniques global jobs and correct uniques

https://gerrit.wikimedia.org/r/352099

Change 352181 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add last access uniques global oozie jobs

https://gerrit.wikimedia.org/r/352181

Change 353310 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Provide RedirectToPageview function and UDF

https://gerrit.wikimedia.org/r/353310

Change 352099 abandoned by Joal:
Add uniques global jobs and correct uniques

Reason:
Split in two smaller CR:
https://gerrit.wikimedia.org/r/#/c/352181/
https://gerrit.wikimedia.org/r/#/c/352182/

https://gerrit.wikimedia.org/r/352099

So here's where we are so far:
Successes:

Problem:

  • There still is an issue with numbers: Given that wikidata.org has very few domains, it's possible to compare uniques-per-domain and unique project-wide - And they were different enough for us to stop and investigate (regular difference of ~75% for each days of full month).

Investigation details (one full day in May as dataset):

    • We first have seen that most of the difference was to be accounted for a specific subset rows in the per-domain dataset:
      • on m.wikidata.org domain
      • accepting cookies (nocookies IS NULL)
      • NOT having last-access-per-domain set (x_analytics_map['WMF-Last-Access'] IS NULL]
      • having last-access-project-wide set to today (x_analytics_map['WMF-Last-Access'] == TODAY)
    • We then fingerprinted some of the sessions present the above dataset, and looked at their events in details. It was feeling as if we were missing events, since we couldn't find events (on any of the *.wikidata.org sub-domains) where WMF-Last-Access-Global was NULL or not today, and by definition this cookie is set to today only if it is first seen NULL or before today.
  • Knowing that current per-domain computation uses is_pageview as a filter and these don't account for redirects (http code 301, 302, 307), we thought the issue might come from the cookie being set on a first hit to www.wikidata.org that was redirecting to m.wikidata.org -- This way, both WMF-Last-Access-Global and WMF-Last-Access cookie are set, but on the per-domain one, it's for www.wikidata.org. Since the redirected hit is on m.wikidata.org, uniques computation at the per-domain level counts for this domain and not for the main one (because of redirects being excluded from pageviews). GBut for the global computation, since the redirect hit is removed from computation, we only see the redirected one where cookie is set, therefore we count those hits has being already counted, while they are not.
    • We then looked at sessions including is_redirect_to_pageview events, and the idea explained above showed up as real: users hit first www.wikidata.org, and get redirected to m.wikidata.org.
    • It however allowed us to detect another bug: the first hit (the redirected one) has, as expected, no WMF-Last-Access nor WMF-Last-Access-Global cookie set. However, it has nocookies == 1, which means that the browser doesn't have any other cookie set for the wikidata domain. While this seems reasonable on the HTTP events side, it causes a problem in uniques computation: these specific sessions are not counted in the global cookie computation:
      • Since nocookie is set, it's not counted in the uniques-underestimates.
      • Since the session has more than a single event, it's not counted in the offsets.
      • We tried to count them in the offset side by counting all the sessions there were having exactly one event with nocookies == NULL, but the difference didn't match.
    • The redirect trick also lead to another idea, which is that the uniques counted for m.wikidata.org since nocookie is not NULL since WMF-Last-Access-Global cookie is not null (previsou hit had nocookie is null) are actually an artifact of the new WMF-Last-Access-Global cookie (deployed 2017-02-14) ! We tried to double check that in Pivot, looking at uniques_underestimates and uniques_offset over time for m.wikidata.org, and we found that it actually seems to be the case, but is also related to again another problem occuring 2016-11-10:


While being problematic because of the hole, these 2 charts still allows us to think that the uniques_underestimate + uniques_offset approach is relatively stable !

JAllemandou added a comment.EditedMay 17 2017, 2:46 PM

Another round of explanations, trying to explain my understanding of the issue:

  • First, extract subset ofdata from webrequest
ADD JAR file:///home/joal/code/refinery-source/refinery-hive/target/refinery-hive-0.0.46-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION is_redirect_to_pageview as 'org.wikimedia.analytics.refinery.hive.IsRedirectToPageviewUDF';
DROP TABLE joal.uniques_check;
CREATE TABLE joal.uniques_check AS
SELECT
    LOWER(uri_host) AS uri_host,
    normalized_host.project_class AS project_class,
    x_analytics_map['WMF-Last-Access'] AS last_access,
    x_analytics_map['WMF-Last-Access-Global'] AS last_access_global,
    x_analytics_map['nocookies'] AS nocookies,
    is_pageview,
    is_redirect_to_pageview(uri_host, uri_path, uri_query, http_status, content_type, user_agent, x_analytics) AS is_redirect_to_pageview,
    ip,
    user_agent,
    accept_language
FROM wmf.webrequest
WHERE x_analytics_map IS NOT NULL
    AND agent_type = 'user'
    AND webrequest_source = 'text'
    AND (is_pageview OR is_redirect_to_pageview(uri_host, uri_path, uri_query, http_status, content_type, user_agent, x_analytics))
    AND year = 2017
    AND month = 5
    AND day = 8
    AND lower(uri_host) LIKE '%wikidata.org'
;
  • Triple check data looks good: Use pageviews in pivot
    • Expected pageviews for project wikidata from users on
      • Desktop: 634.4k
      • Mobile (web): 49.1k
SELECT
  uri_host,
  COUNT(1)
FROM joal.uniques_check
WHERE is_pageview
GROUP BY uri_host;

uri_host        _c1
m.wikidata.org  49108
www.wikidata.org        634405

Look good, YAY :)

  • Double checking project_class associated with domain is correct (should be wikidata for every present domain)
SELECT
  uri_host,
  project_class,
  COUNT(1)
FROM joal.uniques_check
GROUP BY uri_host, project_class;

uri_host        project_class   _c2
m.wikidata.org  wikidata        50636
wikidata.org    wikidata        2423
www.wikidata.org        wikidata        643564

Looks correct, forward !

Double checking number of included redirects_to_pageviews (for Global uniques redirect special case):

SELECT
  uri_host,
  COUNT(1)
FROM joal.uniques_check
WHERE is_redirect_to_pageview
GROUP BY uri_host;

m.wikidata.org  1528
wikidata.org    2423
www.wikidata.org        9159

Number is 1 order of magnitude smaller than pageviews, seems acceptable (but difficult to provide more precise quality evaluation)

Checking nocookies validity

SELECT
  nocookies IS NULL as cookie_set,
  last_access IS NOT NULL AS la_set,
  last_access_global IS NOT NULL as lag_set,
  count(1)
FROM joal.uniques_check
GROUP BY
  nocookies IS NULL,
  last_access IS NOT NULL,
  last_access_global IS NOT NULL;

cookie_set      la_set  lag_set _c3
false   false   false   503476        -- Cookies are said not set, and they actually are not - No problem
true    false   false   2883          -- Cookies are said set, but none of LA nor LAG are set - Can be another cookie - No problem
true    false   true    24590         -- Cookies are said set, LAG is set but not LA - Can happen in case of redirects - No problem
true    true    false   354           -- Cookies are said set, LA is set but not LAG - Seems impossible ! -- PROBLEM -- INVESTIGATION: All of those are on old windows browsers for which cookies problem is a known issue. Given the small proportion of those requests, we are not doing anything for those.
true    true    true    165320        -- Cookies are said set, both LA and LAG are set - No problem

Posting this report on data quality, further analysis on the dataset in next comment ...

Another thing we forgot on core data checking: pageview and redirect_to_pageview should not happen conjointly

SELECT
  is_pageview,
  is_redirect_to_pageview,
  COUNT(1)
FROM joal.uniques_check
GROUP BY is_pageview, is_redirect_to_pageview;

is_pageview	is_redirect_to_pageview	_c2
false	true	13110
true	false	683513

Yes, - ok

Some more vetting on uniques computation now:

  • First, last access per domain as it is computed in production:

Expected value from wmf.last_access_uniques_daily table:

SELECT
  uri_host,
  SUM(uniques_underestimate) as uniques_underestimate,
  SUM(uniques_offset) as uniques_offset
FROM wmf.last_access_uniques_daily
WHERE year = 2017
  AND month = 5
  AND day = 8
  AND uri_host LIKE '%wikidata.org'
GROUP BY uri_host;

uri_host        uniques_underestimate   uniques_offset
m.wikidata.org  25350   4159
www.wikidata.org        7198    32474

And the vakues computed (without countries) from the extracted table with the same algoithm:

WITH last_access AS (
SELECT
    uri_host,
    unix_timestamp(last_access, 'dd-MMM-yyyy') AS last_access,
    nocookies,
    ip,
    user_agent,
    accept_language
FROM joal.uniques_check
-- keeping only pageviews, not redirect to pageviews
WHERE is_pageview
),
fresh_sessions_aggregated AS (
    SELECT
        uri_host,
        COUNT(1) AS uniques_offset
    FROM (
        SELECT
            hash(ip, user_agent, accept_language, uri_host) AS id,
            uri_host,
            SUM(CASE WHEN (nocookies IS NOT NULL) THEN 1 ELSE 0 END),
            SUM(CASE WHEN (nocookies IS NULL) THEN 1 ELSE 0 END)
        FROM
            last_access
        GROUP BY
            hash(ip, user_agent, accept_language, uri_host),
            uri_host
        -- Only keeping clients having done
        --    1 event without cookies
        --    0 with cookies (if > 0, already counted with last_access method)
        HAVING SUM(CASE WHEN (nocookies IS NOT NULL) THEN 1 ELSE 0 END) = 1
            AND SUM(CASE WHEN (nocookies IS NULL) THEN 1 ELSE 0 END) = 0
        ) fresh_sessions
    GROUP BY
        uri_host
)
SELECT
    COALESCE(la.uri_host, fresh.uri_host) as uri_host,
    SUM(CASE
        -- Last access not set and client accept cookies --> first visit, count
        WHEN (la.uri_host IS NOT NULL AND last_access IS NULL AND nocookies is NULL) THEN 1
        -- Last access set and date before today --> First visit today, count
        WHEN ((last_access IS NOT NULL)
            AND (last_access < unix_timestamp('2017-05-08', 'yyyy-MM-dd'))) THEN 1
        -- Other cases, don't
        ELSE 0
    END) AS uniques_underestimate,
    COALESCE(fresh.uniques_offset, 0) AS uniques_offset
FROM
    last_access AS la
    FULL OUTER JOIN fresh_sessions_aggregated AS fresh
        ON (la.uri_host = fresh.uri_host)
GROUP BY
    COALESCE(la.uri_host, fresh.uri_host),
    COALESCE(fresh.uniques_offset, 0);


la.uri_host     uniques_underestimate   uniques_offset
m.wikidata.org  25554   4159
www.wikidata.org        7202    32473

Difference is due to a bug in production data computation (see T165661), so we move forward.

  • Computing global uniques
WITH last_access_global AS (
SELECT
    project_class,
    unix_timestamp(last_access_global, 'dd-MMM-yyyy') AS last_access_global,
    nocookies,
    ip,
    user_agent,
    accept_language
FROM joal.uniques_check
WHERE (is_pageview OR is_redirect_to_pageview)
),
fresh_sessions_aggregated AS (
    SELECT
        project_class,
        COUNT(1) AS uniques_offset
    FROM (
        SELECT
            hash(ip, user_agent, accept_language, project_class) AS id,
            project_class,
            SUM(CASE WHEN (nocookies IS NOT NULL) THEN 1 ELSE 0 END),
            SUM(CASE WHEN (nocookies IS NULL) THEN 1 ELSE 0 END)
        FROM
            last_access_global
        GROUP BY
            hash(ip, user_agent, accept_language, project_class),
            project_class
        -- Only keeping clients having done
        --    1 event without cookies
        --    0 with cookies (if > 0, already counted with last_access method)
        HAVING SUM(CASE WHEN (nocookies IS NOT NULL) THEN 1 ELSE 0 END) = 1
            AND SUM(CASE WHEN (nocookies IS NULL) THEN 1 ELSE 0 END) = 0
        ) fresh_sessions
    GROUP BY
        project_class
)
SELECT
    COALESCE(la.project_class, fresh.project_class) AS project_class,
    SUM(CASE
        -- Last access not set and client accept cookies --> first visit, count
        WHEN (la.project_class IS NOT NULL AND last_access_global IS NULL AND nocookies is NULL) THEN 1
        -- Last access set and date before today --> First visit today, count
        WHEN ((last_access_global IS NOT NULL)
            AND (last_access_global < unix_timestamp('2017-05-08', 'yyyy-MM-dd'))) THEN 1
        -- Other cases, don't
        ELSE 0
    END) AS uniques_underestimate,
    COALESCE(fresh.uniques_offset, 0) AS uniques_offset
FROM
    last_access_global AS la
    FULL OUTER JOIN fresh_sessions_aggregated AS fresh
        ON (la.project_class = fresh.project_class)
GROUP BY
    COALESCE(la.project_class, fresh.project_class),
    COALESCE(fresh.uniques_offset, 0);


project_class	uniques_underestimate	uniques_offset
wikidata	8086	36587

As experienced in previous experimentations:

  • the offset number is very similar betweern global uniques and the sum of per domain.
  • the underestimate number however is very smaller in global than the sum in per domain.

Some deeper view of the underestimate computation (since difference happens on that):

SUM(CASE
        -- Last access not set and client accept cookies --> first visit, count
        WHEN (la.project_class IS NOT NULL AND last_access_global IS NULL AND nocookies is NULL) THEN 1
        -- Last access set and date before today --> First visit today, count
        WHEN ((last_access_global IS NOT NULL)
            AND (last_access_global < unix_timestamp('2017-05-08', 'yyyy-MM-dd'))) THEN 1
        -- Other cases, don't
        ELSE 0
    END) AS uniques_underestimate,

The underestimate is computed out of 2 parts:

  • nocookies IS NULL (meaning cookies are present, preventing counting bots that don't accept cookies) and last_access_global IS NULL (since LAG is not set, it is the first visit for this user
  • last_access_global IS NOT NULL (it has a value), in that case the last-access-global date should be before today, if not, it's not the first time this user comes.

Let's split the underestimate computation based on nocookies and last_access being null:

  • For per-domain first
WITH last_access AS (
SELECT
    uri_host,
    unix_timestamp(last_access, 'dd-MMM-yyyy') AS last_access,
    nocookies
FROM joal.uniques_check
-- keeping only pageviews, not redirect to pageviews
WHERE is_pageview
)
SELECT
    uri_host,
    (nocookies IS NULL) as cookies_set,
    (last_access IS NOT NULL) as last_access_set,
    SUM(CASE
        -- Last access not set and client accept cookies --> first visit, count
        WHEN (last_access IS NULL AND nocookies is NULL) THEN 1
        -- Last access set and date before today --> First visit today, count
        WHEN ((last_access IS NOT NULL)
            AND (last_access < unix_timestamp('2017-05-08', 'yyyy-MM-dd'))) THEN 1
        -- Other cases, don't
        ELSE 0
    END) AS uniques_underestimate
FROM last_access
GROUP BY
    uri_host,
    (nocookies IS NULL),
    (last_access IS NOT NULL);

m.wikidata.org	false	false	0
m.wikidata.org	true	false	23600
m.wikidata.org	true	true	1954
www.wikidata.org	false	false	0
www.wikidata.org	true	false	2861
www.wikidata.org	true	true	4341
  • Global then:
WITH last_access_global AS (
SELECT
    project_class,
    unix_timestamp(last_access_global, 'dd-MMM-yyyy') AS last_access_global,
    nocookies
FROM joal.uniques_check
WHERE is_pageview or is_redirect_to_pageview
)
SELECT
    project_class,
    (nocookies IS NULL) as cookies_set,
    (last_access_global IS NOT NULL) as last_access_global_set,
    SUM(CASE
        -- Last access not set and client accept cookies --> first visit, count
        WHEN (last_access_global IS NULL AND nocookies is NULL) THEN 1
        -- Last access set and date before today --> First visit today, count
        WHEN ((last_access_global IS NOT NULL)
            AND (last_access_global < unix_timestamp('2017-05-08', 'yyyy-MM-dd'))) THEN 1
        -- Other cases, don't
        ELSE 0
    END) AS uniques_underestimate
FROM last_access_global
GROUP BY
    project_class,
    (nocookies IS NULL),
    (last_access_global IS NOT NULL);

wikidata	false	false	0
wikidata	true	false	3237
wikidata	true	true	4849
  • Splitting global by uri_host, in order to try to try to better understand the diff with per-domain:
WITH last_access_global AS (
SELECT
    uri_host,
    unix_timestamp(last_access_global, 'dd-MMM-yyyy') AS last_access_global,
    nocookies
FROM joal.uniques_check
WHERE is_pageview or is_redirect_to_pageview
)
SELECT
    uri_host,
    (nocookies IS NULL) as cookies_set,
    (last_access_global IS NOT NULL) as last_access_global_set,
    SUM(CASE
        -- Last access not set and client accept cookies --> first visit, count
        WHEN (last_access_global IS NULL AND nocookies is NULL) THEN 1
        -- Last access set and date before today --> First visit today, count
        WHEN ((last_access_global IS NOT NULL)
            AND (last_access_global < unix_timestamp('2017-05-08', 'yyyy-MM-dd'))) THEN 1
        -- Other cases, don't
        ELSE 0
    END) AS uniques_underestimate
FROM last_access_global
GROUP BY
    uri_host,
    (nocookies IS NULL),
    (last_access_global IS NOT NULL);

uri_host	cookies_set	last_access_global_set	uniques_underestimate
m.wikidata.org	false	false	0
m.wikidata.org	true	false	38
m.wikidata.org	true	true	622
wikidata.org	false	false	0
wikidata.org	true	false	2
wikidata.org	true	true	13
www.wikidata.org	false	false	0
www.wikidata.org	true	false	3197
www.wikidata.org	true	true	4214

(It's nice to see that the sum of per domain equals the overall :)

What we get out of that:

  • The redirect moves some of m.wikidata.org to www.wikidata.org
  • There is a big difference on m.wikidata.org for rows having cookies set (nocookie is null) and last-access not set (last-access is null): about 23k uniques for a day.
    • We suspected that this number was due to redirects from www. to m. - In this case, the number would have been a transfer from offset for global uniques, to underestimates for per-domain uniques (due to last-access-global cookie being set at redirect time) - But, those numbers don't match.
    • We investigated looking at fingerprinting-hashes, and something bizarre appears: It seems we can't find no raws for initialisation of the Last-Access-Global cookie for our day:
SELECT
  dt,
  http_status,
  x_analytics_map['WMF-Last-Access-Global'] as last_access_global,
  x_analytics_map['WMF-Last-Access'] as last_access
FROM wmf.webrequest
WHERE year = 2017
  AND month = 5
  AND ((day = 7 and hour = 23) OR (day = 8 and hour = 0))
  AND webrequest_source = 'text'
  AND x_analytics_map IS NOT NULL
  AND uri_host like '%wikidata%'
  AND hash(ip, user_agent) = 238533050
ORDER BY dt ASC, http_status DESC
LIMIT 1000;

dt	http_status	last_access_global	last_access
2017-05-07T23:42:46	302	07-May-2017	07-May-2017
2017-05-07T23:42:46	200	07-May-2017	NULL
2017-05-08T00:08:50	302	08-May-2017	08-May-2017
2017-05-08T00:08:50	200	08-May-2017	NULL
2017-05-08T00:17:40	302	08-May-2017	08-May-2017
2017-05-08T00:17:41	200	08-May-2017	NULL
2017-05-08T00:27:50	302	08-May-2017	08-May-2017
2017-05-08T00:27:51	200	08-May-2017	NULL

In the previous table we can see that the first raw for 2017-05-08 already has cookies set to 08-May-2017 ... WEIRD !!!

Change 353310 abandoned by Nuria:
Provide RedirectToPageview function and UDF

https://gerrit.wikimedia.org/r/353310

Nuria added a comment.May 30 2017, 6:32 PM

@joal: are there any updates we want to add here after our findings about content-type?

@joal: are there any updates we want to add here after our findings about content-type?

There are @Nuria!
No doc has been writen yet.
I will write a new comment for that task now, and let's also discuss how we want to document the differences between per-domain and project-wide on wikitech.

(Hopefully) last comment in that thread: We have found the problem.
Thanks a lot @Nuria for all the bike-shedding and data vetting!

TL;DR: The difference we experienced on wikidata between per-domain and project-wide is an artifact of the computation. It is due to redirects originally not being counted.

More precisely, 3 changes were needed for the artifact to disappear:

  • Add redirects-to-pageviews to the data used for project-wide computation, which means pageviews but with http response codes 301, 302 or 307.
  • Don't filter out redirects which response content-type is undefined (this one took me long to find...) - For a page to be considered pageview, we expect its response content-type to be set (html + optional charset). However it is very natural for a redirect to have undefined response content-type.
  • Finally update offset computation to count in fingerprinted sessions having a single event WITHOUT last-access cookie and multiple events WITH last-access cookie. This can be considered a bug of the current per-domain computation: for the moment we only count fingerprinted sessions having a single hit WITHOUT last-access cookie, and NO event WITH last-access cookie. However, session having a single event without last-access cookie and some events with last-access cookie are not counted in the underestimate since the cookie value set on the events with cookie is by construction the same day of the first hit without cookie. Therefore the offset we count in per-domain is currently under-counting (by ~10%), and this needs to be changed in the project-wide version for the redirect artifact to disappear. I suggest that while we are correcting this for global, we should also consider correcting the per-domain one.

Change 356125 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/refinery/source@master] Provide RedirectToPageview function and UDF

https://gerrit.wikimedia.org/r/356125

Nuria changed the point value for this task from 8 to 21.May 31 2017, 8:27 PM

Change 352181 merged by Mforns:
[analytics/refinery@master] Add unique devices project-wide oozie jobs

https://gerrit.wikimedia.org/r/352181

Nuria renamed this task from Count global unique devices per top domain (like *.wikipedia.org) to Count project-wide unique devices (like *.wikipedia.org).Jun 5 2017, 4:34 PM
JAllemandou updated the task description. (Show Details)Jun 9 2017, 1:10 PM
JAllemandou updated the task description. (Show Details)Jun 9 2017, 6:45 PM
JAllemandou updated the task description. (Show Details)Jun 12 2017, 7:07 PM
JAllemandou moved this task from Ready to Deploy to Done on the Analytics-Kanban board.
Nuria closed this task as Resolved.Jun 27 2017, 6:29 PM