##Tracking anonymous users
To track user retention we need to identify somehow which events come from same users. We cannot use client pageToken nor sessiontoken as those stay in the browser only for the current pageview/session. When a user enables the AMC, we could store some unique identifier in local storage, and then send that identifier with every opt-in/opt-out request (it will have to be passed to the server on Special:MobileOptions page). But that value might be identifying. We don't want to assign any identifiers to users.
Just to avoid confusion, this sentence refers to *anonymous* editors (we do of course assign identifiers to logged-in users, namely their public user name and ID).
Instead, we can store last AMC opt-in/opt-out date in local storage. When user opt-in for the first time we send event with lastActionDate=null and we store in local storage the current date. Then on every opt-in/opt-out we will send the lastActionDate=localStorage.get('amc.lastactiondate') with the event, and then override the local storage mc.lastactiondate to the current date.
Each event will have current date, and the time of last action date which should allow us to track events chain (when given browser opted in/opted out). Checking retention rate for anon users is going
be difficult for the analyst (creating a query take takes into consideration dates), but it's possible.
It's not terribly difficult per se, assuming that every opt-out event comes with the date of the preceding opt-in. But the resulting data is going to be more brittle than for logged-in users, for example because we have no way to distinguish between retained anonymous users and those who lost their cookie/amc.lastactiondate value and opted in again with lastActionDate=null.
Can events logged on the server AND on the client be tied together? For example, if I Iogged that a user visits the mobile options page on the server and makes a change on the client, can we recognize in the logs that both events are from the same user?
Yes, events can be logged on both sides, but I'm not sure if this is possible to identify that both events (server and js) comes from the same user as we try to make events not identifying.
To clarify, the PrefUpdate schema does log the user ID (see documentation). (@Niedzielski , by "makes a change on the client", did you refer to making an edit to a page, or were you talking about a hypothetical new schema logging preference changes on the client side?)
Related (but still quite expansible) documentation: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines
Seems this is also affecting non-iOS EventLogging schemas, e.g. ReadingDepth and NavigationTiming.
Wed, Dec 12
@pmiazga and I discussed various aspects of this today, he is going to write up some things here, and I will follow up with other details. But to note one thing already as a direct followup on today's meeting:
Tue, Dec 11
Mon, Dec 10
Is the opt-in/out status going to be stored in the user preferences (for logged-in users)? In that case we could first look at what the existing PrefUpdate schema can give us.
Fri, Dec 7
Cool! I'll close this for now; might reopen it in case I get to look at 3. above later.
See also the observations in T195880 (note: "none" != "unknown")
See also T211077 (TLDR: it looks like a lot of formerly "unknown" referrers on Chrome Mobile are now, since around September 13, classified as "external (search engine)")
I added an entry about this to the log at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16
Thu, Dec 6
Wed, Dec 5
Here is a quick, partial answer for enwiki:
Thanks for the ping! I spent some time working on this a couple of weeks ago, but encountered an unexpected issue with the referer data, which gave rise to some questions about its validity in general (basically, an implausibly large number of referers are HTTP instead of HTTPS URLs), and I ran out of the allotted time while investigating this. I think I'll be able to get back to that and wrap this task up (possibly with somewhat less accurate results) by early next week.
For the record: decided with @ovasileva to remove the session IDs and keep the page IDs. I'll see to submit the patch soon.
Tue, Dec 4
Thanks @Niedzielski and @GoranSMilovanovic! I ran a query based on that approach (the wikibase_item page property) for a few wikis, more out of curiosity (I guess @mpopov might incorporate a more thorough look at this in his analysis). It confirmed the assumption that the vast majority of Wikipedia articles have a Wikidata item.
What is the plan for measuring the impact of AMC on this metric?
Mon, Dec 3
Sat, Dec 1
Thu, Nov 29
For illustration: It might look like this chart (that I'm currently generating by hand in Google Sheets).
Wed, Nov 28
Tue, Nov 27
Testing again after Neil's update:
Import still fails for me, but with a new error message:
Mon, Nov 26
Thanks @GoranSMilovanovic! It is indeed about mainspace pages only, but about those that have an associated Wikidata item (i.e. appear in the sitelinks of said item), rather than making use of its properties.
I started drafting a query myself using wb_items_per_site, but the result for enwiki looks implausibly low: https://quarry.wmflabs.org/query/31482 Do you happen to see what might be wrong with the query?
Thu, Nov 22
And to record something here from our earlier offline discussions:
Besides determining whether there was a change, I think we should also try to assess its size (and sign ;)
Wed, Nov 21
Tue, Nov 20
@Nuria I thought the requirements from the user perspective were evident from the task, but to clarify it a bit more:
What are the length recommendations for this? Is https://developers.google.com/search/docs/data-types/article about this? It says "Headlines should not exceed 110 characters." Aren't page previews extracts normally much longer?
Page title and ID contain largely the same information, so if we whitelist one of them, the other should be fine too (and vice versa - if one of them needs to be purged, the other should too).
I"m not sure that I'm clear on what makes sessionToken PII and not IP address.
IP addresses are PII (actually they are more sensitive than session tokens), and indeed the corresponding field is not contained in the whitelist for this schema.
Would it be OK to replace sessionToken with an ID of the previous page token? We could then perform any analysis that doesn't involve joining on sessionToken.
If you mean the page token of the immediately preceding pageview in the session, that probably wouldn't make a big difference privacy-wise, because the session could still be reconstructed.
Could a reasonable option be to generate the statistics we need from the pages, aggregate or add noise to make them non-identifying and then remove the page_id column?
I think we will want to remove the session IDs instead, as (IIRC) less of our data questions depend on them. But there too we could think about calculating and storing some of the session-dependent data in aggregated form.
The same numbers broken down by project:
|wiki||all beta views /day||% beta||logged in beta views /day||logged in % beta|
Fri, Nov 16
And the same for logged-in views:
@alexhollender asked about the percentage of mobile beta pageviews, so I re-ran the calculation from above (T182235#3833702 ) , correcting the queries a bit (in particular restricting it to webrequests that are pageviews):
Is this related to T206279 ?
Thu, Nov 15
Wed, Nov 14
Looks like there could be some synergy with the web team's work, see e.g. T198218 .
Just to double-check: The information in the documentation that "Sanitization happens right after events are generated (with a couple hours lag)" is still current, right? In that case I don't think this will be a concern (although we will need to update some queries - CCing @Groceryheist regarding ReadingDepth).
Nov 14 2018
Checked that the following sets of pages look quite uniformly distributed now:
enwiki: https://quarry.wmflabs.org/query/31221 (a version of  from the task description that actually completes on Quarry)
For reference, the implementation task for the underlying instrumentation: T126693
Nov 13 2018
This should be sorted out now. (Seems we still need to streamline and formalize the access granting process more.)
Discussed with @ovasileva today - we are going to remove the page IDs and keep the session IDs. I will submit a patch soon.
Discussed with @ovasileva today - we are going to remove the session IDs and keep the page names. I will submit a patch soon.
Nov 12 2018
Repeating query  from the task description, the distribution on ptwiki pages created on or before Dec 8, 2005 looks plausible now: https://quarry.wmflabs.org/query/31152
Will do the other checks by tomorrow evening PST (note that the Hive query  can't be re-run directly right now as it depends on the monthly Data Lake snapshot, but of course we can run it in MySQL/MariaDB elsewhere).
See now also T198946: Add Schema property 'sameAs' pointing to Wikidata entries (which also adds a few other schema.org properties, see T198946#4672325 for details)
Cool, thanks for sorting this all out and vetting it!
Nov 11 2018
I nominated this for the 2019 community wishlist survey (as a volunteer), although it remains to be seen whether it fits the scope.
Nov 10 2018
Yes, I can take care of this.
Nov 9 2018
PS: and (in the name of the team) thanks for catching this!
@mforns: I assume "you" in the task description refers to me (since you assigned the task to me). I didn't have anything to do with the original creation of the schema or the field renames in question, and am not among the schema's maintainers.
We'll likely discuss this in our team meeting later today - it's probably best if the involved analysts determine the precise list of field names to be added, although I'll be happy to help submitting the resulting whitelist patch as I did earlier this week in case of the Popups schema.
OK, here is the HiveQL expression determining GN I have been using for the past few years:
I'll look into this next week with @ovasileva .
Yes, this would be great to know - I have no idea myself either, but looking at the above mentioned case in T166733#4709400 , it seems that batched updating of 16.8 million rows (across 9 tables) took 11 hours there.
- Can you help us identify parties we should coordinate with based on your experience making similar changes? In your comment you mentioned #wikimedia-operations if the update was quick but per the previous bullet, we’re unsure. Does ~1.5 million rows seem like a long running change? Would it be necessary or advisable to divide this change into enwiki and non-enwiki updates? Lastly, are Anomie’s T166733 and T188132 scripts running constantly or intermittently? They seem to be scheduled for weeks! Would they interfere with a simultaneous update to page_random?
FWIW, it seems that @Anomie's updates are not operating on the page table that we are concerned with here, but on different tables (namely image for T188132, and revision, archive, logging, ipblocks, image, oldimage, filearchive, protected_titles and recentchanges for T166733, according to T166733#4709400 - @Anomie, can you confirm?).