Page MenuHomePhabricator

[EPIC] Reading List Sync service analytics
Closed, DeclinedPublic

Description

Background

Reading List Syncing (RLS) is a really cool feature/service which required time & effort from multiple teams to make it happen. The biggest thing is that now articles saved for offline reading on the Android device(s) can be synced to the user's iOS device(s) too. So, naturally, there are some questions that would be great to answer:

  • How many people actually syncing or just enabling syncing?
  • How many people getting actual use out of it by syncing to multiple devices or are they just sending data to their account in the cloud without syncing to another device?
  • How many users syncing reading lists across their iOS and Android devices?

…as the answers to these questions would inform resourcing decisions regarding the future of this service and taking on similarly cross-platforms initiatives (like if we learn that I'm only one of 5 people total who sync across platforms).

Proposal

After several discussions about privacy and workload implications, we (@Fjalapeno @JMinor @mpopov) have arrived at the following proposed solution:

  • When the user enables RLS on their device, there's an event that is sent by the client to EL which registers the device.
  • The client then remembers when this event was sent and resends it after 60 days.
  • If RLS is already enabled when the user opens the app for the first time after updating to the version that has this funnel, that's when the app sends the first registration event.

EventLogging Schema

MobileWikiAppRLSRegistration schema has the following fields:

  • user_id which has the username or ID, doesn't matter as long as it's consistent between iOS & Android and we can use it to link multiple app_install_id's together
  • app_install_id that Android & iOS apps include in events anyway
  • client_ts field for client-side timestamps in case the device goes offline and the event is queued up for a future opportunity

Since the events include User-Agent strings from the apps, we can use it to figure out if people are enabling RLS across platforms.

Prioritization

This work is low priority. Having these analytics would be nice to have at some point, but the stakeholders aren't itching to have those questions answered and there is more important work that needs to be done first.

Future Work

Note that those are basically the only questions we could answer. Any questions regarding users' actual usage of the feature/service as well as reading lists in general would require additional EL work on client-side.

Event Timeline

mpopov triaged this task as High priority.Apr 9 2018, 11:35 PM
mpopov created this task.

This would happen on the client side, right? (i.e. the header would be set on the request, not the response.)

This would happen on the client side, right? (i.e. the header would be set on the request, not the response.)

Yep, just the requests! Nothing changes for responses.

mpopov renamed this task from Enable linking Reading List sync-related requests by including user-identifying info in X-Analytics to Enable Reading List Syncing usage stats.Apr 17 2018, 11:41 PM
mpopov lowered the priority of this task from High to Medium.
mpopov updated the task description. (Show Details)
mpopov added a subscriber: Fjalapeno.

Note retention of this data is subjected to 90 days unless granted otherwise by legal, is the reading infrastructure ready to drop the data after 90 days of storing it? Currently there is no data dropping routines in cassandra which I think is the backend for this data.

This is our rentention per our privacy policy:

https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data

I think storing meta data about usage of feature together with feature data might not be the best solution.

What are the concerns of using EL and events for behavioral analytics such as these? I know iOS cannot use EL but Android, where usage is really wide spreaded can. Questions posed on the ticket have different levels of relevance, and seeing usage among users that both use IOS and Android seems an fringe case rather that a core usage of the feature. Also, if this feature only applies to loggedin users it will be well worth it running some numbers as to the cap usage we would expect to see in say, Android. I would do that first, to gauge against instrumentation efforts.

fdans raised the priority of this task from Medium to High.Apr 23 2018, 3:54 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Note retention of this data is subjected to 90 days unless granted otherwise by legal, is the reading infrastructure ready to drop the data after 90 days of storing it? Currently there is no data dropping routines in cassandra which I think is the backend for this data.

What is 90 days of data for a table of registered devices? Do we go by the last sync timestamp? I guess that's fine, since if someone hasn't synced a device in more than 90 days then that device might as well not exist and can be dropped. (This was actually the reason for having a last sync timestamp in the proposal, so we can filter out devices that haven't been sync'd in X days, dropping past 90 is just an explicit version of that.)

What are the concerns of using EL and events for behavioral analytics such as these? I know iOS cannot use EL but Android, where usage is really wide spreaded can. Questions posed on the ticket have different levels of relevance, and seeing usage among users that both use IOS and Android seems an fringe case rather that a core usage of the feature. Also, if this feature only applies to loggedin users it will be well worth it running some numbers as to the cap usage we would expect to see in say, Android. I would do that first, to gauge against instrumentation efforts.

Namely that it would be a lot of unnecessary work for both teams on client side to instrument something that would be much easier and simpler to do on the backend side. Also, if we used EL client-side, we still wouldn't have an accurate picture of the service's usage because, and I'll repeat myself from proposal number 1 here:

I expect the # of people who are opted-in [to sharing usage data] on every device they own to be incredibly small if not downright zero.

Keeping track of service usage on the backend would yield actual numbers, not estimates with extremely wide confidence intervals.

mpopov updated the task description. (Show Details)

Update: @chelsyx @Nuria and I met to discuss this and have come up Proposal 4 (which is really just an EventLogging-based version of Proposal 2).

All these new proposals sound a bit overcomplicated. Why not just use X-Analytics? There is already a purge mechanism for raw webrequest data, right?

All these new proposals sound a bit overcomplicated. Why not just use X-Analytics? There is already a purge mechanism for raw webrequest data, right?

Meeting with Legal later this week and hoping to get an approval for the X-Analytics approach :)

All these new proposals sound a bit overcomplicated. Why not just use X-Analytics?

We do not use X-analytics for structured data at all, we have eventlogging for that purpose. As we have mentioned before, in the context of the page previews discussion, we need to think about events for analytics rather that thinking that parsing the whole webrequest firehouse is efficient in any way, that amounts to parse terabytes of data per day. We already discussed this at length when we made a plan to measure page previews.

I strongly recommend measure what we want to measure by emiting events, it would allow you to ingest those events into stores like druid and see them with superset.

In the light of new modern event data platform we are working on for next year we will be moving away from solutions as X-Analytics, which by the way do not work automagically.

would be much easier and simpler to do on the backend side.

We're trying to avoid centralizing metric definition and emission of events. Webrequest is huge, but dumb. (See also the Event Data Platform program for next FY.) We want teams to be able to emit events of their design (within some guidelines), and then be able to query those events in different systems.

What is 90 days of data for a table of registered devices?

The events themselves and anything with PII is subject to this. If you are trying to keep the current state of something, rather than historical events, you'd likely create a job that derives the current state from the latest events and inserts or updates some other table. I don't have the context as to what you are trying to track, but likely if the state for some user (or device) hasn't changed in 90 days (and you are keeping PII for that user/devices), then you should probably purge that record. If it has changed, then you can keep it! :)

BTW, just checking that you are synced up with the iOS team on this. It sounds like you are both trying to do similar things! See T192819.

BTW, just checking that you are synced up with the iOS team on this. It sounds like you are both trying to do similar things! See T192819.

@Ottomata Yes, we have been synced up about this. T192819 is only collecting data on the client side (the iOS app), while this ticket is to track logged in users across device/platform, which require a different solution.

mpopov renamed this task from Enable Reading List Syncing usage stats to [EPIC] Reading List Sync service analytics.Apr 24 2018, 10:45 PM
mpopov updated the task description. (Show Details)

@APalmer_WMF @Fjalapeno @Jhernandez: can y'all please take a look at the updated description and let us know if you have any questions or concerns.

I'd like to get a thumbs up from both Legal and Reading Infrastructure before creating sub-tasks :) Also! I will be scheduling a separate meeting to discuss the two crossDeviceID methods so we can decide on which one to go with.

@mpopov FYI that adding things to X-Analytics does not work automagically, we strongly recommend team to track interactions using events.
Want to clarify With @chelsyx and @Fjalapeno that events do not have to be send from client side, the server can also send them. EL has a mediawiki server side client too.

@mpopov FYI that adding things to X-Analytics does not work automagically, we strongly recommend team to track interactions using events.
Want to clarify With @chelsyx and @Fjalapeno that events do not have to be send from client side, the server can also send them. EL has a mediawiki server side client too.

Noted re: X-Analytics. FYI as you can see in the updated description, the server-side EL-based solution is the one we decided to go with.

Method 1 has the disadvantage that we would be able to find out username given crossDeviceID, which is not the case for Method 2.

How is that not the case for Method 2?

One nit! Remember that your json field names are going to be directly mapped to caseless SQL column names, so please avoid using camelCase when you can. snake_case is much better. E.g. app_install_id cross_device_id, etc. :)

Method 1 has the disadvantage that we would be able to find out username given crossDeviceID, which is not the case for Method 2.

How is that not the case for Method 2?

Good question! I'm no expert in cryptography but as far as I've been able to tell it is impossible to reverse a good hash function. Any attacker would basically need to make their own mapping table of usernames to hashes even if they knew exactly which hashing function was used.

One nit! Remember that your json field names are going to be directly mapped to caseless SQL column names, so please avoid using camelCase when you can. snake_case is much better. E.g. app_install_id cross_device_id, etc. :)

Will do! Thanks for the reminder and suggestion!

Method 1 has the disadvantage that we would be able to find out username given crossDeviceID, which is not the case for Method 2.

How is that not the case for Method 2?

Good question! I'm no expert in cryptography but as far as I've been able to tell it is impossible to reverse a good hash function. Any attacker would basically need to make their own mapping table of usernames to hashes

Yes, it is considered impossible for practical purposes to come up with a source value when given the hash value alone (assuming that we choose a well-established hash function whose security has been widely vetted).
But in situations where one has the additional information that the hash can only come from a fairly limited set of source values, this is no longer true. That is well known and for example forms the reason why passwords hashes are always stored with a salt. The situation here is even worse - the list of existing users is public and fairly small (<200 million accounts across all WMF wikis, much less when applying some easy heuristics, e.g. limiting to recently active users).

even if they knew exactly which hashing function was used.

Which they would, considering that our code is open source ;)

What's more, reconstructing the user name corresponding to some logged data is only one possible threat model. Another one, arguably more important, is finding the logged data corresponding to a given user (e.g. someone with access to the data wants to know how I have used reading lists synchronization recently). Method 2 offers basically zero protection against that.

In summary, I don't see how Method 2 offers a meaningful privacy protection.

The reason to hash app_install_id is because these events would end up somewhere where we would be able to join with behavioral data sent by mobile apps, which we DON'T want

To clarify just in case, it's fine to log app_install_id in connection with user actions, it has been done in many different schemas for years. And "behavioral data" would seem to describe this data here too.
So I guess the "don't want" here refers to connecting users IDs with those other schemas via the app install ID, right? (in which case, fully agreed, although it seems we had been trying to prevent that with Method 1 or Method 2 anyway)

I see the quarterly check-in slides had some data on reading list sizes. One thing that would be nice to have is reading list "churn" (ie. is the median list size small because most people just don't use the feature much, or do they use it as bookmarks and remove articles once they have read them?).

mpopov lowered the priority of this task from High to Low.May 30 2018, 10:31 PM
mpopov updated the task description. (Show Details)

The reason to hash app_install_id is because these events would end up somewhere where we would be able to join with behavioral data sent by mobile apps, which we DON'T want

To clarify just in case, it's fine to log app_install_id in connection with user actions, it has been done in many different schemas for years. And "behavioral data" would seem to describe this data here too.
So I guess the "don't want" here refers to connecting users IDs with those other schemas via the app install ID, right? (in which case, fully agreed, although it seems we had been trying to prevent that with Method 1 or Method 2 anyway)

We have moved away from the proposal requiring either method. I'm sorry I didn't respond to your earlier comments, was waiting until I updated the description before updating you.

One thing that would be nice to have is reading list "churn" (ie. is the median list size small because most people just don't use the feature much, or do they use it as bookmarks and remove articles once they have read them?).

Sounds interesting! Can you please explain this idea in more detail?

ts field for client-side timestamps in case the device goes offline and the event is queued up for a future opportunity

Orrrrr maybe dt in ISO-8601 ? :D

ts field for client-side timestamps in case the device goes offline and the event is queued up for a future opportunity

Orrrrr maybe dt in ISO-8601 ? :D

Updated schema to say client_ts and specified ISO-8601 to avoid ambiguity. Client-side timestamps in mobile apps EL use ISO-8601, but you're right, it's good to be specific about that.

client_dt? We are trying to use the convention that fields named after
dt are ISO-8601, and ‘timestamps’ or ts fields are unix epoch
timestamps.

Sounds good, @Ottomata! Updated and I'll keep this in mind going forward.

Nuria raised the priority of this task from Low to Needs Triage.Sep 26 2018, 7:25 PM
Nuria moved this task from Operational Excellence to Radar on the Analytics board.
Jhernandez changed the task status from Open to Stalled.Feb 27 2019, 5:02 PM
Jhernandez triaged this task as Lowest priority.

Reflecting reality, no-one seems to be interested in these. Please comment and move to our "Needs triage" column if you do.

LGoto closed this task as Declined.Oct 9 2020, 4:50 PM