We would like to learn about the actual usage of the Synced Reading Lists feature by users, not just whether they have it enabled or not (which we can get with EventLogging albeit only on Android, not iOS). For example, are people actually syncing or just enabling syncing? Are people getting use out of it by syncing to multiple devices or are they just sending data to the cloud without syncing to another device? Are users of the beta version who have both iOS & Android apps syncing across platforms?
To answer these questions we can look at request logs in [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest | `wmf.webrequest` ]], but unfortunately there's no information that we can use to link requests from the same user across apps.
# Proposal 1 (Original)
To that end, I propose that API calls made by the apps to /api/rest_v1/data/lists/ include a user-identifier (e.g. `wmfsyncid`) in the [[ https://wikitech.wikimedia.org/wiki/X-Analytics | X-Analytics ]] header. Almost like a cross-platform `wmfuuid`! :) It's fine if the identifier is a hashed username (as long as both apps yield the same hash of the same username) since we're not interested in identifying specific users or looking at specific users' reading lists, just their usage of the syncing feature. **Again**, this would //only// be for sync-related requests, no other requests.
My strong preference is for the identifiers to be sent with //**all**// sync-related requests and not just when the user has opted-in to in-app analytics (like how `wmfuuid` behaves) so that we can get actual usage numbers, not biased estimates. But more importantly, I expect the # of people who are opted-in on every device they own to be incredibly small if not downright zero.
# Proposal 2 (New alternative)
> Alternatively, the Reading Infrastructure team could generate their own backend logging with user-identifying info but I'm pretty sure this is the simplest approach. I cannot think of any other way to assess the success or usage of this feature in a meaningful way without doing this, but I'm open to ideas and suggestions.
In a meeting with @Fjalapeno, he suggested registering devices on the backend and not having the apps send any additionalReading Lists Syncing (RLS) feature by users, user-identifying info.not just whether they have it enabled or not -- which we can get with EventLogging (EL) albeit only on Android, The backend would generate a unique ID for the user that can be used to link multiple app install IDsnot iOS. For each registered deviceSpecifically, the backend would insert/update:people working on this feature and the people making resourcing decisions have questions like:
- 🔑 Cross-device ID (stays the same)
- 🔑 App install ID ( stays the same)
- OS family (stays the same)
- App version (will change as user updates app)How many people actually syncing or just enabling syncing?
- OS major.minor version (will change as user updates OS)How many people getting actual use out of it by syncing to multiple devices or are they just sending data to their account in the cloud without syncing to another device?
- Timestamp of the last sync (will change with usage)
The following diagram illustrates this:
Adding another proposal per my conversation with @Fjalapeno
# Proposal 3 (new new alternative)How many users syncing reading lists across their iOS and Android devices?
We use eventlogging to send data like (user_id, reading_list_id, platform, reading_list_transaction_id) that quantifies (per user_id) access to a reading list across platforms (platform being different installs of wikipedia app). This assumes that logging in is needed to access your reading list across platforms (which I think is how this feature works) , also assumes service backend can provide transaction_ids per reading_list.
Requires instrumentation in Android but leaves out the iOS case. Now, note that with this scheme you can infer whether users use more than 1 platform in Android (phone and tablet) and whether they use reading lists across android/iOS.# Proposal
Using "reading_list_transaction_id" we can learn about iOS/Android cross-device usage using just instrumentation on Android alone without additional ids. A sequence of transactions on event logging like 1, 2, 3, 4 given a reading_list would tell us the user is only using an instrumented (Android) platform.
A sequence like 1After several discussions about privacy and workload implications, 3we've arrived at the following solution (formerly Proposal 4, 5 of reading_list_transaction_id would tell us that user has done some transactions (2 and 4) on the reading list on a platform we have not instrumented (iOS).
Example:for those keeping track).
Events send to EL would look like
(user_id, reading_list_id, platform, reading_list_transaction_id)## Cross-device identifier
001, 001, Android-phone-1, 01
001, 001, Android-tablet-2, 02
001, 002On the RLS Backend, Android-phone-1,we would generate a unique identifier `crossDeviceID` for each user and that ID would be used to associate app installs together. 03It will be up to the Reading Infrastructure team to decide between these two methods:
002, 001,1. Android-phone-1For every user who syncs, 01generate a random ID and put it into a mapping table that would be used to look up the randomly generated ID by username
002, 001, Android-phone-1,2. 03Use a deterministic hashing function that takes a username and returns the hashed version
003In either case, 001,we should be able to find out `crossDeviceID` given username. Android-phone-1Method 1 has the disadvantage that we would be able to find out username given `crossDeviceID`, 01
003which is not the case for Method 2. On the other hand, 001,looking up the pre-computed ID via Method 1 is probably way faster (and involves less computation) than hashing the username every time on-demand via Method 2. Android-tablet-2Although since `appInstallID`s (see notes below) will need to be hashed anyway, 02we might as well go with Method 2.
User 001 has an Android phone and a tablet and syncs reading lists among both.## EventLogging
UWhen a user 002 has and android-phone and some other devicesyncs their reading lists, RLS Backend would send an event to which he syncs list.EL with the following information:
User 003 uses only one device to interact with reading lists.
This schema requires no additional identifiers created (other than distinct transaction ids done server side that are somewhat sequential) and I think provides all the information required.
# Proposal 4 (New³ alternative1. The `crossDeviceID` (see section above)
2. Salted & hashed `appInstallID` (see notes below)
3. if possible, RLS Backend should set the UserAgent (UA) to the UA it received from the app (see notes below)
Instead of maintaining a table (updating records, dropping data older than 90 days), backend just sends RLS usage updates as events to EventLogging with the following schema:**Some notes:**
| Property | Type | Required |
| crossDeviceID | string | true |
| appInstallID | string | true |- Due to the 90 day data retention policy and the auto-purging put in place by Analytics Engineering, devices that haven't been synced in more than 90 days would just disappear from the table
| wmf_app_version† | string | true |- The reason to hash `appInstallID` is because these events would end up somewhere where we would be able to join with behavioral data sent by mobile apps, which we DON'T want
| os_family† | string | true |- The reason to salt `appInstallID` on the RLS Backend side before hashing it is to prevent someone (i.e. a data analyst) from just applying the same hashing function to the `appInstallID` in those other tables and then joining by hashed `appInstallID`
| os_version† (major.minor format) | string | true |
†: these can be omitted if the backend sets the UA (of the events it sends) to the UserAgent info it received from the apps
Note that we don't need a last sync timestamp since we can just refer to the timestamp in EventLogging.
The event can be sent by the backend every time the backend responds to a request from an app. Devices that haven't been synced in more than 90 days would disappear from the table automatically.- Ideally the RLS Backend would forward the UAs from the apps because Analytics Engineering parse the UAs and put them into nicely query-able structured data. Otherwise, RLS Backend would need to parse the UA itself and send: app version, OS family, and OS major.minor version