As you all know, the iOS team has been using Piwik to track user behaviors, but it doesn’t work very well because piwik can’t handle the volume of events from iOS app. We’re also sending data to [[ https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/iOS/Analysis#Event_logging_schema | some event logging schemas ]], but a lot of them haven’t been maintained for a long while (e.g. T192520) or not collecting the data we want. As a first step to sunset piwik, we decide to implement event logging for the [[ https://www.mediawiki.org/wiki/Wikimedia_Apps/Synced_Reading_Lists | synced reading list feature ]] on the iOS app and adopt a format that is used by [[ https://matomo.org/docs/event-tracking/ | Piwik ]] and [[ https://support.google.com/analytics/answer/1033068 | Google Analytics ]]. We will gradually implement EL on other features using the same format, and we will stop using piwik and clean up unused EL schemas after we finish. Before proceeding, we want to reach out to the interested & affected parties for feedback and suggestion.
== The schema
We will implement several event tables and one user properties table.
The event tables will record users’ interaction with the app using 4 fields: **Category** (on which screen), **Label** (optional, on what element of that screen), **Action** (what action did the user perform), **Value** (optional, if there is a value associate). In addition to the [[ https://meta.wikimedia.org/wiki/Schema:EventCapsule | standard event capsule ]], all app events would share a "meta schema" which provides the app specific context information. This capsule would include: **appInstallID**, **primaryLanguage**, **isAnon** (whether this user is logged in), **client_ts** (client side timestamp) and **sessionID**.
Because recording all the events in one table is not good for query efficiency, we will break it down by function ([[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSReadingLists | MobileWikiAppiOSReadingLists ]], [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSLoginAction | MobileWikiAppiOSLoginAction ]], [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSSettingAction | MobileWikiAppiOSSettingAction ]], [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSSessions | MobileWikiAppiOSSessions ]]), although all of them will have the same fields. Like other EL tables, the event tables will be purged after 90 days.
The user properties table ([[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSUserHistory | MobileWikiAppiOSUserHistory ]]) is recording all the historical states of user properties. These properties include how many articles have they saved, how many reading lists have they created, have they turn on the reading list sync, primary language, text size choice, theme choice, etc. When users first open the app after install or update, we record these properties values locally and remotely. At the beginning of each future session, we will check whether these properties values have been changed. If so, we update the value locally and send the new value with the last action timestamp and previous sessionID to the server. Like the event tables, we will send a capsule with every user properties record, including **appInstallID**, **client_ts** and **sessionID**.
Unlike the event tables and other EL tables, the user properties table will **NOT** be purged except the IP address and user agent field. After discussing with legal, our initial plan (not finalized yet) is not to track any users whose primary language and/or country have very small population.
See the [[ https://docs.google.com/presentation/d/1drI6nN3xQ5CMZrfQJkiMAYVI_5uQ1NypkYxg_qDXojk/edit?usp=sharing | spec slide ]] for more details and examples.
== Why don’t we use Android team’s EL schemas?
In short, Android’s EL schema is tailor to Android's flow, and not immediately usable to iOS. Using the same schema requires adjustment on the implementation for both apps. Even after the adjustment, we still can't use the same logic to consume the data, which leaves almost no benefit to us for using the same schema.
Take the reading list EL schema as an example. We wanted to use [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppReadingLists | MobileWikiAppReadingLists ]] as Android did at first. But after reviewing the reading list flow on both app, we found that we have to add an 'addtodefault' event (see T190748#4098226 for more details). Even after this adjustment, if we want to count the number of articles added after the release, for iOS, we need to count the 'addtodefault' event; for Android, we need to count 'addtodefault', 'addtoexisting' and 'addtonew' event and then sum them up.
== Why choose this format?
Using this format to store users’ events and properties can benefit us in the following ways:
It fulfills [[ https://www.mediawiki.org/wiki/Wikimedia_Apps/App_Analytics | our need ]] and conforms to [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines | Analytics Engineering team’s guideline ]] (although not 100%, see the question section below), which means it can be piped into [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid | Druid ]] easily so that we can use [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset | superset ]] to build a dashboard.
Since all the event tables have the same fields, we can union them easily and then analyze the conversion funnel.
This format is flexible enough for adding events, moving certain events from one table to another, and supporting new features in the future.
== Questions for AE
* Does this schema conforms to Druid’s rule? Could it be piped into Druid and then used by superset easily?
** The value field of MobileWikiAppiOSUserHistory is a string type because the properties value types are different (integer, boolean, string). This doesn’t conform to [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines | this guidelines ]] and cannot be used directly by druid. Is there anyway to solve this problem?
** Superset has a [[ https://superset.incubator.apache.org/sqllab.html | SQL lab ]] function which would be useful for more complex analysis. Do we have any plan to include SQL lab?
* Can we send all the data to Hadoop cluster (not send to Mysql at all) without sampling? Based on [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppDailyStats | MobileWikiAppDailyStats ]], we only have ~65k daily active users who agree to share their usage data with us on iOS app.
* As mentioned before, all the events table share the same fields. If we are not going to send any data to Mysql, for simplicity, can we send all the events to one big table and partitioned by function? (i.e. convert [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSReadingLists | MobileWikiAppiOSReadingLists ]], [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSLoginAction | MobileWikiAppiOSLoginAction ]], [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSSettingAction | MobileWikiAppiOSSettingAction ]], [[ https://meta.wikimedia.org/wiki/Schema:MobileWikiAppiOSSessions | MobileWikiAppiOSSessions ]] into partitions in Hadoop)
* For user properties table, we want to NULL IP and userAgent after 90 days, but keep the country names, os versions and app versions. How can we do that?