Page MenuHomePhabricator

Event Logging schemas for Wikipedia iOS app
Closed, ResolvedPublic

Description

Tl;dr The iOS team is planning to implement event logging (spec slides) for the reading list feature and we need your suggestion

As you all know, the iOS team has been using Piwik to track user behaviors, but it doesn’t work very well because piwik can’t handle the volume of events from iOS app. We’re also sending data to some event logging schemas, but a lot of them haven’t been maintained for a long while (e.g. T192520) or not collecting the data we want. As a first step to sunset piwik, we decide to implement event logging for the synced reading list feature on the iOS app and adopt a format that is used by Piwik and Google Analytics. We will gradually implement EL on other features using the same format, and we will stop using piwik and clean up unused EL schemas after we finish. Before proceeding, we want to reach out to the interested & affected parties for feedback and suggestion.

The schema

We will implement several event tables and one user properties table.

Event Table

The event tables will record users’ interaction with the app using 4 fields: Category (on which screen), Label (optional, on what element of that screen), Action (what action did the user perform), Measure (optional, if there is a number associated). In addition to the standard event capsule, all app events would share a "meta schema" which provides the app specific context information. This capsule would include: app_install_id, primary_language, is_anon (whether this user is logged in), event_dt (client side timestamp) and session_id.

Because recording all the events in one table is not good for query efficiency, we will break it down by function (MobileWikiAppiOSReadingLists, MobileWikiAppiOSLoginAction, MobileWikiAppiOSSettingAction, MobileWikiAppiOSSessions), although all of them will have the same fields. Like other EL tables, the event tables will be purged after 90 days.

User Properties History Table

The user properties table (MobileWikiAppiOSUserHistory) is recording all the historical states of user properties. These properties include how many articles have they saved, how many reading lists have they created, have they turn on the reading list sync, primary language, text size choice, theme choice, etc. When users first open the app after install or update, we record these properties values (initial state) locally and send them to event logging server. At the end of each session, we take a snapshot of these properties: if ANY of these properties’ values have been changed comparing to user’s last snapshot, we send the new snapshot to the server with ALL properties values, the session_end timestamp and the session_id. If NONE of these properties’ values has changed, we won’t send the snapshot. Like the event tables, we will send a capsule with every user state, including app_install_id, event_dt and session_id.

Unlike the event tables and other EL tables, the user properties table will NOT be purged except the IP address and user agent field (we will set IP and userAgent to NULL but keep OS version, app version and country). After discussing with legal, in order to keep users from being identified by this data, we are not going to track any users whose countries have small numbers of active users. Specifically, we will only collect data from users in the top 50 countries in this list (from US to Egypt), which is the average daily unique visitors by countries from Jan 1 - March 31 2018. We will keep monitoring this number, if some of these countries end up with very small number of active users in a period of time, we will adjust the list of countries we collect data from accordingly.

Sampling

Because the volume of users who agree to share their usage data with us on iOS app is not very big, we will send all the data to Hadoop cluster (not send to Mysql at all) without sampling.

See the spec slide for more details and examples.

Why don’t we use Android team’s EL schemas?

In short, Android’s EL schema is tailor to Android's flow, and not immediately usable to iOS. Using the same schema requires adjustment on the implementation for both apps. Even after the adjustment, we still can't use the same logic to consume the data, which leaves almost no benefit to us for using the same schema.

Take the reading list EL schema as an example. We wanted to use MobileWikiAppReadingLists as Android did at first. But after reviewing the reading list flow on both app, we found that we have to add an 'addtodefault' event (see T190748#4098226 for more details). Even after this adjustment, if we want to count the number of articles added after the release, for iOS, we need to count the 'addtodefault' event; for Android, we need to count 'addtodefault', 'addtoexisting' and 'addtonew' event and then sum them up.

Why choose this format?

Using this format to store users’ events and properties can benefit us in the following ways:

  • It fulfills our need and conforms to Analytics Engineering team’s guideline (although not 100%, see the question section below), which means it can be piped into Druid easily so that we can use superset to build a dashboard.
  • Since all the event tables have the same fields, we can union them easily and then analyze the conversion funnel.
  • This format is flexible enough for adding events, moving certain events from one table to another, and supporting new features in the future.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think A-team is drafting a more comprehensive response, but here's a few quick ones:

  • Always consider that your schema and field names are going to be mapped to a caseless SQL environment. Please use snake_case instead of camelCase. E.g. mobile_wiki_app_ios_reading_lists and app_install_id are much more readable.
  • If possible, please use the ISO-8601 UTC dt convention we've adopted for timestamps. Instead of client_ts, perhaps event_dt

would be better?

  • Re. multiple schemas with the same fields: Sounds fine too me. In the future Event Data Platform blabla, ideally we'd allow schemas to be more easily shared between different event types / topics / tables.
  • Re. single value field with overloaded type. This sounds like a bad idea to me. It doesn't technically break the "don't change types" rule, but I think it'll cause problems down the road. What if you want to sum over the field?
  • Re no sampling & no MySQL. I think this will be fine :)

As mentioned before, all the events table share the same fields. If we are not going to send any data to Mysql, for simplicity, can we send all the events to one big table and partitioned by function?

We can't do this now, as it would require special casing in the EventLogging jobs for your schemas. But, something like this would be nice to support in the future.

Thanks @Ottomata !

  • Always consider that your schema and field names are going to be mapped to a caseless SQL environment. Please use snake_case instead of camelCase. E.g. mobile_wiki_app_ios_reading_lists and app_install_id are much more readable.

Agree. We chose camelCase because we wanted to use consistent vocabulary with other schemas (e.g. isAnon, appInstallID have been used in multiple places). But we can change them to snake_case.

  • If possible, please use the ISO-8601 UTC dt convention we've adopted for timestamps. Instead of client_ts, perhaps event_dt

would be better?

Yes, we will follow ISO 8601 with a minor deviation as Android did (T186768#4044717), is that ok?

  • Re. single value field with overloaded type. This sounds like a bad idea to me. It doesn't technically break the "don't change types" rule, but I think it'll cause problems down the road. What if you want to sum over the field?

Yes, I understand this is a bad idea :( But we want to make the schema flexible so that it will be easy to add new properties for new features in the future. I was thinking about:

  1. converting boolean to 0 and 1, and converting strings to IDs (e.g. 1=English, 2=French, ...). But to make it readable in superset, we have to join the ID with the lookup table. OR
  2. something like:
propertyinteger_valueboolean_valuestring_value
readinglist_listcount3NULLNULL
isAnonNULLtrueNULL
primaryLanguageNULLNULLEnglish

But none of them looks like a very good idea... :( Any suggestion?

Yes, we will follow ISO 8601 with a minor deviation

That deviation (using the timezone) is fine, but let's call the field something with dt in the name then. We try to use 'ts' when it is in integer timestamp. event_dt makes the most sense to me, and will is consistent with the concept of 'event time' in general, not just 'client time'.

  1. something like:

Hm, readinglist_listcount here is seems different than primaryLanguage or isAnon. Is listcount really a property the user can change? Or is it something you are trying to measure? You really want to model state changes in user properties, right?

We've done this in 2 different ways for mediawiki/event-schemas. The first, which I don't love, just has scheam-less removed and added properties objects. This allows for flexibility (and should work ok with EventLogging Hive Refinement), but might not be very future proof. We may need strict schemas for the future Event Data Platform work. The second is a little better, but less flexible. It explicitly declares the current and prior state of each property anytime there is a property change.

If you do either of the above, you'll be able to emit the prior and current state of all properties for a user when any property changes. That would allow you to know the current state of all properties of a user given the latest event.

That deviation (using the timezone) is fine, but let's call the field something with dt in the name then. We try to use 'ts' when it is in integer timestamp. event_dt makes the most sense to me, and will is consistent with the concept of 'event time' in general, not just 'client time'.

Will do! Thanks @Ottomata !

Hm, readinglist_listcount here is seems different than primaryLanguage or isAnon. Is listcount really a property the user can change? Or is it something you are trying to measure?

readinglist_listcount is something I'm trying to measure at the beginning of each session on the app. I couldn't measure it by counting events because this number can be changed on other devices and then sync to the app.

You really want to model state changes in user properties, right?

Yes, I want to record all the historical states of user properties (Actually as we talked before, I only want the most recent state of user properties for now. But considering appending a record is easier than updating a record in a table and we can potentially use those past states for some fun analysis, we decided to keep all the historical states.).

The second is a little better, but less flexible. It explicitly declares the current and prior state of each property anytime there is a property change.

The second one looks good to me! Just want to make sure I understand it correctly, with the second option, I will see a table like this:

event_dtapp_install_idproperties
2018-03-10T14:03:56+000012345{"readinglist_listcount":3, "isAnon":true, "primaryLanguage":"English"}Initial state
2018-03-15T14:03:56+000012345{"readinglist_listcount":3, "isAnon":true, "primaryLanguage":"Chinese"}User change their primary language
2018-04-10T14:03:56+000012345{"readinglist_listcount":5, "isAnon":true, "primaryLanguage":"Chinese"}The number of reading list change from 3 to 5

Then by querying the MAX(event_dt) I can know the current state of all properties of a user.
A few question about this solution:

  • If I add a property (e.g. from the one in the above table to {"readinglist_listcount":5, "isAnon":true, "primaryLanguage":"Chinese", "font_size": 3}), is it backwards compatible?
  • What if I have many properties to record in the future? Is there a size limit on the properties field?
  • What do I need to do to convert the nested fields into druid columns (like ua_os_family)? An oozie job with a query to select the individual nested field?

The properties field is a nested data structure, so in a Hive table, you'll get a struct field, and be able to access the fields like:
SELECT event.properties.readlinglist_count or SELECT event.properties.*, etc. We recommend flat structures because there are many systems (like Druid), where nested fields aren't supported. (We actually have to flatten any EventLogging data for Druid anyway, because technically all schema fields are nested, since they are enclosed in the capsule as the event field.) We can support nested fields just fine, but they might not be very future proof, so we recommend in the schema guidelines that you usually avoid them.

If I add a property

Adding an optional property is supported :) Pretty much any other change is not.

Is there a size limit on the properties field?

Not that I know of!

What do I need to do to convert the nested fields into druid columns

Any Druid ingestion job is going to have to flatten anyway, so just be sure that you don't have any flattened column name collisions. E.g. having both event_dt and event.dt fields would collide during flattening.

Thanks @Ottomata !

We can support nested fields just fine, but they might not be very future proof, so we recommend in the schema guidelines that you usually avoid them.

I think we'd best to flatten the properties field, i.e. the table in T192819#4152042 will become

event_dtapp_install_idreadinglist_listcountisAnonprimaryLanguage
2018-03-10T14:03:56+0000123453true"English"Initial state
2018-03-15T14:03:56+0000123453true"Chinese"User change their primary language
2018-04-10T14:03:56+0000123455true"Chinese"The number of reading list change from 3 to 5

As long as we only add new fields, but not change the data type or delete the existing fields in the future, we should be fine, right?
(I was trying to avoid this much simpler design just because I don't want to send all the properties every time when only one of them has changed, without thinking about all these downsides :P)

Also in the guidelines:

All fields are "dimensions" by default unless they are prefixed with measure_

Is this a must? Does it mean there will be extra steps in druid to define which one is a "number" and can be summed up or averaged?

I was trying to avoid this much simpler design just because I don't want to send all the properties every time when only one of them has changed

You could still do this, especially if you include the prior value of the property that has changed in the event. You'd then have to parse the entire history to rebuild the current state though.

All fields are "dimensions" by default unless they are prefixed with measure_

We don't actually have a good automated and generic eventlogging -> druid process in place yet, but in order to build one, we'd need an explicit way to know which fields are dimensions and which aren't. Do you want this user state table to actually go to druid? likely you'd have a job (hive or spark + oozie, or whatever) that joins together your datasets into some kind of aggregate, and then that would be sent to Druid.

BTW, don't forget you probably want is_anon and primary_language, etc.! I know you are trying to be consistent, but we should fix the old bad usage asap. Mixed case doesn't work well in SQL systems.

Re. single value field with overloaded type. This sounds like a bad idea to me. It doesn't technically break the "don't change types" rule, but I think it'll cause problems down the road. What if you want to sum over the field?

To expand on this a bit, overloaded fields do not play well with druid indexation and precomputation as they cannot be aggregated. You cannot for example define a sum in a field that holds integers (3, 5, 6) and booleans (true, false). The field would need to be ingested like a string and fields with bazillion distinct values are not druid strong hold. Druid will not be useful for data organized that way.

Note that the google documentation you linked also assumes typing on "columns" (on the " category, label, action, value" scheme). All fields are strings, but value that is an integer. This is what, on our end , we call a "measure".

Also, in order to purge data, having a container field that can host any data of any type -that might be subjected to different retention schemes - is unadvisable, there is no way to know how to retain or purge the field.

Thanks so much @Ottomata and @Nuria !

I have

  1. changed all the camelCase to snake_case
  2. changed the names of integer fields to measure or measure_
  3. flatten MobileWikiAppiOSUserHistory as described in T192819#4154725

in this task description, the schemas and the spec slides.

Do you want this user state table to actually go to druid? likely you'd have a job (hive or spark + oozie, or whatever) that joins together your datasets into some kind of aggregate, and then that would be sent to Druid.

Yes, I'd like to have a job to get the most recent state from this user state table:

-- most_recent_state
SELECT p.*
FROM (
  SELECT event.app_install_id as id, MAX(event.event_dt) as max_ts
  FROM MobileWikiAppiOSUserHistory
  GROUP BY event.app_install_id
) l INNER JOIN MobileWikiAppiOSUserHistory p
ON l.max_ts = p.event.event_dt AND l.id = p.event.app_install_id

Then pipe this most_recent_state table into Druid. I suppose druid can support calculations like SELECT COUNT(DISTINCT event.app_install_id) or SELECT AVG(event.measure_readinglist_itemcount), correct?

@Nuria, for the user state table which we are talking with legal about white-list it (MobileWikiAppiOSUserHistory), we want to NULL the IP and userAgent after 90 days, but keep the country names, os versions and app versions. Is it something doable?

We might need some input from another analytics team member (@JAllemandou ?) about how this would work in Druid. Druid is usually (but maybe not always) inherently time series, so the idea of putting an updated state into it seems a little weird. I might be wrong though.

@Nuria, for the user state table which we are talking with legal about white-list it (MobileWikiAppiOSUserHistory), we want to NULL the IP and userAgent after 90 days, but keep the country names

I think that sounds fine, in the absence of identifiers or small 2 people countries that data should not be identifying.

Druid is usually (but maybe not always) inherently time series,

I think druid works best with timeseries data but its authors avoid calling it a timeseries database, so you truly could use it w/o a time dimension but i doubt this is a well documented case that works as well as the timeseries-based one.

I am not sure what most_recent_state table would have. Notice that druid is also not good with high cardinality, so fields like app_install_id are also not a good fit, per your statement above cardinality on that field is 65.000 per day. Thus, to sum up, best use case for druid is a time series of, say, "m" dimensions with "n" measures and none of the dimensions have high cardinality.

Druid is usually (but maybe not always) inherently time series

@Ottomata @Nuria I think you are right. This druid documentation says "Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter."

Notice that druid is also not good with high cardinality

Got it. Thanks!

Just a thought on the overloaded field...

I’m on board with generic slots as it increases flexibility, however I see the issue with putting unknown types in here.

I believe many out of the box analytics solutions deal with “generic” slots in 2 ways usually:

  1. Send a companion type field to instruct the backend on how to interpret value.
  2. Use multiple generic fields each specific to a type (a number field, Boolean field, string field, date field, etc)

I’m on board with generic slots as it increases flexibility, however I see the issue with putting unknown types in here.

I think the current proposal ( @chelsyx can correct) does not include fields that can have multiple types and abides to guidelines of how can schemas need to be set up such they are ingestable into druid

Thanks @Fjalapeno for the suggestion!

I’m on board with generic slots as it increases flexibility, however I see the issue with putting unknown types in here.

As @Nuria mentioned, the current proposal does not include fields that can have multiple types.

... and abides to guidelines of how can schemas need to be set up such they are ingestable into druid

@Nuria , as we discussed in T192819#4159439, since druid is not good at dealing with high cardinality and the most likely usage of this user history table is to count distinct app_install_id at a given timestamp (with some filter), we are not going to pipe this table into druid directly. We will pipe the aggregation of it (the result of count distinct app_install_id) into druid if needed.

  1. Send a companion type field to instruct the backend on how to interpret value.

Hmm, I'm wondering whether this is something doable with our event logging backend. @Ottomata any idea?

  1. Use multiple generic fields each specific to a type (a number field, Boolean field, string field, date field, etc)

Just to be clear, you're talking about a table looks like:

event_dtapp_install_idpropertymeasure_integermeasure_booleanmeasure_string
2018-03-10T14:03:56+000012345readinglist_listcount3NULLNULL
2018-03-10T14:03:56+000012345isAnonNULLtrueNULL
2018-03-10T14:03:56+000012345primaryLanguageNULLNULLEnglish

I agree that this is more flexible than our current design. But considering we are going to send this table to Hadoop only (no MySQL) and adding new fields to it would be backward compatible and wouldn't generate new tables, I don't see a huge increase in flexibility by adopting what you proposed. The other benefit of your proposal I can think of is that we don't need to send all the properties every time when only one of them has changed.
@Ottomata @Nuria what do you think of this proposal comparing to the current design?

Such a scheme has many issues, one of which is processing but not the only one. Example: "does null in all columns for say 'PrimaryLanguage' signals a value, if so, of what type?" In general (and note this is shared advice with the google guidelines that @chelsey included) typed fields are strongly encouraged, and explicit typed information is needed for aggregation of values.

  1. Send a companion type field to instruct the backend on how to interpret value.

Not without extra code to figure it out. I think this would make things more inflexible, as we want to move towards open source generic solutions for event ingestion, and the more custom stuff we have to do the harder that is.

  1. Use multiple generic fields each specific to a type

I don't love this from an aesthetic standpoint (I like when variables,fields,columns etc. have explicit semantic meaning), but there aren't any technical reasons it wouldn't work. :)

Just confirming that I have spoken with @chelsyx off-ticket, and have approved the team's plan for the country whitelist.

Change 434424 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[operations/puppet@production] Blacklisting new iOS eventlogging schemas on MySQL

https://gerrit.wikimedia.org/r/434424

Change 434424 merged by Ottomata:
[operations/puppet@production] Blacklisting new iOS eventlogging schemas on MySQL

https://gerrit.wikimedia.org/r/434424

Vvjjkkii renamed this task from Event Logging schemas for Wikipedia iOS app to 8eeaaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii removed chelsyx as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
AfroThundr3007730 renamed this task from 8eeaaaaaaa to Event Logging schemas for Wikipedia iOS app.Jul 1 2018, 6:03 AM
AfroThundr3007730 assigned this task to chelsyx.
AfroThundr3007730 lowered the priority of this task from High to Medium.
AfroThundr3007730 updated the task description. (Show Details)
AfroThundr3007730 added subscribers: GerritBot, Aklapper.

Hi @chelsyx

When working on T195269, I saw that a new field was added tho MobileWikiAppiOSUserHistory: feed_enabled_list. This field is a "2-level" nested object with arrays at its leaves; While theoretically this is supported by EL pipeline, we might see some issues. A couple comments on it:

  • As MobileWikiAppiOSUserHistory is already blacklisted for MySQL insertion, there will not be problems inserting events for this schema to MySQL and/or sanitizing these events in MySQL.
  • However, as this field can potentially become very long, it might contribute to the whole event overflowing the max URL length of aprox. 2000 chars. And in this case, the events will fail validation in the EL processors. I saw that the subfield names were shortened on purpose, so I assume you already are aware of this.
  • Fields with complex types are not supported in Druid, so this schema as is, will not be able to be fully imported to Druid (or turnilo).
  • I think the schema does not follow the json schema spec when defining the 'ena' and 'dis' sub-fields. I think the [ and ] are not supposed to be there, but I might be wrong.

Please let me know if they make sense.
Thanks!

Thanks @mforns ! The latest revision 18126357 of MobileWikiAppiOSUserHistory, which includes the feed_enabled_list field, is still under development and we haven't send any data to it yet. Any comments and suggestions are welcome!

However, as this field can potentially become very long, it might contribute to the whole event overflowing the max URL length of aprox. 2000 chars. And in this case, the events will fail validation in the EL processors. I saw that the subfield names were shortened on purpose, so I assume you already are aware of this.

Yes, I'm aware that the length of the URL may be more than 2000 chars in some extreme cases (e.g. the user selects many languages). But I don't have other solution except putting it into another schema. Do you have any suggestion?

I think the schema does not follow the json schema spec when defining the 'ena' and 'dis' sub-fields. I think the [ and ] are not supposed to be there, but I might be wrong.

'ena' and 'dis' are arrays and should work for json schema. An example of the feed_enabled_list field: https://docs.google.com/presentation/d/1SEznC3gKIgzSR0knzfFNJ7LpmFKF3S_u0_HrDDAbYZY/edit#slide=id.g3c41ec4ac9_0_138

@chelsyx :]

Yes, I'm aware that the length of the URL may be more than 2000 chars in some extreme cases (e.g. the user selects many languages). But I don't have other solution except putting it into another schema. Do you have any suggestion?

Not really... I assume the idea is to make it queriable, and this should be nicely queriable in Hive, like: WHERE 'pt' IN feed_enabled_list.fa.ena. If ease of querying wouldn't be needed, there might be a shorter way to store those languages, but not a lot shorter I'd say. Yea, the only thing I can think, as you say, is putting it in another schema.

'ena' and 'dis' are arrays and should work for json schema. An example of the feed_enabled_list field: https://docs.google.com/presentation/d/1SEznC3gKIgzSR0knzfFNJ7LpmFKF3S_u0_HrDDAbYZY/edit#slide=id.g3c41ec4ac9_0_138

I meant that even when ena and dis are arrays, I think their json schema should be:

"dis": {
    "description": "Disabled languages among all user selected languages.",
    "type": "array",
    "required": false,
    "items": {
        "type": "string",
        "description": "Language code",
        "required": false
    }
}

as opposed of:

"dis": {
    "description": "Disabled languages among all user selected languages.",
    "type": "array",
    "required": false,
    "items": [
        {
            "type": "string",
            "description": "Language code",
            "required": false
        }
    ]
}

That's what I understand of the json schema spec, but I just read it once and might be totally wrong.

@chelsyx Oh, ok! Sorry for the confusion :P

@chelsyx: Take a look and see if what is being recorded for that field on that schema is what you want in yours (you can see it on the event database)

"l["|C1718_en6C_mob_FR|1514763460|2.1","|C1718_en6C_mob_FR|1514763496|2.1","|C1718_en6C_mob_FR|1514763517|2.1","|C1718_en6C_mob_FR|1514763561|2.1","|C1718_en6C_mob_FR|1514763618|2.1","|C1718_en6C_mob_FR|1514763654|2.1","|C1718_en6C_mob_FR|1514763616|2.1","|C1718_en6C_mob_FR|1514763718|2.1","|C1718_en6C_mob_FR|1514763698|2.1","|C1718_en6C_mob_FR|1514763723|2.1","|C1718_en6C_mob_FR|1514763845|2.1"]

@Nuria I've checked the l field in CentralNoticeBannerHistory is what I want. Thanks!

Closing this ticket as the main purpose of it -- instrumenting for the reading list feature on iOS app -- has already been done.