Page MenuHomePhabricator

Update webperf EventLogging consumers for userAgent schema change
Closed, ResolvedPublic

Description

@Nuria informed us of the following:

We are changing the EL capsule to include the parsed user agent rather than the "raw" string. Field would look identical to what it is on hive, it will be a json object like:

  • {
  • "device_family":"Other",
  • "browser_major":"11",
  • "os_family":"Windows",
  • "os_major":"-",
  • "browser_family":"IE",
  • "os_minor":"-"
  • }

I think you parse this UA in one of your jobs that reports data to graphana so this is an FYI, let us know if these changes might be problematic.

We need to update NavigationTiming and SaveTiming code

Event Timeline

@Nuria what's the migration plan? Will the old field still be available during a transition period?

We need to update NavigationTiming and SaveTiming code

The client reporting code? no, that should not be needed we (eventlogging service) is parsing incoming data to dictionary format and that is how it would be stored on DB.

Will the old field still be available during a transition period?

No, it is a drop in replacement on db (it doesn't affect influx of data)

Let me know what changes are needed in your end and we can probably also take care of those.

We need to update NavigationTiming and SaveTiming code

The client reporting code? no, that should not be needed

No, the server-side code on hafnium that consumes EventLogging and reports to Graphite. In there we inspect the userAgent field to group metrics in a few stable buckets for major browsers.

https://github.com/wikimedia/operations-puppet/blob/51f1d151fa97ee89af28f093a574b90505e4a4f2/modules/webperf/files/navtiming.py#L277

This will need updating to instead perform (as closely as we can) the same matches using the ua-parser result object. We don't want to fragment by more browsers than before, so we can't use it as-is directly as that would create more metrics than we intend.

Either way, we'll likely skew our metrics because ua-parser and ours will come up with slightly different results, which means timing from the same browsers would now up in a different in a Graphite metric group. This affects monitoring and graphing continuity. We'll need to find a close-as-possible mapping based on the ua-parser result.

Will the old field still be available during a transition period?

No, it is a drop in replacement on db (it doesn't affect influx of data)

What will the property name be? Currently it's meta.userAgent. Ideally the new field would have a different name (e.g. meta.uaMap) so that it is easy for us to write and deploy the new code before the EventLogging change is deployed - by using a run-time duck-type check. Then, whenever EventLogging is upgraded, we'll automatically trigger the new code path without having to orchestrate any deploys or restarts from our side. We can then remove the old code at a later time.

I see. I think that consumes from eventlogging-valid-mixed and now those records would look like:

(this is beta, this request is a fake one)

{"event": {"apiMode": 0, "appInstallID": "612491f107e6-56ca-4c90-b721-1359058ffa3c", "fromBack": 0, "fromDisambig": 0, "fromExternal": 0, "fromHistory": 0, "fromInternal": 2, "fromLanglink": 0, "fromNearby": 0, "fromRandom": 0, "fromReadingList": 0, "fromSearch": 0, "leadLatency": 1139, "length": 633, "restLatency": 707, "totalPages": 2}, "recvFrom": "deployment-cache-text04.deployment-prep.eqiad.wmflabs", "revision": 15522505, "schema": "MobileWikiAppSessions", "seqId": 456135, "timestamp": 1486082642, "userAgent": "{\"os_minor\": \"2\", \"os_major\": \"4\", \"device_family\": \"Generic Tablet\", \"os_family\": \"Android\", \"browser_major\": \"4\", \"browser_family\": \"Android\"}", "uuid": "72f99ec206885ee289567998bffe76f5", "wiki": "enwiki"}

What will the property name be? Currently it's meta.userAgent. Ideally the new field would have a different name (e.g. meta.uaMap)

We are replacing the old field as we do not wish to retain raw user agents (see prior example) so you are very right that is s not backwards compatible to your code. Sorry that is not ideal but the userAgent is supposed to be a "string" type wise, which still is. We can do changes to your code to inspect string and "glue" old convention to new one if that makes things easier so you do not have to eat the cost of this change.

{"event": { .. }, "userAgent": "{\"os_minor\": \"2\", \"os_major\": \"4\", \"device_family\": \"Generic Tablet\", \"os_family\": \"Android\", \"browser_major\": \"4\", \"browser_family\": \"Android\"}", .. }

What will the property name be? Currently it's meta.userAgent. Ideally the new field would have a different name (e.g. meta.uaMap)

We are replacing the old field as we do not wish to retain raw user agents (see prior example) so you are very right that is s not backwards compatible to your code. Sorry that is not ideal but the userAgent is supposed to be a "string" type wise, which still is.

Why keep it as a string, though? It should be easy enough to bump the EventCapsule schema and declare this as an object instead. This is clearly a breaking change (much like the removal of the clientIp field for example). Doing so would help a lot:

  • Consumers already parse the event message. Requiring them to perform a second parse to obtain this field is a bit counter-intuitive (also wasteful), and would require adding error handling in user code (right now validation/parsing happens in the client library).
  • Sending it as part of the message would allow for stricter schema validation. We could just declare it as "object", but going further and declaring the properties would improve overall stability and ease of use for EventLogging; by providing consumers with a guarantee that e.g. brower_family will exist in all valid messages. (avoids another level of error handling).
  • Making it a proper breaking change avoids misinterpretation of this field by clients we may have forgotten about (and hopefully help us find it by causing an error instead of silent failure).

Why keep it as a string, though? It should be easy enough to bump the EventCapsule schema and declare this as an object instead. This is clearly a breaking change (much like the removal of the clientIp field for example). Doing so would help a lot:

It is a breaking change, not disputing that. Keeping it as a string makes things simple for the database consumer, we plan to do away with that code next quarter (most likely) and modifying it to change the capsule of EL is not a trivial task. Inserting an event with objects inside objects also requires EL modifications that we rather not do at this time.

We could just declare it as "object", but going further and declaring the properties would improve overall stability and ease of use for EventLogging; by providing consumers with a guarantee that e.g. brower_family will exist in all valid messages. (avoids another level of error handling).

We can do this once we no longer depend on the database consumer, agreed to it being a much better description of what the field is about.

Why keep it as a string, though?

It is a breaking change, not disputing that. Keeping it as a string makes things simple for the database consumer

Ah, because the database only fragments when the inner schema changes, not when the outer one does. We'd have two different field types within the same database column. I understand :)

However, wouldn't that problem go away if we name this field differently? (like uaMap, for example)

Agree that it might be preferable to declare it as object rather than as string, and to rename the field. But shouldn't this discussion be in T153207 instead?

Ah, because the database only fragments when the inner schema changes, not when the outer one does. We'd have two different field types within the same database column. I understand :)
However, wouldn't that problem go away if we name this field differently? (like uaMap, for example)

It would not. The capsule is (I think on purpose) not designed to be changed easily, an additional field would mean that all active schemas that are having data stored on DB (most of them, but not all) would need an ALTER TABLE statement to accommodate a new column with the new field. This is a breaking change, just one of a different sort . It affects the service not the clients.
Also, the current database code does not support inserting events with objects within objects. This is a restriction of the EL DB consumer, and, again, we want to remove in the near term That piece of code entirely.

Thus, in short, the replacement of the raw UA by the parsed one without it being an objet in the schema (yet) is the least invasive change from a practical perspective (have in mind we want to remove all our code that depends on MYSQL on the near term).
I understand that changes need to be done in your end to accommodate both formats and I am offering for our team to do those so it is not a burden on your end.

@Nuria Okay, makes sense! I assumed the DB consumer would automatically store those as separate columns (e.g. "obj_subkey" or "obj.subkey") or as JSON-string in a single column. I'm glad this restriction is on the way out, and I suppose we'll migrate to an object sometime after.

I appreciate the offer to do the migration work, but I'm not worried about that. Should be a fairly small change. I'm more worried about the potential slowdown of extra parsing. As well as added error handling for cases where the json may be invalid and/or missing properties. Right now the schema handler has no error scenario (the received packet enters the thread as a Python object, already json-parsed by zmq and EL-validated).

Krinkle renamed this task from Check if the EventLogging User Agent schema upgrade breaks any performance tool/metric to Update webperf EventLogging consumers for userAgent schema change.Feb 9 2017, 7:36 PM
Krinkle claimed this task.
Krinkle triaged this task as High priority.

I'm glad this restriction is on the way out, and I suppose we'll migrate to an object sometime after.

Yes, we would need to do small changes but nothing major, schema of capsule could evolve on its own pace.

where the json may be invalid and/or missing properties.

I think contract is such that properties should not be missing or json be invalid per code here (let me know if you see anything that needs improvement): https://gerrit.wikimedia.org/r/#/c/335145/3/eventlogging/utils.py

runtime errors could just happen as before, of course.

Change 337158 had a related patch set uploaded (by Nuria):
[WIP] Changes to perf consumer of event logging events

https://gerrit.wikimedia.org/r/337158

Krinkle moved this task from Doing (old) to Radar on the Performance-Team board.

Change 337158 merged by Ottomata:
[operations/puppet] webperf: Update event logging consumer for userAgent schema change

https://gerrit.wikimedia.org/r/337158