Page MenuHomePhabricator

Popups timestamp field contains multiple types
Closed, ResolvedPublic13 Estimated Story Points

Description

Field timestamp is repeated with multiple non StructType types: StructField(timestamp,DoubleType,true) , StructField(timestamp,LongType,true))

This is an error message I'm getting from the Hive refinement step. It seems that sometimes clients are emitting the new timestamp field sometimes as integers, and sometimes as decimals. This passes JSONSchema validation because number validates both decimals and integers. However, this won't work for Hive. It is really important to know the exact type in strongly typed systems. This error means that the refinement failed for the hours of data that have different types in them. It looks like the first successful hour of data contained only decimal fields, so the timestamp field in Hive is of type double.

Can y'all make sure the client code always sends decimal numbers? Adding a .0 to the end of the number would be good enough.

NOTE: From T182000#3809923 onwards, we (Readers Web) decided to round the timestamp property of all events.

Event Timeline

This really should be addressed in the schema validation for event logging.
Is this possible?

Sort of! That would be best, but JSONSchema does not have a float type. JavaScript considers 1 and 1.0 to be equivalent numbers.

From my brief googling, the Python JSONSchema validator can do this. You'd need to declare your field as something like:

"type": "number",
"not": { "type": "integer" },

Although, EventLogging analytics schemas are JSONSchema draft 3, so this might actually be:

"type": "number",
 "disallow": [ "integer" ],

I haven't tested this though, and I'm worried that this might not work with the EventLogging extension or other clients that produce the data, as I don't know if this would work for their JSONSchema implementations (that is, if the clients also validate before producing to EventLogging beacon).

We gotta figure this out - this error needs to fail sooner than hive!

While strengthening the validator is a good goal, the fix for this may be quick and easy: we round the number before sending it to the backend, i.e. we always send integer values.

The timestamp field is intended to hold the number of milliseconds from the page loading to the user action occurring. This is so that we can determine the average, median, p75, p95, p99 values for the first link interaction on a page, which will then inform a lazy loading strategy for the Page Previews codebase. Being accurate to the millisecond seems sufficient here – the accuracy of internal timers of browsers is ~2 ms at best and ~16 ms at worst, for example.

i.e. we always send integer values.

That'll be fine, but it will mean that much of the data currently imported will be incompatible. We still have the raw (unrefined) data. To get it into Hive tables, we'll have to do some manual conversions from the raw data.

We gotta figure this out - this error needs to fail sooner than hive!

I agree, but such is the JSON world we live with :p

ovasileva raised the priority of this task from Medium to High.Dec 4 2017, 8:01 PM

Change 395083 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/Popups@master] Schema: convert timestamp into integer

https://gerrit.wikimedia.org/r/395083

what if this fix was in the eventLogging code itself?

EventLogging (server side) code isn't going to help much here, unless we do some very special casing for this field. The problem is JSON/JSONSchema itself not differentiating between decimals and integers.

Per https://phabricator.wikimedia.org/T182000#3810054 @Ottomata is it okay to merge the patch? I'm not sure how much of a pain manual conversions are but it seems we prefer to work with integers here - the float doesn't add anything useful.

i.e. we always send integer values.

That'll be fine, but it will mean that much of the data currently imported will be incompatible. We still have the raw (unrefined) data. To get it into Hive tables, we'll have to do some manual conversions from the raw data.

If it helps to save work, it should be OK to throw away the data received before https://gerrit.wikimedia.org/r/395083 goes live, and have the experiment run a little longer instead. [edit for extra clarity: Throw away the data from today only, of course - not the data received during the previous test ended in T178500.)

We gotta figure this out - this error needs to fail sooner than hive!

I agree, but such is the JSON world we live with :p

So for context, would this problem have occurred in the same way with the existing MySQL EL setup (instead of the new Hive refinement process)?

Alright. Per T182000#3810935, I think it's fine to merge the change, deploy it, and then continue with T181493: Enable Page Previews EventLogging instrumentation.

I haven't tested this though, and I'm worried that this might not work with the EventLogging extension or other clients that produce the data, as I don't know if this would work for their JSONSchema implementations (that is, if the clients also validate before producing to EventLogging beacon).

Indeed. The EventLogging client's JSON Schema validator doesn't support draft 3 of JSON Schema. I've created T182094: Update client-side event validator to support (at least) draft 3 of JSON Schema to track this problem, which is a subtask of the (very old!) T46809: Use draft 4 of JSON Schema specification. I'll add a little more detail to the latter around the client and server being out of sync in the versions of JSON Schema that they support.

If it helps to save work, it should be OK to throw away the data received before https://gerrit.wikimedia.org/r/395083

That will definitely make it easier! :) Actually, I could even recreate the table with an integer field, and the opposite will happen: hours of existing data with only integer timestamps will be refined, and hours of data with floats will fail.

So for context, would this problem have occurred in the same way with the existing MySQL EL setup (instead of the new Hive refinement process)?

Just checked:

MariaDB [staging]> desc otto1;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| i     | int(11) | YES  |     | NULL    |       |
| f     | float   | YES  |     | NULL    |       |
+-------+---------+------+-----+---------+-------+

MariaDB [staging]> insert into otto1 values (1, 1.1);
Query OK, 1 row affected (0.00 sec)

MariaDB [staging]> insert into otto1 values (1.1, 1);
Query OK, 1 row affected (0.00 sec)

MariaDB [staging]> select * from otto1;
+------+------+
| i    | f    |
+------+------+
|    1 |  1.1 |
|    1 |    1 |
+------+------+
2 rows in set (0.00 sec)

In the case where you insert a float into and int, you just lose precision. In the case where you insert and int as a float, you'd just keep the int value (since I guess it is a valid 'float' in MySQL). It looks like MySQL does some transparent type casting during insertion.

MBinder_WMF set the point value for this task to 1.Dec 5 2017, 6:13 PM

Change 395083 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] Schema: convert timestamp into integer

https://gerrit.wikimedia.org/r/395083

This LGTM. I'm only seeing integer values for the timestamp property for all events being sent by Page Previews on the Beta Cluster both when previews are enabled or disabled. I've only tested this in Chrome@62.0.3202.94 as I believe the change should work just fine across all browsers.

Got 2 events in the 2017-12-12T18:00:00 hour that have a decimal timestamp:

{"wiki": "enwiki", "uuid": "XXXX", "webHost": "en.wikipedia.org", "recvFrom": "cp3030.esams.wmnet", "seqId": 5183610, "userAgent": {XXXXX}, "dt": "2017-12-12T18:42:26", "schema": "Popups", "event": {"namespaceIdSource": -1, "isAnon": true, "timestamp": 3032.6850000000004, "sessionToken": "XXXX", "pageToken": "d89115ca4d58e395", "hovercardsSuppressedByGadget": false, "pageIdSource": 0, "popupEnabled": true, "action": "pageLoaded", "previewCountBucket": "unknown", "pageTitleSource": "Search"}, "revision": 17430287}

{"wiki": "enwiki", "uuid": "XXXXX", "webHost": "en.wikipedia.org", "recvFrom": "cp3030.esams.wmnet", "seqId": 5183835, "userAgent": {XXXXX}, "dt": "2017-12-12T18:42:36", "schema": "Popups", "event": {"namespaceIdSource": -1, "isAnon": true, "timestamp": 2187.52, "sessionToken": "XXXX", "pageToken": "773c5acec1ff4351", "hovercardsSuppressedByGadget": false, "pageIdSource": 0, "popupEnabled": true, "action": "pageLoaded", "previewCountBucket": "unknown", "pageTitleSource": "Search"}, "revision": 17430287}

I found these with pyspark2 by parsing the raw imported data for this hour like::

import json
data = sc.sequenceFile("/wmf/data/raw/eventlogging/eventlogging_Popups/hourly/2017/12/12/18/")
events = data.map(lambda x: json.loads(x[1]))
bad_events = events.filter(lambda e: 'timestamp' in e['event'] and '.' in str(e['event']['timestamp']))
for bad_event in bad_events.collect():
    print(json.dumps(bad_event))

This caused the hour for 2017-12-12T18:00:00 not to be refined into the event.popups table.

This caused the hour for 2017-12-12T18:00:00 not to be refined into the event.popups table.

Seems like we could make the refining more resilient so compliant events are refined even if these two are at fault?

That would be nice. Sounds hard, but I'll look into it.

So for context, would this problem have occurred in the same way with the existing MySQL EL setup (instead of the new Hive refinement process)?

Just checked:

...

In the case where you insert a float into and int, you just lose precision. In the case where you insert and int as a float, you'd just keep the int value (since I guess it is a valid 'float' in MySQL). It looks like MySQL does some transparent type casting during insertion.

I see, thanks for the explanation! In that case it seems that the kind of error that gave rise to this task may be a common hazard when converting schemas and processes from the existing MySQL setup, right? Seems worth documenting somewhere.

Got 2 events in the 2017-12-12T18:00:00 hour that have a decimal timestamp:

FWIW, I think it's a fairly common occurrence that some isolated events from an old version of a web schema may arrive days or even weeks after that old version has been removed from production, due to caching or simply people keeping their browser tabs open for a long time. In other words, that added resilience could be quite useful.

Yeah, def. Usually (as Toby said) this stuff should be caught be the
jsonschema validation step. This is just a fluke of JSON that we can’t
easily do that.

HEY WAIT! You guys switched to integer timestamps. Can you just change the Popups schema to specify the timestamp field type as integer? integer IS supported by JSONSchema, it's just that decimal types are not.

Done. However, I don't this fixes the issue that there exist some clients that haven't/won't refreshed and downloaded the new instrumentation/schema.

^ Wait. Doing this means that the validator processes will put the erroneous events in a different queue, which isn't written to the event.* tables, right? This fixes the refinement step, right?

. Doing this means that the validator processes will put the erroneous events in a different queue, which isn't written to the event.* tables, right? This fixes the refinement step, right?

Correct, events with decimal timestamps will be invalid.

Done. However, I don't this fixes the issue that there exist some clients that haven't/won't refreshed and downloaded the new instrumentation/schema.

Right, I'm looking into it, and I have some ideas on how I can make the refinement process a little more lenient for longs vs. doubles. I think I can pretty easily have spark auto cast longs to doubles if the source Hive table has a double type, e.g. timestamp: 123 -> timestamp 123.0. Going the other way is a little trickier, but I think I can make spark just drop (or nullify) the input records with doubles where the source Hive table has longs. We will see...

Change 398293 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[mediawiki/extensions/Popups@master] instrumentation: Bump schema revision

https://gerrit.wikimedia.org/r/398293

Change 398293 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] instrumentation: Bump schema revision

https://gerrit.wikimedia.org/r/398293

@Ottomata: Friendly ping!!1 🙂 Have you had any luck on your side?

I've learned a lot about spark code, and I think something useful is possible. I haven't 100% flushed it out though. I'll be working on it today!

Ottomata changed the point value for this task from 1 to 13.

OK! Phew!

https://gerrit.wikimedia.org/r/#/c/399126/ seemed to do it!

I've just deployed this. Now IF there are invalid fields when Spark is refining, the entire partition won't just fail refinement. Instead, the records with bad fields will most likely just be nullified. For your data, I think this is mostly ok. If you ever find fields in your data that are all NULLs, then you know something went wrong.

In the future (probably after we upgrade these jobs to Spark 2, hopefully next quarter), we will look into generating stats/alerts about bad records during refinement.

Most of the time this shouldn't happen, as invalid records should be caught by EventLogging JSONSchema validation step. But, for edge cases (like this one) where they aren't at least the valid data will still make it into Hive.

I just watched the refinement step, and the few bad days I was getting now succeeded refinement! :)

Summary:

  • The latest version of the Page Previews client will send integer values for the timestamp field (see 38a7978e and 7258be71)
  • Old versions of the Page Previews client, e.g. instances of the codebase running in old browser tabs/windows, may send floating point values for the timestamp field but those events will be rejected by the server at validation time (not refinement time!)
  • The refinement step has been made more resilient to field type mismatches (see 2cc7ba30):
    • Records with bad fields with types that can be coerced will be coerced, favouring the type of the field in the Hive table, and the record will be refined.
    • Records with bad fields with types that can't be coerced will have those fields set to NULL.

Thanks for your hard work and patience, @Ottomata!

Old versions of the Page Previews client, e.g. instances of the codebase running in old browser tabs/windows, may send floating point values for the timestamp field but those events will be rejected by the server at validation time (not refinement time!)

This isn't quite right. Old versions of the client will likely send events with the schema revision set to one that has the timestamp field as a number. Those will not be caught at validation time, because they validate against the given schema revision. One day we'd like to have some kind of 100% requirement that type changes aren't allowed by EventLogging at all. If we had this, then EventLogging could possibly always use the latest revision for a schema.

Records with bad fields with types that can be coerced will be coerced, favouring the type of the field in the Hive table, and the record will be refined.
Records with bad fields with types that can't be coerced will have those fields set to NULL.

Gotta clarify this a bit. IF the type can be coerced AND Spark is smart about it, the field will be cast to the hive field. I *think* this will happen if the data contains integers, but the schema specifies a decimal. E.g 1 -> 1.0. (I have not totally confirmed this.) However, in our case, where the data contains decimals, but the schema has integers, Spark will end up setting ALL fields for the record to NULL, not just the offending field. I don't totally know why it does this.

We'd like to be a little smarter about this too, but we believe that Spark 2 has some better built in options for what to do when it encounters mismatched field types. It looks like Spark 1 only supports this NULL out all record behavior.