Maniphest T182000

Popups timestamp field contains multiple types
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	Ottomata
	Dec 4 2017, 4:10 PM

Description

Field timestamp is repeated with multiple non StructType types: StructField(timestamp,DoubleType,true) , StructField(timestamp,LongType,true))

This is an error message I'm getting from the Hive refinement step. It seems that sometimes clients are emitting the new timestamp field sometimes as integers, and sometimes as decimals. This passes JSONSchema validation because number validates both decimals and integers. However, this won't work for Hive. It is really important to know the exact type in strongly typed systems. This error means that the refinement failed for the hours of data that have different types in them. It looks like the first successful hour of data contained only decimal fields, so the timestamp field in Hive is of type double.

Can y'all make sure the client code always sends decimal numbers? Adding a .0 to the end of the number would be good enough.

NOTE: From T182000#3809923 onwards, we (Readers Web) decided to round the timestamp property of all events.

Details

	Subject	Repo	Branch	Lines +/-
	instrumentation: Bump schema revision	mediawiki/extensions/Popups	master	+1 -1
	Schema: convert timestamp into integer	mediawiki/extensions/Popups	master	+4 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Gilles	T175916 Investigate 2017-08-29 page load time increase
Resolved	Jdlrobson	T176211 Page Previews could load less JS on pageload
Resolved	phuedx	T181493 Enable Page Previews EventLogging instrumentation
Resolved	phuedx	T180036 Instrument time to first user link interaction
Declined	None	T180356 Popups EventLogging events occasionally invalid
Resolved	phuedx	T182000 Popups timestamp field contains multiple types

Event Timeline

Ottomata created this task.Dec 4 2017, 4:10 PM

This really should be addressed in the schema validation for event logging.
Is this possible?

Sort of! That would be best, but JSONSchema does not have a float type. JavaScript considers 1 and 1.0 to be equivalent numbers.

From my brief googling, the Python JSONSchema validator can do this. You'd need to declare your field as something like:

"type": "number",
"not": { "type": "integer" },

Although, EventLogging analytics schemas are JSONSchema draft 3, so this might actually be:

"type": "number",
 "disallow": [ "integer" ],

I haven't tested this though, and I'm worried that this might not work with the EventLogging extension or other clients that produce the data, as I don't know if this would work for their JSONSchema implementations (that is, if the clients also validate before producing to EventLogging beacon).

We gotta figure this out - this error needs to fail sooner than hive!

While strengthening the validator is a good goal, the fix for this may be quick and easy: we round the number before sending it to the backend, i.e. we always send integer values.

The timestamp field is intended to hold the number of milliseconds from the page loading to the user action occurring. This is so that we can determine the average, median, p75, p95, p99 values for the first link interaction on a page, which will then inform a lazy loading strategy for the Page Previews codebase. Being accurate to the millisecond seems sufficient here – the accuracy of internal timers of browsers is ~2 ms at best and ~16 ms at worst, for example.

phuedx added a parent task: T180036: Instrument time to first user link interaction.Dec 4 2017, 5:54 PM

phuedx mentioned this in T180036: Instrument time to first user link interaction.Dec 4 2017, 5:57 PM

i.e. we always send integer values.

That'll be fine, but it will mean that much of the data currently imported will be incompatible. We still have the raw (unrefined) data. To get it into Hive tables, we'll have to do some manual conversions from the raw data.

We gotta figure this out - this error needs to fail sooner than hive!

I agree, but such is the JSON world we live with :p

phuedx mentioned this in T181493: Enable Page Previews EventLogging instrumentation.Dec 4 2017, 7:07 PM

Jdlrobson moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.Dec 4 2017, 7:48 PM

ovasileva raised the priority of this task from Medium to High.Dec 4 2017, 8:01 PM

Moving to "Doing" after speaking with Olga.

Change 395083 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/Popups@master] Schema: convert timestamp into integer

https://gerrit.wikimedia.org/r/395083

gerritbot added a project: Patch-For-Review.Dec 4 2017, 8:49 PM

• bmansurov moved this task from Doing to Needs Code Review on the Readers-Web-Kanbanana-Board-Old board.Dec 4 2017, 8:49 PM

Krinkle unsubscribed.Dec 4 2017, 8:49 PM

what if this fix was in the eventLogging code itself?

EventLogging (server side) code isn't going to help much here, unless we do some very special casing for this field. The problem is JSON/JSONSchema itself not differentiating between decimals and integers.

Per https://phabricator.wikimedia.org/T182000#3810054 @Ottomata is it okay to merge the patch? I'm not sure how much of a pain manual conversions are but it seems we prefer to work with integers here - the float doesn't add anything useful.

In T182000#3810054, @Ottomata wrote:

i.e. we always send integer values.

That'll be fine, but it will mean that much of the data currently imported will be incompatible. We still have the raw (unrefined) data. To get it into Hive tables, we'll have to do some manual conversions from the raw data.

If it helps to save work, it should be OK to throw away the data received before https://gerrit.wikimedia.org/r/395083 goes live, and have the experiment run a little longer instead. [edit for extra clarity: Throw away the data from today only, of course - not the data received during the previous test ended in T178500.)

We gotta figure this out - this error needs to fail sooner than hive!

I agree, but such is the JSON world we live with :p

So for context, would this problem have occurred in the same way with the existing MySQL EL setup (instead of the new Hive refinement process)?

Alright. Per T182000#3810935, I think it's fine to merge the change, deploy it, and then continue with T181493: Enable Page Previews EventLogging instrumentation.

phuedx mentioned this in T182094: Update client-side event validator to support (at least) draft 3 of JSON Schema.Dec 5 2017, 2:00 PM

In T182000#3809548, @Ottomata wrote:

I haven't tested this though, and I'm worried that this might not work with the EventLogging extension or other clients that produce the data, as I don't know if this would work for their JSONSchema implementations (that is, if the clients also validate before producing to EventLogging beacon).

Indeed. The EventLogging client's JSON Schema validator doesn't support draft 3 of JSON Schema. I've created T182094: Update client-side event validator to support (at least) draft 3 of JSON Schema to track this problem, which is a subtask of the (very old!) T46809: Use draft 4 of JSON Schema specification. I'll add a little more detail to the latter around the client and server being out of sync in the versions of JSON Schema that they support.

phuedx mentioned this in T46809: Use draft 4 of JSON Schema specification.Dec 5 2017, 2:12 PM

If it helps to save work, it should be OK to throw away the data received before https://gerrit.wikimedia.org/r/395083

That will definitely make it easier! :) Actually, I could even recreate the table with an integer field, and the opposite will happen: hours of existing data with only integer timestamps will be refined, and hours of data with floats will fail.

So for context, would this problem have occurred in the same way with the existing MySQL EL setup (instead of the new Hive refinement process)?

Just checked:

MariaDB [staging]> desc otto1;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| i     | int(11) | YES  |     | NULL    |       |
| f     | float   | YES  |     | NULL    |       |
+-------+---------+------+-----+---------+-------+

MariaDB [staging]> insert into otto1 values (1, 1.1);
Query OK, 1 row affected (0.00 sec)

MariaDB [staging]> insert into otto1 values (1.1, 1);
Query OK, 1 row affected (0.00 sec)

MariaDB [staging]> select * from otto1;
+------+------+
| i    | f    |
+------+------+
|    1 |  1.1 |
|    1 |    1 |
+------+------+
2 rows in set (0.00 sec)

In the case where you insert a float into and int, you just lose precision. In the case where you insert and int as a float, you'd just keep the int value (since I guess it is a valid 'float' in MySQL). It looks like MySQL does some transparent type casting during insertion.

ovasileva moved this task from Needs Prioritization to 2017-18 Q2 on the Web-Team-Backlog board.Dec 5 2017, 3:29 PM

MBinder_WMF set the point value for this task to 1.Dec 5 2017, 6:13 PM

Jdlrobson moved this task from Needs Code Review to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.Dec 5 2017, 6:14 PM

Change 395083 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] Schema: convert timestamp into integer

https://gerrit.wikimedia.org/r/395083

This LGTM. I'm only seeing integer values for the timestamp property for all events being sent by Page Previews on the Beta Cluster both when previews are enabled or disabled. I've only tested this in Chrome@62.0.3202.94 as I believe the change should work just fine across all browsers.

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)).Dec 5 2017, 7:00 PM

phuedx updated the task description. (Show Details)Dec 5 2017, 7:05 PM

Per T182000#3813924.

Got 2 events in the 2017-12-12T18:00:00 hour that have a decimal timestamp:

{"wiki": "enwiki", "uuid": "XXXX", "webHost": "en.wikipedia.org", "recvFrom": "cp3030.esams.wmnet", "seqId": 5183610, "userAgent": {XXXXX}, "dt": "2017-12-12T18:42:26", "schema": "Popups", "event": {"namespaceIdSource": -1, "isAnon": true, "timestamp": 3032.6850000000004, "sessionToken": "XXXX", "pageToken": "d89115ca4d58e395", "hovercardsSuppressedByGadget": false, "pageIdSource": 0, "popupEnabled": true, "action": "pageLoaded", "previewCountBucket": "unknown", "pageTitleSource": "Search"}, "revision": 17430287}

{"wiki": "enwiki", "uuid": "XXXXX", "webHost": "en.wikipedia.org", "recvFrom": "cp3030.esams.wmnet", "seqId": 5183835, "userAgent": {XXXXX}, "dt": "2017-12-12T18:42:36", "schema": "Popups", "event": {"namespaceIdSource": -1, "isAnon": true, "timestamp": 2187.52, "sessionToken": "XXXX", "pageToken": "773c5acec1ff4351", "hovercardsSuppressedByGadget": false, "pageIdSource": 0, "popupEnabled": true, "action": "pageLoaded", "previewCountBucket": "unknown", "pageTitleSource": "Search"}, "revision": 17430287}

I found these with pyspark2 by parsing the raw imported data for this hour like::

import json
data = sc.sequenceFile("/wmf/data/raw/eventlogging/eventlogging_Popups/hourly/2017/12/12/18/")
events = data.map(lambda x: json.loads(x[1]))
bad_events = events.filter(lambda e: 'timestamp' in e['event'] and '.' in str(e['event']['timestamp']))
for bad_event in bad_events.collect():
    print(json.dumps(bad_event))

This caused the hour for 2017-12-12T18:00:00 not to be refined into the event.popups table.

This caused the hour for 2017-12-12T18:00:00 not to be refined into the event.popups table.

Seems like we could make the refining more resilient so compliant events are refined even if these two are at fault?

That would be nice. Sounds hard, but I'll look into it.

In T182000#3812724, @Ottomata wrote:

So for context, would this problem have occurred in the same way with the existing MySQL EL setup (instead of the new Hive refinement process)?

Just checked:

...

In the case where you insert a float into and int, you just lose precision. In the case where you insert and int as a float, you'd just keep the int value (since I guess it is a valid 'float' in MySQL). It looks like MySQL does some transparent type casting during insertion.

I see, thanks for the explanation! In that case it seems that the kind of error that gave rise to this task may be a common hazard when converting schemas and processes from the existing MySQL setup, right? Seems worth documenting somewhere.

In T182000#3832638, @Ottomata wrote:

Got 2 events in the 2017-12-12T18:00:00 hour that have a decimal timestamp:

FWIW, I think it's a fairly common occurrence that some isolated events from an old version of a web schema may arrive days or even weeks after that old version has been removed from production, due to caching or simply people keeping their browser tabs open for a long time. In other words, that added resilience could be quite useful.

Yeah, def. Usually (as Toby said) this stuff should be caught be the
jsonschema validation step. This is just a fluke of JSON that we can’t
easily do that.

HEY WAIT! You guys switched to integer timestamps. Can you just change the Popups schema to specify the timestamp field type as integer? integer IS supported by JSONSchema, it's just that decimal types are not.

Done. However, I don't this fixes the issue that there exist some clients that haven't/won't refreshed and downloaded the new instrumentation/schema.

^ Wait. Doing this means that the validator processes will put the erroneous events in a different queue, which isn't written to the event.* tables, right? This fixes the refinement step, right?

. Doing this means that the validator processes will put the erroneous events in a different queue, which isn't written to the event.* tables, right? This fixes the refinement step, right?

Correct, events with decimal timestamps will be invalid.

Done. However, I don't this fixes the issue that there exist some clients that haven't/won't refreshed and downloaded the new instrumentation/schema.

Right, I'm looking into it, and I have some ideas on how I can make the refinement process a little more lenient for longs vs. doubles. I think I can pretty easily have spark auto cast longs to doubles if the source Hive table has a double type, e.g. timestamp: 123 -> timestamp 123.0. Going the other way is a little trickier, but I think I can make spark just drop (or nullify) the input records with doubles where the source Hive table has longs. We will see...

Change 398293 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[mediawiki/extensions/Popups@master] instrumentation: Bump schema revision

https://gerrit.wikimedia.org/r/398293

phuedx moved this task from Ready for Signoff to Doing on the Readers-Web-Kanbanana-Board-Old board.Dec 14 2017, 5:34 PM

^ rEPOP7258be717c19: instrumentation: Bump schema revision needs review.

Change 398293 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] instrumentation: Bump schema revision

https://gerrit.wikimedia.org/r/398293

Thanks, @bmansurov!

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)); removed MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)).Dec 14 2017, 7:00 PM

@Ottomata: Friendly ping!!1 🙂 Have you had any luck on your side?

I've learned a lot about spark code, and I think something useful is possible. I haven't 100% flushed it out though. I'll be working on it today!

phuedx moved this task from Ready for Signoff to Blocked on Others on the Readers-Web-Kanbanana-Board-Old board.Dec 19 2017, 6:06 PM

Ottomata added a project: Analytics-Kanban.Dec 20 2017, 9:33 PM

Ottomata changed the point value for this task from 1 to 13.

OK! Phew!

https://gerrit.wikimedia.org/r/#/c/399126/ seemed to do it!

I've just deployed this. Now IF there are invalid fields when Spark is refining, the entire partition won't just fail refinement. Instead, the records with bad fields will most likely just be nullified. For your data, I think this is mostly ok. If you ever find fields in your data that are all NULLs, then you know something went wrong.

In the future (probably after we upgrade these jobs to Spark 2, hopefully next quarter), we will look into generating stats/alerts about bad records during refinement.

Most of the time this shouldn't happen, as invalid records should be caught by EventLogging JSONSchema validation step. But, for edge cases (like this one) where they aren't at least the valid data will still make it into Hive.

I just watched the refinement step, and the few bad days I was getting now succeeded refinement! :)

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.Dec 20 2017, 9:37 PM

Summary:

The latest version of the Page Previews client will send integer values for the timestamp field (see 38a7978e and 7258be71)
Old versions of the Page Previews client, e.g. instances of the codebase running in old browser tabs/windows, may send floating point values for the timestamp field but those events will be rejected by the server at validation time (not refinement time!)
The refinement step has been made more resilient to field type mismatches (see 2cc7ba30):
- Records with bad fields with types that can be coerced will be coerced, favouring the type of the field in the Hive table, and the record will be refined.
- Records with bad fields with types that can't be coerced will have those fields set to NULL.

Thanks for your hard work and patience, @Ottomata!

Old versions of the Page Previews client, e.g. instances of the codebase running in old browser tabs/windows, may send floating point values for the timestamp field but those events will be rejected by the server at validation time (not refinement time!)

This isn't quite right. Old versions of the client will likely send events with the schema revision set to one that has the timestamp field as a number. Those will not be caught at validation time, because they validate against the given schema revision. One day we'd like to have some kind of 100% requirement that type changes aren't allowed by EventLogging at all. If we had this, then EventLogging could possibly always use the latest revision for a schema.

Records with bad fields with types that can be coerced will be coerced, favouring the type of the field in the Hive table, and the record will be refined.
Records with bad fields with types that can't be coerced will have those fields set to NULL.

Gotta clarify this a bit. IF the type can be coerced AND Spark is smart about it, the field will be cast to the hive field. I *think* this will happen if the data contains integers, but the schema specifies a decimal. E.g 1 -> 1.0. (I have not totally confirmed this.) However, in our case, where the data contains decimals, but the schema has integers, Spark will end up setting ALL fields for the record to NULL, not just the offending field. I don't totally know why it does this.

We'd like to be a little smarter about this too, but we believe that Spark 2 has some better built in options for what to do when it encounters mismatched field types. It looks like Spark 1 only supports this NULL out all record behavior.

I added a NOTE about this here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop_&_Hive

phuedx awarded a token.Dec 21 2017, 4:02 PM

• Tbayer mentioned this in T185973: [Config] Disable Page Previews EventLogging instrumentation.Jan 30 2018, 12:41 AM

Aklapper added a project: MediaWiki-extensions-EventLogging.May 16 2023, 11:37 AM

Aklapper removed a subscriber: MediaWiki-extensions-EventLogging.

Restricted Application added a project: Data-Engineering. · View Herald TranscriptMay 16 2023, 11:37 AM

Maintenance_bot removed a project: Patch-For-Review.May 16 2023, 12:11 PM