Page MenuHomePhabricator

[Bug] Type mismatch between NavigationTiming EL schema and Hive table schema
Closed, ResolvedPublic

Description

[[ https://developer.mozilla.org/en-US/docs/Web/API/Navigator/deviceMemory | The Navigator.deviceMemory property ]] is used to populate the deviceMemory property of a NavigationTiming event. That property is a floating point number.

The NavigationTiming schema defines the deviceMemory field as having the number type, reflecting the above. Indeed, the field is set to the value of the above without any processing.

However, the corresponding column in the Hive table has the BIGINT type:

[0]
use event; # The describe statement can't handle a complex column name with more than 2 separators...
describe navigationtiming.event.deviceMemory;

+---------------+------------+--------------------+
|   col_name    | data_type  |      comment       |
+---------------+------------+--------------------+
| deviceMemory  | bigint     | from deserializer  |
+---------------+------------+--------------------+

Expected results

[1]
use event;
describe navigationtiming.event.deviceMemory;

+---------------+------------+--------------------+
|   col_name    | data_type  |      comment       |
+---------------+------------+--------------------+
| deviceMemory  | double     | from deserializer  |
+---------------+------------+--------------------+

Event Timeline

@Nuria, @Krinkle: Sorry it took me a little while to file this since I stumbled across it.

phuedx renamed this task from [Bug] Mistmatch between NavigationTiming EL schema and Hive table schema to [Bug] Type mistmatch between NavigationTiming EL schema and Hive table schema.Jan 22 2019, 1:30 PM
phuedx edited projects, added Analytics; removed Web-Team-Backlog.
phuedx renamed this task from [Bug] Type mistmatch between NavigationTiming EL schema and Hive table schema to [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema.Jan 22 2019, 1:39 PM

Types are inferred on incoming data, something to be aware is that if a field defined as number comes with values 1,27,300 it will be stored as an integer (regardless of schema type as schemas are not looked at to persist data) . For it to be not an integer values will need to be specified like: 300.0, 50.0, 70.0. We will look at this but please take a look at : https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines#Schema_set_up where this shortcoming is explained in more detail.

[..] but you'll need to make sure that the values ALWAYS have a decimal point in them. 2.0 is a valid float number, 2 is not. You'll run into trouble if your data has both of these formats for the same field.

It's not clear to me what this instruction is referring to. Within JavaScript and JSON, it is not possible to express a redundant decimal point.

x = 5;
//> 5

x = 5.0;
//> 5

( 5 === 5.0 );
//> true

JSON.stringify(5.0);
//> "5"

JSON.parse("5.0"); // manually encoded
//> 5

Yeah, that is a big problem we ran into (and I didn't realize when I wrote that part of the wikitech page).

Since the integers are always integers, and we currently infer Hive schemas from imported JSON data (not the JSONSchemas), there are times when the Hive table creation will be wrong. E.g. if the data for the first hour of data only contains a single "deviceMemory": 1, then the code will assume that deviceMemory is an int. This is only done the first hour any instance of a field is encountered.

In the future, we'd like to solve this generally by creating Hive tables from the JSONSchemas themselves. This is part of the work for MEP and will likely be done using Kafka Connect, but we haven't gotten that far yet.

For this particular problem, I believe we can manually alter the table to change the field to a double. I'm not 100% sure what will happen to past data here though, we'd have to find out. It's possible that queries that select deviceMemory from past data would fail...but Parquet might be smart enough to cast the underlying data.

@Ottomata you thinking altering the table will fix this going forward or it just fix existing data?

Altering the table will fix this going forward, but it might break existing data...

fdans moved this task from Incoming to Operational Excellence on the Analytics board.

@Krinkle: refined data was down-cast and is going to remain incorrect unless we re-ingest. We have raw data going back 90 days [1], so if you need we can alter the table and re-ingest to fix existing records. Let us know either way.

Going forward, we thought of a possible general solution: just treat any numbers as Doubles. Ints, floats, whatever, just store them all as doubles. This may require consumers to cast to int in some rare cases, but in general it should be transparent and deal with the annoying JS/JSON limitation.

[1] https://github.com/wikimedia/puppet/blob/f8e40f287a222ec7bf178fd0205245e0d11a20cb/modules/profile/manifests/analytics/refinery/job/data_purge.pp#L53

Eeee I don't know if that is a good idea... to treat all data as doubles. The right solution is to create the table from the JSONSchema in the first place. We can eventually do this with Kafka Connect etc.

@Milimetric If I understand correctly, you're proposing to change the schema and replace records for which we still have the raw data. Older data will remain as-is with the integer casted to double by the schema change, right? So for older data, values that should've been 0.25 but are 0 will remain 0 (or 0.0). I think doing that would be nice and preferred over losing older data entirely.

@Krinkle, the schema wouldn't change, it's fine as it is. The rest is correct. The confusion might come from the fact that we don't use the schema to refine data. That's how we ran into this problem, not knowing what type a field is, we just infer it from the JSON. Which doesn't work for whole numbers that need to be stored as Doubles. So, my proposed hack until we can make refine aware of the schema:

  • when we read raw data and store it, we would always interpret any number as if the schema defined it as a Double
  • we would use this new logic to replace older data with whatever raw data is available

It sounds like you're ok with this, thanks for letting us know.

Thanks for hashing this out, y'all. FWIW I think the proposed interim solution seems reasonable.

Per per standup conversation: let's look at how many records are affected by the bug in this case (we will know as every record that has a zero on the column is potentially information that we have lost) , if there is a handful of records (unlikely) probably backfilling the last 90 days is not worth it.

Before backfilling we need to change the column type to a double.

I still think this is a bad idea! BTW in T215442: Make Refine use JSONSchemas of event data to support Map types and proper types for integers vs decimals we determined that the only way we can really handle 'map' with Refine is to use the JSONSchemas directly. We want to do this with Kafka Connect instead, but don't have resources/timeline to get that deployed and running. So, we will probably need to modify Refine to be able to lookup the JSONSchemas and build a Spark Schema (and Hive Schema) from them. If we do this, it will solve this bug too (for new schemas/fields).

For this particular field, we should just alter the table to use a double.

@Ottomata can you clarify what you think is a bad idea so we are all on the same page?

I just checked and there are records with deviceMemory=0 in the tens of thousands so backfilling is probably a good idea.

@Nuria sorry, I mean assuming that all integers are floats is a bad idea. We should use double for decimals and bigints for integers, just need to get the schemas right.

sorry, I mean assuming that all integers are floats is a bad idea.

Agree, I do not think we want to see numbers as edit_count of a user to be like 123.0 that makes things quite confusing.

just to make sure I have the correct sequence of actions in mind:

  • Update event.navigationtiming hive table so that deviceMemory field is a double
  • Backfill refine job for the NavigationTiming schema as much as possible
  • Update not-backfilled data to make deviceMemory a double even without recovering the lost information (preventing querying failure)

Is that correct?

Change 490875 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Temporarily deactivate refine_eventlogging for maintenance

https://gerrit.wikimedia.org/r/490875

The plan is to:

  1. Stop refine (ensure => absent in puppet?)
  2. Move /wmf/data/event/NavigationTiming data to temporary location
  3. Drop event.navitgationtiming table and recreate it with correct type for field event.deviceMemory
  4. Restart refine (ensure => present in puppet), prod data flows in with correct type
  5. Use custom Refine job to backfill NavigationTiming for last 3 months
  6. Use spark to insert historical data from temporary table into new event.NavigationTiming table (with correct type)
  7. Delete temporary data

Change 490875 merged by Elukey:
[operations/puppet@production] Temporarily deactivate refine_eventlogging for maintenance

https://gerrit.wikimedia.org/r/490875

Change 490887 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Reactivate refine_eventlogging_analytics after maintenance

https://gerrit.wikimedia.org/r/490887

Change 490887 merged by Elukey:
[operations/puppet@production] Reactivate refine_eventlogging_analytics after maintenance

https://gerrit.wikimedia.org/r/490887

Hey all :]

Just finished to apply the fix plan.
The current status of event.NavigationTiming is the following.

  • The deviceMemory field is now a double, and data flowing in is populating it without precision loss.
  • Since 2019-01-01 up to now, the data has been re-refined from raw events to retroactively correct the precision loss.
  • Before 2019-01-01, the field deviceMemory is a double, but still has precision loss.
  • Note that before 2018-10-02 the field deviceMemory was not populated and is NULL.

Cheers!

I really wonder if the way to circumvent this bug (that will reappear in other schemas) is to send measures such us these as integers including a precision after the decimal, similar to how you deal with payments. Example: payment is $195,97 system sends/stores 19590 This is done sometimes to avoid amounts be casted as doubles when they are not such. See martin fowler talking about this in a now pretty old article: https://www.martinfowler.com/bliki/HiddenPrecision.html

In this case if this the api always reports memory with YY.xx precision it could be sent as integer as YYxx , readers of value need to -of course- know the convention but this will circumvent this bug and really, it is a technique widely used in other systems (for different reasons but I think we can probably take advantage of that)

Let me know if this idea sounds good (pinging @Krinkle , @phuedx , @Milimetric , @mforns and @Ottomata) and I will document it as such in our schema guidelines:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines

@Nuria

I think this could work, but I believe it has a couple drawbacks that we can avoid:

  • I think it needs some extra cognitive load. Especially because different fields with different orders of magnitude will need different decimal shifts. We can store the length of the shift in the schema field description, but still it's potentially confusing I think.
  • Also, it would need an explicit conversion on the client (1.2345 => 12345). And the shift length per field would have to be stored somewhere or hardcoded in the instrumentation, no?

I think @Ottomata's suggestion in the comments above (creating the table from the schema) would avoid all this.

Thanks yall! The procedure seems great but also especially cautious. This would also work with a simple ALTER TABLE and then re-refine the data in question using the new table schema.

Looks like we can close this ticket @Milimetric is going to take a look at other schemas to quantify data affected by this issue

@Ottomata I tried the simple ALTER TABLE, and it works, provided the field you want to change the type of, is a top level field. In our case, we want to change the type of a subfield of the event struct. When you do this the whole event field becomes unreadable for all the data that was in the table. I think the problem lies in the serde (parquet serde?). That is why we needed to backfill the entire table since 2017-11. However we only had raw events for the last 3 months, so part of the backfilling has been done with a temp copy of the old data.

Ah ok that makes sense. Thanks for doing that!