Page MenuHomePhabricator

Invalid navigation timing events
Closed, ResolvedPublic

Description

We have seen couple instances of very large values for navtiming timestamps that break processing of those events (as it is a long out of range). An example:

{"dt": "2020-06-04T18:13:46Z", "event": {"action": "view", "connectEnd": *18446742482415155000*, "connectStart": 18446742482415155000, "dnsLookup": 0, "domComplete": 1200, "domInteractive": 200, "fetchStart": 18446742482415155000, "gaps": 0, "hardwareConcurrency": 2, "isAnon": true, "isOversample": true, "loadEventEnd": 1200, "loadEventStart": 1200, "mediaWikiLoadEnd": 700, "mediaWikiVersion": "1.35.0-wmf.35", "namespaceId": 0, "originCountry": "ES", "oversampleReason": "[\"wiki:cawiki\"]", "pageviewToken": "75c245a891f8ed8dc994", "redirecting": 0, "requestStart": 18446742482415155000, "responseEnd": -1591294395400, "responseStart": -1591294395400, "revId": 23712760, "secureConnectionStart": 18446742482415155000, "unload": 0}, "ip": "x", "recvFrom": "cp3056.esams.wmnet", "revision": 19604530, "schema": "NavigationTiming", "seqId": 164837414, "userAgent": {"browser_family": "Firefox", "browser_major": "78", "browser_minor": "0", "device_family": "Other", "is_bot": false, "is_mediawiki": false, "os_family": "Windows", "os_major": "10", "os_minor": null, "wmf_app_version": "-"}, "uuid": "3115c6f484605e5d8aee2b61e8b135bd", "webHost": "ca.wikipedia.org", "wiki": "cawiki"}

We drop these events when we refine but just an FYI in case you want to filter them from being sent client side.

Event Timeline

Gilles triaged this task as Medium priority.
Gilles moved this task from Inbox to Doing (old) on the Performance-Team board.

@Nuria when did those last happen? I can't find recent examples in logstash. The validation errors I see in there come from bots or people using ancient browsers (Firefox from 2006 or Chrome from 2011).

These errors are are from last week, it might be a total one-off and you can ignore, ticket was just an FYI.

I can't find that particular error in logstash at that date, though:

https://logstash.wikimedia.org/goto/9dd45df06d68b8f928681dfdc89946e5

Did you pull it from elsewhere?

Logstash would only have errors where records do not validate, this one does validate, issues are around the values of the given keys (integers out of range)

So, it's valid as per the schema, but fails to be inserted into Hive because the values are out of range? Where can I find the list of these?

Where can I find the list of these?

There is no list I can point you too, these events are dropped when processed but exist on the raw data. To be clear there are very few.

How can I inspect the raw data?

Looking at the raw files in hdfs as mentioned in the docs (eg. /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_NavigationTiming/...), they're not human-readable. Is there a tool to inspect their content?

Since the example you gave is the current beta version of Firefox (78), it might be a genuine bug in that browser that we need to figure out before it becomes the next stable version. And since very few people use Firefox 78, we wouldn't be receiving that many hits (about one per hour valid one at a glance).

You can do:

hdfs dfs -text /wmf/data/raw/eventlogging/eventlogging_NavigationTiming/hourly/2020/06/04/18/eventlogging_NavigationTiming.1003.0.8930.777500533.1591293600000 > out2.txt

and later use grep.

I did so for the two files of that hour and i found couple more odd records for FF

Thanks. Looking at the last few hours of data, I don't see the huge values reoccurring and the Firefox 78 records we receive look fine.

Sometimes the browsers send us junk, it's a known factor. I think that trying to plan for all the types of junk we may receive at the client-side level just bloats the logic for little gain, it's fine for them to get discarded where they are right now.

@Gilles as FYI it re-happened for hour 2019-06-14T08, I forced the refine for NavigationTiming to skip the problematic records :)

@elukey do you mean that requires manual work for you every time this happens?

@Gilles refine stops when these kind of events happen to be on the safe side, it is just a line of bash to make it re-run skipping the problematic fields (we could automate it but it happens sporadically so it is better to double check anyway). I pinged you only to alert that other records were dropped, in case the assumption was that the event was a one-off, if everything is fine np :)

Sure, but if this happens on a weekly basis now it's disruptive. The users with the broken browsers may continue browsing the site and generate those records every time they get sampled by NavigationTiming.

I see that it was Windows Firefox 78 again, this time hitting Commons. Origin country was RU. Previously it was ES. Bother countries are oversampled for NavigationTiming. I guess that from that we've learned it happens to different users in exactly the same way.

Is there a way to get the unprocessed UA string for a particular record? It might be interesting to narrow it down to a specific subversion of 78.

As an aside, 78 is graduating to stable in 2 weeks. If this is a widespread issue affecting the latest 78, we should see a clear uptick in broken records when that gets released.

Is there a way to get the unprocessed UA string for a particular record?

You can look for this particular record (not at all formatted) in the webrequest table, it will appear as a regular get request

Yes, but you have to go back in the pipeline to the eventlogging-client-side topic. So you can use kafkacat. I don't know if there's a way to get a specific date range (I thought new kafka let you do that). But you can always guess and check with -o (offset from the current event) and -c (how many events to grab).

kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t eventlogging-client-side -o -15800000 -c 6000000 | grep NavigationTiming

I put the output of that in stat1004:/home/milimetric/eventlogging.NavigationTiming.2020-06-14T08.raw and made it readable. There are only 11 records from FF 78 in those 44 minutes and you can see their full agent string. To find the one that broke this time you can look for the out of bounds value: 18446742481584390000. It shows up 3 times, all in that same record as the fetchStart/connectStart/connectEnd.

That'd be great if we can find that it's a FF 78 bug, but I didn't see abnormal values in some of the other 11 records. Probably some manual testing required.

NOTE: let me know when you're done with it so I can erase it (update: I have erased it)

I dunno why clients would send this, but we could avoid this on our side in two ways:

A. Set minimum and maximum values for this (and other?) numeric fields.
B. Make EventGate always add max and min long validation to integer types, even if the jsonschema field doesn't specify it.
C. Make Refine more resilient here, and truncate out of range values like this to max long 9223372036854775807.

C. sounds like it might hide the problem, A. would reject these values for this schema field, and B. would reject more generally. B. is probably the best, but perhaps that could also lead to unexpected behavior? Not sure, do clients ever need to set values outside of the long range?

I think it's expected. If something is typed int it should expect to reject values larger than maxint. And if clients need to set something bigger, they should use a string or we can augment jsonschema with a bigger int type.

Since not all our clients are JS, we need to make sure min/max ranges match up (like for PHP or native apps).

If we get that right, values outside the range will show up as validation errors and go into the error topic, allowing us to even more easily do the kind of investigation above (potential bugs in FF78). We'll just need to make sure to monitor that error stream more closely or set up alarms for weird validation failures.

Since JSONSchema only has numeric and integer types that have no min or max values, we'll have to choose. For integer types in Hive, we assume long, since assuming integer might lose larger values. So, for integer types we can probably make EventGate augment all schemas it loads to inject max and min long values for integer types.

For numeric types, we assume double, since assuming float might lose larger values. We could do the same for numeric types and inject max and min double validation to all schemas.

B. is probably the best, but perhaps that could also lead to unexpected behavior?

I think B is the best option, but min/max values should not "match" the clients but rather (i agree with @Ottomata) the persistence/transport layer.

I think this would just make us more opinionated about valid JSON data than JSON is, which is a good thing. Basically, numeric JSON values outside of the normal long or double ranges are not valid JSON.

numeric JSON values outside of the normal long or double ranges are not valid JSON

Hear! Hear!

From our perspective, it would be ideal if these were validation errors that ended up in logstash. That would allow us to track per-UA trends without manual work.

As for which limit to set for numbers in JSON, a lot is valid for UAs: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/MAX_VALUE

There is no integer type in JS. While integer values beyond MAX_SAFE_INTEGER are not guaranteed to be precisely stored, they're valid. Since JSON was born in JS, this is probably why it doesn't enforce any limits on numbers.

Since we do have an integer type in EventLogging, I think it makes sense to enforce a reasonable integer limit for those, with validation errors when clients don't respect it. Same thing for the number type, with a different limit.

@Milimetric you can clear the data, thanks for extracting it for me. I was hoping that the UA string would tell us a more specific version of 78, but it does not. I guess we'll have to find out what happens when it's rolled out.

FYI: A ResourceTiming event just failed to refine encodedBodySize = 18446744073709552000. Chrome 84. I do think we need some general way of handling these errors, it seems these APIs are just reporting bad values. I call to handle it in one of the ways Andrew proposed (T254606#6227987):

A. Set minimum and maximum values for this (and other?) numeric fields.
B. Make EventGate always add max and min long validation to integer types, even if the jsonschema field doesn't specify it.
C. Make Refine more resilient here, and truncate out of range values like this to max long 9223372036854775807.

I vote for B, validation errors are more visible and they won't mess with statistics/make analysis more difficult the way C would.

@Gilles: this has become a more serious nuisance, there are malformed events once or twice a day. You're probably right about some sort of browser bug. The fix on our end will likely be to automatically add validation so large numbers throw errors that instrumentation users can see. But until we do that, is it possible to double check integer values on NavigationTiming and ResourceTiming? Can I submit a patch to that effect around here:

https://gerrit.wikimedia.org/g/mediawiki/extensions/NavigationTiming/+/40e70cd938a7da306d20e91d788cb43d43717a2d/modules/ext.navigationTiming.js#649
https://gerrit.wikimedia.org/g/mediawiki/extensions/NavigationTiming/+/40e70cd938a7da306d20e91d788cb43d43717a2d/modules/ext.navigationTiming.js#816

@Milimetric sure, please do, I'll be happy to review it and get it merged quickly.

This is legacy metawiki schema so either it needs to be migrated to
eventgate and fixed, or perhaps it can be fixed on metawiki. We could add
the max int validation to the schema and bump the schema revision the code
uses (not sure where this event is generated). I think that will cause
eventlogging-processor to validate with the new version of the schema.
We'll need a code deploy to get the change out there.

@Ottomata what's the syntax for adding min/max values to a schema defined on meta?

I'm pretty sure this hasn't changed between drafts of the JSONSchema spec. eventlogging-processor uses Draft-03, which says just maximum.

Change 618095 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@master] Allow schemas to set maxmimum and mininum values

https://gerrit.wikimedia.org/r/618095

Change 618095 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Allow schemas to set maxmimum and mininum values

https://gerrit.wikimedia.org/r/618095

Ok @Gilles maximum and minimum can now be added. I edited NavigationTiming and added them for the connectStart and connectEnd fields: https://meta.wikimedia.org/w/index.php?title=Schema%3ANavigationTiming&type=revision&diff=20358564&oldid=19604530

Not sure if there are other fields that need this too, please add if they do. Once done, can you update the Schema revision used by NavigationTiming? This can be done in the extension code, or in MW Config if we want to do it without a backport/train deploy.

I can take care of the extension update, but I'm off all of next week, so can't be around to keep an eye on it when the change gets deployed. If this can't wait for the week of August 24, @Krinkle can probably handle this.

@Gilles the sooner the better cause it is about twice a day that the refine jobs fail due to the integer overflow problem (cc @Krinkle)

I noticed that there was another field (responseStart) that also had this problem, so I went ahead and made a schema edit to add max and min to all integer timing fields:

https://meta.wikimedia.org/w/index.php?title=Schema%3ANavigationTiming&type=revision&diff=20373802&oldid=19604530

This is happening a lot, so I'm going to deploy config today to update the EventLogging schema revision. The code should also be updated so we don't have to have a manual config override. @Gilles can do that when he gets back, thank you!

Change 620711 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] wgEventLoggingSchemas - update bugged schema revisions (group0)

https://gerrit.wikimedia.org/r/620711

Change 620711 merged by jenkins-bot:
[operations/mediawiki-config@master] wgEventLoggingSchemas - update bugged schema revisions (group0)

https://gerrit.wikimedia.org/r/620711

Mentioned in SAL (#wikimedia-operations) [2020-08-17T14:35:05Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - group0 - T254606 (duration: 00m 56s)

Change 620713 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] wgEventLoggingSchemas - update bugged schema revisions - all wikis

https://gerrit.wikimedia.org/r/620713

Change 620713 merged by jenkins-bot:
[operations/mediawiki-config@master] wgEventLoggingSchemas - update bugged schema revisions - all wikis

https://gerrit.wikimedia.org/r/620713

Mentioned in SAL (#wikimedia-operations) [2020-08-17T14:44:09Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - all wikis - T254606 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2020-08-17T14:57:33Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - all wikis (take 2) - T254606 (duration: 00m 53s)

Majavah renamed this task from Invalid navigation timing events to Invalid navigation timing events.Aug 17 2020, 5:11 PM

@Ottomata is there a way for us to track a metric of how often these schemas are now hitting their min/max values in Grafana?

Change 622380 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Point to new version of schemas

https://gerrit.wikimedia.org/r/622380

Change 622380 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Point to new version of schemas

https://gerrit.wikimedia.org/r/622380

...is there a way for us to track a metric of how often these schemas are now hitting their min/max values in Grafana?

Yes, these will now be validation errors. You can look at these in:

It looks like there are like 5000 of them right now.

Change 639600 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Remove no longer needed EventLoggingSchemas override

https://gerrit.wikimedia.org/r/639600

Change 639600 merged by Ottomata:
[operations/mediawiki-config@master] Remove no longer needed EventLoggingSchemas override

https://gerrit.wikimedia.org/r/639600

Mentioned in SAL (#wikimedia-operations) [2020-11-24T19:53:33Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings-labs.php: Remove no longer needed EventLoggingSchemas override for NavigationTiming and ResourceTiming - T254606 (duration: 01m 01s)