Page MenuHomePhabricator

Invalid EventLogging messages for NavigationTiming topic
Open, Needs TriagePublic

Description

This is currently the top offender, with lots of errors like:

-1 is less than the minimum of 0
-2 is less than the minimum of 0
...

I haven't been able to eyeball any patterns or find useful trace information. The error is coming from the fetchStart field.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Milimetric added subscribers: Gilles, Milimetric.

ping @Gilles, you were asking about this last week, here are some more examples of validation errors.

This is actually good news. This is a consequence of a fix for T254606: Invalid navigation timing events. Browsers are nasty and send bad data sometimes I guess ¯\_(ツ)_/¯

Oh yes, we've known for years that browsers send junk values for these APIs regularly. There's not much we can do about it, really. The novelty is that it's now being filtered out by EventLogging and surfaced in Logstash, but these values were previously filtered by the navtiming daemon.

The only time we've been able to establish a pattern, we reported it upstream: https://bugs.webkit.org/show_bug.cgi?id=186919

I don't think there's much interest among the browser vendors to fix those edge cases for values that don't respect the Navigation Timing spec, probably because it's also difficult for them to find the common cause.

For these -1, -2ms values, I suspect it might come from the fuzzing browsers apply to these high resolution timestamps in order to mitigate timing attacks (eg. Meltdown/Spectre).

Sounds reasonable—so the fix is to relax validation, correct?

No, because those invalid integers were causing processing issues further in the EventLogging pipeline, killing a service. Things are now filtered appropriately (hence those errors) at the ingestion point where the Analytics team prefers to have it filtered.

The current situation is how things are supposed to be. While the errors are noisy, it's important to have a way to track counts, as a spike might be indicative of a bug, either in our client-side code or in a new browser version.

invalid integers were causing processing issues further in the EventLogging pipeline, killing a service

This is a separate problem: the service should be more robust, logging warnings for bad input records but not dying.

Things are now filtered appropriately (hence those errors) at the ingestion point where the Analytics team prefers to have it filtered.

Thanks for the confirmation! Maybe we should split the dashboards as is done for fatals, one report for the known EventLogging errors that we can't or won't fix, and another for unknown errors. That would be a separate task anyway, please feel free to close this one.

This is a separate problem: the service should be more robust, logging warnings for bad input records but not dying.

The ingestion pipeline wasn't dying because of a -1 or -2, it was failing because browsers were sending values larger than 64 bit integers. Also, the pipeline wasn't dying, it was failing to ingest the hour in which there was bad data. We had to manually rerun the hour with some more lenient constraints. We could set up those lenient constraints by default, but then we'd have no indication of when we are losing data records because of malformed data.

NavigationTiming was changed so that the minimum integer value for these timing fields was 0. Previously a -1 or a -2 would have been ingested into Hive just fine. If we want to accept negative timing values to avoid the validation errors, the minimum could be changed. The fix in T254606 was mostly to make sure we didn't receive values that were to large to store in a Long or a Double.

I'd rather have all filtering of out of bounds value happen in the same place. Both huge and negative values aren't allowed by the browser specs.