Page MenuHomePhabricator

The UI timestamp in Kibana should be based on source not Logstash intake (for PHP-FPM and MW)
Open, Needs TriagePublic

Description

There was a minor incident today causing the Kafka-to-Logstash intake to get clogged for a few hours. It confirmed to me a suspicion I've held which is that the UI timestamp (used for bar charts and message feed sorting), seems to be based on Logstash intake, not based on message dispatch from the source application.

I thought for a while that this is perhaps because the source timestamp is hard to obtain or at least something we don't have right now. However I think I understand now that we do:

Screenshot 2020-06-16 at 00.00.35.png (1×2 px, 376 KB)

It seems @timestamp is the one used for time slice queries, UI charts, sorting, etc. and timestamp (no @) is the unused internal field for debug purposes, indicating when the message was dispatched in or to rsyslog, I think?

A similar issue exists with the Prometheus metrics we have for Logstash, however once understood, I think those are much easier to work with as-is. It also logically makes sense that you can't/shouldn't backdate or mutate such metrics. They are simply about the intake.

For Kibana and develop UX though, this intake information is imho the internal/debugging one, not the other way around. Could we flip this around for PHP/MW? Or perhaps for everything if people don't mind?

Alternatively, if we want to ensure a consistent meaning for the non-@ internal timestamp as being original application dispatch, perhaps we could make them both the same for PHP/MW, and if we want to store the Logstash intake time, put that in a separate field.

Event Timeline

Another example of where this leads to confusion:

Screenshot 2020-06-16 at 00.00.37.png (566×1 px, 95 KB)

Based on this graph, I would suspect the error to be due to some kind of automated job given it a regular interval and consisteny-height spikes. In actuality, these just reflect the processing cadence into Logstash, thus making the graph useless, even after waiting for intake delay.

It also wrongly gives the impression that data for 5 min ago is available, when in fact it is not (yet).

This sounds a lot like something we identified during the audit phase of T234565: a number of fields are created (and ultimately passed through transparently) that have essentially the same data, just different keys. IIRC, we want to consolidate on the source's timestamp and only provide our own if one is not available.