Page MenuHomePhabricator

Make webrequest_frontend being ingested using the in-data `dt` field
Closed, ResolvedPublic

Description

The timestamp set in Kafka by HAProxyKafka doesn't match with the timestamp set in the data dt field. This is not great because we want to use the Kafka timestamp for Gobblin to batch rows in hourly folders.

See discussions below, the problem can't be fixed at HAProxyKafka level for performance reasons so we fix it in Gobblin by parsing the data and use its dt field (same as currently done in webrequest).

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Explicitly setting TimestampType as LogAppendrepos/sre/haproxykafka!81fabfurts-typemain
Proposal: set kafka message timestamp to the content of `dt` fieldrepos/sre/haproxykafka!80fabfurmessage-timestampmain
Customize query in GitLab

Event Timeline

Maybe I'm misreading the task description but from

If the problem can't be fixed at HAProxyKafka level, we'll fix it in Gobblin by parsing the data and use its dt field (same as currently done in webrequest).

I understand that for webrequest you're already using the dt field instead of the kafka message timestamp?

If that's right, why are you introducing this change in behavior?

Maybe I'm misreading the task description but from

If the problem can't be fixed at HAProxyKafka level, we'll fix it in Gobblin by parsing the data and use its dt field (same as currently done in webrequest).

I understand that for webrequest you're already using the dt field instead of the kafka message timestamp?

If that's right, why are you introducing this change in behavior?

You're right in your reading. Let me give some context and explanation.
When we started using Kafka for webrequest, the kafka timestamp feature was not existing, so we used (and still use) a parser to extract the timestamp from the data.
Now that the kafka-timestamp exists, we want to use it as much as we can to prevent parsing the data for the timestamp field only as the data-movement tool we use doesn't need to parse the JSON for anything else.

I answered to a comment on the gitlab PR (https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/80#note_131234) that I'll repeat here for consistency.
If there is a performance issue with getting the timestamp in HAProxyKafka, let's keep doing it in Gobblin where the performance gain/loss is not as important.

BTW, there was a request to do this for varnishkafka, but it was declined when it was intended to do it form ATS instead:

T166833: Produce webrequests from varnishkafka to Kafka with Kafka message timestamp set to configurable content field

So, we should do it for haproxykafka too (if we can!)

Mentioned in SAL (#wikimedia-operations) [2025-03-20T16:46:28Z] <fabfur> imported haproxykafka 0.3.6 into apt repository (added TimestampType) (T388397)

JAllemandou renamed this task from Fix `webrequest_frontend` kafka timestamp mismatch with in-data `dt` field to Make webrequest_frontend being ingested using the in-data `dt` field.Mar 20 2025, 5:48 PM
JAllemandou updated the task description. (Show Details)

Change #1129900 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update Gobblin webrequest_frontend timestamp

https://gerrit.wikimedia.org/r/1129900

Change #1129900 merged by Joal:

[analytics/refinery@master] Update Gobblin webrequest_frontend timestamp

https://gerrit.wikimedia.org/r/1129900