Page MenuHomePhabricator

Remove extra fields currently sent to Kafka
Open, Needs TriagePublic

Description

As discussed a bit in the Traffic Team, the current Benthos instance is configured to send extra data, compared to what currently VarnishKafka is sending to Kafka.

These are the fields that are sent from Benthos that aren't present in the current webrequest stream:

  • HTTP version (eg. HTTP/1.1)
  • $schema (required by DE), with "static" value set to "/webrequest/1.0.0"
  • meta.id (uuidv4 generated by benthos)
  • meta.request_id (different uuidv4 generated by haproxy)

About $schema I don't think there's much problem with it on the computational side on the cp hosts.

We don't need for sure two uuid (generate by different parts of the processing pipeline), that are expensive to generate under heavy load and could result in a potential waste of bandwidth/space on Kafka.
Maybe the sequence field could suffice for the same purpose? Or there's the ability to generate it later in the pipeline and not on the cp hosts directly?

For the HTTP version I suggest to discard it at the moment, as isn't currently used in webrequest and eventually add it later if needed.

We can use this ticket to discuss which fields should be kept and which can be safely discarded

Event Timeline

Change #1013341 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] benthos/haproxy: delete some fields that aren't in curr webrequest

https://gerrit.wikimedia.org/r/1013341

These are the fields that are sent from Benthos that aren't present in the current webrequest stream:

FWIW meta and $schema are not part of webrequest, but a requirements for EP integration: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas#Required_event_data. Usually they are managed by our node/java producers.

We don't need for sure two uuid (generate by different parts of the processing pipeline), that are expensive to generate under heavy load and could result in a potential waste of bandwidth/space on Kafka.

I need to investigate the historical reason behind both meta.id and meta.request_id, but if performance is a concern I think we can live without meta.id. meta itself is a required field in the webrequest schema, but payload should validate with missing/empty id.

kafka-jumbo and hadoop should be fine (storage wise), but I do appreciate that at webrequest scale every byte sent over the wire counts (and adds up quickly).
For my own education, do you have some datapoints that show how much overhead uuidv4 in benthos produces (cpu-wise)?

Maybe the sequence field could suffice for the same purpose? Or there's the ability to generate it later in the pipeline and not on the cp hosts directly?

Is sequence globally unique? IIRC it was unique per host, but I might be missing something.

In any case, we can extract/generate unique event ids in post-processing.

Re bandwidth: did you enable message compression in the benthos producer? That should help. Unfortunately our version of kafka does not support zlib, but we do generally recommend to use snappy (EP producers enable it by default).

These are the fields that are sent from Benthos that aren't present in the current webrequest stream:

FWIW meta and $schema are not part of webrequest, but a requirements for EP integration: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas#Required_event_data. Usually they are managed by our node/java producers.

We don't need for sure two uuid (generate by different parts of the processing pipeline), that are expensive to generate under heavy load and could result in a potential waste of bandwidth/space on Kafka.

I need to investigate the historical reason behind both meta.id and meta.request_id, but if performance is a concern I think we can live without meta.id. meta itself is a required field in the webrequest schema, but payload should validate with missing/empty id.

kafka-jumbo and hadoop should be fine (storage wise), but I do appreciate that at webrequest scale every byte sent over the wire counts (and adds up quickly).
For my own education, do you have some datapoints that show how much overhead uuidv4 in benthos produces (cpu-wise)?

To have this kind of data we should compare before/after stopping the generation of the UUIDs on cp hosts. I check if it's possible to gather this metric.

Maybe the sequence field could suffice for the same purpose? Or there's the ability to generate it later in the pipeline and not on the cp hosts directly?

Is sequence globally unique? IIRC it was unique per host, but I might be missing something.

The request counter (sequence key) is generated in HAProxy as timestamp + request counter since process start, it's not guaranteed to be truly unique but the possibility that two hosts have the same sequence should be fairly low, IIUC.

In any case, we can extract/generate unique event ids in post-processing.

Re bandwidth: did you enable message compression in the benthos producer? That should help. Unfortunately our version of kafka does not support zlib, but we do generally recommend to use snappy (EP producers enable it by default).

Yes, we use snappy as default as compression algorithm (AFAIK as we do for all things that writes to Kafka).

Change #1013341 merged by Fabfur:

[operations/puppet@production] benthos/haproxy: delete some fields that aren't in curr webrequest

https://gerrit.wikimedia.org/r/1013341

meta.id and meta.request_id

meta.id is used to uniquely identify an event, and it is usually used for deduplication.

meta.request_id should be the same as the X-Request-Id header, and is used for request tracing.

If you already have meta.request_id, I don't think it would hurt to set meta.id = meta.request_id here. This is the frontend request, right? So this is the first event for which request_id would be set. So this should be fine, as long as this is only ever done for frontend webrequests? Thoughts?

meta.id and meta.request_id

meta.id is used to uniquely identify an event, and it is usually used for deduplication.

Do you know who set these fields with the current webrequest flow? Currently in what we "generate" on every host with VarnishKafka I see only the sequence number as the unique identifier for the message and it's not an UUID, but probably I'm missing something here...

meta.request_id should be the same as the X-Request-Id header, and is used for request tracing.

If you already have meta.request_id, I don't think it would hurt to set meta.id = meta.request_id here. This is the frontend request, right? So this is the first event for which request_id would be set. So this should be fine, as long as this is only ever done for frontend webrequests? Thoughts?

meta.id

Do you know who set these fields with the current webrequest flow?

It isn't set for current flow; this is a new field for use with event platform tooling and tracing. If it is burdensome to set this, we can probably omit it as @gmodena suggested.