Page MenuHomePhabricator

Move analytics log from Varnish to HAProxy
Open, In Progress, Needs TriagePublic

Description

Now that we have moved all traffic to HAProxy, we should replace the current pipeline

  • Varnish logs -> Varnishkafka -> Kafka

to

  • HAProxy -> RSyslog -> Kafka

The format of the messages sent to Kafka should remain the same, so for the content, to avoid breaking existing pipelines and tools.

Roughly, the actions needed are:

  • Investigate whether the log format can be transformed to match the VarnishKafka one (checkout the VarnishKafka configuration file for more information)
  • Configure RSyslog to send the data initially to a different Kafka topic(s)
  • Compare the existing topic(s) content to the new one, to spot eventual differences
  • Finalize configuring RSyslog to send data to the existing Kafka topic(s) and remove VarnishKafka

To test the actual feasibility we could use the deployment-prep environment, as there's already a kafka cluster in there, and varnishkafka configured on cache hosts (NB. investigate why the deployment-prep jumbo cluster isn't receiving any message from varnishkafka).

Another viable option (thanks to @brouberol ) could be use the kafka-test cluster in production, configuring one cp production host to send from rsyslog to this cluster on a disposable topic, and compare to the actual messages sent by VarnishKafka.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hi @Milimetric sorry for the late reply, I'll try to answer to your question but consider we're still investigating about all pro and cons of this "migration", and for sure we'll share our thought and our action plan before moving on with this...

no worries at all, do ping me (IRC or slack) if you want to talk more directly

  • We'll try to be consistent with timestamps in HAProxy, we have both the option to send the request timestamp (as logged by HAProxy) or the *captured* log timestamp (eg. the one set by benthos). I wasn't aware about the issue with incomplete records for Varnish, but we can definitely work around this.

I believe Varnish could also send the start timestamp, and it has that consistently. The reason we get records without timestamps is because we get the response timestamp (see config, but I believe it uses this).

  • The sequence number is also set by HAProxy (and eventually benthos if we need another one).

Oh really? I thought it was varnish that controls that, so I'd need to learn more here. Like, we have a bunch of data quality code that checks for false positives based on what we know about varnish restarts and how those affect sequence numbers. When this work goes forward, it'd be good to zoom in on these sequence numbers.

SRE has been working on a nice design doc.

Of the solutions proposed there, I prefer the benthos one, if it works. Benthos is a really nice tool. I think it would be worth investing in it for use cases like this.

Hey all, whatever the chosen solution for producing logs, we need a nice migration plan. I just talked with @Ahoelzl, and we realized that it would probably be easier to produce to a new stream / topics in Kafka, than try to migrate by producing to the existent topics (webrequest_source + webrequest_text) and handle the migration piecemeal.

So, I propose that we do the following

  • T314956: [Event Platform] Declare webrequest as an Event Platform stream, but with the new stream using new topics. Your haproxy log producer will produce to these topics, instead of the existent ones.
    • This will create new Hive tables in Hadoop.
  • DPE teams analyze differences in old webrequest tables and new ones. If all looks good, we can do the switchover.
  • DPE & SRE teams switch data pipelines to use new webrequest tables.
    • This includes in Hadoop as well as any other consumers, e.g. SRE sampled logs, perhaps FRTech has some consumers of webrequest still?
  • Once all pipelines are switched, SRE decommissions varnishkafka.
  • After 90 days, all webrequest data in Hadoop is purged. After this time we can remove old webrequest topics, tables and pipelines.

Whatchall think? cc @Milimetric @gmodena @JAllemandou @Antoine_Quhen.

I'm going to work on T314956 now, before I go on sabbatical + parent leave in January. (I'll be out Jan 8 - April 12 2024.) I hope to get the new stream/topics/Hive tables in place for you to produce to before I go.

Change 983898 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] WIP - Add webrequest schema

https://gerrit.wikimedia.org/r/983898

Change 983905 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] WIP - add webrequest.frontend stream

https://gerrit.wikimedia.org/r/983905

Change 983926 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] WIP - Add gobblin job webrequest_frontend to pull new webrequest stream

https://gerrit.wikimedia.org/r/983926

To do this migration plan ^, we'd need Kafka jumbo to support 2x webrequest volume while we migrate. Let's check with Data Platform SREs, @brouberol @BTullis ? Whatcha think?

To do this migration plan ^, we'd need Kafka jumbo to support 2x webrequest volume while we migrate. Let's check with Data Platform SREs, @brouberol @BTullis ? Whatcha think?

I think that this would be OK.

I've checked a few resources on each broker to see what the current headroom looks like and they all seem like they could handle a significant jump from doubling up the webrequest topics.

We would certainly want to watch it keenly to see if anything needed tweaking after the producers come online, but I'm not sure that there is anything I would modify in advance of the change.

That said, I believe that @brouberol has been working on a tool that can analyse a kafka config and perhaps suggest optimisations. This might be a good time to try it out, unless I've got my wires crossed.

The cluster has enough capacity to accommodate 2x webrequest volume, so 👍 on my end. When we stop writing to the original webrequest_text and webrequest_uploads topics, I might have to perform a small rebalancing afterwards, if I see that the cluster isn't as nicely balanced as it is right now, but this is mainly to keep my OCDs in check.

Do you have an estimate of the duration for which we'd be dual-writing?

@BTullis I have indeed been working on https://github.com/brouberol/kafkacfg, but this is mostly for broker-level configuration. While I believe we could optimize some settings, IMHO we should do this when the workload is stable.

SRE has been working on a nice design doc.

Of the solutions proposed there, I prefer the benthos one, if it works. Benthos is a really nice tool. I think it would be worth investing in it for use cases like this.

I'm also keen the benthos approach, although I have a couple of questions. I'll follow up with those up in the doc.

Do you have an estimate of the duration for which we'd be dual-writing?

I think we hope to get this done in Q3. But YMMV ¯\_(ツ)_/¯

Hm, @Fabfur does the same cache host that serves the haproxy request serve the varnish request? I.e. does haproxy only route to localhost? If so, we can probably do a bulk of the comparison analysis by enabling the haproxy producer on only a few cache hosts, and then compare the varnishkafka vs haproxy logs generated by that host.

Do you have an estimate of the duration for which we'd be dual-writing?

I think we hope to get this done in Q3. But YMMV ¯\_(ツ)_/¯

Hm, @Fabfur does the same cache host that serves the haproxy request serve the varnish request? I.e. does haproxy only route to localhost? If so, we can probably do a bulk of the comparison analysis by enabling the haproxy producer on only a few cache hosts, and then compare the varnishkafka vs haproxy logs generated by that host.

Yes exactly, HAProxy and Varnish (and ATS) are co-located on the same host and each HAProxy instance routes only to the "local" varnish instance via unix socket.

  • The sequence number is also set by HAProxy (and eventually benthos if we need another one).

Oh really? I thought it was varnish that controls that, so I'd need to learn more here. Like, we have a bunch of data quality code that checks for false positives based on what we know about varnish restarts and how those affect sequence numbers. When this work goes forward, it'd be good to zoom in on these sequence numbers.

@Fabfur could you please point me to the HAProxy (and/or Benthos) documentation about the mechanism that we intend to use to generate these sequence numbers? From our side (Data Products) we'd like to start studying them to see whether we need to modify our data quality checks or not.

  • The sequence number is also set by HAProxy (and eventually benthos if we need another one).

Oh really? I thought it was varnish that controls that, so I'd need to learn more here. Like, we have a bunch of data quality code that checks for false positives based on what we know about varnish restarts and how those affect sequence numbers. When this work goes forward, it'd be good to zoom in on these sequence numbers.

@Fabfur could you please point me to the HAProxy (and/or Benthos) documentation about the mechanism that we intend to use to generate these sequence numbers? From our side (Data Products) we'd like to start studying them to see whether we need to modify our data quality checks or not.

Hi, we can generate unique IDs in HAProxy combining all variables we have on the request/response side (headers, IP address, and such). Not exactly a sequence number but still an unique identifier. On the Benthos side there is a special method for that (eg. a simple mapping like root.sequence = count("sequence") in benthos configuration creates increasing entries in the json log like sequence: 1234).

Otherwise, as Varnish generates sequences (the ones we're using right now, I think) we can pass that to HAProxy and from there to Benthos.

To add some information about having sequences in logs, I'd like to elaborate a bit more about our current options:

  1. Let Varnish generate the sequence number (as we do now), catch it (with some custom header that will be deleted afterwards) in HAProxy and log it.
    • This IIUC keeps the same issues we currently have with Varnish sequences, they are reset every time varnish-frontend is restarted
    • This requires also a small change in the Varnish VCL code to export the unique ID as response header.
  2. The sequence number can be generated by Benthos: the expression root.sequence = count("sequence") (https://www.benthos.dev/docs/guides/bloblang/functions/#counter) produces a monotonic increase in the outputted json.
    • This approach suffers from the same issues ad the previous one, relying on the fact we shouldn't ever restart Benthos
  3. Generate an unique-id in HAProxy configuration and include it in logs. This feature is well documented (https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4-unique-id-format). An example for this could be:
[...]
unique-id-format "%{+X}o %ci:%cp_%fi:%fp_%Ts_%rt:%pid"
log-format "%ci:%cp %ID [%tr] ..."
[...]

Produces a log like (truncated):

127.0.0.1:57628 7F000001:E11C_7F000001:01BB_65A69170_0000:B91ED [16/Jan/2024:14:23:44.137] ...

The 7F000001:E11C_7F000001:01BB_65A69170_0000:B91ED string is unique for each request and it's a hexadecimal representation of clienti p:client port_frontend ip:frontend port_timestamp_request counter_haproxy process id. If the hexadecimal format is not easy to use as sequence number, this can be converted to the "regular" format like: 127.0.0.1:57630_127.0.0.1:443_1705415321_0:885939. In this case swapping the haproxy process id with request counter could be (naive?) solution to have a unique, increasing request counter.

Note that the request-id can be included also as response header to the client, if needed.

Naive question for @Fabfur: could Varnish generate a request ID that is a UUID v7? It's a unique, ever increasing UUID, that is also pretty human readable.

I must point out that the use case we have today that utilizes Varnish sequence number does not require uniqueness, but it does require the number to be always increased by 1 for every new request a particular host serves. We use these numbers to see check whether we are missing requests in the data lake. If interested, you can see the SQL manipulations we do to infer data loss here. The SQL does take into account that particular hosts coul restart at any time.

Thus given the 3 choices from T351117#9462450, I would prefer we stick to option (1), given that there are no additional benefits to (2) (and as per linked documentation that feature is experimental), and (3) would not comply with our current use case.

Some updates about the ongoing work:

Currently our Benthos configuration produces this output, when fed with HAProxy logs:

Example request:

curl -H 'Accept-Language: en' -H 'Referer: test.test' -H 'Range: -' 'https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page?foo=bar'

HAProxy log line (ignore the part before newlog: this is needed by mtail and it's entirely dropped by Benthos) :

5 387 0 0 200 {en.wikipedia.beta.wmflabs.org} {miss} -- newlog 127.0.0.1:57954 7F000001:E262_7F000001:01BB_65C3885F_0005:2B60B5 [07/Feb/2024:13:40:47.567] tls~ tls/backend_server 0/0/0/387/387 200 72108 - - ---- 1/1/0/0/0 0/0 {en.wikipedia.beta.wmflabs.org|test.test|curl/7.74.0|en|-|*/*|vers=TLSv1.3;keyx=unknown;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new} {miss|text/html;charset=UTF-8ns=0;page_id=1;rev_id=587739;proxy=OperaMini;https=1;client_port=57954;nocookies=1|traffic-cache-bullseye miss, traffic-cache-bullseye miss|deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud} GET /wiki/Main_Page ?foo=bar HTTP/2.0 en.wikipedia.beta.wmflabs.org/TLSv1.3/TLS_AES_256_GCM_SHA384

Benthos produced json:

{
  "accept": "*/*",
  "accept_date": "07/Feb/2024:12:04:41.510",
  "accept_language": "en",
  "actconn": "1",
  "backend_name": "tls",
  "backend_queue": "0",
  "beconn": "0",
  "bytes_read": "72068",
  "cache_status": "hit-local",
  "captured_request_cookie": "-",
  "captured_response_cookie": "-",
  "client_ip": "127.0.0.1",
  "client_port": "57952",
  "content_type": "text/html; charset=UTF-8",
  "dt": "2024-02-07T12:04:41.512972244Z",
  "feconn": "1",
  "frontend_name": "tls~",
  "haproxy_hour": "12",
  "haproxy_milliseconds": 0,
  "haproxy_minute": "04",
  "haproxy_month": "Feb",
  "haproxy_monthday": "07",
  "haproxy_second": "41.5",
  "haproxy_time": "12:04:41.5",
  "haproxy_uniq_id": "7F000001:E260_7F000001:01BB_65C371D9_0004:2B60B5",
  "haproxy_year": "2024",
  "hostname": "traffic-cache-bullseye",
  "http_method": "GET",
  "http_status_code": "200",
  "http_version": "2.0",
  "priority": 134,
  "range": "-",
  "referer": "test.test",
  "retries": "0",
  "server": "ATS/9.1.4",
  "server_name": "backend_server",
  "severity": 6,
  "srv_queue": "0",
  "srvconn": "0",
  "termination_state": "----",
  "time_backend_connect": "0",
  "time_backend_response": "2",
  "time_duration": "2",
  "time_queue": "0",
  "time_request": "0",
  "timestamp": "2024-02-07T12:04:41Z",
  "tls": "vers=TLSv1.3;keyx=unknown;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new",
  "uri_host": "en.wikipedia.beta.wmflabs.org",
  "uri_path": "/wiki/Main_Page",
  "uri_query": "?foo=bar",
  "user_agent": "curl/7.74.0",
  "x_analytics": "ns=0;page_id=1;rev_id=587739;proxy=OperaMini;https=1;client_port=57952;nocookies=1",
  "x_cache": "traffic-cache-bullseye hit, traffic-cache-bullseye hit/2"
}

As you can see there are a lot of haproxy-specific fields that can be easily dropped.

For comparison this is a sample webrequest message (captured with kafkacat):

{
  "accept": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.2.0\"",
  "accept_language": "en",
  "backend": "ATS/9.1.4",
  "cache_status": "hit-front",
  "content_type": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.5.0\"",
  "dt": "2023-11-23T16:04:17Z",
  "hostname": "cp3067.esams.wmnet",
  "http_method": "GET",
  "http_status": "200",
  "ip": "<REDACTED>",
  "range": "-",
  "referer": "https://en.wikipedia.org/w/index.php?title=Category:Films_based_on_non-fiction_books&pagefrom=Power+Play+%281978+film%29%0APower+Play+%281978+film%29",
  "response_size": 987,
  "sequence": 10558502962,
  "time_firstbyte": 0.000201,
  "tls": "vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new",
  "uri_host": "en.wikipedia.org",
  "uri_path": "/api/rest_v1/page/summary/Secretariat_(film)",
  "uri_query": "",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0",
  "x_analytics": "WMF-Last-Access=23-Nov-2023;WMF-Last-Access-Global=23-Nov-2023;include_pv=0;https=1;client_port=33126",
  "x_cache": "cp3067 miss, cp3067 hit/5"
}

Some notes:

  • IIUC, what's called in Varnish time_firstbyte can be compared to the time_backend_response in Benthos output (a rename is possible if needed). From the Varnish documentation this is:
Varnish:time_firstbyte

    Time from when the request processing starts until the first byte is sent to the client, in seconds. For backend mode: Time from the request was sent to the backend to the entire header had been received.

While in HAProxy this is roughly equivalent to the %Tr control point:

- Tr: server response time (HTTP mode only). It's the time elapsed between
  the moment the TCP connection was established to the server and the moment
  the server sent its complete response headers. It purely shows its request
  processing time, without the network overhead due to the data transmission.
  It is worth noting that when the client has data to send to the server, for
  instance during a POST request, the time already runs, and this can distort
  apparent response time. For this reason, it's generally wise not to trust
  too much this field for POST requests initiated from clients behind an
  untrusted network. A value of "-1" here means that the last the response
  header (empty line) was never seen, most likely because the server timeout
  stroke before the server managed to process the request.

(more on that on https://docs.haproxy.org/2.6/configuration.html#8.4)

  • The hostname is not a FQDN. Seems that Benthos has not this option (still) but we can patch that considering that we'll using a Puppet template to generate the Benthos configuration.
  • The backend key is different: in Varnish this indicates the Server header, that in Benthos is mapped to the server key (this can be changed too, if needed)
  • The dt format is slightly different, we could consider truncating or adapting, if needed.
  • http_status in Varnish is equivalent to the http_status in Benthos. Can be changed if needed.
  • ip in Varnish is equivalent to client_ip in Benthos. Can be changed if needed
  • response_size in Varnish is equivalent to bytes_read in Benthos. From the documentation of the %B option:
- "bytes_read" is the total number of bytes transmitted to the client when
    the log is emitted. This does include HTTP headers.
  • As for the sequence field: this indicates an ever-increasing sequence number in Varnish. We already discussed a bit about this in previous comments

Some updates about the ongoing work:

Hey @Fabfur,

thanks for this!
Blow is a summary of our chat earlier today. @Fabfur holler if I got something wrong :).

Currently our Benthos configuration produces this output, when fed with HAProxy logs:

[...]

This sample is super useful. Next steps (for the EP side of things) is to map the json payload to an Event Platform Schema. This would help stream management and tooling integration.

This means:

  • declaring a jsonschema for webrequests (wip in patch 983905). This is only for tooling compat. webrequest will not be routed through EventGate.
  • adding a meta field to the benthos/haproxy produced payload. It should set (at least) meta.stream to the stream name, we need to iron out details for other fields (meta.dt, meta.id). We can iterate on this async.
  • dt and all timestamps should be produced into kafka as ISO-8601 UTC datetime strings, e.g. '2020-07-01T00:00:00Z'.
  • @Fabfur to check internally which fields can be dropped from the current payload, @gmodena to do the same.
  • IIUC, what's called in Varnish time_firstbyte can be compared to the time_backend_response in Benthos output (a rename is possible if needed). From the Varnish documentation this is:

Re times: if I understand correctly, we might expect some shift (latency) if the timestamp is recorded by haproxy, instead of
varnish.
From my point of view. as long as we are consistent (e.g. don't mix benthos/haproxy times in the same field) with populating fields we should be good. We use timestamps to partition data when ingesting into the datalake.

wrt how to store timestamps. In EP schema the convention is to use:

  • meta.dt -> event process time (set by benthos).
  • dt (top level field) -> event emission time (request timestamp recorded by haproxy).
  • The hostname is not a FQDN. Seems that Benthos has not this option (still) but we can patch that considering that we'll using a Puppet template to generate the Benthos configuration.
  • The backend key is different: in Varnish this indicates the Server header, that in Benthos is mapped to the server key (this can be changed too, if needed)
  • The dt format is slightly different, we could consider truncating or adapting, if needed.
  • http_status in Varnish is equivalent to the http_status in Benthos. Can be changed if needed.
  • ip in Varnish is equivalent to client_ip in Benthos. Can be changed if needed
  • response_size in Varnish is equivalent to bytes_read in Benthos. From the documentation of the %B option:

Need to validate the impact of these changes internally. The only thing that I'd like to address early on in prototyping is the dt format.

  • As for the sequence field: this indicates an ever-increasing sequence number in Varnish. We already discussed a bit about this in previous comments

ack. If I understand correctly, you suggested it would be possible to map haproxy unique-id to a monotonically increasing (by 1) integer field, that should be keep this proprties across restarts. Sounds good to me, but ping to @xcollazo. Need to thing it overflowing unsigned int could be an issue.

Some updates:

  • For backend, dt, http_status, ip, response_size keys, they are now aligned to the current schema (regarding format and naming).
  • The hostname thing will be fixed with the puppetization
  • Meta keys: Benthos supports metadata for Kafka (https://www.benthos.dev/docs/configuration/metadata/), I think this is the way we should proceed for meta.dt, meta.domain and meta.stream. We should definitely check with a Kafka broker what's produced.
  • The sequence field now is determined as timestamp + request counter. Even if HAProxy restarts, the timestamp section would be ever increasing so this should be ok.

I'll post an updated version of the schema (Benthos output) as soon as I'll work on the fields to be dropped (trying also to find a way to make it elegantly in the configuration file).

The sequence field now is determined as timestamp + request counter. Even if HAProxy restarts, the timestamp section would be ever increasing so this should be ok.

Timestamp of the last restart? Or timestamp of the request? If the first, that could work, if the later, we'd loose the monotonic increase by 1 property, so we could not use that field anymore.

@xcollazo Unfortunately the timestamp is referred to the request timestamp. I can check if I can somehow use the last restart timestamp...

I've adapted the Benthos configuration to produce an output similar to the current (webrequest) data:

{
  "accept": "*/*",
  "accept_language": "en",
  "backend": "ATS/9.1.4",
  "cache_status": "hit-local",
  "content_type": "text/html; charset=UTF-8",
  "dt": "2024-02-08T17:31:26Z",
  "hostname": "traffic-cache-bullseye",
  "http_method": "GET",
  "http_status": "200",
  "ip": "127.0.0.1",
  "priority": 134,
  "range": "-",
  "referer": "test.test",
  "response_size": 72068,
  "sequence": 17074134866,
  "severity": 6,
  "time_firstbyte": 0.003,
  "timestamp": "2024-02-08T17:31:26Z",
  "tls": "vers=TLSv1.3;keyx=unknown;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=;sess=new",
  "uri_host": "en.wikipedia.beta.wmflabs.org",
  "uri_path": "/wiki/Main_Page",
  "uri_query": "?foo=bar",
  "user_agent": "curl/7.74.0",
  "x_analytics": "ns=0;page_id=1;rev_id=587739;proxy=OperaMini;https=1;client_port=58056;nocookies=1",
  "x_cache": "traffic-cache-bullseye hit, traffic-cache-bullseye hit/6"
}

Compared with the captured one:

{
  "accept": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.2.0\"",
  "accept_language": "en",
  "backend": "ATS/9.1.4",
  "cache_status": "hit-front",
  "content_type": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.5.0\"",
  "dt": "2023-11-23T16:04:17Z",
  "hostname": "cp3067.esams.wmnet",
  "http_method": "GET",
  "http_status": "200",
  "ip": "<REDACTED>",
  "range": "-",
  "referer": "https://en.wikipedia.org/w/index.php?title=Category:Films_based_on_non-fiction_books&pagefrom=Power+Play+%281978+film%29%0APower+Play+%281978+film%29",
  "response_size": 987,
  "sequence": 10558502962,
  "time_firstbyte": 0.000201,
  "tls": "vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new",
  "uri_host": "en.wikipedia.org",
  "uri_path": "/api/rest_v1/page/summary/Secretariat_(film)",
  "uri_query": "",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0",
  "x_analytics": "WMF-Last-Access=23-Nov-2023;WMF-Last-Access-Global=23-Nov-2023;include_pv=0;https=1;client_port=33126",
  "x_cache": "cp3067 miss, cp3067 hit/5"
}

@Fabfur here is example payload with added meta, as we'd expect to receive according to the WIP webrequest event schema.

{
  "meta": {
      dt: "2023-11-23T16:04:17Z", # value set by Benthos
      stream: "webrequest_text", # value set by Benthos
      domain: "en.wikipedia.org", # can we get this from HAProxy?
      request_id: request-uuid # can we get this from HAProxy?
      id: "event-uuid" # value set by Benthos? 
   },
  "accept": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.2.0\"",
  "accept_language": "en",
  "backend": "ATS/9.1.4",
  "cache_status": "hit-front",
  "content_type": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.5.0\"",
  "dt": "2023-11-23T16:04:17Z", # value recorded by HAProxy
  "hostname": "cp3067.esams.wmnet",
  "http_method": "GET",
  "http_status": "200",
  "ip": "<REDACTED>",
  "range": "-",
  "referer": "https://en.wikipedia.org/w/index.php?title=Category:Films_based_on_non-fiction_books&pagefrom=Power+Play+%281978+film%29%0APower+Play+%281978+film%29",
  "response_size": 987,
  "sequence": 10558502962,
  "time_firstbyte": 0.000201,
  "tls": "vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new",
  "uri_host": "en.wikipedia.org",
  "uri_path": "/api/rest_v1/page/summary/Secretariat_(film)",
  "uri_query": "",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0",
  "x_analytics": "WMF-Last-Access=23-Nov-2023;WMF-Last-Access-Global=23-Nov-2023;include_pv=0;https=1;client_port=33126",
  "x_cache": "cp3067 miss, cp3067 hit/5"
}

In meta we'd need at least dt and stream. Those fields are required by etl tooling. The other fields would be very nice to have, but in a first iteration I think we can live without if to cumbersome to define / collect.
.
This fragment contains the full declaration of common EP schema fields (meta + dt).

I am having a look at Benthos now; I wonder if we could implement event schema validation upstream with a plugin (pure speculation - not implying requirements). cc @tchin

With Benthos we have the ability to set actual metadata attached to the message (eg. with meta keyword) or simply define another meta key under the root one and set all needed fields under that.

Both approaches are feasible (also at the same time if we do accept to increase the payload a little)...

{
  "meta": {
      dt: "2023-11-23T16:04:17Z", # value set by Benthos
      stream: "webrequest_text", # value set by Benthos
      domain: "en.wikipedia.org", # can we get this from HAProxy?
      request_id: request-uuid # can we get this from HAProxy?
      id: "event-uuid" # value set by Benthos? 
   },
...

No problem with the two different timestamps generated by HAProxy and Benthos in different fields (meta and "regular"), I've already changed the Benthos configuration to reflect this. Same for the meta domain (that is a duplicate of the uri_host field).

The stream value, like the hostname one, can be set by Puppet in the configuration template (we use this approach with VarnishKafka), as it's not expected to change.

Both request_id and id can be generated by Benthos and HAProxy as UUIDv4 (HAProxy uses RFC4122 standard). If that's ok we can consider this settled...

Both approaches are feasible (also at the same time if we do accept to increase the payload a little)...

Nice. From my side, increasing payload size with meta should not be an issue (and AFAIK not significantly impact jumbo). It's a field we expect anyway. I'm implementation agnostic, but very much curious to see how Benthos handles things.

If that's ok we can consider this settled...

Sounds good to me. Modulo maybe fine tuning naming and id formats.

"dt": "2024-02-07T12:04:41.512972244Z",

I think that this more precise timestamp would be parseable by our ingestion system just fine, but we should verify. If we can get this precise I suppose...why not? I see that existent varnish dt is only seconds, which doesn't seem very precise, especially for webrequest. Perhaps we should take this opportunity to increase the precision a bit. If we can, we should strive for at least millisecond. Not a blocker for this task though.

Both request_id and id can be generated by Benthos

Nice! request_id is usually populated from the X-Request-Id header for tracing purposes. Can we set that from haproxy?

stream: "webrequest_text", # value set by Benthos

TBD on final stream name in T314956: [Event Platform] Declare webrequest as an Event Platform stream, but the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.

We might want to a new field to indicate the cache cluster / webrequest source the request is from. The webrequest refine job will pull the Hive webrequest_source partition from the topic name; but I think it might be best to have this info in the event data too. (I suppose the Refine job could do that, but it's probably more future proof to let the haproxy producer set the value explicitly).

TBD on final stream name in T314956: [Event Platform] Declare webrequest as an Event Platform stream, but the currently suggested one is webrequest.frontend

Open question: do we want webrequest.frontent (or whatever we settle on) to be a versioned stream? https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning

That's a config setting that mostly concerns the EP side of tooling (kafka topics would not necessarily have to follow the same versioning scheme).

the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.

We discussed this OTR in slack, and this should be feasible with the current gobblin config.

We might want to a new field to indicate the cache cluster / webrequest source the request is from. The webrequest refine job will pull the Hive webrequest_source partition from the topic name; but I think it might be best to have this info in the event data too. (I suppose the Refine job could do that, but it's probably more future proof to let the haproxy producer set the value explicitly).

Good point @Ottomata
I'd also prefer if this was set upstream by the haproxy producer. Would be useful if / when accessing the kafka topics (instead of refined data in HDFS).

cc / @Fabfur

Open question: do we want webrequest.frontent (or whatever we settle on) to be a versioned stream? https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning

xpost for visbility: https://phabricator.wikimedia.org/T314956#9539679

I tagged the current WIP stream as webrequest.frontentd.rc0. The expected json payload would look something like this:

{
  "meta": {
      dt: "2023-11-23T16:04:17Z", 
      stream: "webrequest.frontentd.rc0", # set by prometheus 
      domain: "en.wikipedia.org",
      request_id: request-uuid ,
      id: "event-uuid",
   },
   ...
}

This versioning is useful for experimentation, and for (eventually) introducing breaking changes in GA event schemas. The drawback is that we need to keep track of version in multiple places:

  • mediawiki-config: where the stream config is declared
  • puppet: where the haproxy produced gets the stream value to set.

Since webrequest.frontend is not stricly and Event Platform (produced) stream, it might make more sense to drop the version suffix.

the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.

We discussed this OTR in slack, and this should be feasible with the current gobblin config.

We might want to a new field to indicate the cache cluster / webrequest source the request is from. The webrequest refine job will pull the Hive webrequest_source partition from the topic name; but I think it might be best to have this info in the event data too. (I suppose the Refine job could do that, but it's probably more future proof to let the haproxy producer set the value explicitly).

Good point @Ottomata
I'd also prefer if this was set upstream by the haproxy producer. Would be useful if / when accessing the kafka topics (instead of refined data in HDFS).

cc / @Fabfur

Ok I've added a webrequest_source field in the produced output that can assume the text|upload values (set by puppet, as we set the kafka topic)

Update: Benthos is installed on cp4037 and after some minor fixes, is finally ready to ingest, process and send HAProxy logs to the temporary Kafka topic (webrequest_text_test) on the jumbo cluster. Considering that, when we'll switch on the new log destination in HAProxy and repool the host, a very large amount of data will be processed and sent, I prefer starting "manually" sending single log lines to the Benthos socket and checking that all is correct on the Kafka side.
This means that the first messages on the topic will be "artificial" and can be safely discarded.

I'll let you know the results of this test and when we can actually start processing "real" logs (maybe a topic cleanup could be necessary to avoid ingesting in the analytics pipeline "fake" logs).

Update: yesterday we modified the HAProxy log destination to send them into Benthos and repooled cp4037 for a very short time to capture "real" logs. Benthos configuration sent them to two separate Kafka topics: webrequest_text_test for "regular" logs and webrequest_text_test_error for errored (unparsable) logs.

NOTE that currently cp4037.ulsfo.wmnet, the server we're using for testing purpose is depooled, downtimed and puppet is disabled on that specific host

While the great part of logs has been correctly parsed, we noticed some minor issues that needs to be addressed, mainly:

  • The "metadata field" in benthos output must be renamed from debug_metadata to meta. This has been already fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009722 and is waiting for merge
  • Some message types cannot be parsed correctly, two examples:
    • <REDACTED IP>:52088 [07/Mar/2024:14:58:14.151] tls/2: SSL handshake failure
    • <REDACTED IP>:59665 - [07/Mar/2024:14:58:14.237] http http/<NOSRV> 0/-1/-1/-1/0 301 164 - - LR-- 2059/9/0/0/0 0/0 {www.wikipedia.org} {int-tls} GET / HTTP/1.1 cee4b720-1709-45f3-913a-ba0dc6604454

In the first case the reason is obvious: such errors cannot really match our parsing pattern.

The second case required more investigation, and after some trials I think I've nailed the reason: this kind of requests (redirects) that hit the http frontend in HAProxy has not the same log format as the ones that hits the tls frontend, simply due to the fact that in the http frontend the captured headers are very different in number and type.

So IMHO we have two different ways to manage this:

  1. Ignore entries that hits the http frontend (used only to redirect to https), eg. specifying a different log target in HAProxy. This way we'll loose important information? Are they passed down and collected by Varnish?
  2. Edit the HAProxy configuration for the http frontend to capture the same headers (or use dummy placeholders in case not appliable) and keep the log-format consistent across the two frontends. This will allow Benthos to parse logs from both frontends. What the problem could be if capturing and logging different headers on the Varnish side?

I suggest considering the second approach, a possible CR would be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009724

Mentioned in SAL (#wikimedia-operations) [2024-03-08T15:32:09Z] <fabfur> repooling cp4037 for this weekend, all log-format changes are reverted (T351117)

Note: I've repooled cp4037 for the next days as I'll be busy on the SRE Summit to work on it.

All modifications to HAProxy logging has been reverted, we re-enabling them back reverting https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009769 and keep working on the fixes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009722 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009724

I've fixed some errors (now metadata field name should be correct) in the Benthos configuration. I've repooled cp4037 for about one minute (https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2024-03-18) to collect some logs.

ATM the logs sent to the deadletter queue (webrequest_text_test_error) are all messages that could be safely ignored, from the Analytics perspective (while we could use some metrics about those, but let's see).

@gmodena you should have some more data to play with now, while I work on the performance optimization and on Benthos internal metrics...

Change 1012656 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] Add webrequest_frontent raw schema.

https://gerrit.wikimedia.org/r/1012656

Change #983905 merged by jenkins-bot:

[operations/mediawiki-config@master] Add webrequest.frontend.rc0 stream

https://gerrit.wikimedia.org/r/983905

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:16:47Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:20:33Z] <hashar@deploy1002> otto and hashar: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:37:47Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] (duration: 20m 59s)

Change #983898 merged by jenkins-bot:

[schemas/event/primary@master] development: add webrequest schema

https://gerrit.wikimedia.org/r/983898

Change #1015260 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Change #1015260 merged by jenkins-bot:

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:12:44Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:28:20Z] <hashar@deploy1002> gmodena and hashar: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:46:48Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] (duration: 34m 03s)

Change #1017041 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

@gmodena you should have some more data to play with now, while I work on the performance optimization and on Benthos internal metrics...

f/up to some threads we had in irc/slack.

The composite webrequest.frontend.rc0 stream is now declared in mw stream config.
All events currently being produced by benthos to kafka (upload/text) validate against the jsonschema.

I setup an end to end pipeline (gobblin + airflow) to load data from kafka, and generate the downstream (postprocessed) webrequest dataset (and error loss request). It implements the same logic that currently processes the varnish feed. The new dataset will be available in superset as gmodena.webrequest (you can already play with a small sample). To ease analysis, it is exposed with the same schema as the current (varnish based) table.

I have some patches in flight, that once deployed will automate ingestion and etl of the kafka topics.

Next steps: now that we are starting to collect more logs, we can start comparing current / new webrequest records.

Change #1017913 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: remove timestamp from unique-id-format

https://gerrit.wikimedia.org/r/1017913

Change #1017913 merged by Fabfur:

[operations/puppet@production] haproxy: remove timestamp from unique-id-format

https://gerrit.wikimedia.org/r/1017913

Change #1018188 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: increase capture header length for UA

https://gerrit.wikimedia.org/r/1018188

Change #1018188 merged by Fabfur:

[operations/puppet@production] haproxy: increase capture header length for UA

https://gerrit.wikimedia.org/r/1018188

Change #983926 merged by Gmodena:

[analytics/refinery@master] Add gobblin job webrequest_frontend_rc0

https://gerrit.wikimedia.org/r/983926

Change #1017041 merged by Btullis:

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

Change #1019867 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery/source@master] refinery-job: add webrequest instrumentation.

https://gerrit.wikimedia.org/r/1019867

Next steps: now that we are starting to collect more logs, we can start comparing current / new webrequest records.

An update after extensive rounds of validation that @Fabfur and I did in the past two weeks. @Fabfur holler if I'm missing something.

The haproxy and varnishkafka generates looks seem to be in agreement both in terms of volume (requests by status code and hostname),
as well as field (discrete values) overlap. All ulsfo traffic should be now served through haproxy.

A summary of our validation plan and findings can be found in this HAProxy log validation plan.

The only difference compared to varnishkafka logs is sequence number. With haproxy we observe the following behaviour:

  • upon haproxy (config) reloads, a new process is spawned that share the req count. As a result we’ll log a bunch of requests with duplicated sequence.
  • upon haproxy restarts, we’ll rest (and log) the counter to 0

There's an internal Slack thread wrt how this could impact ops, and how address it (cc / @xcollazo @JAllemandou ).

Pipeline status
  • The new Kafka topics are ingested at 10-minute intervals by Gobblin. Raw data is available in Hive at wmf_raw.webrequest_frontend_rc0.
  • A refine pipeline is currently running on a dev environment and producing post-processed data in gmodena.webrequest. Data loss support tables are available in the gmodena namespace. Note that this pipeline is orchestrated from a dev Airflow instance, and data might lag.
  • Traffic volume metrics (requests by status code and host) are available in gmodena.haproxy_requests. Note that they are pre-computed in incremental batches and may lag a few hours behind live traffic.
  • We added instrumentation to webrequest (via DQ framework) to try and spot eventual regressions post-migration.
Next steps

I believe the next steps would be agreeing on acceptance criteria (discussed with @Ahoelzl) and coming up with a rollout timeline.

From the data side, before switching off varnishkafka, we'll need to:

  • Plan the webrequest_raw sources switch in the production refine pipeline.
  • Bump the Event Platform stream name from webrequest_frontend_rc0 to webrequest_frontend, and promote its event schema to the primary registry (this should not impact the timeline).

I agree with @gmodena on all topics, more specifically:

  • About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.
  • The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021412

About next steps, we have a couple of options here:

  • We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka
  • We should turn off VarnishKafka on those cp hosts.

Ideally the two things could be done at the same time, especially if we don't have a way to differentiate downstream which logs comes from Benthos and which one comes from VarnishKafka. If we can differentiate we could first send the logs to the production topic and in a second time switch off VarnishKafka (in this time interval we'll have "duplicate" logs on the topic).

We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates

Instead of prepending this to the number, what about just adding this as a new field? We are currently using hostname + sequence for uniqueness, why not hostname + pid + sequence?

About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.

Ack and +1 to your proposal. IMHO it's easier to be resilient to reloads, than working around non-monotonicity. Do you maybe a feel for how often haproxy reloads are expected to happen once it prod? Could we assume they are sporadic events?

The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021412

I'm seeing uppercase keyx in both topics.

We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka

Do you mean reusing the current varnishkafka webrequest topics, or creating new production topics for webrequest_text_test and webrequest_upload_test (changing the _test suffix) ?

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.

Ack and +1 to your proposal. IMHO it's easier to be resilient to reloads, than working around non-monotonicity. Do you maybe a feel for how often haproxy reloads are expected to happen once it prod? Could we assume they are sporadic events?

HAProxy reload could be frequent, considering that every time someone edits any configuration bits it triggers a reload. Also the @Ottomata idea of adding another (separate) field to use as discriminator for uniqueness is perfectly doable on our side.

The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021412

I'm seeing uppercase keyx in both topics.

We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka

Do you mean reusing the current varnishkafka webrequest topics, or creating new production topics for webrequest_text_test and webrequest_upload_test (changing the _test suffix) ?

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

We can do both, for us it's just a matter of changing a string on puppet. I think decision is more on your side, choose the easiest/best option for you and we'll implement!

I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values, while still having a de-duplication key with the new field.

I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values, while still having a de-duplication key with the new field.

+1

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

We can do both, for us it's just a matter of changing a string on puppet. I think decision is more on your side, choose the easiest/best option for you and we'll implement!

Terrific!
@brouberol would it ok to 2x webrequest data in jumbo (all topics have 7 days retention)? This will be temporary (no ETA yet though), till we fully migrate to the new haproxy+benthos producers.

The haproxy_id field has been added to messages.

(PS. I'll keep this open as umbrella task)

Change #1021902 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] development: webrequest: add haproxy_pid field

https://gerrit.wikimedia.org/r/1021902

Change #1021902 merged by Gmodena:

[schemas/event/primary@master] development: webrequest: add haproxy_pid field

https://gerrit.wikimedia.org/r/1021902

The haproxy_id field has been added to messages.

This seems to have helped a lot with dedup, at least on the samples we collected so far.
@xcollazo @JAllemandou there is a sample in gmodena.webrequest_sequence_stats if you'd like to take a look.

As for next steps;

Schema finalization.

  • there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).
  • we'll need to change steam name from webrequest_frontend_rc0 to web request_frontend in EventStreamConfig. Possibly puppet too.
  • we need names for the new Kafka topics. @Fabfur would you have any preference? Historically Event Platform topics are prefixed with a datacenter id.

I was thinking about the follow rollout steps:

  1. Deploy HAProxy Benthos producer across entire fleet.
  1. Begin shipping logs to prod Kafka topics once ready. Update ingestion and analytics pipelines once topics are deemed ready for downstream users.
  1. Run Benthos and VarnishKafka producers in parallel for 7 days.
  1. Refine HAProxy data for 7 days, allocating a maintenance window for to update the webrequest feed (tables+dag). HAProxy logs become source of truth for wmf.webrequest. Notify downstream consumers and await bug reports. Rollback to varnishkafka feed if critical issues arise.
  1. If metrics stabilize after X days, disable varnishkafka.

@Fabfur thoughts on this high-level plan? How long can varnishkafka and benthos run together across the fleet?
@brouberol would it ok to 2x webrequest data (varnishkafka + benthos producers) in kafka-jumbo (and keep 7 days retention)?

Additional data platform-specific implementation steps can be discussed separately.

  • there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).

No problem here, for us is just a matter of removing a line from the Benthos configuration. Let me know if I can proceed!

  • we need names for the new Kafka topics. @Fabfur would you have any preference? Historically Event Platform topics are prefixed with a datacenter id.

Currently IIUC we just use webrequest_text and webrequest_upload. Do you suggest to use something like uslfo_webrequest_text instead? This shouldn't be an issue for us, anyway.
I also suggest to keep the DLQ as <topic_name>_errors, as it has been an invaluable help for debugging.

I was thinking about the follow rollout steps:

  1. Deploy HAProxy Benthos producer across entire fleet.
  1. Begin shipping logs to prod Kafka topics once ready. Update ingestion and analytics pipelines once topics are deemed ready for downstream users.
  1. Run Benthos and VarnishKafka producers in parallel for 7 days.
  1. Refine HAProxy data for 7 days, allocating a maintenance window for to update the webrequest feed (tables+dag). HAProxy logs become source of truth for wmf.webrequest. Notify downstream consumers and await bug reports. Rollback to varnishkafka feed if critical issues arise.
  1. If metrics stabilize after X days, disable varnishkafka.

@Fabfur thoughts on this high-level plan? How long can varnishkafka and benthos run together across the fleet?

I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

An update: we currently noticed some messages drop at operating system level during high-messages spike. This means that HAProxy tried to send logs to Benthos but the UDP buffer was full and the network stack discarded those messages before they could be read by Benthos. Unfortunately this isn't exactly easy to monitor but here are some considerations:

  • The ulsfo DC where we are testing this is the lowest traffic one. Is reasonable to imagine that when Benthos will be deployed on other datacenters this kind of drops could be even more frequent.
  • We are considering different solutions here (thanks to @Vgutierrez that is helping me with this) to mitigate this issue.
  • I don't think this would block the overall activity but is something we need to address right now to avoid worse performance in the future.

Do you suggest to use something like uslfo_webrequest_text instead?

The current naming 'convention' is to use '.' as the concept separator in topic names. So: ulsfo.webrequest_text would be fine.

also suggest to keep the DLQ as <topic_name>_errors

Can we do <topic_name>.error, to follow our other error topic naming convention? E.g. ulsfo.webrequest_text.error ?

Actually, the convention we use isn't <topic_name>.error, but <producer_name>.error, because in this case the subject of the data in the topic isn't about webrequests, but about a problem with the producer. But, naming is hard, and e.g. ulsfo.webrequest_text.error is fine. :)