Page MenuHomePhabricator

Move analytics log from Varnish to HAProxy
Open, In Progress, Needs TriagePublic

Description

Now that we have moved all traffic to HAProxy, we should replace the current pipeline

  • Varnish logs -> Varnishkafka -> Kafka

to

  • HAProxy -> RSyslog -> Kafka

The format of the messages sent to Kafka should remain the same, so for the content, to avoid breaking existing pipelines and tools.

Roughly, the actions needed are:

  • Investigate whether the log format can be transformed to match the VarnishKafka one (checkout the VarnishKafka configuration file for more information)
  • Configure RSyslog to send the data initially to a different Kafka topic(s)
  • Compare the existing topic(s) content to the new one, to spot eventual differences
  • Finalize configuring RSyslog to send data to the existing Kafka topic(s) and remove VarnishKafka

To test the actual feasibility we could use the deployment-prep environment, as there's already a kafka cluster in there, and varnishkafka configured on cache hosts (NB. investigate why the deployment-prep jumbo cluster isn't receiving any message from varnishkafka).

Another viable option (thanks to @brouberol ) could be use the kafka-test cluster in production, configuring one cp production host to send from rsyslog to this cluster on a disposable topic, and compare to the actual messages sent by VarnishKafka.

Related Objects

StatusSubtypeAssignedTask
In ProgressFabfur
Opengmodena
ResolvedFabfur
ResolvedFabfur
DeclinedFabfur
ResolvedFabfur
ResolvedFabfur
DeclinedFabfur
DeclinedFabfur
DeclinedFabfur
ResolvedFabfur
ResolvedFabfur
DeclinedFabfur
ResolvedFabfur
ResolvedVgutierrez
ResolvedFabfur
ResolvedFabfur
DeclinedFabfur
DeclinedFabfur
DeclinedFabfur
ResolvedFabfur
ResolvedFabfur
ResolvedFabfur
DeclinedFabfur
ResolvedJoe
In ProgressFabfur
ResolvedFabfur
DeclinedFabfur
ResolvedFabfur
ResolvedFabfur
OpenFabfur
In ProgressFabfur
OpenFabfur
OpenFabfur
ResolvedFabfur
ResolvedFabfur
ResolvedFabfur
ResolvedFabfur
ResolvedFabfur
OpenFabfur
OpenFabfur
ResolvedFabfur
ResolvedFabfur
ResolvedFabfur

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Open question: do we want webrequest.frontent (or whatever we settle on) to be a versioned stream? https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning

xpost for visbility: https://phabricator.wikimedia.org/T314956#9539679

I tagged the current WIP stream as webrequest.frontentd.rc0. The expected json payload would look something like this:

{
  "meta": {
      dt: "2023-11-23T16:04:17Z", 
      stream: "webrequest.frontentd.rc0", # set by prometheus 
      domain: "en.wikipedia.org",
      request_id: request-uuid ,
      id: "event-uuid",
   },
   ...
}

This versioning is useful for experimentation, and for (eventually) introducing breaking changes in GA event schemas. The drawback is that we need to keep track of version in multiple places:

  • mediawiki-config: where the stream config is declared
  • puppet: where the haproxy produced gets the stream value to set.

Since webrequest.frontend is not stricly and Event Platform (produced) stream, it might make more sense to drop the version suffix.

the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.

We discussed this OTR in slack, and this should be feasible with the current gobblin config.

We might want to a new field to indicate the cache cluster / webrequest source the request is from. The webrequest refine job will pull the Hive webrequest_source partition from the topic name; but I think it might be best to have this info in the event data too. (I suppose the Refine job could do that, but it's probably more future proof to let the haproxy producer set the value explicitly).

Good point @Ottomata
I'd also prefer if this was set upstream by the haproxy producer. Would be useful if / when accessing the kafka topics (instead of refined data in HDFS).

cc / @Fabfur

Ok I've added a webrequest_source field in the produced output that can assume the text|upload values (set by puppet, as we set the kafka topic)

Update: Benthos is installed on cp4037 and after some minor fixes, is finally ready to ingest, process and send HAProxy logs to the temporary Kafka topic (webrequest_text_test) on the jumbo cluster. Considering that, when we'll switch on the new log destination in HAProxy and repool the host, a very large amount of data will be processed and sent, I prefer starting "manually" sending single log lines to the Benthos socket and checking that all is correct on the Kafka side.
This means that the first messages on the topic will be "artificial" and can be safely discarded.

I'll let you know the results of this test and when we can actually start processing "real" logs (maybe a topic cleanup could be necessary to avoid ingesting in the analytics pipeline "fake" logs).

Update: yesterday we modified the HAProxy log destination to send them into Benthos and repooled cp4037 for a very short time to capture "real" logs. Benthos configuration sent them to two separate Kafka topics: webrequest_text_test for "regular" logs and webrequest_text_test_error for errored (unparsable) logs.

NOTE that currently cp4037.ulsfo.wmnet, the server we're using for testing purpose is depooled, downtimed and puppet is disabled on that specific host

While the great part of logs has been correctly parsed, we noticed some minor issues that needs to be addressed, mainly:

  • The "metadata field" in benthos output must be renamed from debug_metadata to meta. This has been already fixed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009722 and is waiting for merge
  • Some message types cannot be parsed correctly, two examples:
    • <REDACTED IP>:52088 [07/Mar/2024:14:58:14.151] tls/2: SSL handshake failure
    • <REDACTED IP>:59665 - [07/Mar/2024:14:58:14.237] http http/<NOSRV> 0/-1/-1/-1/0 301 164 - - LR-- 2059/9/0/0/0 0/0 {www.wikipedia.org} {int-tls} GET / HTTP/1.1 cee4b720-1709-45f3-913a-ba0dc6604454

In the first case the reason is obvious: such errors cannot really match our parsing pattern.

The second case required more investigation, and after some trials I think I've nailed the reason: this kind of requests (redirects) that hit the http frontend in HAProxy has not the same log format as the ones that hits the tls frontend, simply due to the fact that in the http frontend the captured headers are very different in number and type.

So IMHO we have two different ways to manage this:

  1. Ignore entries that hits the http frontend (used only to redirect to https), eg. specifying a different log target in HAProxy. This way we'll loose important information? Are they passed down and collected by Varnish?
  2. Edit the HAProxy configuration for the http frontend to capture the same headers (or use dummy placeholders in case not appliable) and keep the log-format consistent across the two frontends. This will allow Benthos to parse logs from both frontends. What the problem could be if capturing and logging different headers on the Varnish side?

I suggest considering the second approach, a possible CR would be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009724

Mentioned in SAL (#wikimedia-operations) [2024-03-08T15:32:09Z] <fabfur> repooling cp4037 for this weekend, all log-format changes are reverted (T351117)

Note: I've repooled cp4037 for the next days as I'll be busy on the SRE Summit to work on it.

All modifications to HAProxy logging has been reverted, we re-enabling them back reverting https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009769 and keep working on the fixes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009722 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009724

I've fixed some errors (now metadata field name should be correct) in the Benthos configuration. I've repooled cp4037 for about one minute (https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2024-03-18) to collect some logs.

ATM the logs sent to the deadletter queue (webrequest_text_test_error) are all messages that could be safely ignored, from the Analytics perspective (while we could use some metrics about those, but let's see).

@gmodena you should have some more data to play with now, while I work on the performance optimization and on Benthos internal metrics...

Change 1012656 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] Add webrequest_frontent raw schema.

https://gerrit.wikimedia.org/r/1012656

Change #983905 merged by jenkins-bot:

[operations/mediawiki-config@master] Add webrequest.frontend.rc0 stream

https://gerrit.wikimedia.org/r/983905

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:16:47Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:20:33Z] <hashar@deploy1002> otto and hashar: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-03-27T08:37:47Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] (duration: 20m 59s)

Change #983898 merged by jenkins-bot:

[schemas/event/primary@master] development: add webrequest schema

https://gerrit.wikimedia.org/r/983898

Change #1015260 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Change #1015260 merged by jenkins-bot:

[operations/mediawiki-config@master] webrequest: disable canary events.

https://gerrit.wikimedia.org/r/1015260

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:12:44Z] <hashar@deploy1002> Started scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:28:20Z] <hashar@deploy1002> gmodena and hashar: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-02T07:46:48Z] <hashar@deploy1002> Finished scap: Backport for [[gerrit:1015260|webrequest: disable canary events. (T314956 T351117)]] (duration: 34m 03s)

Change #1017041 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

@gmodena you should have some more data to play with now, while I work on the performance optimization and on Benthos internal metrics...

f/up to some threads we had in irc/slack.

The composite webrequest.frontend.rc0 stream is now declared in mw stream config.
All events currently being produced by benthos to kafka (upload/text) validate against the jsonschema.

I setup an end to end pipeline (gobblin + airflow) to load data from kafka, and generate the downstream (postprocessed) webrequest dataset (and error loss request). It implements the same logic that currently processes the varnish feed. The new dataset will be available in superset as gmodena.webrequest (you can already play with a small sample). To ease analysis, it is exposed with the same schema as the current (varnish based) table.

I have some patches in flight, that once deployed will automate ingestion and etl of the kafka topics.

Next steps: now that we are starting to collect more logs, we can start comparing current / new webrequest records.

Change #1017913 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: remove timestamp from unique-id-format

https://gerrit.wikimedia.org/r/1017913

Change #1017913 merged by Fabfur:

[operations/puppet@production] haproxy: remove timestamp from unique-id-format

https://gerrit.wikimedia.org/r/1017913

Change #1018188 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] haproxy: increase capture header length for UA

https://gerrit.wikimedia.org/r/1018188

Change #1018188 merged by Fabfur:

[operations/puppet@production] haproxy: increase capture header length for UA

https://gerrit.wikimedia.org/r/1018188

Change #983926 merged by Gmodena:

[analytics/refinery@master] Add gobblin job webrequest_frontend_rc0

https://gerrit.wikimedia.org/r/983926

Change #1017041 merged by Btullis:

[operations/puppet@production] analytics: refinery: add webrequest_frontend timer

https://gerrit.wikimedia.org/r/1017041

Change #1019867 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery/source@master] refinery-job: add webrequest instrumentation.

https://gerrit.wikimedia.org/r/1019867

Next steps: now that we are starting to collect more logs, we can start comparing current / new webrequest records.

An update after extensive rounds of validation that @Fabfur and I did in the past two weeks. @Fabfur holler if I'm missing something.

The haproxy and varnishkafka generates looks seem to be in agreement both in terms of volume (requests by status code and hostname),
as well as field (discrete values) overlap. All ulsfo traffic should be now served through haproxy.

A summary of our validation plan and findings can be found in this HAProxy log validation plan.

The only difference compared to varnishkafka logs is sequence number. With haproxy we observe the following behaviour:

  • upon haproxy (config) reloads, a new process is spawned that share the req count. As a result we’ll log a bunch of requests with duplicated sequence.
  • upon haproxy restarts, we’ll rest (and log) the counter to 0

There's an internal Slack thread wrt how this could impact ops, and how address it (cc / @xcollazo @JAllemandou ).

Pipeline status
  • The new Kafka topics are ingested at 10-minute intervals by Gobblin. Raw data is available in Hive at wmf_raw.webrequest_frontend_rc0.
  • A refine pipeline is currently running on a dev environment and producing post-processed data in gmodena.webrequest. Data loss support tables are available in the gmodena namespace. Note that this pipeline is orchestrated from a dev Airflow instance, and data might lag.
  • Traffic volume metrics (requests by status code and host) are available in gmodena.haproxy_requests. Note that they are pre-computed in incremental batches and may lag a few hours behind live traffic.
  • We added instrumentation to webrequest (via DQ framework) to try and spot eventual regressions post-migration.
Next steps

I believe the next steps would be agreeing on acceptance criteria (discussed with @Ahoelzl) and coming up with a rollout timeline.

From the data side, before switching off varnishkafka, we'll need to:

  • Plan the webrequest_raw sources switch in the production refine pipeline.
  • Bump the Event Platform stream name from webrequest_frontend_rc0 to webrequest_frontend, and promote its event schema to the primary registry (this should not impact the timeline).

I agree with @gmodena on all topics, more specifically:

  • About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.
  • The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021412

About next steps, we have a couple of options here:

  • We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka
  • We should turn off VarnishKafka on those cp hosts.

Ideally the two things could be done at the same time, especially if we don't have a way to differentiate downstream which logs comes from Benthos and which one comes from VarnishKafka. If we can differentiate we could first send the logs to the production topic and in a second time switch off VarnishKafka (in this time interval we'll have "duplicate" logs on the topic).

We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates

Instead of prepending this to the number, what about just adding this as a new field? We are currently using hostname + sequence for uniqueness, why not hostname + pid + sequence?

About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.

Ack and +1 to your proposal. IMHO it's easier to be resilient to reloads, than working around non-monotonicity. Do you maybe a feel for how often haproxy reloads are expected to happen once it prod? Could we assume they are sporadic events?

The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021412

I'm seeing uppercase keyx in both topics.

We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka

Do you mean reusing the current varnishkafka webrequest topics, or creating new production topics for webrequest_text_test and webrequest_upload_test (changing the _test suffix) ?

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.

Ack and +1 to your proposal. IMHO it's easier to be resilient to reloads, than working around non-monotonicity. Do you maybe a feel for how often haproxy reloads are expected to happen once it prod? Could we assume they are sporadic events?

HAProxy reload could be frequent, considering that every time someone edits any configuration bits it triggers a reload. Also the @Ottomata idea of adding another (separate) field to use as discriminator for uniqueness is perfectly doable on our side.

The X-Analytics-TLS header keyx parameter should be now uppercase, like in Varnish with change https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021412

I'm seeing uppercase keyx in both topics.

We should start sending Benthos logs to the actual "production" kafka topic(s), same as VarnishKafka

Do you mean reusing the current varnishkafka webrequest topics, or creating new production topics for webrequest_text_test and webrequest_upload_test (changing the _test suffix) ?

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

We can do both, for us it's just a matter of changing a string on puppet. I think decision is more on your side, choose the easiest/best option for you and we'll implement!

I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values, while still having a de-duplication key with the new field.

I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values, while still having a de-duplication key with the new field.

+1

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

We can do both, for us it's just a matter of changing a string on puppet. I think decision is more on your side, choose the easiest/best option for you and we'll implement!

Terrific!
@brouberol would it ok to 2x webrequest data in jumbo (all topics have 7 days retention)? This will be temporary (no ETA yet though), till we fully migrate to the new haproxy+benthos producers.

The haproxy_id field has been added to messages.

(PS. I'll keep this open as umbrella task)

Change #1021902 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] development: webrequest: add haproxy_pid field

https://gerrit.wikimedia.org/r/1021902

Change #1021902 merged by Gmodena:

[schemas/event/primary@master] development: webrequest: add haproxy_pid field

https://gerrit.wikimedia.org/r/1021902

The haproxy_id field has been added to messages.

This seems to have helped a lot with dedup, at least on the samples we collected so far.
@xcollazo @JAllemandou there is a sample in gmodena.webrequest_sequence_stats if you'd like to take a look.

As for next steps;

Schema finalization.

  • there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).
  • we'll need to change steam name from webrequest_frontend_rc0 to web request_frontend in EventStreamConfig. Possibly puppet too.
  • we need names for the new Kafka topics. @Fabfur would you have any preference? Historically Event Platform topics are prefixed with a datacenter id.

I was thinking about the follow rollout steps:

  1. Deploy HAProxy Benthos producer across entire fleet.
  1. Begin shipping logs to prod Kafka topics once ready. Update ingestion and analytics pipelines once topics are deemed ready for downstream users.
  1. Run Benthos and VarnishKafka producers in parallel for 7 days.
  1. Refine HAProxy data for 7 days, allocating a maintenance window for to update the webrequest feed (tables+dag). HAProxy logs become source of truth for wmf.webrequest. Notify downstream consumers and await bug reports. Rollback to varnishkafka feed if critical issues arise.
  1. If metrics stabilize after X days, disable varnishkafka.

@Fabfur thoughts on this high-level plan? How long can varnishkafka and benthos run together across the fleet?
@brouberol would it ok to 2x webrequest data (varnishkafka + benthos producers) in kafka-jumbo (and keep 7 days retention)?

Additional data platform-specific implementation steps can be discussed separately.

  • there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).

No problem here, for us is just a matter of removing a line from the Benthos configuration. Let me know if I can proceed!

  • we need names for the new Kafka topics. @Fabfur would you have any preference? Historically Event Platform topics are prefixed with a datacenter id.

Currently IIUC we just use webrequest_text and webrequest_upload. Do you suggest to use something like uslfo_webrequest_text instead? This shouldn't be an issue for us, anyway.
I also suggest to keep the DLQ as <topic_name>_errors, as it has been an invaluable help for debugging.

I was thinking about the follow rollout steps:

  1. Deploy HAProxy Benthos producer across entire fleet.
  1. Begin shipping logs to prod Kafka topics once ready. Update ingestion and analytics pipelines once topics are deemed ready for downstream users.
  1. Run Benthos and VarnishKafka producers in parallel for 7 days.
  1. Refine HAProxy data for 7 days, allocating a maintenance window for to update the webrequest feed (tables+dag). HAProxy logs become source of truth for wmf.webrequest. Notify downstream consumers and await bug reports. Rollback to varnishkafka feed if critical issues arise.
  1. If metrics stabilize after X days, disable varnishkafka.

@Fabfur thoughts on this high-level plan? How long can varnishkafka and benthos run together across the fleet?

I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

An update: we currently noticed some messages drop at operating system level during high-messages spike. This means that HAProxy tried to send logs to Benthos but the UDP buffer was full and the network stack discarded those messages before they could be read by Benthos. Unfortunately this isn't exactly easy to monitor but here are some considerations:

  • The ulsfo DC where we are testing this is the lowest traffic one. Is reasonable to imagine that when Benthos will be deployed on other datacenters this kind of drops could be even more frequent.
  • We are considering different solutions here (thanks to @Vgutierrez that is helping me with this) to mitigate this issue.
  • I don't think this would block the overall activity but is something we need to address right now to avoid worse performance in the future.

Do you suggest to use something like uslfo_webrequest_text instead?

The current naming 'convention' is to use '.' as the concept separator in topic names. So: ulsfo.webrequest_text would be fine.

also suggest to keep the DLQ as <topic_name>_errors

Can we do <topic_name>.error, to follow our other error topic naming convention? E.g. ulsfo.webrequest_text.error ?

Actually, the convention we use isn't <topic_name>.error, but <producer_name>.error, because in this case the subject of the data in the topic isn't about webrequests, but about a problem with the producer. But, naming is hard, and e.g. ulsfo.webrequest_text.error is fine. :)

  • there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).

No problem here, for us is just a matter of removing a line from the Benthos configuration. Let me know if I can proceed!

Ok, I'll f/up here on phab with a wishlist of changes, so that they could be addressed in one go before moving to prod topics.

  • we need names for the new Kafka topics. @Fabfur would you have any preference? Historically Event Platform topics are prefixed with a datacenter id.

Currently IIUC we just use webrequest_text and webrequest_upload. Do you suggest to use something like uslfo_webrequest_text instead? This shouldn't be an issue for us, anyway.
I also suggest to keep the DLQ as <topic_name>_errors, as it has been an invaluable help for debugging.

+1 for error topics. See comment from @Ottomata below for naming.

I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

Makes sense. This would require some work on our end to generate webrequest data from two "raw" sources at once, but I think as long as we can filter on dc / hostnames, we should manage. Let me take a better look at how this ETL is setup.

In HDFS, data is partitioned at hourly level. Could we assume that all varnishkafka instances in a given DC will be turned off within the same hour?

An update: we currently noticed some messages drop at operating system level during high-messages spike. This means that HAProxy tried to send logs to Benthos but the UDP buffer was full and the network stack discarded those messages before they could be read by Benthos. Unfortunately this isn't exactly easy to monitor but here are some considerations:

Ack. Thanks for the heads up.

I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

Makes sense. This would require some work on our end to generate webrequest data from two "raw" sources at once, but I think as long as we can filter on dc / hostnames, we should manage. Let me take a better look at how this ETL is setup.

One thought: one thing that we would lose if we turn off varnishkafka DC-by-DC is the capability of "easy" rollbacks. For example, pointing the ETL to the old varnishkafka logs in HDFS and regenerating web request data.

Right now, we don't partition wmf.webrequest data by DC; it's only partitioned by hour and request source (upload, text). This means that varnishkafka and haprox logs will end up in the same wmf.webrequest partition. So rollbacks would have to be a bit more surgical. If we turn off varnishkafka in one DC and later on detect issues in another, we won't be able to recover of analyze past data. I'm not sure about the likelihood of this being an issue, though.

I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

Storing (Kafka, HDFS) twice the data should be okay (within reason). I do appreciate that bandwidth volumes (web -> Kafka -> HDFS) are not trivial. But how much of a blocker would it be to keep Varnishkafka lights on until the whole migration is completed? Could we compress timelines and keep Benthos and HAProxy producing in parallel?

@Fabfur f/up from our chat earlier; these would be the pending config bits that we'll the to finalize when moving to prod topics:

  1. adopt topic names that follow EP conventions: <dc>.<topic_name> and <dc>.<topic_name>.error. E.g. ulsfo.webrequest_text and ulsfo.webrequest_text.error.
  2. remove the webrequest_source field from the event schema / payload.
  3. rename haproxy_pid to server_pid
  4. Change stream name (ESC, benthos producer) from webrequest_frontend.rc0 to webrequest_frontend.v1 (suffix versions are an EP convention).

These changes will require puppet changes, schema repo and Hive DDL, and a new deployment of EventStreamConfig. I'll start prepping work-in-progess CRs to EP and DE code bases.

Change #1026498 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] primary: add webrequest schema

https://gerrit.wikimedia.org/r/1026498

Change #1026506 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.v1.

https://gerrit.wikimedia.org/r/1026506

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Discussion about this here in slack.

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Thanks for clarifying @Ottomata.

@Ottomata @Fabfur If we remove prefixing, there is a potential clash between varnishkafka and benthos topics.
How about we name the production Haproxy/benthos topics as follows?

  • webrequest_frontend_text
  • webrequest_frontend_text.error
  • webrequest_frontend_upload
  • webrequest_frontend_upload.error

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Thanks for clarifying @Ottomata.

@Ottomata @Fabfur If we remove prefixing, there is a potential clash between varnishkafka and benthos topics.
How about we name the production Haproxy/benthos topics as follows?

  • webrequest_frontent_text
  • webrequest_frontent_text.error
  • webrequest_frontent_upload
  • webrequest_frontent_upload.error

No problem for us to rename these topics, even with or without "variable" part...

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Thanks for clarifying @Ottomata.

@Ottomata @Fabfur If we remove prefixing, there is a potential clash between varnishkafka and benthos topics.
How about we name the production Haproxy/benthos topics as follows?

  • webrequest_frontent_text
  • webrequest_frontent_text.error
  • webrequest_frontent_upload
  • webrequest_frontent_upload.error

No problem for us to rename these topics, even with or without "variable" part...

ack - and thanks!
Oof s/frontent/frontend. Spelling that word is not my forte.

Change #1031762 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache:benthos: switch to production topic names

https://gerrit.wikimedia.org/r/1031762

Change #1036272 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] webrequest: add error schema.

https://gerrit.wikimedia.org/r/1036272

Change #1036299 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.error

https://gerrit.wikimedia.org/r/1036299

Change #1037599 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] gobblin: add webrequest_error pull job.

https://gerrit.wikimedia.org/r/1037599

The haproxy / benthos feed is now available in raw form under wmf_staging.webrequest_frontend_rc0 and post-processed in wmf_staging.webrequest.

Ingestion and processing is coordinated by Gobbling and airflow jobs deployed to prod systems. However, the dataset themselves are still considered experimental / WIP.

Change #1019867 merged by jenkins-bot:

[analytics/refinery/source@master] refinery-job: add webrequest instrumentation.

https://gerrit.wikimedia.org/r/1019867

Change #1031762 abandoned by Fabfur:

[operations/puppet@production] cache:benthos: switch to production topic names

Reason:

Superseded by T370668

https://gerrit.wikimedia.org/r/1031762

Change #1026506 abandoned by Gmodena:

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.v1.

Reason:

analytics feed for haproxy development has halted

https://gerrit.wikimedia.org/r/1026506

Change #1012656 abandoned by Gmodena:

[analytics/refinery@master] hql: webrequest: add webrequest_frontend.

Reason:

analytics feed for haproxy development has halted

https://gerrit.wikimedia.org/r/1012656

Change #1026498 abandoned by Gmodena:

[schemas/event/primary@master] primary: add webrequest schema

Reason:

analytics feed for haproxy development has halted

https://gerrit.wikimedia.org/r/1026498

Change #1036299 abandoned by Gmodena:

[operations/mediawiki-config@master] EventStreamConfig: Add webrequest.frontend.error

Reason:

analytics feed for haproxy development has halted

https://gerrit.wikimedia.org/r/1036299

Change #1036272 abandoned by Gmodena:

[schemas/event/primary@master] webrequest: add error schema.

Reason:

analytics feed for haproxy development has halted

https://gerrit.wikimedia.org/r/1036272

Change #1012656 restored by Gmodena:

[analytics/refinery@master] hql: webrequest: add webrequest_frontend.

https://gerrit.wikimedia.org/r/1012656