Page MenuHomePhabricator

navtiming: Loss of Kafka connection fills multiple log files with identical stack traces
Closed, ResolvedPublicBUG REPORT

Description

Caused T391272: deployment-webperf21 puppet runs crashing with `Error: No space left on device`

root@deployment-webperf21:/var/log# grep 'kafka.errors.NoBrokersAvailable: NoBrokersAvailable' messages | wc -l
1447089
root@deployment-webperf21:/var/log# grep 'kafka.errors.NoBrokersAvailable: NoBrokersAvailable' syslog | wc -l
1448219
root@deployment-webperf21:/var/log# grep 'kafka.errors.NoBrokersAvailable: NoBrokersAvailable' user.log | wc -l
1448965
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]: 2025-03-29 00:01:21,006 [ERROR] (get_kafka_iterator:1040) Unhandled exception in main loop, restarting consumer
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]: Traceback (most recent call last):
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]:  File "/srv/deployment/performance/navtiming-cache/revs/c868e79beea45232e51b305a0bad0949848d0191/navtiming/__init__.py", line 988, in get_kafka_iterator
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]:    consumer = KafkaConsumer(
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]:  File "/usr/lib/python3/dist-packages/kafka/consumer/group.py", line 356, in __init__
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]:    self._client = KafkaClient(metrics=self._metrics, **self.config)
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]:  File "/usr/lib/python3/dist-packages/kafka/client_async.py", line 244, in __init__
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]:    self.config['api_version'] = self.check_version(timeout=check_timeout)
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]:  File "/usr/lib/python3/dist-packages/kafka/client_async.py", line 900, in check_version
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]:    raise Errors.NoBrokersAvailable()
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]: kafka.errors.NoBrokersAvailable: NoBrokersAvailable
Mar 29 00:01:21 deployment-webperf21 navtiming[2564521]: 2025-03-29 00:01:21,006 [INFO] (get_kafka_iterator:986) Starting Kafka connection to brokers (('deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud:9093',))

Event Timeline

bd808 renamed this task from navtiming: Loss of Kafka connection to kafka fills multiple log files with identical stack traces to navtiming: Loss of Kafka connection fills multiple log files with identical stack traces.Apr 7 2025, 5:11 PM

The navtiming service on deployment-webperf21 seems to be pointed at itself for the Kafka queue

root@deployment-webperf21:/srv/deployment/performance/navtiming# cat /lib/systemd/system/navtiming.service
[Unit]
Description=navtiming
After=network-online.target

[Service]
WorkingDirectory=/srv/deployment/performance/navtiming
ExecStart=/usr/bin/python3 /srv/deployment/performance/navtiming/run_navtiming.py \
    --brokers "deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud:9093" \
    --security-protocol "SSL" \
    --ssl-cafile "/etc/ssl/certs/wmf-ca-certificates.crt" \
    --consumer-group navtiming \
    --statsd-host "localhost" \
    --statsd-port "8125"
User=nobody
Restart=always
SyslogIdentifier=navtiming

[Install]
WantedBy=multi-user.target

I can't quite figure out where in the twisty maze of passages that is the config here where that value comes from. I also don't immediately see in role:webperf where a local kafka broker would be provisioned.

bd808 triaged this task as High priority.Apr 7 2025, 5:47 PM

The navtiming service is currently shutdown on deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud so that the disk doesn't just fill back up as happened in T391272: deployment-webperf21 puppet runs crashing with `Error: No space left on device`. Somebody with a better grasp of role::webperf should try to figure out what changed and triggered the error storm.

Possibly related to ...

Oh oops sorry I missed that this was in beta. Carry on!

https://www.mediawiki.org/wiki/Developers/Maintainers#Services_and_administration has navtiming the primary steward listed as "unassigned", but maybe someone in SRE Observability can take a look or think of someone else to help get this working in the Beta Cluster again.

I think the root issue is in T346402

There may be other Kafka issues in deployment-prep as well, but --brokers "deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud:9093" is pointing at a non-existent Kafka server.

bd808@deployment-webperf21:~$ sudo netstat -tunlp | grep 9093
bd808@deployment-webperf21:~$

For what it's worth, there never has been a local Kafka there and it wouldn't help to add one as nothing would write there.

I'm guessing someone changed a Hiera config file to remove some Kafka broker (instead of replacing), and then something in Puppet started unhelpfully defaulting to localhost instead of producing an error.

It's supposed to point to a Kafka host that will provide the NavigationTimimg events, as produced by the MediaWiki NavigationTiming extension in Beta Cluster. This worked in 2023 when we last had a Perf team, as we deployed and tested in Beta before prod.

https://wikitech.wikimedia.org/wiki/Performance#/media/File:WMF_Performance_Team_infrastructure_2023.png

MediaWiki JS sends EventLogging events to an intake URL (EventGate?), the same service that eg MediaWiki PHP submits jobqueue jobs to, which writes them to Kafka (in prod these are separate instances but of the essentially the same service), which are then consumed from the Kafka topic by services like navtiming.py.

Perhaps a Kafka host was removed, or firewalled, or mixed up with a similar-but-different instance of the same service?

For what it's worth, there never has been a local Kafka there and it wouldn't help to add one as nothing would write there.

I'm guessing someone changed a Hiera config file to remove some Kafka broker (instead of replacing), and then something in Puppet started unhelpfully defaulting to localhost instead of producing an error.

This sounds like a reasonable guess. We do have Kafka in Beta Cluster that we should be able to point this at. The issue right now is that I don't know where in the sea of Puppet and Hiera things are getting pointed to the --brokers "deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud:9093" endpoint.

Neither of these references in cloud/instance-puppet.git are the culprit:

bd808@mbp03:~/projects/wmf/cloud/instance-puppet/deployment-prep$ git grep 9093
_.yaml:- deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:9093
deployment-mediawiki.yaml:- deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:9093
bd808@mbp03:~/projects/wmf/cloud/instance-puppet/deployment-prep$ git grep webperf21
deployment-cache-text.yaml:- replacement: http://deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud

In operations/puppet.git the only hit for webperf21 is in the scap::dsh::groups hiera tree:

$ git grep webperf21
hieradata/cloud/eqiad1/deployment-prep/common.yaml:            - deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud

Time to crawl further into the rabbit hole...

In ::profile::webperf::processors:

    # statsv is on main kafka, not analytics or jumbo kafka.
    # Note that at any given time, all statsv varnishkafka producers are
    # configured to send to only one kafka cluster (usually main-eqiad).
    # statsv in an inactive datacenter will not process any messages, as
    # varnishkafka will not produce any messages to that DC's kafka cluster.
    # This is configured by the value of the hiera param
    # profile::cache::kafka::statsv::kafka_cluster_name when the statsv varnishkafka
    # profile is included (as of this writing on text caches).
    $kafka_main_config = kafka_config("main-${::site}")
    $kafka_main_brokers = $kafka_main_config['brokers']['ssl_string']

# ... skipping some stuff ...

    # EventLogging is on the jumbo kafka. Unlike the main one, this
    # is not yet mirrored to other data centers, so for prod,
    # assume eqiad.
    $kafka_config  = kafka_config('jumbo-eqiad')
    $kafka_brokers = $kafka_config['brokers']['ssl_string']

That $kafka_brokers value is passed to ::webperf::navtiming which is our misconfigured service.

I think the ultimate issue here is that profile::webperf::processors expects a Kafka cluster named "main-${::site}" to be declared in the kafka_clusters Hiera variable (which has main-deployment-prep but not main-eqiad).

modules/role/lib/puppet/parser/functions/kafka_config.rb defaults to a cluster consisting of the current host only when the configuration is not found.

::webperf::navtiming leads to modules/webperf/templates/navtiming.systemd.erb where --brokers "<%= @kafka_brokers %>" is output. So kafka_config('jumbo-eqiad')['brokers']['ssl_string'] is the thing that needs fixing.

I think the ultimate issue here is that profile::webperf::processors expects a Kafka cluster named "main-${::site}" to be declared in the kafka_clusters Hiera variable (which has main-deployment-prep but not main-eqiad).

modules/role/lib/puppet/parser/functions/kafka_config.rb defaults to a cluster consisting of the current host only when the configuration is not found.

This tracks with the other investigation above. It sounds like we need to have both main-eqiad and jumbo-eqiad config in Hiera for the Beta Kafka clusters.

Let's try this and see what happens:

kafka_clusters:
  jumbo-deployment-prep: &kafka-jumbo
    brokers:
      deployment-kafka-jumbo-5.deployment-prep.eqiad1.wikimedia.cloud:
        id: 1
        rack: A
      deployment-kafka-jumbo-8.deployment-prep.eqiad1.wikimedia.cloud:
        id: 2
        rack: B
      deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud:
        id: 3
        rack: C
    ipv6: false
    zookeeper_cluster_name: main-deployment-prep
  jumbo-eqiad: *kafka-jumbo
  logging-codfw:
    brokers:
      deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:
        id: 1001
        rack: B
    ipv6: false
    zookeeper_cluster_name: main-deployment-prep
  logging-eqiad:
    brokers:
      deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:
        id: 1001
        rack: B
    ipv6: false
    zookeeper_cluster_name: main-deployment-prep
  main-deployment-prep: &kafka-main
    brokers:
      deployment-kafka-main-5.deployment-prep.eqiad1.wikimedia.cloud:
        id: 1
        rack: A
      deployment-kafka-main-6.deployment-prep.eqiad1.wikimedia.cloud:
        id: 2
        rack: B
    ipv6: false
    zookeeper_cluster_name: main-deployment-prep
  main-eqiad: *kafka-main

Mentioned in SAL (#wikimedia-releng) [2025-06-03T17:04:24Z] <bd808> Added jumbo-eqiad and main-eqiad aliases to kafka_clusters hiera config (T391273)

Bah. Things don't survive the round trips from one system to another.

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key 'lookup_options' failed: cloudlib::httpyaml failed Unknown alias: id001 (file: /srv/puppet_code/environments/production/modules/role/manifests/wmcs/instance.pp, line: 3, column: 5) on node deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud

Mentioned in SAL (#wikimedia-releng) [2025-06-03T17:08:40Z] <bd808> Manually expanded (duplicated) jumbo-eqiad and main-eqiad aliases in kafka_clusters hiera config (T391273)

Mentioned in SAL (#wikimedia-releng) [2025-06-03T17:09:43Z] <bd808> Forced puppet run on deployment-webperf21 to pick up Kafka config changes (T391273)

Notice: /Stage[main]/Webperf::Navtiming/Systemd::Service[navtiming]/Systemd::Unit[navtiming]/File[/lib/systemd/system/navtiming.service]/content:
--- /lib/systemd/system/navtiming.service       2025-01-06 19:11:37.519389317 +0000
+++ /tmp/puppet-file20250603-3075651-u1krqv     2025-06-03 17:09:01.554898515 +0000
@@ -5,7 +5,7 @@
 [Service]
 WorkingDirectory=/srv/deployment/performance/navtiming
 ExecStart=/usr/bin/python3 /srv/deployment/performance/navtiming/run_navtiming.py \
-    --brokers "deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud:9093" \
+    --brokers "deployment-kafka-jumbo-5.deployment-prep.eqiad1.wikimedia.cloud:9093,deployment-kafka-jumbo-8.deployment-prep.eqiad1.wikimedia.cloud:9093,deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud:9093" \
     --security-protocol "SSL" \
     --ssl-cafile "/etc/ssl/certs/wmf-ca-certificates.crt" \
     --consumer-group navtiming \

Bah. Things don't survive the round trips from one system to another.

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key 'lookup_options' failed: cloudlib::httpyaml failed Unknown alias: id001 (file: /srv/puppet_code/environments/production/modules/role/manifests/wmcs/instance.pp, line: 3, column: 5) on node deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloud

Can you file a separate Cloud-VPS bug for this?

bd808 claimed this task.

I think this is fixed. There is a different log spam now that is dominating /var/log/messages but I will split that out into a separate task.

I think the ultimate issue here is that profile::webperf::processors expects a Kafka cluster named "main-${::site}" to be declared in the kafka_clusters Hiera variable (which has main-deployment-prep but not main-eqiad).

modules/role/lib/puppet/parser/functions/kafka_config.rb defaults to a cluster consisting of the current host only when the configuration is not found.

[…] It sounds like we need to have both main-eqiad and jumbo-eqiad config in Hiera for the Beta Kafka clusters.

Ack. The following might (not) be related: There has been a beta-cherry-pick for 5 years (uploaded at https://gerrit.wikimedia.org/r/c/operations/puppet/+/941475) which modifies how kafka_config() works in Beta.