Page MenuHomePhabricator

beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'
Closed, ResolvedPublic

Description

beta-scap-sync-world fails since at least March 21st 00:46:17 UTC (and last worked at 00:36 UTC ). The failure is:

08:10:13 Waiting 20 seconds for canary traffic...
08:10:33 Executing check 'Logstash canary error rate'
08:10:33 Check 'Logstash canary error rate' failed: Traceback (most recent call last):
  File "/usr/local/bin/logstash_checker.py", line 421, in <module>
    main()
  File "/usr/local/bin/logstash_checker.py", line 412, in main
    sys.exit(checker.run())
  File "/usr/local/bin/logstash_checker.py", line 309, in run
    entries = r['aggregations']['2']['buckets']
KeyError: 'aggregations'

08:10:33 Failed to complete canary checks for some reason.
08:10:33 Scap cancelled

That is the same kind of error as T260667.

Event Timeline

hashar triaged this task as Unbreak Now! priority.Mar 21 2024, 8:33 AM

Since I could not find how scap retrieve the logstash_host from configuration, notably all the bits I could find referred to some deployment-logstash02 machines which got deleted 2/3 years ago. I thus went to edit /usr/local/bin/logstash_checker.py and make it print arguments its passed using:

from pprint import pprint
pprint(sys.argv)

Which let me craft the full command line to reproduce the issue:

/usr/local/bin/logstash_checker.py \
  --service-name mwdeploy \
  --host canaries \
  --fail-threshold 10.0 \
  --delay 20 \
  --logstash-host logging-logstash-02.logging.eqiad1.wikimedia.cloud:9200

Passing it --verbose does show the logstash query and response and we get:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 0,
    "successful": 0,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": 0,
    "hits": []
  }
}

The code checks whether there is an error key, else it tries to retrieve an aggregations and since that does not exist in the above response, the checker fails.

The query was:

{'size': 0, 'aggs': {'2': {'date_histogram': {'interval': '10s', 'field': '@timestamp'}}}, 'query': {'bool': {'filter': [{'range': {'@timestamp': {'lte': 'now', 'gte': 'now-60m'}}}, {'query_string': {'query': '(kubernetes.labels.release:canary AND kubernetes.labels.deployment:() OR host:(deployment-mediawiki12 OR \n)) AND type:mediawiki AND channel:(exception OR error)'}}], 'must_not': [{'terms': {'level': ['DEBUG']}}]}}}

I don't know where the logs are actually sent or how to check them. The link I had bookmarked https://logstash-beta.wmflabs.org/ redirects to https://beta-logs.wmcloud.org/ and there is nothing logged there. I could not find any relevant change in operations/puppet, cloud/instance-puppet, operations/mediawiki-config or from the Puppet logs on the deployment server. So that is essentially a mystery to me.

My best guess is something has changed in the logging WMCS project ( https://openstack-browser.toolforge.org/project/logging ), and that seems to be maintained by observability .

Eventually I have found the source for the logstash host checked by the logstash_checker.py, that is hardcoded in the scap configuration:

modules/scap/templates/scap.cfg.erb
[eqiad1.wikimedia.cloud]
logstash_host: logging-logstash-02.logging.eqiad1.wikimedia.cloud:9200

Puppet and WMCS are full of references to deployment-logstash03, an instance that has been phased out more than two years ago by T2830123. I am eventually giving up, others are more knowledgeable of the logging stack than me.

@colewhite stated a disk is full and logs aren't being ingested, suspecting ssl on the kafka hosts.

deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud has a single 20G partition and it is almost full. Looking at the journal, there is a spam of:

Mar 21 12:36:45 deployment-kafka-logging01 kafka-server-start[9710]: [2024-03-21 12:36:45,778] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
Mar 21 12:36:45 deployment-kafka-logging01 kafka-server-start[9710]: [2024-03-21 12:36:45,882] ERROR [Controller id=1001, targetBrokerId=1001] Connection to node 1001 failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)

Change #1013598 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: record beta-scap-sync-world is disabled

https://gerrit.wikimedia.org/r/1013598

Change #1013598 merged by jenkins-bot:

[integration/config@master] jjb: record beta-scap-sync-world is disabled

https://gerrit.wikimedia.org/r/1013598

@colewhite stated a disk is full and logs aren't being ingested, suspecting ssl on the kafka hosts.

deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud has a single 20G partition and it is almost full. Looking at the journal, there is a spam of:

Mar 21 12:36:45 deployment-kafka-logging01 kafka-server-start[9710]: [2024-03-21 12:36:45,778] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
Mar 21 12:36:45 deployment-kafka-logging01 kafka-server-start[9710]: [2024-03-21 12:36:45,882] ERROR [Controller id=1001, targetBrokerId=1001] Connection to node 1001 failed authentication due to: SSL handshake failed (org.apache.kafka.clients.NetworkClient)

We could use some help on this one. The disk is full on deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud but it's unclear what we can (if anything) drop to save space there.

Does anyone from SRE Observability have ideas there?

In the interim, I wonder if we can skip the check in scap so we can keep deploying and lower this from UBN

Change #1014089 had a related patch set uploaded (by Thcipriani; author: Thcipriani):

[integration/config@master] beta-scap-sync-world: temporarily skip scap canary checks

https://gerrit.wikimedia.org/r/1014089

Change #1014089 merged by jenkins-bot:

[integration/config@master] beta-scap-sync-world: temporarily skip scap canary checks

https://gerrit.wikimedia.org/r/1014089

thcipriani lowered the priority of this task from Unbreak Now! to High.Mar 25 2024, 6:41 PM
18:18:57     <wmf-insecte> Yippee, build fixed!
18:18:57     <wmf-insecte> Project beta-scap-sync-world build #147700: FIXED in 11 min: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/147700/

Lowering priority away from UBN since scap is once again deploying to beta (without checking for any increased error rates or catastrophic bugs)—not an ideal place to be, but at least code is winding through beta.

an instance that has been phased out more than two years ago by T2830123

Just linking T283013: Migrate beta cluster to ELK7 here, per the above.

Just linking T283013: Migrate beta cluster to ELK7 here, per the above.

We no longer use ELK: T288618.

We could use some help on this one. The disk is full on deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud but it's unclear what we can (if anything) drop to save space there.

Does anyone from SRE Observability have ideas there?

The full disks on the kafka and logstash collector hosts are a symptom of the underlying issue: the server certificate for Kafka expired on March 6th in deployment-prep:

(...)
        Issuer: C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = kafka
        Validity
            Not Before: Mar  7 06:47:00 2023 GMT
            Not After : Mar  6 06:47:00 2024 GMT
(...)
            Authority Information Access: 
                OCSP - URI:http://pki-intermediate.pki.eqiad1.wikimedia.cloud/ocsp/kafka

            X509v3 Subject Alternative Name: 
                DNS:deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud
            X509v3 CRL Distribution Points: 

                Full Name:
                  URI:http://pki-intermediate.pki.eqiad1.wikimedia.cloud/crl/kafka
(...)

We need a new certificate issued for Kafka, but I have no visibility into the pki project in horizon. Maybe Infrastructure-Foundations can help here?

Once the new certificate is installed, we can reconnect rsyslog and logstash to kafka and get logs flowing again.

I issued a new certificate from the puppetmaster and the existing PKI. Let me see if I need to do more or that fixed the issue.

That was one hell of a rabbit hole.

So as I said above, I reissued the server certificate via the new puppetserver, that didn't fix the issue, because SSL in kafka (on port 9093) has its own certificates (chained) stored in /etc/kafka/ssl and managed by keytool (a java secret management thing). I tried to reissue a new self-signed cert via keytool and I think it ended up creating the same old expired cert (probably somewhere in the private repo of the new beta cluster's puppetserver that cert exists and needs updating).

I figured it's probably easier to use the general PKI we have for this, and indeed a lot of this seems to be coming from the refactor in puppet to make kafka nodes in production use general PKI instead of cergen with a comment saying "we probably should fix beta" (T319372#8779624). When I tried to fix beta and use prod-like hiera values by adding "profile::kafka::broker:use_pki_settings: true" things progressed a lot and new certs started showing up but still were failing with stuff like this:

Error: '/usr/bin/cfssl sign -config /etc/cfssl/client-cfssl.conf -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem -mutual-tls-client-cert /etc/cfssl/mutual_tls_client_cert.pem -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud.pem -label kafka -profile kafka_11 /etc/kafka/ssl/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11.csr | /usr/bin/cfssljson -bare /etc/kafka/ssl/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11
' returned 1 instead of one of [0]

When I ran the command directly, this was the response:

{"code":7400,"message":"failed POST to https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign: Post \"https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign\": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0"}

And internet said you can sent env variable GODEBUG="x509ignoreCN=0" to make this work and this was the result:

root@deployment-kafka-logging01:/etc/kafka/ssl# GODEBUG="x509ignoreCN=0" /usr/bin/cfssl sign -config /etc/cfssl/client-cfssl.conf -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem -mutual-tls-client-cert /etc/cfssl/mutual_tls_client_cert.pem -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud.pem -label kafka -profile kafka_11 /etc/kafka/ssl/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11.csr
2024/03/27 00:50:58 [INFO] Using client auth with mutual-tls-cert: /etc/cfssl/mutual_tls_client_cert.pem and mutual-tls-key: /var/lib/puppet/ssl/private_keys/deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud.pem
2024/03/27 00:50:58 [INFO] Using trusted CA from tls-remote-ca: /etc/ssl/localcerts/pki_api_CA.pem
2024/03/27 00:50:58 [ERROR] http error with https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign
{"code":7400,"message":"{\"success\":false,\"result\":null,\"errors\":[{\"code\":1,\"message\":\"invalid token\"}],\"messages\":[]}\n"}

(invalid token)

At that point, I gave up and just disabled ssl on kafka in beta cluster (it doesn't make much sense, the password for it is qwerty) and after that things seems to be "normal" and kafka service seems to be working fine and journal logs seem happy.

logstash_checker.py is still failing but it might be that there hasn't been any logs produced since kafka is back up? I'll check later.

Change #1014715 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] Revert "jjb: record beta-scap-sync-world is disabled"

https://gerrit.wikimedia.org/r/1014715

At that point, I gave up and just disabled ssl on kafka in beta cluster (it doesn't make much sense, the password for it is qwerty) and after that things seems to be "normal" and kafka service seems to be working fine and journal logs seem happy.

Thanks for digging in!

Unfortunately, rsyslog and logstash both expect Kafka to listen with SSL on port 9093. Disabling SSL for these tools is not a small task. In addition, removing SSL would increase the configuration gap between beta and production and further degrade its usefulness as a testing environment.

Current status is logs are not flowing into deployment-kafka-logging01.

At that point, I gave up and just disabled ssl on kafka in beta cluster (it doesn't make much sense, the password for it is qwerty) and after that things seems to be "normal" and kafka service seems to be working fine and journal logs seem happy.

Thanks for digging in!

Unfortunately, rsyslog and logstash both expect Kafka to listen with SSL on port 9093. Disabling SSL for these tools is not a small task. In addition, removing SSL would increase the configuration gap between beta and production and further degrade its usefulness as a testing environment.

Current status is logs are not flowing into deployment-kafka-logging01.

It also seems varnishkafka is expecting that, too. Seeing this on deployment-cache-upload:

Mar 28 17:19:07 deployment-cache-upload08 varnishkafka[625222]: %3|1711646347.450|FAIL|varnishkafka#producer-1| [thrd:ssl://deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia]: ssl://
deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud:9093/bootstrap: Connect to ipv4#172.16.3.226:9093 failed: Connection timed out (after 130097ms in state CONNECT, 1 identical e
rror(s) suppressed)

@elukey since you handled PKI for Kafka in production (ref: T319372), do you have any insights as to the knobs that need to be turned for this?

At that point, I gave up and just disabled ssl on kafka in beta cluster (it doesn't make much sense, the password for it is qwerty) and after that things seems to be "normal" and kafka service seems to be working fine and journal logs seem happy.

Thanks for digging in!

Unfortunately, rsyslog and logstash both expect Kafka to listen with SSL on port 9093. Disabling SSL for these tools is not a small task. In addition, removing SSL would increase the configuration gap between beta and production and further degrade its usefulness as a testing environment.

Current status is logs are not flowing into deployment-kafka-logging01.

Then I continue pretending to know what I'm doing until this is fixed. I have some ideas actually.

I made progress, will write down more once I'm done.

So let's start on what the error was:

[2024-03-26 23:17:10,985] WARN [RequestSendThread controllerId=1001] Controller 1001's connection to broker deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:9093 (id: 1001 rack: B) was unsuccessful (kafka.controller.RequestSendThread)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed
        at sun.security.ssl.Alert.createSSLException(Alert.java:131)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:331)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:274)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:269)
        at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:654)
        at sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:473)
        at sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:369)
        at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:377)
        at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444)
        at sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:981)
        at sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:968)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.security.ssl.SSLEngineImpl$DelegatedTask.run(SSLEngineImpl.java:915)
        at org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:388)
        at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:468)
        at org.apache.kafka.common.network.SslTransportLayer.doHandshake(SslTransportLayer.java:326)
        at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:253)
        at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:88)
        at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:486)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:424)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460)
        at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:73)
        at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:271)
        at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:225)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

That was because we didn't have any SSL certificate for kafka, to fix that I found some issues:

1- This is I think the biggest one: The Auth token is not being set correctly. The auth token for the cloud's PKI is something different that what is showing up in /etc/cfssl/client-cfssl.conf (my assumption is that profile::pki::client::auth_key has not been added to the actual private repo of the cloud when migrating to the new puppetserver). You can look up the actual token by looking up the file /etc/cfssl/signers/kafka/cfssl.conf in PKI VMs (or many other cfssl config files).

2- The pki-intermediate has not got any puppet update for a month now. When I logged in, I was greeted with this:

The last Puppet run was at Wed Feb 28 10:02:56 UTC 2024 (42433 minutes ago). 
Last Puppet commit:

The error is:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Duplicate declaration: Package[python3-cryptography] is already declared at (file: /etc/puppet/modules/cfssl/manifests/ocsp.pp, line: 34); cannot redeclare (file: /etc/puppet/modules/prometheus/manifests/node_textfile.pp, line: 34) (file: /etc/puppet/modules/prometheus/manifests/node_textfile.pp, line: 34, column: 9) (file: /etc/puppet/modules/profile/manifests/pki/multirootca/monitoring.pp, line: 26) on node pki-intermediate.pki.eqiad1.wikimedia.cloud

Thankfully it still had the certificates I need to do the signing and I didn't need to touch anything there.

3- Kafka's OCSP process in pki-intermediate is also broken:

Mar 27 05:57:30 pki-intermediate cfssl-ocsprefresh[2327]: Traceback (most recent call last):
Mar 27 05:57:30 pki-intermediate cfssl-ocsprefresh[2327]:   File "/usr/local/sbin/cfssl-ocsprefresh", line 176, in <module>
Mar 27 05:57:30 pki-intermediate cfssl-ocsprefresh[2327]:     raise SystemExit(main())
Mar 27 05:57:30 pki-intermediate cfssl-ocsprefresh[2327]:   File "/usr/local/sbin/cfssl-ocsprefresh", line 152, in main
Mar 27 05:57:30 pki-intermediate cfssl-ocsprefresh[2327]:     if not update_required(responses_file, Path(f'{args.dbconfig}.json'), args.update):
Mar 27 05:57:30 pki-intermediate cfssl-ocsprefresh[2327]:   File "/usr/local/sbin/cfssl-ocsprefresh", line 134, in update_required
Mar 27 05:57:30 pki-intermediate cfssl-ocsprefresh[2327]:     return last_update < get_db_update_time(db_conn, config['db'], table_name)
Mar 27 05:57:30 pki-intermediate cfssl-ocsprefresh[2327]: TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
Mar 27 05:57:30 pki-intermediate systemd[1]: cfssl-ocsprefresh-kafka.service: Main process exited, code=exited, status=1/FAILURE
Mar 27 05:57:30 pki-intermediate systemd[1]: cfssl-ocsprefresh-kafka.service: Failed with result 'exit-code'.
Mar 27 05:57:30 pki-intermediate systemd[1]: Failed to start OCSP Refresh job - kafka.
Mar 27 06:57:32 pki-intermediate systemd[1]: Starting OCSP Refresh job - kafka...

4- The path to the csr file for kafka SSL command is wrong. This is what puppet issues:

/usr/bin/cfssl sign -config /etc/cfssl/client-cfssl.conf -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem -mutual-tls-client-cert /etc/cfssl/mutual_tls_client_cert.pem -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud.pem -label kafka -profile kafka_11 /etc/kafka/ssl/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11.csr

/etc/kafka/ssl/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11.csr doesn't exist until you actually generate the certificate (and definitely not a cfssl request file). The actual file is in this path:
/etc/cfssl/csr/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11.csr

With all of these "fixed" by manually changing them hoping to puppetize them later. I issued this command and it produced the certificates from PKI:

GODEBUG=x509ignoreCN=0 /usr/bin/cfssl gencert   -config /etc/cfssl/client-cfssl.conf   -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem   -mutual-tls-client-cert /etc/cfssl/mutual_tls_client_cert.pem   -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud.pem  -loglevel 0 -label kafka -profile server /etc/cfssl/csr/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11.csr | /usr/bin/cfssljson -bare /etc/kafka/ssl/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11

But when you restart kafka, I got a lot of:

ERROR Error while accepting connection (kafka.network.Acceptor)
java.lang.ArithmeticException: / by zero

Which implies zookeper is not happy? but it went away and got replaced with this error:

Mar 28 23:18:53 deployment-kafka-logging01 kafka-server-start[17572]: [2024-03-28 23:18:53,452] ERROR [Controller id=1001, targetBrokerId=1001] Connection to node 1001 failed authentication due to: SSL handshake
Mar 28 23:18:53 deployment-kafka-logging01 kafka-server-start[17572]: [2024-03-28 23:18:53,452] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
Mar 28 23:18:53 deployment-kafka-logging01 kafka-server-start[17572]: java.io.IOException: Broken pipe

Controller log said this:

Caused by: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed
	at sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:331)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:274)
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:269)

I did this:

root@deployment-kafka-logging01:/etc/kafka# keytool -delete -noprompt -alias kafka_logging-eqiad_broker -keystore /etc/kafka/ssl/kafka_logging-eqiad_broker.keystore.p12
root@deployment-kafka-logging01:/etc/kafka# keytool -import -alias kafka_logging-eqiad_broker -file /etc/kafka/ssl/kafka__deployment-kafka-logging01_deployment-prep_eqiad1_wikimedia_cloud_kafka_11.pem -keystore /etc/kafka/ssl/kafka_logging-eqiad_broker.keystore.p12

to remove the old cert and add the new one.

Now I'm getting:

[2024-03-29 00:07:47,832] WARN [RequestSendThread controllerId=1001] Controller 1001's connection to broker deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:9093 (id: 1001 rack: B) was unsuccessful (kafka.controller.RequestSendThread)
org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?

Which is weird because the broker is set up correctly?

[2024-03-29 00:06:51,163] INFO Registered broker 1001 at path /brokers/ids/1001 with addresses: ArrayBuffer(EndPoint(deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud,9092,ListenerName(PLAINTEXT),PLAINTEXT), EndPoint(deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud,9093,ListenerName(SSL),SSL)) (kafka.zk.KafkaZkClient)

I tried setting security.inter.broker.protocol=PLAINTEXT since this is between brokers inside the VM and still didn't fix the issue, at this point, I'm running out of ideas.

Change #1014715 abandoned by Hashar:

[integration/config@master] Revert "jjb: record beta-scap-sync-world is disabled"

Reason:

Dupe of I5e05d683bcb7f10c52f90d0bd0dafeba0f66ba0f and https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ is active.

https://gerrit.wikimedia.org/r/1014715

Ok the fact that the PKI node with the intermediate CA is broken is not great, let's try to fix that first.

As Amir pointed out:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Duplicate declaration: Package[python3-cryptography] is already declared at (file: /etc/puppet/modules/cfssl/manifests/ocsp.pp, line: 34); cannot redeclare (file: /etc/puppet/modules/prometheus/manifests/node_textfile.pp, line: 34) (file: /etc/puppet/modules/prometheus/manifests/node_textfile.pp, line: 34, column: 9) (file: /etc/puppet/modules/profile/manifests/pki/multirootca/monitoring.pp, line: 26) on node pki-intermediate.pki.eqiad1.wikimedia.cloud

Afaics both classes use ensure_packages that should prevent, IIRC, duplicate declarations. I am wondering though if the error is due to the fact that in cloud we get this by default:

profile::pki::multirootca::enable_monitoring: false

That in turn causes profile::pki::multirootca::monitoring in multirootca.pp to in turn set the ensure=absent to prometheus::node_textfile (for the certificate expiry) that in turn tries to remove the python3-cryptography package, meanwhile cfssl::ocsp tries to install it. So the duplicate declaration is because there are two competing install/remove actions, in theory.

We can try to find a solution and then run puppet on the kafka nodes to see if they are able to get the right certs, and restart from there. Not sure what is the right fix for this use case though.

Change #1015541 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pki::multirootca::monitoring: add workaround for python3-crypto

https://gerrit.wikimedia.org/r/1015541

Change #1015541 merged by Elukey:

[operations/puppet@production] profile::pki::multirootca::monitoring: add workaround for python3-crypto

https://gerrit.wikimedia.org/r/1015541

Change #1016289 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pki::multirootca::monitoring: rework prometheus-client pkg

https://gerrit.wikimedia.org/r/1016289

Change #1016289 merged by Elukey:

[operations/puppet@production] profile::pki::multirootca::monitoring: rework prometheus-client pkg

https://gerrit.wikimedia.org/r/1016289

PKI intermediate cloud node fixed, now I think that we'd need to fix the second biggest issue pointed out by Amir, namely the fact that the auth token is probably misconfigured on a lot of cloud nodes.

In deployment-prep I see that the auth token, set by profile::pki::client::auth_key, is the one set in the fake private repository, I have no idea if there is a proper private repo for cloud where to set the right auth key. Once solved we should be able to unblock this.

I see "local" private commits on the new puppetmaster in the private repo

There are: [ACTUALLY PRIVATE] commits in root@deployment-puppetserver-1:/srv/git/labs/private#

But I can't add more commits because there is a pre-commit hook

#!/bin/sh

echo "Local commits are not allowed in this repository. Please go to frontend puppetmaster and commit"
exit 1

What "a frontend puppetmaster" means, I don't know.

I think that by default any puppetmaster that pulls data from another repository (so it is not the canonical source of truth) has this protection to avoid mistakes. I see as well some actually-private/local tags for some commits in the git history, probably we are missing some configs.

The puppetmaster04 node as replaced recently in T327742, tagging @Andrew to see if anything can be done to fix this. TL;DR: we are trying to fix a PKI config mess and we'd need to add a local/private commit to puppet-private in deployment-prep, using the deployment-puppetserver-1 node. From the git history there seems to be a way to add actually-private commits, and since the host was added recently we are wondering if any config is missing.

TLDR
rm /srv/git/labs/private/.git/hooks/pre-commit /srv/git/operations/puppet/.git/hooks/pre-commit

The pre-commit comes from puppetmaster::gitclone but it is only installed when the Git::Clone resource is realized which only happens when it is first cloned. Given the old instance had been there before that change got introduced, it never has received the pre-commit hook and we only find out it is broken after a new Puppet master got created.

That is all assuming that no new Puppet master got created since June 2021 and a9b84f7fbb27e3da2cc96db68546bb5788a054cd

I have tried on integration-puppetserver-01.integration.eqiad1.wikimedia.cloud:

  • Delete /srv/git/labs/private/.git/hooks/pre-commit
  • Ran puppet agent -tv

The file is not recreated (since the git directory exists, the git::clone is not realized again). Ideally the git clone should be fixed to set/delete/manage the git hooks directory and the puppetmaster::gitclone should not install the hook on WMCS since it is whole purpose is to be able to apply local commits :-]

Anyway, those are a couple side tasks really.

Mentioned in SAL (#wikimedia-releng) [2024-04-02T13:45:49Z] <elukey> rm /srv/git/labs/private/.git/hooks/pre-commit in deployment-puppetserver-1 - T360595

Mentioned in SAL (#wikimedia-releng) [2024-04-02T13:55:42Z] <elukey> soft reboot deployment-puppetserver-1 - T360595

Mentioned in SAL (#wikimedia-releng) [2024-04-02T14:39:53Z] <elukey> restart puppetserver on deployment-puppetserver-1 with 5g of Xmx (rather than 7g) - T360595

Update: after a lot of configs, we are able now to add local commits to the private repo on deployment-puppetserver-1. I tried various hacks to override the value that we set in profile::pki::client::auth_key, but it seems not picked up.

Next steps:

  • Figure out how to override profile::pki::client::auth_key for all deployment-prep's nodes (using the real key from PKI nodes)
  • Resolve other puppet issues for deployment-kafka-logging01 until kafka starts with the proper TLS certs etc..

Change #1016364 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Remove profile::pki::client::auth_key from common.yaml

https://gerrit.wikimedia.org/r/1016364

Change #1016364 merged by Elukey:

[labs/private@master] Remove profile::pki::client::auth_key from common.yaml

https://gerrit.wikimedia.org/r/1016364

Change #1016386 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Remove profile::pki::client's specific hiera config

https://gerrit.wikimedia.org/r/1016386

Ok so the issue was that the profile::pki::client::auth_key value set under hiera's profile settings was not picked up (thanks Taavi for the help) so I was trying the right config in the wrong place.

I added another auth_key value to common.yaml and that overrode the profile::pki::client::auth_key value from the fake private repo.

I removed all the old content of the kafka ssl dir, as well as the wmf-ca-certs bundle, let puppet recreate everything and restarted kafka.

deployment-kafka-logging01 seems to work now, can anybody confirm?

@thcipriani I haven't forgot about varnishkafka, if the above works we can apply the same fix to the other kafka nodes and we should be good.

deployment-kafka-logging01 seems to work now, can anybody confirm?

The symptom we were seeing that sparked this task (logstash_checker.py) seems to work again:

[thcipriani@deployment-deploy03 ~]$ logstash_checker.py --service mediawiki --host deployment-mediawiki12 --fail-threshold 10.0 --delay 20 --logstash-host logging-logstash-02.logging.eqiad1.wikimedia.cloud:9200
INFO: OK (Avg. errors per 10 seconds: Before: 0.23, After: 1.00, Threshold: 2.32)

🎉

@thcipriani I haven't forgot about varnishkafka, if the above works we can apply the same fix to the other kafka nodes and we should be good.

That'd be great! Thanks so much @elukey !

@thcipriani I haven't forgot about varnishkafka, if the above works we can apply the same fix to the other kafka nodes and we should be good.

That'd be great! Thanks so much @elukey !

Kafka Jumbo on PKI/SSL as well, it didn't have any TLS configured so I just added it. Restarted varnishkafka on the cache node, it should work now (afaics it does but lemme know otherwise).

deployment-kafka-logging01 seems to work now, can anybody confirm?

beta-logs shows logs again - logstash has reconnected and is consuming events. Thank you!! 🎉

Change #1016386 merged by Elukey:

[labs/private@master] Remove profile::pki::client's specific hiera config

https://gerrit.wikimedia.org/r/1016386

Change #1016764 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] profile::pki::client: re-introduce fake auth token

https://gerrit.wikimedia.org/r/1016764

Change #1016764 merged by Elukey:

[labs/private@master] profile::pki::client: re-introduce fake auth token

https://gerrit.wikimedia.org/r/1016764

I think that all the work is done, varnishkafka seems working fine now and the logs are flowing nicely. Shall we close?

lmata moved this task from Inbox to Radar on the observability board.

I think that all the work is done, varnishkafka seems working fine now and the logs are flowing nicely. Shall we close?

Almost, we need to reactivate the logstash checking which prompted me to file this task. It looks like it is working:

^Chashar@deployment-deploy03:/srv/mediawiki-staging$ /usr/local/bin/logstash_checker.py \
>   --service-name mwdeploy \
>   --host canaries \
>   --fail-threshold 10.0 \
>   --delay 20 \
>   --logstash-host logging-logstash-02.logging.eqiad1.wikimedia.cloud:9200
INFO: OK (Avg. errors per 10 seconds: Before: 0.00, After: 0.00, Threshold: 1.00)

Change #1016817 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] beta-scap-sync-world: restore scap canary checks

https://gerrit.wikimedia.org/r/1016817

Change #1016817 merged by jenkins-bot:

[integration/config@master] beta-scap-sync-world: restore scap canary checks

https://gerrit.wikimedia.org/r/1016817

The beta-scap-sync-world job managed to push the update including a successful check to logstash and surely it passed:

17:31:43 17:31:43 Waiting 20 seconds for canary traffic...
17:32:03 17:32:03 Executing check 'Logstash canary error rate'
17:32:04 17:32:04 Finished sync-check-canaries (duration: 00m 27s)

I am assigning the task to @elukey since he restored the SSL cert / validated Kafka.

I have no idea what is the root cause though. It seems the certificates expired early in march and the Puppet master being rebuild last week cause some Hiera setting to no more be applied?

The old certificates (cergen based) for Kafka Logging in deployment-prep expired and we worked to add a PKI-based TLS cert instead of the old one (self managed via puppet etc.., no need for any manual and periodical renew). All the puppet work was related to make PKI to work in deployment-prep, since there were several issues that caused failures. The puppet master rebuild caused some troubles when trying to apply private commits in deployment-prep's private repo. While at it, the deployment-prep's kafka jumbo cluster was moved to PKI as well to fix an issue with varnishkafka running on the deployment-prep's cache node (TLS cert expiry as well).

Lemme know if anything is missing, otherwise we can close! Thanks all :)

The old certificates (cergen based) for Kafka Logging in deployment-prep expired and we worked to add a PKI-based TLS cert instead of the old one (self managed via puppet etc.., no need for any manual and periodical renew). All the puppet work was related to make PKI to work in deployment-prep, since there were several issues that caused failures. The puppet master rebuild caused some troubles when trying to apply private commits in deployment-prep's private repo. While at it, the deployment-prep's kafka jumbo cluster was moved to PKI as well to fix an issue with varnishkafka running on the deployment-prep's cache node (TLS cert expiry as well).

Lemme know if anything is missing, otherwise we can close! Thanks all :)

Thanks for all the digging in here and for this explanation! <3