Page MenuHomePhabricator

Migrate beta cluster to ELK7
Open, Needs TriagePublic

Description

Replace deployment-logstash03 (stretch, elk5) with a buster+elk7 setup

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I took the opportunity to split logstash and the kafka broker to different hosts, so I created deployment-logstash04 and deployment-kafka-logging01. Now having some issues with Kafka certs:

No name matching deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud found

the certs (openssl s_client -connect deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:9093 -showcerts) come from deployment-puppetmaster04:/var/lib/git/labs/private/modules/secret/secrets/certificates/kafka_logging-eqiad_broker and don't have names for the old host either.

I'm not finding any info on how the certs were created, https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_Certificates isn't helpful outside production, so how is this supposed to work?

IIRC: The broker certs don't have hostnames in them; they all use the same cert signed by the same CA.

In labs-private repo, I see modules/secret/secrets/certificates/certificates.manifests.d/deployment_prep.certs.yaml, which declares certs for kafka_logging-eqiad_broker. You should be able to use the same key/cert for the new broker.

having some issues with Kafka certs

Where is that error message coming from, Kafka broker itself?

IIRC: The broker certs don't have hostnames in them; they all use the same cert signed by the same CA.

That's what the certs look like, yeah. Logstash doesn't accept them because it somehow wants them to have the hostname.

Where is that error message coming from, Kafka broker itself?

Logstash itself (/var/log/logstash/logstash-json.log) when trying to connect to Kafka.

Huh, in prod that is not needed.

Perhaps this is a CA issue? IIRC (and it has been a while since I thought about this) The CA used to sign the Kafka broker certs has to be the same CA used for the kafka client certs.

I think I found the reason, ssl_endpoint_identification_algorithm setting looks like it could have something with it, according to https://github.com/logstash-plugins/logstash-integration-kafka/pull/8 setting it to an empty string disables the hostname verification. Puppet manifests are trying to do that but that line is not present on deployment-logstash04 for some reason. Is that there on prod?

Live hacking the if case out of the template like this adds the needed lines into the config files.

root@deployment-puppetmaster04:/var/lib/git/operations/puppet# git diff
diff --git a/modules/logstash/templates/input/kafka.erb b/modules/logstash/templates/input/kafka.erb
index 35171be32e..d9330b897e 100644
--- a/modules/logstash/templates/input/kafka.erb
+++ b/modules/logstash/templates/input/kafka.erb
@@ -27,8 +27,6 @@ input {
     ssl_truststore_location => "<%= @ssl_truststore_location %>"
     ssl_truststore_password => "<%= @ssl_truststore_password %>"
 <% end -%>
-<% if @ssl_endpoint_identification_algorithm -%>
     ssl_endpoint_identification_algorithm => "<%=@ssl_endpoint_identification_algorithm %>"
-<% end -%>
   }
 }

Reverted the live hacks since I'd like to know what's causing this in the first place, unfortunately I don't have any answers.

Note that when testing when the hack was in place, we're getting errors of another trust store:

org.apache.kafka.common.KafkaException: Failed to load SSL keystore /etc/logstash/kafka_logstash-eqiad.truststore.jks of type JKS

that's fairly easily explained, but fixing it is a different thing:

root@deployment-logstash04:/etc/logstash# cat /etc/logstash/kafka_logstash-eqiad.truststore.jks
#FAKE

Why are there two different truststores (kafka_logging-eqiad and kafka_logstash-eqiad)? Why does only one of them have the required fakes? How did this work previously on ELK5?

Puppet manifests are trying to do that but that line is not present on deployment-logstash04 for some reason. Is that there on prod?

Yes, the config currently looks like this in prod

ssl_endpoint_identification_algorithm => ""

MediaWiki event delivery is still failing, rsyslogd spams this all over the logs:

omkafka: kafka delivery FAIL on Topic 'udp_localhost-info', msg [full mediawiki log message contents redacted]

any ideas on what might be causing these? I tried to look at all other logs about this (kafka/logstash/elasticsearch) and found nothing related

and now elasticsearch is sad:

taavi@deployment-logstash04:/etc/elasticsearch$ curl localhost:9200/_cluster/allocation/explain|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   871  100   871    0     0  72583      0 --:--:-- --:--:-- --:--:-- 72583
{
  "index": "logstash-2021.05.18",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "CLUSTER_RECOVERED",
    "at": "2021-05-18T12:45:04.255Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "EH-2hwQAQoCu_tYbRiFFQw",
      "node_name": "deployment-logstash04-deployment-prep-logstash-eqiad",
      "transport_address": "172.16.6.174:9300",
      "node_attributes": {
        "hostname": "deployment-logstash04",
        "fqdn": "deployment-logstash04.deployment-prep.eqiad1.wikimedia.cloud"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[logstash-2021.05.18][0], node[EH-2hwQAQoCu_tYbRiFFQw], [P], s[STARTED], a[id=_clEZGzdQkqdBwXYLNvvoQ]]"
        }
      ]
    }
  ]
}

:(

I took a look at this again, and elasticsearch is still complaining that "a copy of this shard is already allocated to this node". I didn't find any configuration to tell it that there will only be one node and assigning all data to it is fine. Do we have any other options than creating a full cluster with multiple replicas (and if yes, would that require two or three nodes total)?

Mentioned in SAL (#wikimedia-releng) [2021-06-19T08:05:05Z] <majavah> creating deployment-logstash05 and configure it like 04, looks like elasticsearch does not like clusters with only one host T283013

https://logstash-beta.wmcloud.org is now receiving events from mediawiki and others. Next steps are importing dashboards from somewhere (not sure where, betas ELK5 is fully broken) and write some documentation.

New issue: /var/log/logstash/logstash-json.log keeps filling up rather quickly with various events, it runs the logstash nodes frequently out of disk space on the root partition which freezes everything. At least some issues are related to "dead letter queue", wikitech says it's normally not enabled but puppet says it is

This has two (three) issues remaining: