Replace deployment-logstash03 (stretch, elk5) with a buster+elk7 setup
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | colewhite | T211984 Logstash in beta fails periodically | |||
Resolved | colewhite | T233134 logstash-beta.wmflabs.org does not receive any mediawiki events | |||
Resolved | taavi | T283013 Migrate beta cluster to ELK7 |
Event Timeline
I took the opportunity to split logstash and the kafka broker to different hosts, so I created deployment-logstash04 and deployment-kafka-logging01. Now having some issues with Kafka certs:
No name matching deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud found
the certs (openssl s_client -connect deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:9093 -showcerts) come from deployment-puppetmaster04:/var/lib/git/labs/private/modules/secret/secrets/certificates/kafka_logging-eqiad_broker and don't have names for the old host either.
I'm not finding any info on how the certs were created, https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_Certificates isn't helpful outside production, so how is this supposed to work?
IIRC: The broker certs don't have hostnames in them; they all use the same cert signed by the same CA.
In labs-private repo, I see modules/secret/secrets/certificates/certificates.manifests.d/deployment_prep.certs.yaml, which declares certs for kafka_logging-eqiad_broker. You should be able to use the same key/cert for the new broker.
having some issues with Kafka certs
Where is that error message coming from, Kafka broker itself?
IIRC: The broker certs don't have hostnames in them; they all use the same cert signed by the same CA.
That's what the certs look like, yeah. Logstash doesn't accept them because it somehow wants them to have the hostname.
Where is that error message coming from, Kafka broker itself?
Logstash itself (/var/log/logstash/logstash-json.log) when trying to connect to Kafka.
Huh, in prod that is not needed.
Perhaps this is a CA issue? IIRC (and it has been a while since I thought about this) The CA used to sign the Kafka broker certs has to be the same CA used for the kafka client certs.
I think I found the reason, ssl_endpoint_identification_algorithm setting looks like it could have something with it, according to https://github.com/logstash-plugins/logstash-integration-kafka/pull/8 setting it to an empty string disables the hostname verification. Puppet manifests are trying to do that but that line is not present on deployment-logstash04 for some reason. Is that there on prod?
Live hacking the if case out of the template like this adds the needed lines into the config files.
root@deployment-puppetmaster04:/var/lib/git/operations/puppet# git diff diff --git a/modules/logstash/templates/input/kafka.erb b/modules/logstash/templates/input/kafka.erb index 35171be32e..d9330b897e 100644 --- a/modules/logstash/templates/input/kafka.erb +++ b/modules/logstash/templates/input/kafka.erb @@ -27,8 +27,6 @@ input { ssl_truststore_location => "<%= @ssl_truststore_location %>" ssl_truststore_password => "<%= @ssl_truststore_password %>" <% end -%> -<% if @ssl_endpoint_identification_algorithm -%> ssl_endpoint_identification_algorithm => "<%=@ssl_endpoint_identification_algorithm %>" -<% end -%> } }
Reverted the live hacks since I'd like to know what's causing this in the first place, unfortunately I don't have any answers.
Note that when testing when the hack was in place, we're getting errors of another trust store:
org.apache.kafka.common.KafkaException: Failed to load SSL keystore /etc/logstash/kafka_logstash-eqiad.truststore.jks of type JKS
that's fairly easily explained, but fixing it is a different thing:
root@deployment-logstash04:/etc/logstash# cat /etc/logstash/kafka_logstash-eqiad.truststore.jks #FAKE
Why are there two different truststores (kafka_logging-eqiad and kafka_logstash-eqiad)? Why does only one of them have the required fakes? How did this work previously on ELK5?
Yes, the config currently looks like this in prod
ssl_endpoint_identification_algorithm => ""
https://gerrit.wikimedia.org/r/c/operations/puppet/+/683695 was cherry picked and overriding the value. Mystery solved.
MediaWiki event delivery is still failing, rsyslogd spams this all over the logs:
omkafka: kafka delivery FAIL on Topic 'udp_localhost-info', msg [full mediawiki log message contents redacted]
any ideas on what might be causing these? I tried to look at all other logs about this (kafka/logstash/elasticsearch) and found nothing related
and now elasticsearch is sad:
taavi@deployment-logstash04:/etc/elasticsearch$ curl localhost:9200/_cluster/allocation/explain|jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 871 100 871 0 0 72583 0 --:--:-- --:--:-- --:--:-- 72583 { "index": "logstash-2021.05.18", "shard": 0, "primary": false, "current_state": "unassigned", "unassigned_info": { "reason": "CLUSTER_RECOVERED", "at": "2021-05-18T12:45:04.255Z", "last_allocation_status": "no_attempt" }, "can_allocate": "no", "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions": [ { "node_id": "EH-2hwQAQoCu_tYbRiFFQw", "node_name": "deployment-logstash04-deployment-prep-logstash-eqiad", "transport_address": "172.16.6.174:9300", "node_attributes": { "hostname": "deployment-logstash04", "fqdn": "deployment-logstash04.deployment-prep.eqiad1.wikimedia.cloud" }, "node_decision": "no", "deciders": [ { "decider": "same_shard", "decision": "NO", "explanation": "a copy of this shard is already allocated to this node [[logstash-2021.05.18][0], node[EH-2hwQAQoCu_tYbRiFFQw], [P], s[STARTED], a[id=_clEZGzdQkqdBwXYLNvvoQ]]" } ] } ] }
:(
I took a look at this again, and elasticsearch is still complaining that "a copy of this shard is already allocated to this node". I didn't find any configuration to tell it that there will only be one node and assigning all data to it is fine. Do we have any other options than creating a full cluster with multiple replicas (and if yes, would that require two or three nodes total)?
Mentioned in SAL (#wikimedia-releng) [2021-06-19T08:05:05Z] <majavah> creating deployment-logstash05 and configure it like 04, looks like elasticsearch does not like clusters with only one host T283013
https://logstash-beta.wmcloud.org is now receiving events from mediawiki and others. Next steps are importing dashboards from somewhere (not sure where, betas ELK5 is fully broken) and write some documentation.
New issue: /var/log/logstash/logstash-json.log keeps filling up rather quickly with various events, it runs the logstash nodes frequently out of disk space on the root partition which freezes everything. At least some issues are related to "dead letter queue", wikitech says it's normally not enabled but puppet says it is
This has two (three) issues remaining:
- T283013#7163536, which is currently solved with a cronjob on deployment-cumin that automatically deletes those log files every other hour and restarts logstash service
- MediaWiki logstash indexing errors ("Field [timestamp] of type [text] does not support custom formats")
- Things that try to ingest via Gelf
- Docker/Kubernetes services have something like this in production: https://github.com/wikimedia/puppet/blob/be4419c236fc9a2ebca8fc275518e055863e6759/modules/profile/manifests/kubernetes/node.pp#L27, hopefully we can use that but change the path to something like /var/lib/docker/containers/*/*-json.log