Migrate beta cluster to ELK7
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	taavi
	May 17 2021, 2:23 PM

Description

Replace deployment-logstash03 (stretch, elk5) with a buster+elk7 setup

Related Objects
Search...

Status	Assigned	Task
Resolved	colewhite	T211984 Logstash in beta fails periodically
Resolved	colewhite	T233134 logstash-beta.wmflabs.org does not receive any mediawiki events
Resolved	taavi	T283013 Migrate beta cluster to ELK7

Event Timeline

taavi created this task.May 17 2021, 2:23 PM

Restricted Application added a project: User-Majavah. · View Herald TranscriptMay 17 2021, 2:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I took the opportunity to split logstash and the kafka broker to different hosts, so I created deployment-logstash04 and deployment-kafka-logging01. Now having some issues with Kafka certs:

No name matching deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud found

the certs (openssl s_client -connect deployment-kafka-logging01.deployment-prep.eqiad1.wikimedia.cloud:9093 -showcerts) come from deployment-puppetmaster04:/var/lib/git/labs/private/modules/secret/secrets/certificates/kafka_logging-eqiad_broker and don't have names for the old host either.

I'm not finding any info on how the certs were created, https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_Certificates isn't helpful outside production, so how is this supposed to work?

IIRC: The broker certs don't have hostnames in them; they all use the same cert signed by the same CA.

In labs-private repo, I see modules/secret/secrets/certificates/certificates.manifests.d/deployment_prep.certs.yaml, which declares certs for kafka_logging-eqiad_broker. You should be able to use the same key/cert for the new broker.

having some issues with Kafka certs

Where is that error message coming from, Kafka broker itself?

IIRC: The broker certs don't have hostnames in them; they all use the same cert signed by the same CA.

That's what the certs look like, yeah. Logstash doesn't accept them because it somehow wants them to have the hostname.

Where is that error message coming from, Kafka broker itself?

Logstash itself (/var/log/logstash/logstash-json.log) when trying to connect to Kafka.

Huh, in prod that is not needed.

Perhaps this is a CA issue? IIRC (and it has been a while since I thought about this) The CA used to sign the Kafka broker certs has to be the same CA used for the kafka client certs.

I think I found the reason, ssl_endpoint_identification_algorithm setting looks like it could have something with it, according to https://github.com/logstash-plugins/logstash-integration-kafka/pull/8 setting it to an empty string disables the hostname verification. Puppet manifests are trying to do that but that line is not present on deployment-logstash04 for some reason. Is that there on prod?

Live hacking the if case out of the template like this adds the needed lines into the config files.

root@deployment-puppetmaster04:/var/lib/git/operations/puppet# git diff
diff --git a/modules/logstash/templates/input/kafka.erb b/modules/logstash/templates/input/kafka.erb
index 35171be32e..d9330b897e 100644
--- a/modules/logstash/templates/input/kafka.erb
+++ b/modules/logstash/templates/input/kafka.erb
@@ -27,8 +27,6 @@ input {
     ssl_truststore_location => "<%= @ssl_truststore_location %>"
     ssl_truststore_password => "<%= @ssl_truststore_password %>"
 <% end -%>
-<% if @ssl_endpoint_identification_algorithm -%>
     ssl_endpoint_identification_algorithm => "<%=@ssl_endpoint_identification_algorithm %>"
-<% end -%>
   }
 }

Reverted the live hacks since I'd like to know what's causing this in the first place, unfortunately I don't have any answers.

Note that when testing when the hack was in place, we're getting errors of another trust store:

org.apache.kafka.common.KafkaException: Failed to load SSL keystore /etc/logstash/kafka_logstash-eqiad.truststore.jks of type JKS

that's fairly easily explained, but fixing it is a different thing:

root@deployment-logstash04:/etc/logstash# cat /etc/logstash/kafka_logstash-eqiad.truststore.jks
#FAKE

Why are there two different truststores (kafka_logging-eqiad and kafka_logstash-eqiad)? Why does only one of them have the required fakes? How did this work previously on ELK5?

In T283013#7093631, @Majavah wrote:

Puppet manifests are trying to do that but that line is not present on deployment-logstash04 for some reason. Is that there on prod?

Yes, the config currently looks like this in prod

ssl_endpoint_identification_algorithm => ""

https://gerrit.wikimedia.org/r/c/operations/puppet/+/683695 was cherry picked and overriding the value. Mystery solved.

We have data! https://logstash-beta.wmcloud.org

herron awarded a token.May 17 2021, 6:30 PM

Nice!!!

MediaWiki event delivery is still failing, rsyslogd spams this all over the logs:

omkafka: kafka delivery FAIL on Topic 'udp_localhost-info', msg [full mediawiki log message contents redacted]

any ideas on what might be causing these? I tried to look at all other logs about this (kafka/logstash/elasticsearch) and found nothing related

and now elasticsearch is sad:

taavi@deployment-logstash04:/etc/elasticsearch$ curl localhost:9200/_cluster/allocation/explain|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   871  100   871    0     0  72583      0 --:--:-- --:--:-- --:--:-- 72583
{
  "index": "logstash-2021.05.18",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "CLUSTER_RECOVERED",
    "at": "2021-05-18T12:45:04.255Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "EH-2hwQAQoCu_tYbRiFFQw",
      "node_name": "deployment-logstash04-deployment-prep-logstash-eqiad",
      "transport_address": "172.16.6.174:9300",
      "node_attributes": {
        "hostname": "deployment-logstash04",
        "fqdn": "deployment-logstash04.deployment-prep.eqiad1.wikimedia.cloud"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[logstash-2021.05.18][0], node[EH-2hwQAQoCu_tYbRiFFQw], [P], s[STARTED], a[id=_clEZGzdQkqdBwXYLNvvoQ]]"
        }
      ]
    }
  ]
}

I took a look at this again, and elasticsearch is still complaining that "a copy of this shard is already allocated to this node". I didn't find any configuration to tell it that there will only be one node and assigning all data to it is fine. Do we have any other options than creating a full cluster with multiple replicas (and if yes, would that require two or three nodes total)?

Mentioned in SAL (#wikimedia-releng) [2021-06-19T08:05:05Z] <majavah> creating deployment-logstash05 and configure it like 04, looks like elasticsearch does not like clusters with only one host T283013

https://logstash-beta.wmcloud.org is now receiving events from mediawiki and others. Next steps are importing dashboards from somewhere (not sure where, betas ELK5 is fully broken) and write some documentation.

taavi mentioned this in T276521: deployment-logstash03 puppet errors.Jun 19 2021, 11:59 AM

taavi mentioned this in T280324: Puppet failing on deployment-logstash03.deployment-prep.eqiad.wmflabs.

taavi mentioned this in T241481: deployment-logstash03: UDP listener died EADDRINUSE, logstash port conflict with rsyslogd.

taavi mentioned this in T233134: logstash-beta.wmflabs.org does not receive any mediawiki events.Jun 19 2021, 12:02 PM

New issue: /var/log/logstash/logstash-json.log keeps filling up rather quickly with various events, it runs the logstash nodes frequently out of disk space on the root partition which freezes everything. At least some issues are related to "dead letter queue", wikitech says it's normally not enabled but puppet says it is

taavi mentioned this in T286567: Puppet errors on deployment-logstash03.deployment-prep.eqiad.wmflabs.Jul 15 2021, 6:53 AM

This has two (three) issues remaining:

T283013#7163536, which is currently solved with a cronjob on deployment-cumin that automatically deletes those log files every other hour and restarts logstash service
MediaWiki logstash indexing errors ("Field [timestamp] of type [text] does not support custom formats")
Things that try to ingest via Gelf
- Docker/Kubernetes services have something like this in production: https://github.com/wikimedia/puppet/blob/be4419c236fc9a2ebca8fc275518e055863e6759/modules/profile/manifests/kubernetes/node.pp#L27, hopefully we can use that but change the path to something like /var/lib/docker/containers/*/*-json.log

taavi removed taavi as the assignee of this task.Aug 8 2021, 5:46 PM

taavi removed a project: User-Majavah.

taavi added a parent task: T233134: logstash-beta.wmflabs.org does not receive any mediawiki events.Aug 8 2021, 5:48 PM

taavi closed this task as Resolved.Nov 7 2021, 4:47 PM

taavi claimed this task.

kostajh mentioned this in T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.Mar 25 2024, 6:42 PM

Migrate beta cluster to ELK7Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Migrate beta cluster to ELK7
Closed, ResolvedPublic
Actions

Related Objects
Search...