Page MenuHomePhabricator

mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error
Closed, ResolvedPublic

Description

This is causing the cirrusCheckerJob instances to fail, it looks like envoy dislikes the cloudelastic TLS cert? Unclear.

Reproduction:

deploy2002 $ sudo mw-debug-repl testwiki
Finding a mw-debug pod in codfw...
Now running shell.php for testwiki inside pod/mw-debug.codfw.pinkunicorn-65fffb7476-dpqwm...
Psy Shell v0.11.21 (PHP 7.4.33 โ€” cli) by Justin Hileman
> $ch = curl_init('http://localhost:6105')
= curl resource #1577

> curl_exec($ch);
upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268435581:SSLroutines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
= true

> $ch = curl_init('https://cloudelastic.wikimedia.org:9243')
= curl resource #1578

> curl_exec($ch);
{
  "name" : "cloudelastic1001-cloudelastic-chi-eqiad",
  "cluster_name" : "cloudelastic-chi-eqiad",
  "cluster_uuid" : "xwljiBZrQkyMdQmu9NDxcA",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "oss",
    "build_type" : "deb",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
= true

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptDec 6 2023, 7:23 PM

For comparison envoy works fine from mwdeploy2002 itself:

deploy2002 $ mwscript shell.php testwiki
Psy Shell v0.11.21 (PHP 7.4.33 โ€” cli) by Justin Hileman
> $ch = curl_init('http://localhost:6105')
= curl resource #1575

> curl_exec($ch);
{
  "name" : "cloudelastic1002-cloudelastic-chi-eqiad",
  "cluster_name" : "cloudelastic-chi-eqiad",
  "cluster_uuid" : "xwljiBZrQkyMdQmu9NDxcA",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "oss",
    "build_type" : "deb",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
= true

Combined with another bug that doesn't correctly recognize these failures has resulted in an increase of cirrusSearchLinksUpdate from 300-500/s to around 800/s

Cloudelastic uses acmechief for it's TLS certificates, vs most prod services which probably (?) have an internally signed certificate. It seems plausible that the problem has something to do with the certs coming from acmechief (not the certs themselves, but how envoy validates them).

In my estimation an appropriate solution here is to move the cirrusCheckerJob back to the old job runners, and bring them back after solving the TLS issue.

dcausse added subscribers: hnowlan, dcausse.

@hnowlan do you think we could move this job back to the old job runners as Erik suggests while this issue is getting fixed? Thanks!

Change 981282 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] cirrusSearchCheckerJob: Revert to baremetal

https://gerrit.wikimedia.org/r/981282

@dcausse, since no cirrusCheckerJob exists, I assume we are talking about cirrusSearchCheckerJob. I 've uploaded a change to revert, but I 'll admit I am lacking currently a way to see the errors nicely and thus know the rollback worked. Do we have any?

Change 981282 merged by jenkins-bot:

[operations/deployment-charts@master] cirrusSearchCheckerJob: Revert to baremetal

https://gerrit.wikimedia.org/r/981282

@akosiaris thanks for the quick revert, the impact should be visible when looking at the cloudelastic.fixed series in https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&viewPanel=35 that should drop to something close to 0.

I can confirm that the last deploy worked, the fixed rate for cloudelastic is back to 0, thanks!

Edit: this is just plain wrong, I was looking at the wrong pod+image

Diving into the certificate thing now. The very first theory, that ca-certificates isn't in the image was quickly disproved

$ docker run --entrypoint dpkg -it docker-registry.wikimedia.org/wikimedia/mediawiki-services-change-propagation:2023-10-31-075121-production -l ca-certificates
ii  ca-certificates 20230311     all          Common CA certificates

Edit: this is just plain wrong, I was looking at the wrong pod+image

Diving into the certificate thing now. The very first theory, that ca-certificates isn't in the image was quickly disproved

I think you need wmf-certificates updated to 0~20231120, looking at debmonitor it currently has 0~20211129-1. John updated it in https://phabricator.wikimedia.org/T351653 and https://gerrit.wikimedia.org/r/c/operations/debs/wmf-certificates/+/975869/

Looking at debmonitor It was rolled out fleet-wide, but it seems some images were missed.

Edit: this is just plain wrong, I was looking at the wrong pod+image

Diving into the certificate thing now. The very first theory, that ca-certificates isn't in the image was quickly disproved

I think you need wmf-certificates updated to 0~20231120, looking at debmonitor it currently has 0~20211129-1. John updated it in https://phabricator.wikimedia.org/T351653 and https://gerrit.wikimedia.org/r/c/operations/debs/wmf-certificates/+/975869/

Looking at debmonitor It was rolled out fleet-wide, but it seems some images were missed.

I don't think that's it. The validation that fails is for a cert from Let's Encrypt, not an internal WMF CA. Of course, it won't hurt to actually update the package and it's bound to bite us in the future if we don't, so I 'll upload a change for that.

However here's what I found:

Puppet:

https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/templates/services_proxy/envoy_service_cluster.yaml.erb#42

vs

deploment-charts:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/modules/mesh/configuration_1.6.0.tpl#440

That change was introduced in https://gerrit.wikimedia.org/r/#/q/Ia7bb061a0ef531b33fc0f4f254c92ea562261602 but arguably what should have been used was ca-certificates.crt and not wmf-ca-certificates.crt

I 'll upload a change for that as well.

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/8b8d9902f2bacd7589b816d929435c6a0225d898 but arguably, using

Change 981309 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] service_proxy/mesh: Bump to newer version globally

https://gerrit.wikimedia.org/r/981309

Change 981331 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] citoid: Set service_mesh version to 1.23.10-2-s4-20231203

https://gerrit.wikimedia.org/r/981331

Change 981340 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mesh: Ship new configuration templates

https://gerrit.wikimedia.org/r/981340

Change 981341 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mesh: Use ca-certificates instead of wmf-ca-certificates

https://gerrit.wikimedia.org/r/981341

Change 981331 merged by jenkins-bot:

[operations/deployment-charts@master] citoid: Set service_mesh version to 1.23.10-2-s4-20231203

https://gerrit.wikimedia.org/r/981331

Kappakayala triaged this task as Medium priority.

Change 981309 merged by Alexandros Kosiaris:

[operations/puppet@production] service_proxy/mesh: Bump to newer version globally

https://gerrit.wikimedia.org/r/981309

Change 982820 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] apertium/blubberoid: Bump mesh.configuration to latest patch level

https://gerrit.wikimedia.org/r/982820

Change 982821 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mobileapps: mesh.configuration:1.5.x to latest patch level

https://gerrit.wikimedia.org/r/982821

Change 982822 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] function-orchestrator: Bump mesh.configuration:1.4.x to latest patch level

https://gerrit.wikimedia.org/r/982822

Change 982823 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] Bump mesh.configuration:1.4.x to latest patch level

https://gerrit.wikimedia.org/r/982823

Change 981340 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: Ship new configuration templates

https://gerrit.wikimedia.org/r/981340

Change 981341 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: Use ca-certificates instead of wmf-ca-certificates

https://gerrit.wikimedia.org/r/981341

Change 982820 merged by jenkins-bot:

[operations/deployment-charts@master] apertium/blubberoid: Bump mesh.configuration to latest patch level

https://gerrit.wikimedia.org/r/982820

Mentioned in SAL (#wikimedia-operations) [2023-12-13T15:59:58Z] <akosiaris> upgrade apertium, bluebberoid everywhere to use the latest service_proxy image, 1.23.10-2-s4-20231203 T352906

Change 982821 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: mesh.configuration:1.5.x to latest patch level

https://gerrit.wikimedia.org/r/982821

As an update, this is proceeding well. A newer image has being rolled out in various services for testing reasons, followed by related to this task services. This wasn't expected to solve the issue described in this task, but it's good hygiene anyway.

The actual change that is expected to fix this has been uploaded and merged and has been deployed today for a few services without any issues.

Change 982822 merged by jenkins-bot:

[operations/deployment-charts@master] function-orchestrator: Bump mesh.configuration:1.6.x to latest patch level

https://gerrit.wikimedia.org/r/982822

Change 982823 merged by jenkins-bot:

[operations/deployment-charts@master] Bump mesh.configuration:1.4.x to latest patch level

https://gerrit.wikimedia.org/r/982823

Mentioned in SAL (#wikimedia-operations) [2023-12-14T09:24:47Z] <akosiaris> update all the other services. T352906

Change 982843 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] Revert "cirrusSearchCheckerJob: Revert to baremetal"

https://gerrit.wikimedia.org/r/982843

Mentioned in SAL (#wikimedia-operations) [2023-12-14T16:24:15Z] <akosiaris> updates of all wikikube services done T352906

Change 982843 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "cirrusSearchCheckerJob: Revert to baremetal"

https://gerrit.wikimedia.org/r/982843

Final update, I 've just reverted the "stop the bleeding" patch and now cirrusSearchCheckerJob are sent again to MW-on-K8s. I 'll be monitoring the grafana panel provided above and then resolve this.