Page MenuHomePhabricator

Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02
Closed, ResolvedPublicBUG REPORT

Description

I've noticed that since around 16:02-16:06 yesterday, there is a large amount (50% increase in logged messages, compared to the baseline) of extra error logs.

Screenshot from 2021-10-01 14-24-53.png (1×2 px, 143 KB)

They appear to be coming from failed CirrusSearch jobs: "ElasticaWrite job failed" and "Failed executing job: cirrusSearchElasticaWrite".

At first I reported it at T48643#7394374 but as it didn't match the wikis affected, I was told it was unrelated to Wikidata. The fact that it is mostly only CirrusSearch jobs confirms that.

This is the jobqueue channel at that time:
https://logstash.wikimedia.org/goto/263ff9c2e9bdb0668b0864f9cebdb0d5

Screenshot from 2021-10-01 14-01-33.png (704×2 px, 166 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse triaged this task as High priority.EditedOct 1 2021, 12:43 PM
dcausse subscribed.

From an MW app server I can't connect to any of the envoy listeners (port 6105, 6106, 6107) that are used to connect to the cloudelastic elasticsearch cluster.
The connectivity problem seems to have started 2021-09-30T16:00:00, The target cluster appears sane and can be reached from global-search, the main it's used for (https://global-search.toolforge.org/?q=test&namespaces=&title=).

Mentioned in SAL (#wikimedia-operations) [2021-10-01T14:04:11Z] <bblack> C:envoyproxy (appservers and others): ca-certificates updated via cumin to workaround T292291 issues

Errors seem to have receded a lot since 14:05:

Screenshot from 2021-10-01 16-30-45.png (452×2 px, 57 KB)

Recapping from an IRC conversation: this was a fallout of the great Let's Encrypt "DST CA Root X3" expiry event yesterday. cloudelastic.wikinmedia.org uses an LE cert, and the envoyproxy on the appservers uses a bundled copy of BoringSSL combined with the system's CA trust store to check the cert on it, and it doesn't work when the LE-using server is serving the legacy-compatibility (default) chain and the expired DST root still exists in the root store.

The workaround via cumin was to deselect the expired root certificate from the config of the system root store (/etc/ca-certificates.conf), run update-ca-certificates, and then reload the envoyproxy service. This workaround was applied to all envoyproxy hosts, and should hold on these hosts for now (seems to even survive updates of the upstream ca-certificates package), but is currently un-puppetizzed.

We may or may not want to puppetize this, for this case or even more-broadly for all of production, but we can look at that debate on Monday. For now I'll upload a prototype puppet patch which would persist this change going forward, for just the envoyproxy case so far.

Change 725331 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] sslcert::ca_deselect_dstx3 for envoyproxy

https://gerrit.wikimedia.org/r/725331

For more longer term, I also would like to wonder if there something we could add to monitoring that could had forced this to surface issues earlier, as it seemed to be a hard failure situation, not a degradation (even if concentrated on a specific point with probably low to unnoticeable user impact).

The TLS issue seems concrete enough that is unlikely to happen again (I could be wrong, experts to say), but maybe something related to error rate of job execution or conectivity check. E.g. monitoring the description graph error rate or this (mean backlog time):

https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?viewPanel=5&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=1632929081576&to=1633101881576

Update on the ca-certificates end of this: Debian has a patch that will correct this at their own level at https://salsa.debian.org/debian/ca-certificates/-/commit/5b83fd984706ea03101dbb011846e60364c3a149 - but we don't yet know if this will be released for buster and/or bullseye updates. Stalling out a little on this before we move forward with the puppet-based solution.

Update on the ca-certificates end of this: Debian has a patch that will correct this at their own level at https://salsa.debian.org/debian/ca-certificates/-/commit/5b83fd984706ea03101dbb011846e60364c3a149 - but we don't yet know if this will be released for buster and/or bullseye updates. Stalling out a little on this before we move forward with the puppet-based solution.

Would it make sense for us to cherry-pick that and upload our own ca-certificates to apt.wm.o in the meantime?

Change 735599 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] sslcert: introduce ca_deselect_dstx3

https://gerrit.wikimedia.org/r/735599

Change 735599 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] sslcert: introduce ca_deselect_dstx3

https://gerrit.wikimedia.org/r/735599

I've rebased https://gerrit.wikimedia.org/r/c/operations/puppet/+/725331 onto @aborrero 's patch, so now it's just the envoy part. Upstream debian has applied this same blacklist in a new package version, but only in unstable, and I'm not sure this will ever trickle down to stable due to policy issues.

I think the envoy part of the patch is still pending on someone (@jbond ?) sorting out the best way to factor the dependency issues already noted in the comments. We needed this fix to make the elastic problem go away, and it was applied manually to existing envoy hosts. The manual fixup seems like it will survive any package updates from upstream, but it won't be present on new/re-imaged hosts (and other important cases e.g. in containers) unless we merge some variant of this patch, so we should do that soon, probably!

Change 725331 merged by Giuseppe Lavagetto:

[operations/puppet@production] sslcert::ca_deselect_dstx3 for envoyproxy

https://gerrit.wikimedia.org/r/725331

All patches merged. Is this still an issue? Should this still remain open?

No replies by anyone, boldly closing - shrug