Page MenuHomePhabricator

Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02
Open, HighPublicBUG REPORT

Description

I've noticed that since around 16:02-16:06 yesterday, there is a large amount (50% increase in logged messages, compared to the baseline) of extra error logs.

Screenshot from 2021-10-01 14-24-53.png (1×2 px, 143 KB)

They appear to be coming from failed CirrusSearch jobs: "ElasticaWrite job failed" and "Failed executing job: cirrusSearchElasticaWrite".

At first I reported it at T48643#7394374 but as it didn't match the wikis affected, I was told it was unrelated to Wikidata. The fact that it is mostly only CirrusSearch jobs confirms that.

This is the jobqueue channel at that time:
https://logstash.wikimedia.org/goto/263ff9c2e9bdb0668b0864f9cebdb0d5

Screenshot from 2021-10-01 14-01-33.png (704×2 px, 166 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse triaged this task as High priority.EditedFri, Oct 1, 12:43 PM
dcausse added a subscriber: dcausse.

From an MW app server I can't connect to any of the envoy listeners (port 6105, 6106, 6107) that are used to connect to the cloudelastic elasticsearch cluster.
The connectivity problem seems to have started 2021-09-30T16:00:00, The target cluster appears sane and can be reached from global-search, the main it's used for (https://global-search.toolforge.org/?q=test&namespaces=&title=).

Mentioned in SAL (#wikimedia-operations) [2021-10-01T14:04:11Z] <bblack> C:envoyproxy (appservers and others): ca-certificates updated via cumin to workaround T292291 issues

Errors seem to have receded a lot since 14:05:

Screenshot from 2021-10-01 16-30-45.png (452×2 px, 57 KB)

Recapping from an IRC conversation: this was a fallout of the great Let's Encrypt "DST CA Root X3" expiry event yesterday. cloudelastic.wikinmedia.org uses an LE cert, and the envoyproxy on the appservers uses a bundled copy of BoringSSL combined with the system's CA trust store to check the cert on it, and it doesn't work when the LE-using server is serving the legacy-compatibility (default) chain and the expired DST root still exists in the root store.

The workaround via cumin was to deselect the expired root certificate from the config of the system root store (/etc/ca-certificates.conf), run update-ca-certificates, and then reload the envoyproxy service. This workaround was applied to all envoyproxy hosts, and should hold on these hosts for now (seems to even survive updates of the upstream ca-certificates package), but is currently un-puppetizzed.

We may or may not want to puppetize this, for this case or even more-broadly for all of production, but we can look at that debate on Monday. For now I'll upload a prototype puppet patch which would persist this change going forward, for just the envoyproxy case so far.

Change 725331 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] sslcert::ca_deselect_dstx3 for envoyproxy

https://gerrit.wikimedia.org/r/725331

For more longer term, I also would like to wonder if there something we could add to monitoring that could had forced this to surface issues earlier, as it seemed to be a hard failure situation, not a degradation (even if concentrated on a specific point with probably low to unnoticeable user impact).

The TLS issue seems concrete enough that is unlikely to happen again (I could be wrong, experts to say), but maybe something related to error rate of job execution or conectivity check. E.g. monitoring the description graph error rate or this (mean backlog time):

https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?viewPanel=5&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=1632929081576&to=1633101881576

Update on the ca-certificates end of this: Debian has a patch that will correct this at their own level at https://salsa.debian.org/debian/ca-certificates/-/commit/5b83fd984706ea03101dbb011846e60364c3a149 - but we don't yet know if this will be released for buster and/or bullseye updates. Stalling out a little on this before we move forward with the puppet-based solution.

Update on the ca-certificates end of this: Debian has a patch that will correct this at their own level at https://salsa.debian.org/debian/ca-certificates/-/commit/5b83fd984706ea03101dbb011846e60364c3a149 - but we don't yet know if this will be released for buster and/or bullseye updates. Stalling out a little on this before we move forward with the puppet-based solution.

Would it make sense for us to cherry-pick that and upload our own ca-certificates to apt.wm.o in the meantime?