Page MenuHomePhabricator

High level of backend errors for CirrusSearch jobs in jobrunners
Closed, ResolvedPublic

Description

Since late January 30th, there has been a significantly elevated level of failures for CirrusSearch-related jobs on jobrunners.

Errors vary but seem to all revolve around being unable to connect to the elasticsearch backend on the service proxy.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as High priority.Feb 5 2024, 12:30 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
EBernhardson subscribed.

This is a bit of a non error. What happened is:

  • Patch road the train which now runs the Saneitizer on all clusters, including those that are not declared writable
  • CirrusSearch defines a default cluster that points at localhost
  • Apparently the prod configuration was being merged with the default list of clusters, rather than overriding it. This means the default cluster exists

This is being fixed in the next train, which appropriately overrides default configuration with prod configuration

The train rollout hasn't fixed this issue and we're getting alerts every hour for error spikes on jobrunners - this is beginning to hide other prod issues with the jobrunners so this is a pretty serious concern for us. Apologies, the errors in the ticket *have* been fixed, however we're still seeing spikes of the other errors reported, and they are causing alerts to fire every two hours:

Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic

I haven't managed to track down where the Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic error comes from. We recently turned off writes to this cluster from mediawiki on select wikis, but somewhere in the codebase is still trying to create writes even though it shouldn't. Needs more invetigation on our side.

This is kinda verging on a UBN for us as we go into the weekend because it's causing a lot of spam and it'll hide other error prod states for jobqueues.

I see that cloudelastic hosts are still defined in the partitioner config but I doubt that would cause this issue if the jobs themselves are trying to use cloudelastic. Was anything rolled out around/before 1800 or 2200 UTC yesterday that could be rolled back? Errors start around 1800 but only start happening on the 2-hour cycle around 2200. Could we reenable writes or would that cause wider breakage?

If we need them silenced, best bet is probably to re-enable the writes for these wikis. Can be done with a mediawiki-config patch.

Change 999962 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Re-enable cloudelastic writes for non-testwikis

https://gerrit.wikimedia.org/r/999962

Change 999974 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/999974

Change 1002434 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.17] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/1002434

Change 999962 abandoned by Ebernhardson:

[operations/mediawiki-config@master] cirrus: Re-enable cloudelastic writes for non-testwikis

Reason:

root cause fixed in I3d6282f6, this patch is not necessary.

https://gerrit.wikimedia.org/r/999962

Change 1002434 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.17] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/1002434

Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:40:37Z] <ebernhardson@deploy2002> Started scap: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:41:58Z] <ebernhardson@deploy2002> ebernhardson: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:49:12Z] <ebernhardson@deploy2002> Finished scap: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] (duration: 08m 35s)

This looks resolved now, the bi-hourly spikes have gone away since the monday deployment.

It seems the patch didn't actually make it into wmf.18 as expected, jenkins-bot never finished the merge so this was only deployed in wmf.17. I'll get it shipped there too.

Change 1003827 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.18] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/1003827

Change 999974 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/999974

Was supposed to be in the backport window today, but train problems blocked that. This is a pretty safe patch though, i'll ship it a little later.

Change 1003827 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.18] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/1003827

Mentioned in SAL (#wikimedia-operations) [2024-02-15T23:52:25Z] <thcipriani@deploy2002> Started scap: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-15T23:53:50Z] <thcipriani@deploy2002> ebernhardson and thcipriani: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-16T00:02:53Z] <thcipriani@deploy2002> Finished scap: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]] (duration: 10m 28s)