Since late January 30th, there has been a significantly elevated level of failures for CirrusSearch-related jobs on jobrunners.
Errors vary but seem to all revolve around being unable to connect to the elasticsearch backend on the service proxy.
Since late January 30th, there has been a significantly elevated level of failures for CirrusSearch-related jobs on jobrunners.
Errors vary but seem to all revolve around being unable to connect to the elasticsearch backend on the service proxy.
This is a bit of a non error. What happened is:
This is being fixed in the next train, which appropriately overrides default configuration with prod configuration
The train rollout hasn't fixed this issue and we're getting alerts every hour for error spikes on jobrunners - this is beginning to hide other prod issues with the jobrunners so this is a pretty serious concern for us. Apologies, the errors in the ticket *have* been fixed, however we're still seeing spikes of the other errors reported, and they are causing alerts to fire every two hours:
Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic
I haven't managed to track down where the Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic error comes from. We recently turned off writes to this cluster from mediawiki on select wikis, but somewhere in the codebase is still trying to create writes even though it shouldn't. Needs more invetigation on our side.
This is kinda verging on a UBN for us as we go into the weekend because it's causing a lot of spam and it'll hide other error prod states for jobqueues.
I see that cloudelastic hosts are still defined in the partitioner config but I doubt that would cause this issue if the jobs themselves are trying to use cloudelastic. Was anything rolled out around/before 1800 or 2200 UTC yesterday that could be rolled back? Errors start around 1800 but only start happening on the 2-hour cycle around 2200. Could we reenable writes or would that cause wider breakage?
If we need them silenced, best bet is probably to re-enable the writes for these wikis. Can be done with a mediawiki-config patch.
Change 999962 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[operations/mediawiki-config@master] cirrus: Re-enable cloudelastic writes for non-testwikis
Change 999974 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[mediawiki/extensions/CirrusSearch@master] Connection: Correct read-only detection
Change 1002434 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.17] Connection: Correct read-only detection
Change 999962 abandoned by Ebernhardson:
[operations/mediawiki-config@master] cirrus: Re-enable cloudelastic writes for non-testwikis
Reason:
root cause fixed in I3d6282f6, this patch is not necessary.
Change 1002434 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.17] Connection: Correct read-only detection
Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:40:37Z] <ebernhardson@deploy2002> Started scap: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]]
Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:41:58Z] <ebernhardson@deploy2002> ebernhardson: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:49:12Z] <ebernhardson@deploy2002> Finished scap: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] (duration: 08m 35s)
This looks resolved now, the bi-hourly spikes have gone away since the monday deployment.
It seems the patch didn't actually make it into wmf.18 as expected, jenkins-bot never finished the merge so this was only deployed in wmf.17. I'll get it shipped there too.
Change 1003827 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.18] Connection: Correct read-only detection
Change 999974 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Connection: Correct read-only detection
Was supposed to be in the backport window today, but train problems blocked that. This is a pretty safe patch though, i'll ship it a little later.
Change 1003827 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.18] Connection: Correct read-only detection
Mentioned in SAL (#wikimedia-operations) [2024-02-15T23:52:25Z] <thcipriani@deploy2002> Started scap: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]]
Mentioned in SAL (#wikimedia-operations) [2024-02-15T23:53:50Z] <thcipriani@deploy2002> ebernhardson and thcipriani: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2024-02-16T00:02:53Z] <thcipriani@deploy2002> Finished scap: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]] (duration: 10m 28s)