High level of backend errors for CirrusSearch jobs in jobrunners
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hnowlan
	Feb 2 2024, 6:01 PM

Description

Since late January 30th, there has been a significantly elevated level of failures for CirrusSearch-related jobs on jobrunners.

Errors vary but seem to all revolve around being unable to connect to the elasticsearch backend on the service proxy.

Details

Subject	Repo	Branch	Lines +/-
Connection: Correct read-only detection	mediawiki/extensions/CirrusSearch	wmf/1.42.0-wmf.18	+51 -1
Connection: Correct read-only detection	mediawiki/extensions/CirrusSearch	master	+51 -1
Connection: Correct read-only detection	mediawiki/extensions/CirrusSearch	wmf/1.42.0-wmf.17	+51 -1
cirrus: Re-enable cloudelastic writes for non-testwikis	operations/mediawiki-config	master	+0 -4

Customize query in gerrit

Related Objects

Mentioned In: T357713: RuntimeException: Received cirrusSearchElasticaWrite job for an unwritable cluster cloudelastic.
T354793: SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs
Mentioned Here: T354793: SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs

Event Timeline

hnowlan created this task.Feb 2 2024, 6:01 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptFeb 2 2024, 6:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Gehel triaged this task as High priority.Feb 5 2024, 12:30 PM

Gehel moved this task from needs triage to Current work on the Discovery-Search board.

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

This is a bit of a non error. What happened is:

Patch road the train which now runs the Saneitizer on all clusters, including those that are not declared writable
CirrusSearch defines a default cluster that points at localhost
Apparently the prod configuration was being merged with the default list of clusters, rather than overriding it. This means the default cluster exists

This is being fixed in the next train, which appropriately overrides default configuration with prod configuration

The train rollout hasn't fixed this issue and we're getting alerts every hour for error spikes on jobrunners - this is beginning to hide other prod issues with the jobrunners so this is a pretty serious concern for us. Apologies, the errors in the ticket *have* been fixed, however we're still seeing spikes of the other errors reported, and they are causing alerts to fire every two hours:

Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic

hnowlan added a project: serviceops.Feb 9 2024, 10:57 AM

I haven't managed to track down where the Received cirrusSearchElasticaWrite job for unwritable cluster cloudelastic error comes from. We recently turned off writes to this cluster from mediawiki on select wikis, but somewhere in the codebase is still trying to create writes even though it shouldn't. Needs more invetigation on our side.

This is kinda verging on a UBN for us as we go into the weekend because it's causing a lot of spam and it'll hide other error prod states for jobqueues.

I see that cloudelastic hosts are still defined in the partitioner config but I doubt that would cause this issue if the jobs themselves are trying to use cloudelastic. Was anything rolled out around/before 1800 or 2200 UTC yesterday that could be rolled back? Errors start around 1800 but only start happening on the 2-hour cycle around 2200. Could we reenable writes or would that cause wider breakage?

If we need them silenced, best bet is probably to re-enable the writes for these wikis. Can be done with a mediawiki-config patch.

Change 999962 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Re-enable cloudelastic writes for non-testwikis

https://gerrit.wikimedia.org/r/999962

gerritbot added a project: Patch-For-Review.Feb 9 2024, 5:53 PM

Change 999974 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/999974

Change 1002434 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.17] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/1002434

Change 999962 abandoned by Ebernhardson:

[operations/mediawiki-config@master] cirrus: Re-enable cloudelastic writes for non-testwikis

Reason:

root cause fixed in I3d6282f6, this patch is not necessary.

https://gerrit.wikimedia.org/r/999962

Change 1002434 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.17] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/1002434

Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:40:37Z] <ebernhardson@deploy2002> Started scap: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:41:58Z] <ebernhardson@deploy2002> ebernhardson: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-12T22:49:12Z] <ebernhardson@deploy2002> Finished scap: Backport for [[gerrit:1002434|Connection: Correct read-only detection (T354793 T356526)]] (duration: 08m 35s)

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.17; 2024-02-06).Feb 12 2024, 11:00 PM

This looks resolved now, the bi-hourly spikes have gone away since the monday deployment.

The errors are back since yesterday around 1830 UTC.

It seems the patch didn't actually make it into wmf.18 as expected, jenkins-bot never finished the merge so this was only deployed in wmf.17. I'll get it shipped there too.

Change 1003827 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.18] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/1003827

Change 999974 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/999974

ReleaseTaggerBot edited projects, added MW-1.42-notes (1.42.0-wmf.19; 2024-02-20); removed MW-1.42-notes (1.42.0-wmf.17; 2024-02-06).Feb 15 2024, 8:00 PM

Was supposed to be in the backport window today, but train problems blocked that. This is a pretty safe patch though, i'll ship it a little later.

Change 1003827 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.42.0-wmf.18] Connection: Correct read-only detection

https://gerrit.wikimedia.org/r/1003827

Mentioned in SAL (#wikimedia-operations) [2024-02-15T23:52:25Z] <thcipriani@deploy2002> Started scap: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-15T23:53:50Z] <thcipriani@deploy2002> ebernhardson and thcipriani: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-16T00:02:53Z] <thcipriani@deploy2002> Finished scap: Backport for [[gerrit:1003827|Connection: Correct read-only detection (T354793 T356526)]] (duration: 10m 28s)

Maintenance_bot removed a project: Patch-For-Review.Feb 16 2024, 12:30 AM

ReleaseTaggerBot edited projects, added MW-1.42-notes (1.42.0-wmf.18; 2024-02-13); removed MW-1.42-notes (1.42.0-wmf.19; 2024-02-20).Feb 16 2024, 5:01 AM

Looks resolved following this backport, thank you.

High level of backend errors for CirrusSearch jobs in jobrunnersClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

High level of backend errors for CirrusSearch jobs in jobrunners
Closed, ResolvedPublic
Actions