Page MenuHomePhabricator

CirrusSearch: Should wmf run the saneitizer at all times
Closed, ResolvedPublic

Description

See T93040 for some discussion around this - essentially weird stuff happens on the cluster. Its life. The saneitizer's job is to clean up weird stuff. Should we continue to run it manually when we think that something weird happened or just run all the time?

Stakeholders: Cirrus operators
Benefits: This could keep the cluster in a more sane configuration. Or it could hide errors.
Estimate: A week. Probably less, but its a cron change and those can get funky.

Event Timeline

Manybubbles raised the priority of this task from to Needs Triage.
Manybubbles updated the task description. (Show Details)
Manybubbles moved this task to Search on the Discovery-ARCHIVED board.
Manybubbles subscribed.

Besides whatever work is necessary to set up the sanitizer to run automatically, are there any significant negatives to the proposed change?

Besides whatever work is necessary to set up the sanitizer to run automatically, are there any significant negatives to the proposed change?

If we run it all the time we might start accidentally using it to work around real bugs that should be fixed. If we don't run it all the time we won't have automatic recovery for things like redis outages.

Thinking about it now I think the right thing to do is to run it all the time and graph the log file with some alerting on the rates.

Given my experience, I am in favor of running it all the time, barring any large negative consequences. (Since I reported the problem, I've run into a few instances when the search index has taken longer than it should to update, sometimes more than a day, though not recently.)

dcausse claimed this task.
dcausse subscribed.

The Saneitizer is now running at all times. See https://grafana.wikimedia.org/dashboard/db/elasticsearch?panelId=35&fullscreen for details on the fixing rate.