Page MenuHomePhabricator

Address false negatives in Elasticsearch cross-cluster monitoring checks
Closed, ResolvedPublic

Description

The Elasticsearch cluster monitoring python script always reports success, no matter what the state of the cluster.

We discovered this while replacing servers .

Creating this ticket to debug the script, add tests, etc.

Event Timeline

bking renamed this task from Address false negatives in cross-cluster monitoring checks to Address false negatives in Elasticsearch cross-cluster monitoring checks.Feb 10 2022, 9:03 PM
bking updated the task description. (Show Details)
bking updated the task description. (Show Details)

Brian and I paired on this and poked around with some debug output.

Ultimately, it turns out this check has never been working to detect drift because python's .sort mutates the underlying object but does not return the new object. It just returns None. So this line was really just doing None == None: https://github.com/wikimedia/puppet/blob/f3920513905975e7bce93b21c59f5e30fc3eb5b3/modules/icinga/files/elasticsearch/check_cirrus_settings.py#L63

Change 761721 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: fix cirrus settings check false negative

https://gerrit.wikimedia.org/r/761721

Change 761721 merged by Ryan Kemper:

[operations/puppet@production] elastic: fix cirrus settings check false negative

https://gerrit.wikimedia.org/r/761721

Reverted https://gerrit.wikimedia.org/r/761721 last week. The fix was good but revealed other underlying issues with the script that are generating noise. Interestingly [at least some of] the failing checks haven't resolved since the revert.

Also here's the two failure types I was seeing last week that we need to address before reverting the revert:

CRITICAL - $.(clustersearch).remote.psi.seeds not found

CRITICAL - ['elastic2038.codfw.wmnet:9500', 'elastic2042.codfw.wmnet:9500', 'elastic2047.codfw.wmnet:9500'] does not match ['elastic2042.codfw.wmnet:9500', 'elastic2047.codfw.wmnet:9500', 'elastic2052.codfw.wmnet:9500'] for $.(clustersearch).remote.omega.seeds,['elastic2027.codfw.wmnet:9700', 'elastic2029.codfw.wmnet:9700', 'elastic2048.codfw.wmnet:9700'] does not match ['elastic2027.codfw.wmnet:9700', 'elastic2029.codfw.wmnet:9700', 'elastic2049.codfw.wmnet:9700'] for $.(cluster|search).remote.psi.seeds
python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9243/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst
python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9443/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst
python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9643/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst

Fixed the eqiad alerts

I'd messed up the omega vs psi stuff in https://phabricator.wikimedia.org/T294805#7701855

Change 772893 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "elastic: fix cirrus settings check false negative"

https://gerrit.wikimedia.org/r/772893

Change 772893 merged by Ryan Kemper:

[operations/puppet@production] elastic: fix cirrus settings check false negative

https://gerrit.wikimedia.org/r/772893

Change 772894 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: fix cirrus settings check false negative

https://gerrit.wikimedia.org/r/772894

Change 772894 merged by Ryan Kemper:

[operations/puppet@production] elastic: fix cirrus settings check false negative

https://gerrit.wikimedia.org/r/772894

Mentioned in SAL (#wikimedia-operations) [2022-03-22T21:35:27Z] <ryankemper> T301511 Fixed elastic* eqiad cross-cluster search settings (see https://phabricator.wikimedia.org/T301511#7798267) to resolve the ElasticSearch setting check alerts in eqiad

Mentioned in SAL (#wikimedia-operations) [2022-03-22T22:09:23Z] <ryankemper> T301511 Forcing recheck of codfw cirrus setting check

Fixing codfw settings:

python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9243/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst
python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9443/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst
python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

with the file contents like so:

ryankemper@mwmaint1002:~/elastic$ cat chi_codfw_masters.lst
elastic2025.codfw.wmnet:9300
elastic2031.codfw.wmnet:9300
elastic2042.codfw.wmnet:9300

ryankemper@mwmaint1002:~/elastic$ cat omega_codfw_masters.lst
elastic2042.codfw.wmnet:9500
elastic2047.codfw.wmnet:9500
elastic2052.codfw.wmnet:9500

ryankemper@mwmaint1002:~/elastic$ cat psi_codfw_masters.lst
elastic2027.codfw.wmnet:9700
elastic2029.codfw.wmnet:9700
elastic2049.codfw.wmnet:9700

Mentioned in SAL (#wikimedia-operations) [2022-03-22T22:20:56Z] <ryankemper> T301511 Mutated cirrus codfw cluster settings to what [I think] they should be, see https://phabricator.wikimedia.org/T301511#7798415; forcing re-check

RKemper changed the task status from Open to In Progress.Mar 22 2022, 10:24 PM
RKemper claimed this task.
RKemper triaged this task as Medium priority.
RKemper moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.