The Elasticsearch cluster monitoring python script always reports success, no matter what the state of the cluster.
We discovered this while replacing servers .
Creating this ticket to debug the script, add tests, etc.
The Elasticsearch cluster monitoring python script always reports success, no matter what the state of the cluster.
We discovered this while replacing servers .
Creating this ticket to debug the script, add tests, etc.
Brian and I paired on this and poked around with some debug output.
Ultimately, it turns out this check has never been working to detect drift because python's .sort mutates the underlying object but does not return the new object. It just returns None. So this line was really just doing None == None: https://github.com/wikimedia/puppet/blob/f3920513905975e7bce93b21c59f5e30fc3eb5b3/modules/icinga/files/elasticsearch/check_cirrus_settings.py#L63
Change 761721 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] elastic: fix cirrus settings check false negative
Change 761721 merged by Ryan Kemper:
[operations/puppet@production] elastic: fix cirrus settings check false negative
Reverted https://gerrit.wikimedia.org/r/761721 last week. The fix was good but revealed other underlying issues with the script that are generating noise. Interestingly [at least some of] the failing checks haven't resolved since the revert.
Also here's the two failure types I was seeing last week that we need to address before reverting the revert:
CRITICAL - $.(clustersearch).remote.psi.seeds not found CRITICAL - ['elastic2038.codfw.wmnet:9500', 'elastic2042.codfw.wmnet:9500', 'elastic2047.codfw.wmnet:9500'] does not match ['elastic2042.codfw.wmnet:9500', 'elastic2047.codfw.wmnet:9500', 'elastic2052.codfw.wmnet:9500'] for $.(clustersearch).remote.omega.seeds,['elastic2027.codfw.wmnet:9700', 'elastic2029.codfw.wmnet:9700', 'elastic2048.codfw.wmnet:9700'] does not match ['elastic2027.codfw.wmnet:9700', 'elastic2029.codfw.wmnet:9700', 'elastic2049.codfw.wmnet:9700'] for $.(cluster|search).remote.psi.seeds
python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9243/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9443/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9643/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst
Fixed the eqiad alerts
I'd messed up the omega vs psi stuff in https://phabricator.wikimedia.org/T294805#7701855
Change 772893 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] Revert "elastic: fix cirrus settings check false negative"
Change 772893 merged by Ryan Kemper:
[operations/puppet@production] elastic: fix cirrus settings check false negative
Change 772894 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] elastic: fix cirrus settings check false negative
Change 772894 merged by Ryan Kemper:
[operations/puppet@production] elastic: fix cirrus settings check false negative
Mentioned in SAL (#wikimedia-operations) [2022-03-22T21:35:27Z] <ryankemper> T301511 Fixed elastic* eqiad cross-cluster search settings (see https://phabricator.wikimedia.org/T301511#7798267) to resolve the ElasticSearch setting check alerts in eqiad
Mentioned in SAL (#wikimedia-operations) [2022-03-22T22:09:23Z] <ryankemper> T301511 Forcing recheck of codfw cirrus setting check
Fixing codfw settings:
python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9243/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9443/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst
with the file contents like so:
ryankemper@mwmaint1002:~/elastic$ cat chi_codfw_masters.lst elastic2025.codfw.wmnet:9300 elastic2031.codfw.wmnet:9300 elastic2042.codfw.wmnet:9300 ryankemper@mwmaint1002:~/elastic$ cat omega_codfw_masters.lst elastic2042.codfw.wmnet:9500 elastic2047.codfw.wmnet:9500 elastic2052.codfw.wmnet:9500 ryankemper@mwmaint1002:~/elastic$ cat psi_codfw_masters.lst elastic2027.codfw.wmnet:9700 elastic2029.codfw.wmnet:9700 elastic2049.codfw.wmnet:9700
Mentioned in SAL (#wikimedia-operations) [2022-03-22T22:20:56Z] <ryankemper> T301511 Mutated cirrus codfw cluster settings to what [I think] they should be, see https://phabricator.wikimedia.org/T301511#7798415; forcing re-check