https://dbtree.wikimedia.org has 2 backends, dbmonitor1001 and dbmonitor2001.
Both share the same puppet role/regex in site.pp so should be identical.
In Apache Traffic Server config dbmonitor1001 is currently the one that is active and dbmonitor2001 is commented out as a standby. (like we do it for a bunch of misc services).
There is monitoring for http/https that puppet only adds if the do_acme is set to true in Hiera, which it is for 1001.
This ticket is about 2 things.
a) do_acme has lost its meaning nowadays. That was a parameter of letsencrypt::cert::integrated which is not used here anymore. Since we now use acme_chief https should simply work on both backends. We can just drop that distinction and monitor https on both servers. Or if we want to keep defining one of them as a "primary" for one or another reason then let's just rename that Hiera key to something that matches the profile name. Also if we drop it we fix a style issue that there is a hiera lookup inside a role class.
b) while looking at a) it made me want to check if HTTPS monitoring really works on the codfw server as well since there should be no cert issue anymore. So i checked and ran the Icinga plugin command manually, and I found:
usr/lib/nagios/plugins/check_http -H dbtree.wikimedia.org --ssl --sni -I dbmonitor1001.wikimedia.org -u https://dbtree.wikimedia.org HTTP OK: HTTP/1.1 200 OK - 91709 bytes in 2.272 second response time |time=2.271994s;;;0.000000;10.000000 size=91709B;;;0
/usr/lib/nagios/plugins/check_http -H dbtree.wikimedia.org --ssl --sni -I dbmonitor2001.wikimedia.org -u https://dbtree.wikimedia.org HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 286 bytes in 0.144 second response time |time=0.143718s;;;0.000000;10.000000 size=286B;;;0
So i was surprised about that 500 error, i had expected either an OK or a cert issue but not a 500.
I checked logs on dbmonitor2001 and found:
PHP Fatal error: Uncaught Error: Call to undefined function mysql_connect()
So this host is missing the php-mysql package apparently. But with both hosts sharing the same puppet role this should not happen.
That should be fixed because currently it would not work to failover dbtree/tendril to codfw.