Page MenuHomePhabricator

dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring
Closed, DuplicatePublic

Description

https://dbtree.wikimedia.org has 2 backends, dbmonitor1001 and dbmonitor2001.

Both share the same puppet role/regex in site.pp so should be identical.

In Apache Traffic Server config dbmonitor1001 is currently the one that is active and dbmonitor2001 is commented out as a standby. (like we do it for a bunch of misc services).

There is monitoring for http/https that puppet only adds if the do_acme is set to true in Hiera, which it is for 1001.

This ticket is about 2 things.

a) do_acme has lost its meaning nowadays. That was a parameter of letsencrypt::cert::integrated which is not used here anymore. Since we now use acme_chief https should simply work on both backends. We can just drop that distinction and monitor https on both servers. Or if we want to keep defining one of them as a "primary" for one or another reason then let's just rename that Hiera key to something that matches the profile name. Also if we drop it we fix a style issue that there is a hiera lookup inside a role class.

b) while looking at a) it made me want to check if HTTPS monitoring really works on the codfw server as well since there should be no cert issue anymore. So i checked and ran the Icinga plugin command manually, and I found:

usr/lib/nagios/plugins/check_http -H dbtree.wikimedia.org --ssl --sni -I dbmonitor1001.wikimedia.org -u https://dbtree.wikimedia.org
HTTP OK: HTTP/1.1 200 OK - 91709 bytes in 2.272 second response time |time=2.271994s;;;0.000000;10.000000 size=91709B;;;0

vs

 /usr/lib/nagios/plugins/check_http -H dbtree.wikimedia.org --ssl --sni -I dbmonitor2001.wikimedia.org -u https://dbtree.wikimedia.org
HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 286 bytes in 0.144 second response time |time=0.143718s;;;0.000000;10.000000 size=286B;;;0

So i was surprised about that 500 error, i had expected either an OK or a cert issue but not a 500.

I checked logs on dbmonitor2001 and found:

PHP Fatal error:  Uncaught Error: Call to undefined function mysql_connect()

So this host is missing the php-mysql package apparently. But with both hosts sharing the same puppet role this should not happen.

That should be fixed because currently it would not work to failover dbtree/tendril to codfw.

Event Timeline

Dzahn, this is ticket is 100% accurate, but you may not be aware of the why of this- which is explained on T224589. I would suggest to add your comments there- which you may not have seen because it is stalled. This is a real issue, but may be a duplicate of that. Feel free to update that task to reflect your thoughts.

tl;tr: If we want to make tendril work, we need to revert dbmonitor2001 back to jessie to have the php-mysql extension, which would be a huge security concern.

Was about to paste the relevant part and ask more questions about this when I saw your comment.

Ack, merged it in as a duplicate.

   5 class role::tendril
...
 17     # Make tendril active-passive cross-datacenter until a local db backend is
 18     # available on codfw to avoid cross-dc queries or TLS is used to connect
 19     if hiera('do_acme', true) {
 20         ferm::service { 'tendril-http-https':
 21             proto => 'tcp',
 22             port  => '(http https)',
 23         }
 24     }

tl;tr: If we want to make tendril work, we need to revert dbmonitor2001 back to jessie to have the php-mysql extension, which would be a huge security concern.

ah, gotcha! thanks