Page MenuHomePhabricator

Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*)
Closed, ResolvedPublic

Description

Labs Project Tested: -
Site/Location: EQIAD and CODFW
Number of systems: 2 (1 per DC)
Service: tendril
Networking Requirements: external IP. It 'a monitoring tool and needs to not be behind varnish as such
Processor Requirements: 1
Memory: 2G
Disks: 10G
Other Requirements:

Event Timeline

akosiaris renamed this task from Site: (1) VM request for tendril to Site: 2 VM request for tendril.Oct 31 2016, 10:58 AM
akosiaris updated the task description. (Show Details)

Change 327266 had a related patch set uploaded (by Dzahn):
add tendril[12]001, v4 and v6 IPs

https://gerrit.wikimedia.org/r/327266

Change 327266 merged by Alexandros Kosiaris:
introduce dbmonitor, add dbmonitor[12]001, v4 and v6

https://gerrit.wikimedia.org/r/327266

VMs created. MACs for DHCP/PXE are

sudo gnt-instance list -o name,nic.mac/0 dbmonitor1001.wikimedia.org
Instance                    NicMAC/0
dbmonitor1001.wikimedia.org aa:00:00:d6:5c:05
sudo gnt-instance list -o name,nic.mac/0 dbmonitor2001.wikimedia.org
Instance                    NicMAC/0
dbmonitor2001.wikimedia.org aa:00:00:de:d7:57

Change 328509 had a related patch set uploaded (by Alexandros Kosiaris):
Introduce dbmonitor1001, dbmonitor2001

https://gerrit.wikimedia.org/r/328509

Change 328509 merged by Alexandros Kosiaris:
Introduce dbmonitor1001, dbmonitor2001

https://gerrit.wikimedia.org/r/328509

akosiaris claimed this task.
akosiaris added a parent task: Restricted Task.

VMs are up and running, tendril runs on them (with LDAP auth on, over HTTPS), resolving this.

Change 331430 had a related patch set uploaded (by Dzahn):
disable Letsencrypt cert (do_acme: false) on dbmonitor*

https://gerrit.wikimedia.org/r/331430

Change 331430 merged by Dzahn:
disable Letsencrypt cert (do_acme: false) on dbmonitor*

https://gerrit.wikimedia.org/r/331430

re-opening.

tendril in DNS is still an alias for einsteinium

tendril.wikimedia.org is an alias for einsteinium.wikimedia.org.
einsteinium.wikimedia.org has address 208.80.155.119

So the Icinga server is still used when you go to https://tendril.wikimedia.org/ and it's not switched to dbmonitor1001/2001 yet (and t's not behind misc-web (since it's a monitoring tool)

In puppet the role::tendril is applied on einsteinium AND dbmonitor1001/2001 which made me think this migration is simply ongoing.

Since the role adds Letsencrypt certificates, puppet errors happened on which i fixed by adding the "do_acme = false" override for LE in Hiera. (gerrit:331430)

It also adds the Icinga monitoring for the LE cert, so we had alerts about the cert expiring on dbmonitor which led to T162183 for which i merged https://gerrit.wikimedia.org/r/#/c/348172 as a temp fix to remove Icinga cruft

After that i found this ticket again and saw it resolved, so yea.. can the role be removed from einsteinium? how about the DNS CNAME?

Dzahn renamed this task from Site: 2 VM request for tendril to Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*).Apr 14 2017, 2:23 AM

I think the only thing left to do is assess that dbmonitors are OK and then proceed with switching the DNS CNAME, so yeah it should be good to go.

akosiaris changed the task status from Open to Stalled.Apr 21 2017, 10:00 AM

Stalling for 3 weeks. Let's revisit this after the DC switchover fallback

@akosiaris should we get back to it and unstall?

Dzahn changed the task status from Stalled to Open.Jun 14 2017, 6:59 PM

Fine by me. @jcrespo I think we can get back to this finally if you are ok with it.

Things to do (in that order)

  • Make sure dbmonitor1001, dbmonitor2001 run the same tendril version as tegmen/einsteinium (we 've discovered yesterday einsteinium was running an older version then tegmen)
  • Switch over the CNAME to point to dbmonitor1001
  • Test
  • Remove the role from tegmen/einsteinium and clean up tendril stuff from them
  • Remove tegmen/einsteinium from any tendril associated mysql ACLs.
  • Add dbmonitor1001, dbmonitor2001 to mysql ACLs so that tendril db can be contacted from it
34 -- Grants for 'tendril'@'10.%' (tendril)
35 
36 GRANT PROCESS, REPLICATION CLIENT, SELECT, SHOW DATABASES
37     ON *.* TO '<%= @tendril_user %>'@'10.%'
38     IDENTIFIED BY '<%= @tendril_pass %>';

^ I was about to upload changes for the grants, but that looks like we don't need a change for it to work. We allow from @10.%

Or should we make it more specific now for additional security?

P.S. eh, ok, dbmonitors are not in 10.% but in 208.80.155.% but neither are einsteinium and tegmenin 10.% .. why does it work ?:)

  • Make sure dbmonitor1001, dbmonitor2001 run the same tendril version as tegmen/einsteinium (we 've discovered yesterday einsteinium was running an older version then tegmen)

dbmonitor1001/2001 were both at commit a3e37457a77d28b755c

einsteinium and tegmen were both at commit f42955f67093257d

(was just 1 change between them https://gerrit.wikimedia.org/r/#/c/351194/)

i git pulled on both dbmonitors, so now all of them are at f42955f67093257d

Change 359372 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch tendril from einsteinium to dbmonitor1001

https://gerrit.wikimedia.org/r/359372

Change 359373 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb: add GRANT for tendril@dbmonitor1001

https://gerrit.wikimedia.org/r/359373

Do you guys know anything about the latest 2 comments on https://gerrit.wikimedia.org/r/#/c/359373/ ? It seems those GRANTs already exist but are not in the file where they would normally be?

Change 359373 abandoned by Dzahn:
mariadb: add GRANTs for tendril@dbmonitor1001, tendriL@dbmonitor2001

https://gerrit.wikimedia.org/r/359373

Dzahn removed Dzahn as the assignee of this task.Jun 23 2017, 6:18 AM

Change 359372 abandoned by Dzahn:
switch tendril from einsteinium to dbmonitor1001

https://gerrit.wikimedia.org/r/359372

Change 359372 restored by Jcrespo:
switch tendril from einsteinium to dbmonitor1001

https://gerrit.wikimedia.org/r/359372

Change 359372 merged by Jcrespo:
[operations/dns@master] switch tendril from einsteinium to dbmonitor1001

https://gerrit.wikimedia.org/r/359372

The dns change doesn't work either because it causes a TLS error. This is because LE has been disabled there: https://gerrit.wikimedia.org/r/#/c/331430/1

Change 361046 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbmonitor1001: Reenable let's encript generation script

https://gerrit.wikimedia.org/r/361046

Change 361047 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbmonitor: Remove tendril role from einstenium and tegmen

https://gerrit.wikimedia.org/r/361047

Change 361046 merged by Jcrespo:
[operations/puppet@production] dbmonitor1001: Reenable let's encript generation script

https://gerrit.wikimedia.org/r/361046

Change 361047 merged by Jcrespo:
[operations/puppet@production] dbmonitor: Remove tendril role from einstenium and tegmen

https://gerrit.wikimedia.org/r/361047

So the switchover has happened- for future reference: 1) failover the dns 2) enable LE on puppet 3) run puppet. Because of dns cache and the process it takes some good 15 minutes of outage, which I would not be a huge fan of for an important no-downtime host, but it was not a huge deal here once I coordinated with Manuel. I double tested access.log confirms the failover and authentication is still in place.

I will leave the old content on einstenium for now -at least the weekend- (just in case a revert is needed), but after that we should delete:

  1. the tendril repo
  2. the LE private stuf (cert7key)
  3. what else? packages?

packages: It installed php5-mysql

Apache modules: several but i checked all of them are ALSO needed by Icinga, so no remove here

cron:

cron { 'tendril-queries':
106 user => 'tendril',
107 command => '/usr/local/bin/tendril-queries.pl /etc/mysql/tendril.cnf > /var/log/tendril-queries.log 2> /var/log/tendril-queries.err',

So tendril uses dbmonitor1001 for the last week, I am guessing we are ok, resolving. Feel free to reopen

Almost finished, we need to delete the garbage left on einst and the codfw hosts.

I did all of the above on tegmen except the cron, which I think was handled automatically by the user-deletion process (no crons were found by tendril- which doesn't exist- not leftovers of their crontabs. Also no related cron executions by puppet).

I also deleted /etc/apache2/sites-available/50-tendril-wikimedia-org.conf .

for the keys/certs I deleted:

  • /etc/ssl/private/tendril.wikimedia.org.key
  • /etc/acme/cert/tendril.*
  • /etc/ssl/localcerts/tendril.wikimedia.org.*
  • /etc/acme/key/tendril.key
  • /etc/acme/csr/tendril.pem

not sure if I have something missing.

Waiting for Monday for potential cron spam.

can be closed as resolved now?

No yet, I have not deleted the files on einsteinium.

Mentioned in SAL (#wikimedia-operations) [2017-07-25T16:16:39Z] <jynus> about to delete orfphan files on einstenium T149557