Page MenuHomePhabricator

Shutdown Tendril and dbtree
Closed, ResolvedPublic

Description

Tracking task for the shutdown and depecreation of Tendril:

Draft timeline:

  • Second week of January:

Details

SubjectRepoBranchLines +/-
operations/dnsmaster+0 -1
operations/puppetproduction+0 -4
operations/puppetproduction+2 -116
operations/puppetproduction+0 -428
operations/puppetproduction+0 -16
operations/puppetproduction+0 -1
operations/puppetproduction+0 -11
operations/puppetproduction+1 -18
operations/puppetproduction+1 -0
operations/puppetproduction+0 -69
operations/puppetproduction+1 -4
operations/puppetproduction+3 -0
operations/softwaremaster+1 -1
operations/puppetproduction+0 -6
operations/puppetproduction+4 -23
operations/puppetproduction+2 -2
operations/puppetproduction+0 -4
operations/puppetproduction+1 -1
operations/puppetproduction+1 -2
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+2 -1
operations/dnsmaster+0 -2
operations/puppetproduction+0 -1
operations/puppetproduction+0 -24
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/softwaremaster+0 -1
operations/cookbooksmaster+0 -34
operations/software/wmfmariadbpymaster+7 -1
operations/software/wmfmariadbpymaster+8 -1
operations/software/wmfmariadbpymaster+0 -41
operations/mediawiki-configmaster+2 -5
operations/puppetproduction+2 -0
operations/puppetproduction+8 -9
operations/dnsmaster+3 -1
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -1
operations/puppetproduction+1 -1
operations/puppetproduction+12 -2
operations/puppetproduction+1 -1
operations/puppetproduction+2 -1
operations/puppetproduction+16 -16
operations/puppetproduction+21 -20
operations/puppetproduction+21 -20
operations/puppetproduction+12 -11
operations/puppetproduction+2 -2
operations/puppetproduction+2 -5
operations/puppetproduction+1 -1
operations/puppetproduction+6 -3
operations/puppetproduction+5 -0
operations/puppetproduction+2 -2
operations/puppetproduction+160 -1
operations/puppetproduction+2 -0
operations/softwaremaster+0 -6
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 757617 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] switchover-tmpl.sh: Tendril is no more

https://gerrit.wikimedia.org/r/757617

Change 757617 merged by jenkins-bot:

[operations/software@master] switchover-tmpl.sh: Tendril is no more

https://gerrit.wikimedia.org/r/757617

Removed it from switchover-tmpl.sh

Change 760883 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1115: Disable notifications

https://gerrit.wikimedia.org/r/760883

Change 760883 merged by Marostegui:

[operations/puppet@production] db1115: Disable notifications

https://gerrit.wikimedia.org/r/760883

Mentioned in SAL (#wikimedia-operations) [2022-02-08T08:20:09Z] <marostegui> Stop MySQL on db1115 to backup tendril T297605

Change 760884 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2093: Disable notifications

https://gerrit.wikimedia.org/r/760884

Change 760884 merged by Marostegui:

[operations/puppet@production] db2093: Disable notifications

https://gerrit.wikimedia.org/r/760884

Change 760885 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] tendril: Disable systemd jobs

https://gerrit.wikimedia.org/r/760885

Change 760885 merged by Marostegui:

[operations/puppet@production] tendril: Disable systemd jobs

https://gerrit.wikimedia.org/r/760885

Change 760888 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] tendril: Remove systemd jobs

https://gerrit.wikimedia.org/r/760888

Change 760893 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] microsites: Remove link to tendril-legacy

https://gerrit.wikimedia.org/r/760893

Change 760888 merged by Marostegui:

[operations/puppet@production] tendril: Remove systemd jobs

https://gerrit.wikimedia.org/r/760888

Change 760894 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/dns@master] wikimedia.org: Drop tendril-legacy

https://gerrit.wikimedia.org/r/760894

Change 760893 merged by Ladsgroup:

[operations/puppet@production] microsites: Remove link to tendril-legacy

https://gerrit.wikimedia.org/r/760893

Change 760894 merged by Ladsgroup:

[operations/dns@master] wikimedia.org: Drop tendril-legacy

https://gerrit.wikimedia.org/r/760894

The binary backup is done. It is at: dbprov1002:/srv/backups/snapshots/latest/tendril.db1115.8feb2002
For now I am going to leave the database stopped and on Thursday I will:

  • Start mysql again
  • Drop tendril database (so we can start mysql without TokuDB later)
  • Change the role to db_inventory
  • Reimage db1115 with Bullseye
  • Start mysql without TokuDB options

Yesterday there was a question from @Kormat about the Zarcillo DB (on db1115) and what would happen to tools that have its location hardcoded during the reimage. Do we need a "no-maintenance" window?

For now I am going to leave the database stopped and on Thursday I will:

This has broken all of the generate-mysqld-exporter-config jobs on the prometheus hosts.

Change 760911 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] prometheus: Temporarily switch to db2093

https://gerrit.wikimedia.org/r/760911

Change 760911 merged by Kormat:

[operations/puppet@production] prometheus: Temporarily switch to db2093

https://gerrit.wikimedia.org/r/760911

Mentioned in SAL (#wikimedia-operations) [2022-02-08T12:42:17Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on dbmonitor1002.wikimedia.org with reason: Host will be shutdown in a week (T297605)

Mentioned in SAL (#wikimedia-operations) [2022-02-08T12:42:21Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dbmonitor1002.wikimedia.org with reason: Host will be shutdown in a week (T297605)

Marostegui changed the status of subtask T301314: Cleanup db2093 from Open to Stalled.Feb 9 2022, 6:45 AM
Marostegui changed the status of subtask T301315: Move orchestrator from db2093 to db1115 from Open to Stalled.

Mentioned in SAL (#wikimedia-operations) [2022-02-10T06:01:30Z] <marostegui> Drop tendril database from db1115 T297605

Dropped the following databases in db1115:

  • Tendril
  • dsns
  • officewiki

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1115.eqiad.wmnet with OS bullseye

Change 761527 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] netboot.cfg: Add db1115 to RAID1 partman

https://gerrit.wikimedia.org/r/761527

Change 761527 merged by Marostegui:

[operations/puppet@production] netboot.cfg: Add db1115 to RAID1 partman

https://gerrit.wikimedia.org/r/761527

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1115.eqiad.wmnet with OS bullseye executed with errors:

  • db1115 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1115.eqiad.wmnet with OS bullseye executed with errors:

  • db1115 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Change 761528 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] netboot.cfg: Use reuse-raid1-2dev.cfg

https://gerrit.wikimedia.org/r/761528

Change 761528 merged by Marostegui:

[operations/puppet@production] netboot.cfg: Use reuse-raid1-2dev.cfg

https://gerrit.wikimedia.org/r/761528

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1115.eqiad.wmnet with OS bullseye

For the record, I have had to change db1115's partman from manual-setup to RAID1 so the installation could proceed, once on the installer menu and after RAID1 was created I have manually erased / and kept /srv

Change 761529 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Remove tendrilr references

https://gerrit.wikimedia.org/r/761529

Change 761529 merged by Marostegui:

[operations/puppet@production] site.pp: Remove tendril references

https://gerrit.wikimedia.org/r/761529

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1115.eqiad.wmnet with OS bullseye completed:

  • db1115 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202100628_marostegui_29857_db1115.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

So, db1115 has been reimaged to Bullseye with MariaDB 10.4. I have done the following clean up:

  • Removed all pending tokudb leftover files
  • Removed tendril DB
  • Removed officedb
  • Removed dsns
  • Cleaned up tendril@10.% grants from all hosts. (s*, es*, pc*, x*, m*)

What's next: T301315

  • Move orchestrator db to db1115
  • Erase db2093 and reclone it from db1115 once it is not orchestrator active DB (this is the only way to be able to get rid of old tokudbreferences)

Grants for watchdog user: T301442

Change 761534 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] netboot.cfg: db1115 back to non format

https://gerrit.wikimedia.org/r/761534

Change 761534 merged by Marostegui:

[operations/puppet@production] netboot.cfg: db1115 back to non format

https://gerrit.wikimedia.org/r/761534

Reverted the reimage partman so we don't get the host deleted by mistake.

Reverted the prometheus patch, so it goes back to connect to db1115.

Change 761535 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] valid_section.pp: Remove tendril

https://gerrit.wikimedia.org/r/761535

Change 761535 merged by Marostegui:

[operations/puppet@production] valid_section.pp: Remove tendril

https://gerrit.wikimedia.org/r/761535

Change 761572 merged by Marostegui:

[operations/puppet@production] tendril.sql.erb: Remove file

https://gerrit.wikimedia.org/r/761572

Change 761573 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] tendril/maintenance.pp: Files ensure to be absent

https://gerrit.wikimedia.org/r/761573

Change 761575 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db_inventory.my.cnf: Remove tendril specific options

https://gerrit.wikimedia.org/r/761575

Change 761575 merged by Marostegui:

[operations/puppet@production] db_inventory.my.cnf: Remove tendril specific options

https://gerrit.wikimedia.org/r/761575

Change 761577 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production.sql.erb: Remove tendril user

https://gerrit.wikimedia.org/r/761577

Change 761577 merged by Marostegui:

[operations/puppet@production] production.sql.erb: Remove tendril user

https://gerrit.wikimedia.org/r/761577

Change 761578 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] change_mw_mysql_pass.sh: Tendril is gone

https://gerrit.wikimedia.org/r/761578

Change 761578 merged by jenkins-bot:

[operations/software@master] change_mw_mysql_pass.sh: Tendril is gone

https://gerrit.wikimedia.org/r/761578

Change 761585 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production.pp: Remove tendril grants

https://gerrit.wikimedia.org/r/761585

Change 761573 merged by Marostegui:

[operations/puppet@production] tendril/maintenance.pp: Files ensure to be absent

https://gerrit.wikimedia.org/r/761573

Change 761585 merged by Marostegui:

[operations/puppet@production] production.pp: Remove tendril grants

https://gerrit.wikimedia.org/r/761585

Change 761597 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] maintenance.pp: Remove file

https://gerrit.wikimedia.org/r/761597

Change 761597 merged by Marostegui:

[operations/puppet@production] maintenance.pp: Remove file

https://gerrit.wikimedia.org/r/761597

cookbooks.sre.hosts.decommission executed by ladsgroup@cumin1001 for hosts: dbmonitor1002.wikimedia.org

  • dbmonitor1002.wikimedia.org (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

Now we can nuke the tendril puppet rules.

Change 761671 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] First sweep of clean up of tendril

https://gerrit.wikimedia.org/r/761671

I think wmfmariadbpy is clean already:

$ egrep -iR "tendril|db1115" *
debian/changelog:  * db-switchover: Drop tendril support, switch to db-mysql, download events
wmfmariadbpy/cli_admin/switchover.py:ZARCILLO_INSTANCE = "db1115"  # instance_name:port format

Change 761793 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] tendril.cnf.erb: Remove file

https://gerrit.wikimedia.org/r/761793

Change 761671 merged by Ladsgroup:

[operations/puppet@production] First sweep of clean up of tendril

https://gerrit.wikimedia.org/r/761671

Change 761793 merged by Marostegui:

[operations/puppet@production] tendril.cnf.erb: Remove file

https://gerrit.wikimedia.org/r/761793

Change 761918 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] install_server: Drop dbmonitor

https://gerrit.wikimedia.org/r/761918

Change 761919 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] Second sweep of tendril clean up

https://gerrit.wikimedia.org/r/761919

Change 761920 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] tendril/maintenance.pp: Remove file

https://gerrit.wikimedia.org/r/761920

Change 761918 merged by Ladsgroup:

[operations/puppet@production] install_server: Drop dbmonitor

https://gerrit.wikimedia.org/r/761918

Change 761920 merged by Marostegui:

[operations/puppet@production] tendril/maintenance.pp: Remove file

https://gerrit.wikimedia.org/r/761920

Change 761919 merged by Ladsgroup:

[operations/puppet@production] Second sweep of tendril clean up

https://gerrit.wikimedia.org/r/761919

Change 761921 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] Third sweep of tendril clean up

https://gerrit.wikimedia.org/r/761921

Change 761921 merged by Ladsgroup:

[operations/puppet@production] Third sweep of tendril clean up

https://gerrit.wikimedia.org/r/761921

Change 761927 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Remove support for stretch

https://gerrit.wikimedia.org/r/761927

There are no tendril entries for MW - as the rest of pending things have assigned tasks I am closing this as agreed yesterday in the team meeting.

Getting tendril deprecated has been a huge effort, congratulations to everyone involved in making this happen.

Change 907808 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Remove old tendril reference

https://gerrit.wikimedia.org/r/907808

Change 907808 merged by Marostegui:

[operations/dns@master] wmnet: Remove old tendril reference

https://gerrit.wikimedia.org/r/907808

fyi: The placeholder page on https://dbtree.wikimedia.org and https://tendril.wikimedia.org has moved to k8s in T340182 and can just stay there for basically ever.

(even has monitoring ;P)