Page MenuHomePhabricator

Site: 2 VM %request for etherpad
Closed, ResolvedPublic

Description

Cloud VPS Project Tested: n/a, tested in production
Site/Location: eqiad/codfw
Number of systems: 2
Service: etherpad
Networking Requirements: internal
Processor Requirements: 1
Memory: 2GB
Disks: 15GB
Other Requirements: none

Same as T300568 and before that T243475 but a little more RAM which we increased meanwhile.

etherpad1004 replacing etherpad1003 to upgrade to bookworm (parent task T316421)

Additionally we are creating the equivalent in codfw since this was the only service left with a SPOF.

Event Timeline

Dzahn changed the task status from Open to In Progress.Feb 9 2024, 5:20 PM
Dzahn claimed this task.
Dzahn updated the task description. (Show Details)
Dzahn added a project: collaboration-services.

Change 999957 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add etherpad1004 with insetup-role

https://gerrit.wikimedia.org/r/999957

Change 999957 merged by Dzahn:

[operations/puppet@production] site: add etherpad1004 with insetup-role

https://gerrit.wikimedia.org/r/999957

dzahn@cumin1002:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 2 --disk 15 --cluster eqiad -t T357159 --group B --os bookworm etherpad1004
Ready to create Ganeti VM etherpad1004.eqiad.wmnet in the eqiad cluster on group B with 1 vCPUs, 2.0GB of RAM, 15GB of disk in the private network.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host etherpad1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host etherpad1004.eqiad.wmnet with OS bookworm executed with errors:

  • etherpad1004 (FAIL)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" etherpad1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Dzahn triaged this task as Medium priority.

reimage failed because the puppetmaster had an issue at this time.

reimaged again after that was fixed and ready now.

sitting with "insetup" role:

Codename: bookworm
[etherpad1004:~] $

Dzahn renamed this task from Site: 1 VM %request for etherpad to Site: 2 VM %request for etherpad.Feb 12 2024, 4:49 PM
Dzahn reopened this task as Open.
Dzahn updated the task description. (Show Details)

Change 1002591 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add etherpad2001 with insetup role

https://gerrit.wikimedia.org/r/1002591

Change 1002591 merged by Dzahn:

[operations/puppet@production] site: add etherpad2001 with insetup role

https://gerrit.wikimedia.org/r/1002591

Mentioned in SAL (#wikimedia-operations) [2024-02-12T18:35:21Z] <mutante> attempting decom cookbook on "unverified" host etherpad2001, followed by makevm cookbook to create it again to get out of the cycle of adding and removing DNS records - fails with "is already in the cluster" even after decom finished T357159

Change 1002596 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: update etherpad VM 2001 to 2002

https://gerrit.wikimedia.org/r/1002596

Change 1002596 merged by Dzahn:

[operations/puppet@production] site: update etherpad VM 2001 to 2002

https://gerrit.wikimedia.org/r/1002596

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host etherpad2002.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host etherpad2002.codfw.wmnet with OS bookworm completed:

  • etherpad2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402121928_dzahn_3747695_etherpad2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-03-07T16:02:25Z] <mutante> deleting etherpad2001 VM -replaced by etherpad2002 - T357159

cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: etherpad2001.codfw.wmnet

  • etherpad2001.codfw.wmnet (WARN)
    • Missing DNSName in Nebox for etherpad2001, unable to verify it.
    • Missing DNS record for etherpad2001.codfw.wmnet, the steps requiring DNS will fail.
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox