Page MenuHomePhabricator

Migrate URL downloaders to Buster
Closed, ResolvedPublic

Description

URL downloader hosts running on Ganeti instances:

actinium.wikimedia.org
alcyone.wikimedia.org
alsafi.wikimedia.org
aluminium.wikimedia.org

When we migrate them to new OS, it seems like a good idea to also move to a DC-indicated naming scheme like urldownloader1001 to make the DC more visible.

Event Timeline

ArielGlenn triaged this task as Medium priority.Jun 11 2019, 7:59 AM

Change 562283 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/dns@master] Add DNS entries for urldownloader*

https://gerrit.wikimedia.org/r/562283

Change 562283 merged by Dzahn:
[operations/dns@master] Add DNS entries for urldownloader*

https://gerrit.wikimedia.org/r/562283

[authdns1001:~] $ host urldownloader1001
urldownloader1001.wikimedia.org has address 208.80.154.29
urldownloader1001.wikimedia.org has IPv6 address 2620:0:861:1:208:80:154:29
[authdns1001:~] $ host urldownloader1002
urldownloader1002.wikimedia.org has address 208.80.154.81
urldownloader1002.wikimedia.org has IPv6 address 2620:0:861:3:208:80:154:81
[authdns1001:~] $ host urldownloader2001
urldownloader2001.wikimedia.org has address 208.80.153.24
urldownloader2001.wikimedia.org has IPv6 address 2620:0:860:1:208:80:153:24
[authdns1001:~] $ host urldownloader2002
urldownloader2002.wikimedia.org has address 208.80.153.61
urldownloader2002.wikimedia.org has IPv6 address 2620:0:860:2:208:80:153:61
[authdns1001:~] $ host 208.80.154.29
29.154.80.208.in-addr.arpa domain name pointer urldownloader1001.wikimedia.org.
[authdns1001:~] $ host 208.80.154.81
81.154.80.208.in-addr.arpa domain name pointer urldownloader1002.wikimedia.org.
[authdns1001:~] $ host 208.80.153.24
24.153.80.208.in-addr.arpa domain name pointer urldownloader2001.wikimedia.org.
[authdns1001:~] $ host 208.80.153.61
61.153.80.208.in-addr.arpa domain name pointer urldownloader2002.wikimedia.org.

4 VMs have been created and the MAC addresses are on https://gerrit.wikimedia.org/r/c/operations/puppet/+/562394 but OS has not been installed yet. Buster or stretch?

MoritzMuehlenhoff renamed this task from Migrate URL downloaders to Stretch/Buster to Migrate URL downloaders to Buster.Jan 7 2020, 9:33 AM

Change 563148 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Fix coredump_dir for stretch/buster

https://gerrit.wikimedia.org/r/563148

Change 563148 merged by Muehlenhoff:
[operations/puppet@production] Fix coredump_dir for stretch/buster

https://gerrit.wikimedia.org/r/563148

Change 563154 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Apply url downloader role to new hosts

https://gerrit.wikimedia.org/r/563154

Change 563154 merged by Muehlenhoff:
[operations/puppet@production] Apply url downloader role to new hosts

https://gerrit.wikimedia.org/r/563154

urldownloader* have been installed and are working fine in my tests; the only remaining (will do that on Monday) is to switch the CNAMEs and later remove the old jessie instances.

Change 564588 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/dns@master] Switch url-downloader.codfw to urldownloader2001

https://gerrit.wikimedia.org/r/564588

Change 564588 merged by Muehlenhoff:
[operations/dns@master] Switch url-downloader.codfw to urldownloader2001

https://gerrit.wikimedia.org/r/564588

Change 565015 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] calico: Add new urldownloader

https://gerrit.wikimedia.org/r/565015

Change 565016 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] calico: Remove all urldownloader IPs

https://gerrit.wikimedia.org/r/565016

Change 565015 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] calico: Add new urldownloaders

https://gerrit.wikimedia.org/r/565015

Mentioned in SAL (#wikimedia-operations) [2020-01-15T13:53:22Z] <akosiaris> update calico policy on eqiad/codfw/staging. Add new urldownloaders. T224551

Change 565033 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/dns@master] Switch url-downloader.eqiad to urldownloader1001

https://gerrit.wikimedia.org/r/565033

Change 565033 merged by Muehlenhoff:
[operations/dns@master] Switch url-downloader.eqiad to urldownloader1001

https://gerrit.wikimedia.org/r/565033

Change 565271 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove actinium|alcyone|alsafi|aluminium

https://gerrit.wikimedia.org/r/565271

Change 565282 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/dns@master] Remove DNS records for actinium|alcyone|alsafi|aluminium

https://gerrit.wikimedia.org/r/565282

Change 565288 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] profile::url_downloader: Remove support for jessie

https://gerrit.wikimedia.org/r/565288

Today these alerts happened:

19:03 <+icinga-wm> PROBLEM - Check systemd state on urldownloader2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. 
                   https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
19:03 <+icinga-wm> PROBLEM - Check systemd state on urldownloader1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. 
                   https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
19:04 <+icinga-wm> PROBLEM - Check systemd state on urldownloader2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. 
                   https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
19:04 -!- tzatziki [sid6894@wikimedia/fox] has joined #wikimedia-operations
19:05 <+icinga-wm> PROBLEM - Check systemd state on urldownloader1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. 
                   https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

I connected to one of them and saw the failed unit was logrotate.

ExecStart=/usr/sbin/logrotate /etc/logrotate.conf (code=exited, status=1/FAILURE)

I tried to start logrotate manually and found out it was not starting because of:

error: squid3:7 duplicate log entry for /var/log/squid/access.log
error: found error in file squid3, skipping

Then i saw in /etc/logrotate.d/ there are 2 files, "squid" and "squid3" and they cause this duplicate for the squid access.log and logrotate decides to not even start because of that.

Since "squid3" is the old package name pre-buster and now it's just called "squid" i deleted the squid3 file and started the logrotate service.

This made the alerts recover on all 4 servers.

19:45 < mutante> !log urldownloaders - rm /etc/logrotate.d/squid3 ; systemctl start logrotate (this fixes failed logrotate because of squid3 vs squid file = duplicate entry, but puppet will recreate it)

<+icinga-wm> RECOVERY - Check systemd state on urldownloader1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

...

But running puppet recreates the squid3 file. So this will happen again next time it gets restarted and needs a follow-up fix.

Good catch! I'll review the difference between the Logrotate config shipped in the Debian config and our Puppet one, maybe we can simply stick with the Debian default entirely.

Change 565522 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Fix the name of the logrotate config file for Squid 4

https://gerrit.wikimedia.org/r/565522

Change 565522 merged by Muehlenhoff:
[operations/puppet@production] Fix the name of the logrotate config file for Squid 4

https://gerrit.wikimedia.org/r/565522

Mentioned in SAL (#wikimedia-operations) [2020-01-17T13:38:42Z] <moritzm> masking squid3 on old URL downloaders T224551

cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: aluminium.wikimedia.org

  • aluminium.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Wiped bootloaders
    • Shutdown issued. Verify it manually, verification not yet supported
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: alcyone.wikimedia.org

  • alcyone.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Wiped bootloaders
    • Shutdown issued. Verify it manually, verification not yet supported
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Mentioned in SAL (#wikimedia-operations) [2020-01-20T10:06:55Z] <moritzm> removing alcyone/aluminium in Ganeti T224551

cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: alsafi.wikimedia.org

  • alsafi.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Wiped bootloaders
    • Shutdown issued. Verify it manually, verification not yet supported
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: actinium.wikimedia.org

  • actinium.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Wiped bootloaders
    • Shutdown issued. Verify it manually, verification not yet supported
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Mentioned in SAL (#wikimedia-operations) [2020-01-20T12:09:14Z] <moritzm> removing actinium in Ganeti T224551

Change 565271 merged by Muehlenhoff:
[operations/puppet@production] Remove actinium|alcyone|alsafi|aluminium

https://gerrit.wikimedia.org/r/565271

Change 565282 merged by Muehlenhoff:
[operations/dns@master] Remove DNS records for actinium|alcyone|alsafi|aluminium

https://gerrit.wikimedia.org/r/565282

Change 565288 merged by Muehlenhoff:
[operations/puppet@production] profile::url_downloader: Remove support for jessie

https://gerrit.wikimedia.org/r/565288

But running puppet recreates the squid3 file. So this will happen again next time it gets restarted and needs a follow-up fix.

This was an error in the Squid Puppet class, merged and cleaned up the old squid3 configs on Friday. Logrotates are working fine again.

This is complete. The new Buster instances are urldownloader[12]00[12] and the old jessie systems have been removed.

Change 565016 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] calico: Remove all urldownloader IPs

https://gerrit.wikimedia.org/r/565016