Page MenuHomePhabricator

try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye
Closed, ResolvedPublic

Description

  • Test creating a VM with bullseye
  • Put the planet role on it as a test candidate, webserver, python, timers, etc
  • Identify and fix puppet issues
  • Identify missing packages

  • create people1003 with bullseye
  • rsync home dirs from people1002 to people1003
  • ensure people1003 data gets backed up in Bacula
  • adjust/create wikitech pages for user docs and fingerprints
  • switch people.wikimedia.org backend in DNS
  • add logic to add a warning MOTD for users on all backends that are NOT the current backend and source of rsync
  • send mail to list to inform users
  • create people2002 with bullseye
  • other steps above but for codfw while it's passive
  • decom people1002
  • decom people2001

https://gerrit.wikimedia.org/r/q/topic:%22peopleweb%22+(status:open%20OR%20status:merged)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 K -2
operations/puppetproduction+35 -44
operations/puppetproduction+5 -1
operations/puppetproduction+2 -7
operations/puppetproduction+2 -1
operations/puppetproduction+0 -10
operations/puppetproduction+1 -5
operations/puppetproduction+5 -0
operations/puppetproduction+1 -0
operations/puppetproduction+7 -0
operations/puppetproduction+9 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+23 -0
operations/puppetproduction+3 -3
operations/puppetproduction+3 -1
operations/puppetproduction+24 -22
operations/puppetproduction+0 -10
operations/puppetproduction+8 -2
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn triaged this task as Low priority.Apr 23 2021, 4:57 PM
  • planet1003.eqiad.wmnet created with bullseye

https://gerrit.wikimedia.org/r/c/operations/puppet/+/681774
https://gerrit.wikimedia.org/r/c/operations/puppet/+/681779

  • applied planet role

https://gerrit.wikimedia.org/r/c/operations/puppet/+/682171

missing packages:

  • rawdog
  • python-tidylib -> python3-tidylib
  • python-libxml2 -> python3-libxml2
  • envoyproxy

Change 682181 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] planet: use python3 tidylib and libxml2 versions on bullseye

https://gerrit.wikimedia.org/r/682181

Change 682181 merged by Dzahn:

[operations/puppet@production] planet: use python3 tidylib and libxml2 versions on bullseye

https://gerrit.wikimedia.org/r/682181

Mentioned in SAL (#wikimedia-operations) [2021-04-23T19:41:35Z] <mutante> [apt1001:~] $ sudo -i reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy - copy envoy package from buster to bullseye T280989

Mentioned in SAL (#wikimedia-operations) [2021-04-23T20:15:21Z] <mutante> [apt1001:~] $ sudo -i reprepro -C main includedeb bullseye-wikimedia /home/dzahn/rawdog_2.23-2_all.deb (T280989)

We'll need to find a different aggregator, then: rawdog was removed from Bullseye since it was never ported to Python 2, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=938333

We'll need to find a different aggregator, then: rawdog was removed from Bullseye since it was never ported to Python 2, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=938333

planet.debian.org runs on planet-venus, but that one got also removed in bullseye due to being dead upstream and written in Python 2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940982

https://github.com/rubys/venus/issues/37 points to https://github.com/feedreader/pluto which is written in Ruby and still seems to be actively maintained.

Mentioned in SAL (#wikimedia-operations) [2021-04-26T07:09:36Z] <moritzm> removed rawdog from bullseye-wikimedia, needs Py2 T280989

Change 682739 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/DHCP: remove planet1003

https://gerrit.wikimedia.org/r/682739

Change 682739 merged by Dzahn:

[operations/puppet@production] site/DHCP: remove planet1003

https://gerrit.wikimedia.org/r/682739

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: planet1003.eqiad.wmnet

  • planet1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

Change 682755 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/DHCP: add people1003 with bullseye

https://gerrit.wikimedia.org/r/682755

Dzahn renamed this task from try planet on bullseye to try planet/people on bullseye.Apr 26 2021, 9:29 PM

Change 682755 merged by Dzahn:

[operations/puppet@production] site/DHCP: add people1003 with bullseye

https://gerrit.wikimedia.org/r/682755

Change 682766 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add peopleweb role to people1003

https://gerrit.wikimedia.org/r/682766

Change 682766 merged by Dzahn:

[operations/puppet@production] site: add peopleweb role to people1003

https://gerrit.wikimedia.org/r/682766

Change 682771 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: let people1003 use bullseye installer

https://gerrit.wikimedia.org/r/682771

Change 682771 merged by Dzahn:

[operations/puppet@production] DHCP: let people1003 use bullseye installer

https://gerrit.wikimedia.org/r/682771

Change 682776 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ssl: update cert for peopleweb.discovery.wmnet

https://gerrit.wikimedia.org/r/682776

Change 682776 merged by Dzahn:

[operations/puppet@production] ssl: update cert for peopleweb.discovery.wmnet

https://gerrit.wikimedia.org/r/682776

FYI: people1003 is failing to be backed up.

https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home&from=1619492511586&to=1619508775373

Monitoring will indicate a failure if it backups no files or is unable to (e.g. because host is under maintenance, down, etc.).

In this case, it sent out a fatal error:

27-Apr 04:54 backup1001.eqiad.wmnet JobId 329473: Start Backup JobId 329473, Job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-27_04.05.01_22
27-Apr 04:54 backup1001.eqiad.wmnet JobId 329473: Using Device "FileStorageProduction" to write.
27-Apr 04:55 backup1001.eqiad.wmnet JobId 329473: Warning: bsockcore.c:203 Could not connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Connection timed out
Retrying ...
27-Apr 04:57 backup1001.eqiad.wmnet JobId 329473: Fatal error: bsockcore.c:209 Unable to connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Interrupted system call
27-Apr 04:57 backup1001.eqiad.wmnet JobId 329473: Fatal error: No Job status returned from FD.

If this is temporary, no problem, if it is long term, it should be added to the list of ignoring monitoring for backups at: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/backup/job_monitoring_ignorelist

Change 682739 merged by Dzahn:

[operations/puppet@production] site/DHCP: remove planet1003

https://gerrit.wikimedia.org/r/682739

I filed https://phabricator.wikimedia.org/T281219 for finding a replacement for rawdog.

If this is temporary, no problem, if it is long term, it should be added to the list of ignoring monitoring for backups

It's definitely temporary and a fresh install that has some issue on VM level. I will clean it up today.

No problem. Sadly, it is my job to bother people from time to time, making sure backups are working 0:-).

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: people1003.eqiad.wmnet

  • people1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

No problem. Sadly, it is my job to bother people from time to time, making sure backups are working 0:-).

@jcrespo fixed. from CRIT to WARN. remaining WARNs are unrelated hosts

I had removed the role with the backup::set and today I reverted and added it back. That made the Icinga alert trigger again.

Then I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/683732

After deploying that I refreshed next service check in Icinga and it went from CRIT to WARN and the people1003 part of it is gone.

Change 683741 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] rsync::quickdatacopy: ensure a 'passive' host gets an rsync client

https://gerrit.wikimedia.org/r/683741

Change 683742 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: ensure rsync is installed

https://gerrit.wikimedia.org/r/683742

We can try to force a manual run of a backup- backups can fail for many reasons- they are attempted while the host is rebooting or without network, or simply they return no files. Let me know when things at people1003 are in a good state to check it and revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/683732 (no rush).

@jcrespo It looked to me like it wasn't actually failed but just had not ever run yet. Like scheduled but waiting for the start time. I saw the job scheduled for May 1st and it was the 29th. I assumed the issue is the Icinga check can't distinguish between failed and "has not tried yet" because both mean there is no proof of a succeful run. And that it would simply be a matter to downtime new hosts until the job ran the first time?

Change 683742 merged by Dzahn:

[operations/puppet@production] peopleweb: ensure rsync is installed

https://gerrit.wikimedia.org/r/683742

Change 683741 merged by Dzahn:

[operations/puppet@production] rsync::quickdatacopy: ensure a destination host gets an rsync client

https://gerrit.wikimedia.org/r/683741

I assumed the issue is the Icinga check can't distinguish between failed and "has not tried yet" because both mean there is no proof of a succeful run.

It actually does distinguish them:

  • No run generates a warning, and says so
  • Tried and failed generates a critical

There is in fact 5 categories, with different meanings and alerting levels, from "Fresh" to "All failures", as seen at: https://wikitech.wikimedia.org/wiki/Bacula#Monitoring

Downtiming the host will not affect backups, as the monitoring and scheduling happens from bacula hosts. That is why that list of ignored jobs exist, so that backups can continue being executed, but monitoring ignores it until things are stable.

There is in fact 5 categories, with different meanings and alerting levels, from "Fresh" to "All failures", as seen at: https://wikitech.wikimedia.org/wiki/Bacula#Monitoring

Thanks for the pointer! This case seems to match the "No backups: there were no backups or attempts to backing up recorded (successful or not). This could be just a new, recently configured to be backed up host". But I will look closer.

I reverted the addition to the ignore list. Setup is done, there is no reason why it should fail. Let's see what happens. I am running puppet, refreshing Icinga etc.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/684463

== jobs_with_all_failures (1) ==

people1003.eqiad.wmnet-Monthly-1st-Sun-production-home
[backup1001:~] $ sudo check_bacula.py people1003.eqiad.wmnet-Monthly-1st-Sun-production-home
2021-04-27 04:54:36: type: F, status: f, bytes: 0
2021-04-28 05:35:13: type: F, status: f, bytes: 0

@jcrespo This ^ is the current status after removing it from the ignore list. It is in the "with_all_failures" category and I can see the "type: F" and "status: f".

Since the job name includes "Monthly-1st-Sun" does it mean we need to wait until the next 1st of the month? And if I wouldn't have put it on the ignore list during May 1st, it would be done by now?

I am answering from mail- apologies for any formatting errors.

I can have a deeper look tomorrow.

But first...,

one important thing I forgot to communicate: please do not ack/downtime the
bacula for just 1 job (except on a good reason, such as a software bug, or
with a very short expiration date, until I am online).

Doing so will hide errors on the other 100 alerts on the same entry, and
could make backups unusable. We will be able to have better individual
alerts (and individual acks) once we are no longer on icinga. Also that
prevents you acking it and me un-acking it the next day O:-).

If you have to ack, use if possible the individual job ignore method- it
will not delete the job, just ignore its errors- it was thought as a
workaround to ignoring all jobs.

Ok, It seems all 3 methods (keep on ignore list, ACK, don't ACK) are not ideal, so I am not sure how I should correctly handle it. I will just put it back on the ignore list then.

  • host back on ignore list
  • icinga alert cleared
  • removed ACK from icinga check

This can wait, there is no urgency to it. I am also not doing anything special here, so maybe it's all bullseye related.

These are references to people1003 on backups. There are no recent failures

root@backup1001:~$ grep people1003.eqiad.wmnet /var/log/bacula/log.1
27-Apr 04:54 backup1001.eqiad.wmnet JobId 329473: Start Backup JobId 329473, Job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-27_04.05.01_22
27-Apr 04:55 backup1001.eqiad.wmnet JobId 329473: Warning: bsockcore.c:203 Could not connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Connection timed out
27-Apr 04:57 backup1001.eqiad.wmnet JobId 329473: Fatal error: bsockcore.c:209 Unable to connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Interrupted system call
  Job:                    people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-27_04.05.01_22
  Client:                 "people1003.eqiad.wmnet-fd" 
28-Apr 05:35 backup1001.eqiad.wmnet JobId 329595: Start Backup JobId 329595, Job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-28_04.05.02_08
28-Apr 05:37 backup1001.eqiad.wmnet JobId 329595: Warning: bsockcore.c:203 Could not connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Connection timed out
28-Apr 05:38 backup1001.eqiad.wmnet JobId 329595: Fatal error: bsockcore.c:209 Unable to connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Interrupted system call
  Job:                    people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-28_04.05.02_08
  Client:                 "people1003.eqiad.wmnet-fd" 
28-Apr 21:08 backup1001.eqiad.wmnet JobId 0: Client=people1003.eqiad.wmnet-fd not found. Assuming it was removed!!!
28-Apr 21:08 backup1001.eqiad.wmnet JobId 0: Job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home not found. Assuming it was removed!!!

The errors could be just that the host was unavailable/daemon stopped when backups were attempted.

Like recoveries, you can trigger a run of a backup manually with the command "run". I just did that and will report when it starts executing to see what, if something, goes wrong.

Change 685149 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] httpbb: add tests and test_suite for people.wm.org

https://gerrit.wikimedia.org/r/685149

new httpbb tests show webserver working as expected, including URLs behind auth:

[deploy1002:~] $ httpbb --hosts people[1002,1003].eqiad.wmnet /home/dzahn/test_people.yaml 
Sending to 2 hosts...
PASS: 5 requests sent to each of 2 hosts. All assertions passed.

Latest status:

root@backup1001:~$ check_bacula.py people1003.eqiad.wmnet-Monthly-1st-Sun-production-home
None: type: I, status: C, bytes: 0
2021-04-27 04:54:36: type: F, status: f, bytes: 0
2021-04-28 05:35:13: type: F, status: f, bytes: 0
2021-05-05 05:51:58: type: F, status: T, bytes: 56831708224
2021-05-05 08:27:20: type: F, status: R, bytes: 0

After 2 failures (f), it terminated successfully (T). So we can remove it from the ignore list (I will do it). It is normal there is some delay since it was scheduled until it runs- meanwhile, the alert will show the latest completed result, which for people was "all failures". When removed from the list, it will be set into the group "jobs_with_fresh_backups". Full backups are scheduled on the first week of the month, so it is normal there are some delays there.

Change 685149 merged by Dzahn:

[operations/puppet@production] httpbb: add tests and test_suite for people.wm.org

https://gerrit.wikimedia.org/r/685149

Dzahn renamed this task from try planet/people on bullseye to try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye.May 11 2021, 6:29 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

Change 689197 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: fix logic for changing MOTD based on rsync source

https://gerrit.wikimedia.org/r/689197

Change 689197 merged by Dzahn:

[operations/puppet@production] peopleweb: fix logic for changing MOTD based on rsync source

https://gerrit.wikimedia.org/r/689197

Change 689375 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: add people2002 MAC address and use bullseye installer

https://gerrit.wikimedia.org/r/689375

Change 689375 merged by Dzahn:

[operations/puppet@production] DHCP: add people2002 MAC address and use bullseye installer

https://gerrit.wikimedia.org/r/689375

00:51 < mutante> !log [people1002:/home] $ sudo find . -type d -name public_html -exec chmod 555 {} \;
00:54 < mutante> !log made public_html dirs on people1002 readonly to make it obvious it is not the active backend anymore

rsync one last time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/689380

[people1003:/home] $ sudo find . -type d -name public_html -exec chmod 755 {} \;

Change 689259 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] backups: add people2002 to ignore file to avoid false positive monitoring alert

https://gerrit.wikimedia.org/r/689259

Change 689259 merged by Dzahn:

[operations/puppet@production] backups: add people2002 to ignore file to avoid false positive monitoring alert

https://gerrit.wikimedia.org/r/689259

Mentioned in SAL (#wikimedia-operations) [2021-05-12T01:35:48Z] <mutante> people2002 - created new VM resembling people2001, signed puppet cert request, initial puppet run T280989

Change 689407 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add people2002 with peopleweb role

https://gerrit.wikimedia.org/r/689407

Change 689412 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add people2002 with insetup role

https://gerrit.wikimedia.org/r/689412

Change 689412 merged by Dzahn:

[operations/puppet@production] site: add people2002 with insetup role

https://gerrit.wikimedia.org/r/689412

Change 689407 merged by Dzahn:

[operations/puppet@production] site: add peoplweb role to people2002

https://gerrit.wikimedia.org/r/689407

Mentioned in SAL (#wikimedia-operations) [2021-05-12T18:48:12Z] <mutante> rsyncing home dirs of people1003 over to people2002 as well (T280989)

Change 690021 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: remove people1002 and people2001

https://gerrit.wikimedia.org/r/690021

Change 690021 merged by Dzahn:

[operations/puppet@production] DHCP: remove people1002 and people2001

https://gerrit.wikimedia.org/r/690021

Change 690329 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] bacula: Do not ignore people2002 and ignore cloudmetrics1002

https://gerrit.wikimedia.org/r/690329

Change 690329 merged by Jcrespo:

[operations/puppet@production] bacula: Do not ignore people2002 and ignore cloudmetrics1002

https://gerrit.wikimedia.org/r/690329

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: people2001.codfw.wmnet

  • people2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

Change 690666 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove people1002 and people2001, update comments

https://gerrit.wikimedia.org/r/690666

Change 690787 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: put a public_html into /etc/skel to ensure all users get one

https://gerrit.wikimedia.org/r/690787

Change 691131 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] O:admin: add ability to manage home

https://gerrit.wikimedia.org/r/691131

Change 690666 merged by Dzahn:

[operations/puppet@production] site: remove people1002 and people2001, update comments

https://gerrit.wikimedia.org/r/690666

Change 690787 merged by Dzahn:

[operations/puppet@production] peopleweb: put a public_html into /etc/skel to ensure all users get one

https://gerrit.wikimedia.org/r/690787

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: people1002.eqiad.wmnet

  • people1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Dzahn raised the priority of this task from Low to Medium.
Dzahn removed a project: Patch-For-Review.
Dzahn updated the task description. (Show Details)

This is done.

people1003 and people2002 on bullseye have completely replaced people1002 and people2001 on buster

the buster VMs have been decom'ed fully

[people1003:~] $ host people1002.eqiad.wmnet
Host people1002.eqiad.wmnet not found: 3(NXDOMAIN)
[people1003:~] $ host people2001.codfw.wmnet
Host people2001.codfw.wmnet not found: 3(NXDOMAIN)

[people1003:~] $ lsb_release -c
Codename: bullseye

[people2002:~] $ lsb_release -c
Codename: bullseye

Change 691131 merged by Jbond:

[operations/puppet@production] C:admin: add ability to manage home

https://gerrit.wikimedia.org/r/691131

Change 898982 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] planet: if on bullseye, install package contents via puppet

https://gerrit.wikimedia.org/r/898982

Change 898982 abandoned by Dzahn:

[operations/puppet@production] planet: if on bullseye, install package contents via puppet

Reason:

https://gerrit.wikimedia.org/r/898982