Page MenuHomePhabricator

try planet/people on bullseye
Open, LowPublic

Description

  • Test creating a VM with bullseye
  • Put the planet role on it as a test candidate, webserver, python, timers, etc
  • Identify and fix puppet issues
  • Identify missing packages

Event Timeline

Dzahn triaged this task as Low priority.Fri, Apr 23, 4:57 PM
  • planet1003.eqiad.wmnet created with bullseye

https://gerrit.wikimedia.org/r/c/operations/puppet/+/681774
https://gerrit.wikimedia.org/r/c/operations/puppet/+/681779

  • applied planet role

https://gerrit.wikimedia.org/r/c/operations/puppet/+/682171

missing packages:

  • rawdog
  • python-tidylib -> python3-tidylib
  • python-libxml2 -> python3-libxml2
  • envoyproxy

Change 682181 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] planet: use python3 tidylib and libxml2 versions on bullseye

https://gerrit.wikimedia.org/r/682181

Change 682181 merged by Dzahn:

[operations/puppet@production] planet: use python3 tidylib and libxml2 versions on bullseye

https://gerrit.wikimedia.org/r/682181

Mentioned in SAL (#wikimedia-operations) [2021-04-23T19:41:35Z] <mutante> [apt1001:~] $ sudo -i reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy - copy envoy package from buster to bullseye T280989

Mentioned in SAL (#wikimedia-operations) [2021-04-23T20:15:21Z] <mutante> [apt1001:~] $ sudo -i reprepro -C main includedeb bullseye-wikimedia /home/dzahn/rawdog_2.23-2_all.deb (T280989)

We'll need to find a different aggregator, then: rawdog was removed from Bullseye since it was never ported to Python 2, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=938333

We'll need to find a different aggregator, then: rawdog was removed from Bullseye since it was never ported to Python 2, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=938333

planet.debian.org runs on planet-venus, but that one got also removed in bullseye due to being dead upstream and written in Python 2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940982

https://github.com/rubys/venus/issues/37 points to https://github.com/feedreader/pluto which is written in Ruby and still seems to be actively maintained.

Mentioned in SAL (#wikimedia-operations) [2021-04-26T07:09:36Z] <moritzm> removed rawdog from bullseye-wikimedia, needs Py2 T280989

Change 682739 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/DHCP: remove planet1003

https://gerrit.wikimedia.org/r/682739

Change 682739 merged by Dzahn:

[operations/puppet@production] site/DHCP: remove planet1003

https://gerrit.wikimedia.org/r/682739

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: planet1003.eqiad.wmnet

  • planet1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

Change 682755 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/DHCP: add people1003 with bullseye

https://gerrit.wikimedia.org/r/682755

Dzahn renamed this task from try planet on bullseye to try planet/people on bullseye.Mon, Apr 26, 9:29 PM

Change 682755 merged by Dzahn:

[operations/puppet@production] site/DHCP: add people1003 with bullseye

https://gerrit.wikimedia.org/r/682755

Change 682766 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add peopleweb role to people1003

https://gerrit.wikimedia.org/r/682766

Change 682766 merged by Dzahn:

[operations/puppet@production] site: add peopleweb role to people1003

https://gerrit.wikimedia.org/r/682766

Change 682771 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: let people1003 use bullseye installer

https://gerrit.wikimedia.org/r/682771

Change 682771 merged by Dzahn:

[operations/puppet@production] DHCP: let people1003 use bullseye installer

https://gerrit.wikimedia.org/r/682771

Change 682776 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ssl: update cert for peopleweb.discovery.wmnet

https://gerrit.wikimedia.org/r/682776

Change 682776 merged by Dzahn:

[operations/puppet@production] ssl: update cert for peopleweb.discovery.wmnet

https://gerrit.wikimedia.org/r/682776

FYI: people1003 is failing to be backed up.

https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home&from=1619492511586&to=1619508775373

Monitoring will indicate a failure if it backups no files or is unable to (e.g. because host is under maintenance, down, etc.).

In this case, it sent out a fatal error:

27-Apr 04:54 backup1001.eqiad.wmnet JobId 329473: Start Backup JobId 329473, Job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-27_04.05.01_22
27-Apr 04:54 backup1001.eqiad.wmnet JobId 329473: Using Device "FileStorageProduction" to write.
27-Apr 04:55 backup1001.eqiad.wmnet JobId 329473: Warning: bsockcore.c:203 Could not connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Connection timed out
Retrying ...
27-Apr 04:57 backup1001.eqiad.wmnet JobId 329473: Fatal error: bsockcore.c:209 Unable to connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Interrupted system call
27-Apr 04:57 backup1001.eqiad.wmnet JobId 329473: Fatal error: No Job status returned from FD.

If this is temporary, no problem, if it is long term, it should be added to the list of ignoring monitoring for backups at: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/backup/job_monitoring_ignorelist

Change 682739 merged by Dzahn:

[operations/puppet@production] site/DHCP: remove planet1003

https://gerrit.wikimedia.org/r/682739

I filed https://phabricator.wikimedia.org/T281219 for finding a replacement for rawdog.

If this is temporary, no problem, if it is long term, it should be added to the list of ignoring monitoring for backups

It's definitely temporary and a fresh install that has some issue on VM level. I will clean it up today.

No problem. Sadly, it is my job to bother people from time to time, making sure backups are working 0:-).

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: people1003.eqiad.wmnet

  • people1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

No problem. Sadly, it is my job to bother people from time to time, making sure backups are working 0:-).

@jcrespo fixed. from CRIT to WARN. remaining WARNs are unrelated hosts

I had removed the role with the backup::set and today I reverted and added it back. That made the Icinga alert trigger again.

Then I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/683732

After deploying that I refreshed next service check in Icinga and it went from CRIT to WARN and the people1003 part of it is gone.

Change 683741 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] rsync::quickdatacopy: ensure a 'passive' host gets an rsync client

https://gerrit.wikimedia.org/r/683741

Change 683742 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: ensure rsync is installed

https://gerrit.wikimedia.org/r/683742

We can try to force a manual run of a backup- backups can fail for many reasons- they are attempted while the host is rebooting or without network, or simply they return no files. Let me know when things at people1003 are in a good state to check it and revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/683732 (no rush).

@jcrespo It looked to me like it wasn't actually failed but just had not ever run yet. Like scheduled but waiting for the start time. I saw the job scheduled for May 1st and it was the 29th. I assumed the issue is the Icinga check can't distinguish between failed and "has not tried yet" because both mean there is no proof of a succeful run. And that it would simply be a matter to downtime new hosts until the job ran the first time?

Change 683742 merged by Dzahn:

[operations/puppet@production] peopleweb: ensure rsync is installed

https://gerrit.wikimedia.org/r/683742

Change 683741 merged by Dzahn:

[operations/puppet@production] rsync::quickdatacopy: ensure a destination host gets an rsync client

https://gerrit.wikimedia.org/r/683741

I assumed the issue is the Icinga check can't distinguish between failed and "has not tried yet" because both mean there is no proof of a succeful run.

It actually does distinguish them:

  • No run generates a warning, and says so
  • Tried and failed generates a critical

There is in fact 5 categories, with different meanings and alerting levels, from "Fresh" to "All failures", as seen at: https://wikitech.wikimedia.org/wiki/Bacula#Monitoring

Downtiming the host will not affect backups, as the monitoring and scheduling happens from bacula hosts. That is why that list of ignored jobs exist, so that backups can continue being executed, but monitoring ignores it until things are stable.

There is in fact 5 categories, with different meanings and alerting levels, from "Fresh" to "All failures", as seen at: https://wikitech.wikimedia.org/wiki/Bacula#Monitoring

Thanks for the pointer! This case seems to match the "No backups: there were no backups or attempts to backing up recorded (successful or not). This could be just a new, recently configured to be backed up host". But I will look closer.

I reverted the addition to the ignore list. Setup is done, there is no reason why it should fail. Let's see what happens. I am running puppet, refreshing Icinga etc.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/684463

== jobs_with_all_failures (1) ==

people1003.eqiad.wmnet-Monthly-1st-Sun-production-home
[backup1001:~] $ sudo check_bacula.py people1003.eqiad.wmnet-Monthly-1st-Sun-production-home
2021-04-27 04:54:36: type: F, status: f, bytes: 0
2021-04-28 05:35:13: type: F, status: f, bytes: 0

@jcrespo This ^ is the current status after removing it from the ignore list. It is in the "with_all_failures" category and I can see the "type: F" and "status: f".

Since the job name includes "Monthly-1st-Sun" does it mean we need to wait until the next 1st of the month? And if I wouldn't have put it on the ignore list during May 1st, it would be done by now?

I am answering from mail- apologies for any formatting errors.

I can have a deeper look tomorrow.

But first...,

one important thing I forgot to communicate: please do not ack/downtime the
bacula for just 1 job (except on a good reason, such as a software bug, or
with a very short expiration date, until I am online).

Doing so will hide errors on the other 100 alerts on the same entry, and
could make backups unusable. We will be able to have better individual
alerts (and individual acks) once we are no longer on icinga. Also that
prevents you acking it and me un-acking it the next day O:-).

If you have to ack, use if possible the individual job ignore method- it
will not delete the job, just ignore its errors- it was thought as a
workaround to ignoring all jobs.

Ok, It seems all 3 methods (keep on ignore list, ACK, don't ACK) are not ideal, so I am not sure how I should correctly handle it. I will just put it back on the ignore list then.

  • host back on ignore list
  • icinga alert cleared
  • removed ACK from icinga check

This can wait, there is no urgency to it. I am also not doing anything special here, so maybe it's all bullseye related.

These are references to people1003 on backups. There are no recent failures

root@backup1001:~$ grep people1003.eqiad.wmnet /var/log/bacula/log.1
27-Apr 04:54 backup1001.eqiad.wmnet JobId 329473: Start Backup JobId 329473, Job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-27_04.05.01_22
27-Apr 04:55 backup1001.eqiad.wmnet JobId 329473: Warning: bsockcore.c:203 Could not connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Connection timed out
27-Apr 04:57 backup1001.eqiad.wmnet JobId 329473: Fatal error: bsockcore.c:209 Unable to connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Interrupted system call
  Job:                    people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-27_04.05.01_22
  Client:                 "people1003.eqiad.wmnet-fd" 
28-Apr 05:35 backup1001.eqiad.wmnet JobId 329595: Start Backup JobId 329595, Job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-28_04.05.02_08
28-Apr 05:37 backup1001.eqiad.wmnet JobId 329595: Warning: bsockcore.c:203 Could not connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Connection timed out
28-Apr 05:38 backup1001.eqiad.wmnet JobId 329595: Fatal error: bsockcore.c:209 Unable to connect to Client: people1003.eqiad.wmnet-fd on people1003.eqiad.wmnet:9102. ERR=Interrupted system call
  Job:                    people1003.eqiad.wmnet-Monthly-1st-Sun-production-home.2021-04-28_04.05.02_08
  Client:                 "people1003.eqiad.wmnet-fd" 
28-Apr 21:08 backup1001.eqiad.wmnet JobId 0: Client=people1003.eqiad.wmnet-fd not found. Assuming it was removed!!!
28-Apr 21:08 backup1001.eqiad.wmnet JobId 0: Job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home not found. Assuming it was removed!!!

The errors could be just that the host was unavailable/daemon stopped when backups were attempted.

Like recoveries, you can trigger a run of a backup manually with the command "run". I just did that and will report when it starts executing to see what, if something, goes wrong.

Change 685149 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] httpbb: add tests and test_suite for people.wm.org

https://gerrit.wikimedia.org/r/685149

new httpbb tests show webserver working as expected, including URLs behind auth:

[deploy1002:~] $ httpbb --hosts people[1002,1003].eqiad.wmnet /home/dzahn/test_people.yaml 
Sending to 2 hosts...
PASS: 5 requests sent to each of 2 hosts. All assertions passed.

Latest status:

root@backup1001:~$ check_bacula.py people1003.eqiad.wmnet-Monthly-1st-Sun-production-home
None: type: I, status: C, bytes: 0
2021-04-27 04:54:36: type: F, status: f, bytes: 0
2021-04-28 05:35:13: type: F, status: f, bytes: 0
2021-05-05 05:51:58: type: F, status: T, bytes: 56831708224
2021-05-05 08:27:20: type: F, status: R, bytes: 0

After 2 failures (f), it terminated successfully (T). So we can remove it from the ignore list (I will do it). It is normal there is some delay since it was scheduled until it runs- meanwhile, the alert will show the latest completed result, which for people was "all failures". When removed from the list, it will be set into the group "jobs_with_fresh_backups". Full backups are scheduled on the first week of the month, so it is normal there are some delays there.

Change 685149 merged by Dzahn:

[operations/puppet@production] httpbb: add tests and test_suite for people.wm.org

https://gerrit.wikimedia.org/r/685149