Page MenuHomePhabricator

cloudcontrol: review connectivity with backup system
Closed, ResolvedPublic

Description

Following the network rework in the parent task, cloudcontrols aren't getting backups:

104-Jun 02:08 backup1001.eqiad.wmnet JobId 513205: Fatal error: bsockcore.c:208 Unable to connect to Client: cloudcontrol2004-dev.codfw.wmnet-fd on cloudcontrol2004-dev.codfw.wmnet:9102. ERR=Interrupted system call
204-Jun 02:08 backup1001.eqiad.wmnet JobId 513205: Fatal error: No Job status returned from FD.
304-Jun 02:08 backup1001.eqiad.wmnet JobId 513205: Error: Bacula backup1001.eqiad.wmnet 9.6.7 (10Dec20)

This resulted in the hosts being depooled from the backup system monitoring for now see https://gerrit.wikimedia.org/r/c/operations/puppet/+/927119 for example.

There are likely firewalling / routing considerations here.

Event Timeline

aborrero created this task.
aborrero added a project: User-aborrero.

@aborrero: I would start by some sanity checks- confirming it is not just the puppet profile being absent, or the systemd daemon failing, before going to network config (as this was a common case super easy to check)

Change 931576 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-labs: Permit bacula backup traffic

https://gerrit.wikimedia.org/r/931576

Change 931576 merged by Arturo Borrero Gonzalez:

[operations/homer/public@master] cr-labs: Permit bacula backup traffic

https://gerrit.wikimedia.org/r/931576

Mentioned in SAL (#wikimedia-operations) [2023-06-20T14:36:19Z] <arturo> homer run for CR eqiad/codfw to allow bacula traffic in from cloud-hosts (T338132, T339894)

It worked now:

Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name
515898  Full          16    54.00 M  OK       20-Jun-23 15:34 cloudcontrol2001-dev.codfw.wmnet-Monthly-1st-Wed-productionEqiad-mysql-srv-backups
515899  Full           3    6.316 M  OK       20-Jun-23 15:35 cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap

I will revert the monitoring hiding.

Change 931634 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] backup: Revert the addition of codfw-dev cloud hosts to the ignore list

https://gerrit.wikimedia.org/r/931634

Change 931634 merged by Jcrespo:

[operations/puppet@production] backup: Revert the addition of codfw-dev cloud hosts to the ignore list

https://gerrit.wikimedia.org/r/931634