Page MenuHomePhabricator

deployment-ms-fe03 puppet failure
Open, Needs TriagePublic

Description

zabe@deployment-ms-fe03:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::swift::cluster_label' (file: /etc/puppet/modules/profile/manifests/swift/proxy.pp, line: 1) on node deployment-ms-fe03.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
zabe@deployment-ms-fe03:~$

Event Timeline

Change 828664 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/puppet@production] Fix deployment-prep swift cluster label

https://gerrit.wikimedia.org/r/828664

@MatthewVernon

Hey, I tried migrating deployment-prep to https://gerrit.wikimedia.org/r/c/operations/puppet/+/769941, but I failed. Could you help me out with my patch?

I'm a bit stuck, I'm afraid - from the error message Function lookup() did not find a value for the name 'profile::swift::cluster_label' I would expect the problem to be that you've not defined profile::swift::cluster_label correctly but your patch seems to be defining profile::swift::cluster_label: deployment-prep. I don't know why puppet isn't seeing that :(

I'm a bit stuck, I'm afraid - from the error message Function lookup() did not find a value for the name 'profile::swift::cluster_label' I would expect the problem to be that you've not defined profile::swift::cluster_label correctly but your patch seems to be defining profile::swift::cluster_label: deployment-prep. I don't know why puppet isn't seeing that :(

Sorry, my patch does fix that failure, but I am now seeing the following.

zabe@deployment-ms-fe03:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for deployment-ms-fe03.deployment-prep.eqiad.wmflabs
Info: Applying configuration version '(6e232377337) root - deploy swift_ring_manager to deployment-prep'
Notice: The LDAP client stack for this host is: classic/sudoldap
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: classic/sudoldap'
Error: /Stage[main]/Swift::Ring_manager/File[/etc/swift/hosts.yaml]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/swift/deployment-prep_hosts.yaml
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 8.07 seconds
zabe@deployment-ms-fe03:~$

But I don't really know how to properly set up that deployment-prep_hosts.yaml file.

[sorry for the delay - it's been the SRE Summit this week, which has eaten all of my time]

OK, the problem is you need to tell the swift ring management machinery about your swift backends. So you need a deployment-prep_hosts.yaml which gets deployed to /etc/swift/hosts.yaml on the ring_manager host (cf modules/swift/files/*hosts.yaml in puppet) that describes your backends. This is described at
https://wikitech.wikimedia.org/wiki/Swift/Ring_Management (particularly the first section and the one on storage schemas); in theory that should be enough documentation, but if it needs improving do shout!

Change 836953 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/puppet@production] swift: Add deployment-prep_hosts.yaml

https://gerrit.wikimedia.org/r/836953

Cherry-picked 836953 to deployment-prep and puppet now runs successfully

samtar@deployment-ms-fe03:~$ sudo swift_ring_manager
/usr/bin/python: can't open file '/srv/deployment/swift_ring_manager/swift_ring_manager.py': [Errno 2] No such file or directory

samtar@deployment-ms-fe03:~$ sudo systemctl status swift_ring_manager.service
● swift_ring_manager.service
   Loaded: not-found (Reason: No such file or directory)
   Active: failed (Result: exit-code) since Thu 2022-09-08 22:10:10 UTC; 3 weeks 0 days ago
 Main PID: 24913 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Even though puppet runs successfully now given 836953's cherry-pick, it seems ms-fe03 isn't the ring manager? (apologies if I'm misunderstanding things, I'm learning about Swift as I go here...)

swift_ring_managergets deployed to the ring_manager host, which is typically set in the cluster data in swift_clusters in hieradata (cf hieradata/common.yaml for prod). If that's not set for beta yet, then it'll need doing so that host gets s_r_m deployed there.

swift_ring_managergets deployed to the ring_manager host, which is typically set in the cluster data in swift_clusters in hieradata (cf hieradata/common.yaml for prod). If that's not set for beta yet, then it'll need doing so that host gets s_r_m deployed there.

Thank you! :) Have now added that to cloud/instance-puppet, ran puppet on deployment-ms-fe03 and confirmed swift_ring_manager installs & is available.

On running swift_ring_manager (or restarting the service), an ECONNREFUSED error occurs:

-> http://172.16.7.115:6000/recon/replication/object: <urlopen error [Errno 111] ECONNREFUSED>
'/usr/bin/swift-dispersion-report -j -P standard' returned 1.
stdout: Using storage policy: standard

stderr: Traceback (most recent call last):
  File "/usr/bin/swift-dispersion-report", line 392, in <module>
    insecure=insecure)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 636, in get_auth
    timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 495, in get_auth_1_0
    conn.request(method, parsed.path, '', headers)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 437, in request
    files=files, **self.requests_args)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 420, in _request
    return self.request_session.request(*arg, **kwarg)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /auth/v1.0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f0900dcf410>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))

Traceback (most recent call last):
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 1083, in <module>
    main()
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 1039, in main
    disp_ok = check_all_dispersions()
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 152, in check_all_dispersions
    if not check_dispersion(p.name):
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 160, in check_dispersion
    "-j","-P",policyname])
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 581, in run_and_check
    raise subprocess.CalledProcessError(s.returncode, args[0])
subprocess.CalledProcessError: Command '/usr/bin/swift-dispersion-report' returned non-zero exit status 1

Mentioned in SAL (#wikimedia-releng) [2022-10-03T14:22:26Z] <TheresNoTime> set ring_manager host to deployment-ms-fe03 in deployment-prep's _.yaml. T316845

So that's /usr/bin/swift-dispersion-report failing; if you run it by hand (as root) on that node, does it work? If not, I'd hazard a guess at firewalling or similar (or missing credentials).

Progress - running sudo /usr/bin/swift-dispersion-report gives:

samtar@deployment-ms-fe03:~$ sudo /usr/bin/swift-dispersion-report
Using storage policy: standard
Traceback (most recent call last):
  File "/usr/bin/swift-dispersion-report", line 392, in <module>
    insecure=insecure)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 636, in get_auth
    timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 495, in get_auth_1_0
    conn.request(method, parsed.path, '', headers)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 437, in request
    files=files, **self.requests_args)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 420, in _request
    return self.request_session.request(*arg, **kwarg)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /auth/v1.0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f6c1da60410>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))

checking /etc/swift/dispersion.conf shows what appears to be default credentials. Replacing that with the swift:dispersion credentials gives seems to progress a little futher to:

samtar@deployment-ms-fe03:~$ sudo /usr/bin/swift-dispersion-report
Using storage policy: standard
Queried 656 containers for dispersion reporting, 1s, 0 retries
100.00% of container copies found (1312 of 1312)
Sample represents 1.00% of the container partition space
ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED
ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED
ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED
ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED
ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED
ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED
[...]

OK, so it's worth checking that that host has something listening on that port (probably swift :) ), if so there's probably firewall rules in the way - on prod puppet manages this all for you (I think opening relevant ports on the storage hosts and frontends)...

I apologise for all the back and forth @MatthewVernon — on 172.16.7.115 (deployment-ms-be06), swift appears to be running on 6001/6002


Restarted the swift-object service, now running on port 6000

Got to be almost there now... running swift_ring_manager -v:

samtar@deployment-ms-fe03:~$ sudo swift_ring_manager -v
Checking dispersion for policy standard
Checking dispersion for policy lowlatency
Traceback (most recent call last):
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 1083, in <module>
    main()
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 1080, in main
    immediate_only,args.ring_dir,True)
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 919, in compare_states
    immediate_only,swiftdir,dry_run)
  File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 829, in compare_weights
    dw = desired[device].weight
KeyError: 'lv-a1'

So, the device(?) (lv-a1, is this a volume/partition?) doesn't exist?


Is this because I copied the example in https://wikitech.wikimedia.org/wiki/Swift/Ring_Management#Storage_Schemes without modifying the ssd(?) values?

I think this is likely a mismatch between the devices present on the host and the devices declared in the hosts.yaml file - the code can add or remove hosts, but not add devices to or remove them from an existing host (because working out the port mapping and so on is too hard), so I think the failure is that there is a device present in the ring but absent from the state you're configuring in hosts.yaml.

Devices are configured in the ring by name (e.g. /dev/sda3 appears in the ring as sda3), so I guess you have /dev/lv-a1 on the host but don't have lv-a1 in the storage schema?

Devices are configured in the ring by name (e.g. /dev/sda3 appears in the ring as sda3), so I guess you have /dev/lv-a1 on the host but don't have lv-a1 in the storage schema?

This sounds right — we don't have lv-a1 is the storage schema. I'm even more so guessing at this point, but given

samtar@deployment-ms-be05:~$ df -h
Filesystem             Size  Used Avail Use% Mounted on
udev                   7.9G     0  7.9G   0% /dev
tmpfs                  1.6G  168M  1.5G  11% /run
/dev/vda3               19G  5.3G   13G  30% /
tmpfs                  7.9G  4.0K  7.9G   1% /dev/shm
tmpfs                  5.0M     0  5.0M   0% /run/lock
tmpfs                  7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/mapper/vd-lv--a1  113G   95G   18G  85% /srv/swift-storage/lv-a1

I could replace objects: [sdc1, sdd1, sde1] with objects: [lv-a1] in my patch?

That sounds right, but it'd be worth eyeballing what's actually in the ring - if you do

sudo swift-ring-builder /etc/swift/object.builder | less

and check that the name column says lv-a1 then that should confirm you're right here.

Looks good:

Devices:   id region zone   ip address:port replication ip:port  name weight partitions balance flags meta
            2      1    3 172.16.7.114:6000   172.16.7.114:6000 lv-a1 1000.00      65536    0.00
            3      1    4 172.16.7.115:6000   172.16.7.115:6000 lv-a1 1000.00      65536    0.00

I'll make that change and see what happens..

Mentioned in SAL (#wikimedia-releng) [2022-10-10T12:04:36Z] <TheresNoTime> cherry 836953 picking for T316845 to deployment-prep/Swift

Does this seem promising?

samtar@deployment-ms-fe03:~$ sudo swift_ring_manager -v
Checking dispersion for policy standard
Checking dispersion for policy lowlatency
Would set weight deployment-ms-be06/lv-a1 in account to 1004.0
Would set weight deployment-ms-be06/lv-a1 in object to 1160.0
Would set weight deployment-ms-be06/lv-a1 in container to 1004.0
Would set weight deployment-ms-be06/lv-a1 in object-1 to 1012.0
Would set weight deployment-ms-be05/lv-a1 in account to 1004.0
Would set weight deployment-ms-be05/lv-a1 in object to 1160.0
Would set weight deployment-ms-be05/lv-a1 in container to 1004.0
Would set weight deployment-ms-be05/lv-a1 in object-1 to 1012.0
Would add 0 host(s), remove 0, change 8 weights

had to wrangle the config a little to:

schemes:
  prod:
    objects: [lv-a1]
    accounts: [lv-a1]
    containers: [lv-a1]
    object-1: [lv-a1]
    weight:
      objects: 4000
      accounts: &acw 100
      containers: *acw
      ssds: 300
hosts:
  prod:
    - deployment-ms-be05
    - deployment-ms-be06

Hi, that looks like the devices are correct, but the weights are different - I think you should set all the weights in the scheme to 1000 (which I infer is what they are at the moment) - in any case you need the weights for a particular device to be uniform.

Hi, that looks like the devices are correct, but the weights are different - I think you should set all the weights in the scheme to 1000 (which I infer is what they are at the moment) - in any case you need the weights for a particular device to be uniform.

Done, everything appears to be running successfully now — thank you so much for your help @MatthewVernon!

If I may trouble you with one, hopefully last, question potentially relating to Swift (T317417) — what would be the cause of a file (https://upload.wikimedia.beta.wmflabs.org/phonos/0/h/0hp7eif2wwbuhif94n42bzm95o71z9i.mp3) correctly saved in Swift (i.e. I can wget it from the swift frontend server) returning "Unauthorized. This server could not verify that you are authorized to access the document you requested." when directly accessed?

That's a good question, and I'm honestly not sure (not helped by not being able to log in to deployment-ms-fe03.deployment-prep.eqiad.wmflabs); if you can access it from the frontend, though, that makes me think the refusal is coming from further up the stack rather than from swift itself?

Mentioned in SAL (#wikimedia-releng) [2022-10-11T10:53:57Z] <TheresNoTime> add MVernon to deployment-prep, T316845#8307183

That's a good question, and I'm honestly not sure (not helped by not being able to log in to deployment-ms-fe03.deployment-prep.eqiad.wmflabs); if you can access it from the frontend, though, that makes me think the refusal is coming from further up the stack rather than from swift itself?

I have no idea.. I'll keep looking (and for what it's worth, I've just added you to deployment-prep, so if you did want to log in and take a look........)

So I had a quick look in the logs (/var/log/swift/proxy-access.log), and I see this:

Oct 11 10:35:13 deployment-ms-fe03 proxy-server: 81.2.100.161 172.16.1.160 11/Oct/2022/10/35/13 GET /v1/AUTH_mw/global-data-phonos-render./0/h/0hp7eif2wwbuhif94n42bzm95o71z9i.mp3 HTTP/1.0 401 - Mozilla/5.0%20%28X11%3B%20Linux%20x86_64%3B%20rv:91.0%29%20Gecko/20100101%20Firefox/91.0 - - 131 - txbda2d11a9adf492a8b460-00634546e1 - 0.0192 - - 1665484513.898401976 1665484513.917581081 0

I can't help but notice the stray . in the middle of that URL (/global-data-phonos-render./0). But that is a bit of a wild guess...

This issue resolved (but patches not yet merged proper), other discussion at T317417

@Zabe you should be good to un-WIP https://gerrit.wikimedia.org/r/c/operations/puppet/+/828664 now?

This issue resolved (but patches not yet merged proper), other discussion at T317417

@Zabe you should be good to un-WIP https://gerrit.wikimedia.org/r/c/operations/puppet/+/828664 now?

Done

Change 828664 merged by MVernon:

[operations/puppet@production] deploy swift_ring_manager to deployment-prep

https://gerrit.wikimedia.org/r/828664

Change 836953 merged by MVernon:

[operations/puppet@production] swift: Add deployment-prep_hosts.yaml

https://gerrit.wikimedia.org/r/836953