zabe@deployment-ms-fe03:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::swift::cluster_label' (file: /etc/puppet/modules/profile/manifests/swift/proxy.pp, line: 1) on node deployment-ms-fe03.deployment-prep.eqiad.wmflabs Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run zabe@deployment-ms-fe03:~$
Description
Details
Related Objects
- Mentioned In
- T317417: Phonos links to an unauthorized URL
- Mentioned Here
- T317417: Phonos links to an unauthorized URL
Event Timeline
Change 828664 had a related patch set uploaded (by Zabe; author: Zabe):
[operations/puppet@production] Fix deployment-prep swift cluster label
Hey, I tried migrating deployment-prep to https://gerrit.wikimedia.org/r/c/operations/puppet/+/769941, but I failed. Could you help me out with my patch?
I'm a bit stuck, I'm afraid - from the error message Function lookup() did not find a value for the name 'profile::swift::cluster_label' I would expect the problem to be that you've not defined profile::swift::cluster_label correctly but your patch seems to be defining profile::swift::cluster_label: deployment-prep. I don't know why puppet isn't seeing that :(
Sorry, my patch does fix that failure, but I am now seeing the following.
zabe@deployment-ms-fe03:~$ sudo run-puppet-agent Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Info: Caching catalog for deployment-ms-fe03.deployment-prep.eqiad.wmflabs Info: Applying configuration version '(6e232377337) root - deploy swift_ring_manager to deployment-prep' Notice: The LDAP client stack for this host is: classic/sudoldap Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: classic/sudoldap' Error: /Stage[main]/Swift::Ring_manager/File[/etc/swift/hosts.yaml]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/swift/deployment-prep_hosts.yaml Info: Stage[main]: Unscheduling all events on Stage[main] Notice: Applied catalog in 8.07 seconds zabe@deployment-ms-fe03:~$
But I don't really know how to properly set up that deployment-prep_hosts.yaml file.
[sorry for the delay - it's been the SRE Summit this week, which has eaten all of my time]
OK, the problem is you need to tell the swift ring management machinery about your swift backends. So you need a deployment-prep_hosts.yaml which gets deployed to /etc/swift/hosts.yaml on the ring_manager host (cf modules/swift/files/*hosts.yaml in puppet) that describes your backends. This is described at
https://wikitech.wikimedia.org/wiki/Swift/Ring_Management (particularly the first section and the one on storage schemas); in theory that should be enough documentation, but if it needs improving do shout!
Change 836953 had a related patch set uploaded (by Samtar; author: Samtar):
[operations/puppet@production] swift: Add deployment-prep_hosts.yaml
samtar@deployment-ms-fe03:~$ sudo swift_ring_manager /usr/bin/python: can't open file '/srv/deployment/swift_ring_manager/swift_ring_manager.py': [Errno 2] No such file or directory samtar@deployment-ms-fe03:~$ sudo systemctl status swift_ring_manager.service ● swift_ring_manager.service Loaded: not-found (Reason: No such file or directory) Active: failed (Result: exit-code) since Thu 2022-09-08 22:10:10 UTC; 3 weeks 0 days ago Main PID: 24913 (code=exited, status=1/FAILURE) Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Even though puppet runs successfully now given 836953's cherry-pick, it seems ms-fe03 isn't the ring manager? (apologies if I'm misunderstanding things, I'm learning about Swift as I go here...)
swift_ring_managergets deployed to the ring_manager host, which is typically set in the cluster data in swift_clusters in hieradata (cf hieradata/common.yaml for prod). If that's not set for beta yet, then it'll need doing so that host gets s_r_m deployed there.
Thank you! :) Have now added that to cloud/instance-puppet, ran puppet on deployment-ms-fe03 and confirmed swift_ring_manager installs & is available.
On running swift_ring_manager (or restarting the service), an ECONNREFUSED error occurs:
-> http://172.16.7.115:6000/recon/replication/object: <urlopen error [Errno 111] ECONNREFUSED> '/usr/bin/swift-dispersion-report -j -P standard' returned 1. stdout: Using storage policy: standard stderr: Traceback (most recent call last): File "/usr/bin/swift-dispersion-report", line 392, in <module> insecure=insecure) File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 636, in get_auth timeout=timeout) File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 495, in get_auth_1_0 conn.request(method, parsed.path, '', headers) File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 437, in request files=files, **self.requests_args) File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 420, in _request return self.request_session.request(*arg, **kwarg) File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request resp = self.send(prep, **send_kwargs) File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send r = adapter.send(request, **kwargs) File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /auth/v1.0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f0900dcf410>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)) Traceback (most recent call last): File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 1083, in <module> main() File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 1039, in main disp_ok = check_all_dispersions() File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 152, in check_all_dispersions if not check_dispersion(p.name): File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 160, in check_dispersion "-j","-P",policyname]) File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 581, in run_and_check raise subprocess.CalledProcessError(s.returncode, args[0]) subprocess.CalledProcessError: Command '/usr/bin/swift-dispersion-report' returned non-zero exit status 1
Mentioned in SAL (#wikimedia-releng) [2022-10-03T14:22:26Z] <TheresNoTime> set ring_manager host to deployment-ms-fe03 in deployment-prep's _.yaml. T316845
So that's /usr/bin/swift-dispersion-report failing; if you run it by hand (as root) on that node, does it work? If not, I'd hazard a guess at firewalling or similar (or missing credentials).
Progress - running sudo /usr/bin/swift-dispersion-report gives:
samtar@deployment-ms-fe03:~$ sudo /usr/bin/swift-dispersion-report Using storage policy: standard Traceback (most recent call last): File "/usr/bin/swift-dispersion-report", line 392, in <module> insecure=insecure) File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 636, in get_auth timeout=timeout) File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 495, in get_auth_1_0 conn.request(method, parsed.path, '', headers) File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 437, in request files=files, **self.requests_args) File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 420, in _request return self.request_session.request(*arg, **kwarg) File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request resp = self.send(prep, **send_kwargs) File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send r = adapter.send(request, **kwargs) File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /auth/v1.0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f6c1da60410>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
checking /etc/swift/dispersion.conf shows what appears to be default credentials. Replacing that with the swift:dispersion credentials gives seems to progress a little futher to:
samtar@deployment-ms-fe03:~$ sudo /usr/bin/swift-dispersion-report Using storage policy: standard Queried 656 containers for dispersion reporting, 1s, 0 retries 100.00% of container copies found (1312 of 1312) Sample represents 1.00% of the container partition space ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED ERROR: 172.16.7.115:6000/lv-a1: [Errno 111] ECONNREFUSED [...]
OK, so it's worth checking that that host has something listening on that port (probably swift :) ), if so there's probably firewall rules in the way - on prod puppet manages this all for you (I think opening relevant ports on the storage hosts and frontends)...
I apologise for all the back and forth @MatthewVernon — on 172.16.7.115 (deployment-ms-be06), swift appears to be running on 6001/6002
Restarted the swift-object service, now running on port 6000
Got to be almost there now... running swift_ring_manager -v:
samtar@deployment-ms-fe03:~$ sudo swift_ring_manager -v Checking dispersion for policy standard Checking dispersion for policy lowlatency Traceback (most recent call last): File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 1083, in <module> main() File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 1080, in main immediate_only,args.ring_dir,True) File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 919, in compare_states immediate_only,swiftdir,dry_run) File "/srv/deployment/swift_ring_manager/swift_ring_manager.py", line 829, in compare_weights dw = desired[device].weight KeyError: 'lv-a1'
So, the device(?) (lv-a1, is this a volume/partition?) doesn't exist?
Is this because I copied the example in https://wikitech.wikimedia.org/wiki/Swift/Ring_Management#Storage_Schemes without modifying the ssd(?) values?
I think this is likely a mismatch between the devices present on the host and the devices declared in the hosts.yaml file - the code can add or remove hosts, but not add devices to or remove them from an existing host (because working out the port mapping and so on is too hard), so I think the failure is that there is a device present in the ring but absent from the state you're configuring in hosts.yaml.
Devices are configured in the ring by name (e.g. /dev/sda3 appears in the ring as sda3), so I guess you have /dev/lv-a1 on the host but don't have lv-a1 in the storage schema?
This sounds right — we don't have lv-a1 is the storage schema. I'm even more so guessing at this point, but given
samtar@deployment-ms-be05:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 168M 1.5G 11% /run /dev/vda3 19G 5.3G 13G 30% / tmpfs 7.9G 4.0K 7.9G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup /dev/mapper/vd-lv--a1 113G 95G 18G 85% /srv/swift-storage/lv-a1
I could replace objects: [sdc1, sdd1, sde1] with objects: [lv-a1] in my patch?
That sounds right, but it'd be worth eyeballing what's actually in the ring - if you do
sudo swift-ring-builder /etc/swift/object.builder | less
and check that the name column says lv-a1 then that should confirm you're right here.
Looks good:
Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta 2 1 3 172.16.7.114:6000 172.16.7.114:6000 lv-a1 1000.00 65536 0.00 3 1 4 172.16.7.115:6000 172.16.7.115:6000 lv-a1 1000.00 65536 0.00
I'll make that change and see what happens..
Mentioned in SAL (#wikimedia-releng) [2022-10-10T12:04:36Z] <TheresNoTime> cherry 836953 picking for T316845 to deployment-prep/Swift
Does this seem promising?
samtar@deployment-ms-fe03:~$ sudo swift_ring_manager -v Checking dispersion for policy standard Checking dispersion for policy lowlatency Would set weight deployment-ms-be06/lv-a1 in account to 1004.0 Would set weight deployment-ms-be06/lv-a1 in object to 1160.0 Would set weight deployment-ms-be06/lv-a1 in container to 1004.0 Would set weight deployment-ms-be06/lv-a1 in object-1 to 1012.0 Would set weight deployment-ms-be05/lv-a1 in account to 1004.0 Would set weight deployment-ms-be05/lv-a1 in object to 1160.0 Would set weight deployment-ms-be05/lv-a1 in container to 1004.0 Would set weight deployment-ms-be05/lv-a1 in object-1 to 1012.0 Would add 0 host(s), remove 0, change 8 weights
had to wrangle the config a little to:
schemes: prod: objects: [lv-a1] accounts: [lv-a1] containers: [lv-a1] object-1: [lv-a1] weight: objects: 4000 accounts: &acw 100 containers: *acw ssds: 300 hosts: prod: - deployment-ms-be05 - deployment-ms-be06
Hi, that looks like the devices are correct, but the weights are different - I think you should set all the weights in the scheme to 1000 (which I infer is what they are at the moment) - in any case you need the weights for a particular device to be uniform.
Done, everything appears to be running successfully now — thank you so much for your help @MatthewVernon!
If I may trouble you with one, hopefully last, question potentially relating to Swift (T317417) — what would be the cause of a file (https://upload.wikimedia.beta.wmflabs.org/phonos/0/h/0hp7eif2wwbuhif94n42bzm95o71z9i.mp3) correctly saved in Swift (i.e. I can wget it from the swift frontend server) returning "Unauthorized. This server could not verify that you are authorized to access the document you requested." when directly accessed?
That's a good question, and I'm honestly not sure (not helped by not being able to log in to deployment-ms-fe03.deployment-prep.eqiad.wmflabs); if you can access it from the frontend, though, that makes me think the refusal is coming from further up the stack rather than from swift itself?
Mentioned in SAL (#wikimedia-releng) [2022-10-11T10:53:57Z] <TheresNoTime> add MVernon to deployment-prep, T316845#8307183
I have no idea.. I'll keep looking (and for what it's worth, I've just added you to deployment-prep, so if you did want to log in and take a look........)
So I had a quick look in the logs (/var/log/swift/proxy-access.log), and I see this:
Oct 11 10:35:13 deployment-ms-fe03 proxy-server: 81.2.100.161 172.16.1.160 11/Oct/2022/10/35/13 GET /v1/AUTH_mw/global-data-phonos-render./0/h/0hp7eif2wwbuhif94n42bzm95o71z9i.mp3 HTTP/1.0 401 - Mozilla/5.0%20%28X11%3B%20Linux%20x86_64%3B%20rv:91.0%29%20Gecko/20100101%20Firefox/91.0 - - 131 - txbda2d11a9adf492a8b460-00634546e1 - 0.0192 - - 1665484513.898401976 1665484513.917581081 0
I can't help but notice the stray . in the middle of that URL (/global-data-phonos-render./0). But that is a bit of a wild guess...
This issue resolved (but patches not yet merged proper), other discussion at T317417
@Zabe you should be good to un-WIP https://gerrit.wikimedia.org/r/c/operations/puppet/+/828664 now?
Change 828664 merged by MVernon:
[operations/puppet@production] deploy swift_ring_manager to deployment-prep
Change 836953 merged by MVernon:
[operations/puppet@production] swift: Add deployment-prep_hosts.yaml