Page MenuHomePhabricator

'swift' user/group IDs should be consistent across the fleet
Open, Stalled, MediumPublic

Description

At the moment both swift user and group IDs are not fixed in puppet, this means that their assigned IDs depend on package installation order. This in turn makes it clunky to do a reimage/reinstall while keeping the data disks intact. After reinstall the uid/gid in passwd are not guaranteed to match what's on the data disk filesystem.

The plan is thus to first fix the swift user/group uid/gid before puppet runs, then once the fleet is all at the same uid/gid we can let the admin module create the user/group as needed.

The UID/GID reserved for swift is 902 (previously 130), see also https://gerrit.wikimedia.org/r/c/operations/puppet/+/575217.

UID status as of Dec 2020.

===== NODE GROUP =====
(6) ms-be[2016-2021].codfw.wmnet
----- OUTPUT of 'id swift' -----
uid=111(swift) gid=116(swift) groups=116(swift)
===== NODE GROUP =====
(9) ms-be[2057-2061].codfw.wmnet,ms-be[1060-1063].eqiad.wmnet
----- OUTPUT of 'id swift' -----
uid=902(swift) gid=902(swift) groups=902(swift)
===== NODE GROUP =====
(80) ms-be[2022-2056].codfw.wmnet,ms-be[1022-1026,1028-1059].eqiad.wmnet,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet
----- OUTPUT of 'id swift' -----
uid=130(swift) gid=130(swift) groups=130(swift)
===== NODE GROUP =====
(3) ms-be[1019-1021].eqiad.wmnet
----- OUTPUT of 'id swift' -----
uid=112(swift) gid=117(swift) groups=117(swift)

Event Timeline

fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: SRE, SRE-swift-storage.
fgiunchedi subscribed.

I've started provisioning the swift user before puppet on the new swift hardware, since there might be a few packages already installed either post-provisioning or after puppet runs which will claim user/group IDs I've set swift to be 130 for user/group

doable also post-puppet but before machines are in services (i.e. many files owned by swift)

swift-init all stop
userdel swift
groupdel swift
groupadd -g 902 --system swift
useradd -g 902 -u 902 --system --home-dir /var/lib/swift --shell /bin/false swift
chown -R swift:swift /var/cache/swift

Change 297242 had a related patch set uploaded (by Filippo Giunchedi):
install_server: pre-provision swift uid/gid

https://gerrit.wikimedia.org/r/297242

Change 297242 merged by Filippo Giunchedi:
install_server: pre-provision swift uid/gid

https://gerrit.wikimedia.org/r/297242

@fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

fgiunchedi changed the task status from Open to Stalled.Feb 19 2020, 4:20 PM

@fgiunchedi: Hi, the patch in Gerrit has been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

Thanks for the heads up, I'm stalling the task since it'll likely be resolvable once we've decom'd all the old swift backends that still use old IDs

Change 575217 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: use fleetwide uid/gid

https://gerrit.wikimedia.org/r/575217

Change 575217 merged by Filippo Giunchedi:
[operations/puppet@production] swift: use fleetwide uid/gid

https://gerrit.wikimedia.org/r/575217

Change 599693 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: add thanos-fe/thanos-be to late_command swift uid preprovision

https://gerrit.wikimedia.org/r/599693

Change 599693 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: add thanos-fe/thanos-be to late_command swift uid preprovision

https://gerrit.wikimedia.org/r/599693

Mentioned in SAL (#wikimedia-operations) [2020-05-29T08:30:26Z] <godog> update swift uid/gid on thanos hosts - T123918

It occurred to me that as part of uid/gid preprovision we should detect if swift was previously on the host (i.e. there are labeled filesystems) and reuse _that_ uid/gid instead to have fully hands off reimages.

Aklapper changed the task status from Stalled to Open.Mar 22 2022, 1:09 AM

I'm stalling the task since it'll likely be resolvable once we've decom'd all the old swift backends that still use old IDs

That seems to have happened two months ago; resetting task status.

@Aklapper I don't think that's right:

mvernon@cumin1001:~$ sudo cumin O:swift::storage 'id swift'
#[...]
===== NODE GROUP =====                                                                                                 
(59) ms-be[2028-2044,2046-2056].codfw.wmnet,ms-be[1028-1033,1035-1059].eqiad.wmnet                                     
----- OUTPUT of 'id swift' -----                                                                                       
uid=130(swift) gid=130(swift) groups=130(swift)                                                                        
===== NODE GROUP =====                                                                                                 
(26) ms-be[2045,2057-2069].codfw.wmnet,ms-be[1060-1071].eqiad.wmnet                                                    
----- OUTPUT of 'id swift' -----                                                                                       
uid=902(swift) gid=902(swift) groups=902(swift)                                                                        
================

Eh, thanks (and sorry). In that case, this task should depend on whatever task is about decommissioning all the old swift backends that still use old IDs.

I think the newest host with the old id is ms-be2056, which arrived on 2019-09-18, so we won't be decommissioning the last of these nodes until late 2024; so I think it makes sense to leave this stalled in the mean time?

Aklapper changed the task status from Open to Stalled.Mar 22 2022, 9:40 AM

Let's do that