Page MenuHomePhabricator

toolsbeta.automated-toolforge-tests membership causes "groups: cannot find name for group ID 54872" error message
Open, Stalled, LowestPublicBUG REPORT

Description

Running groups $USER for any user in the toolsbeta.automated-toolforge-tests service group results in a "groups: cannot find name for group ID 54872" error message and exit code of 1. toolsbeta.test3 (and other toolsbeta service groups I am in) do not seem to trigger whatever error is happening here.

1$ ldap '(&(cn=toolsbeta.test3)(objectClass=groupOfNames))' + \*
2dn: cn=toolsbeta.test3,ou=servicegroups,dc=wikimedia,dc=org
3gidNumber: 54309
4cn: toolsbeta.test3
5member: uid=bd808,ou=people,dc=wikimedia,dc=org
6member: uid=bstorm,ou=people,dc=wikimedia,dc=org
7objectClass: posixGroup
8objectClass: groupOfNames
9structuralObjectClass: groupOfNames
10entryUUID: 03efe61c-f387-1039-949f-1fdeebad5742
11creatorsName: uid=novaadmin,ou=people,dc=wikimedia,dc=org
12createTimestamp: 20200305234435Z
13entryCSN: 20200305234435.969510Z#000000#001#000000
14modifiersName: uid=novaadmin,ou=people,dc=wikimedia,dc=org
15modifyTimestamp: 20200305234435Z
16entryDN: cn=toolsbeta.test3,ou=servicegroups,dc=wikimedia,dc=org
17subschemaSubentry: cn=Subschema
18hasSubordinates: FALSE
19
20# pagedresults: cookie=
21
22$ ldap '(&(cn=toolsbeta.automated-toolforge-tests)(objectClass=groupOfNames))' + \*
23dn: cn=toolsbeta.automated-toolforge-tests,ou=servicegroups,dc=wikimedia,dc=org
24cn: toolsbeta.automated-toolforge-tests
25member: uid=aborrero,ou=people,dc=wikimedia,dc=org
26member: uid=andrew,ou=people,dc=wikimedia,dc=org
27member: uid=bd808,ou=people,dc=wikimedia,dc=org
28member: uid=dcaro,ou=people,dc=wikimedia,dc=org
29member: uid=mdipietro,ou=people,dc=wikimedia,dc=org
30structuralObjectClass: groupOfNames
31entryUUID: 1c52b37a-0d6a-103c-84f9-bd1263684bec
32creatorsName: cn=admin,dc=wikimedia,dc=org
33createTimestamp: 20220119115307Z
34gidNumber: 54872
35objectClass: groupOfNames
36objectClass: posixGroup
37entryCSN: 20220119121902.526753Z#000000#001#000000
38modifiersName: cn=scriptuser,ou=profile,dc=wikimedia,dc=org
39modifyTimestamp: 20220119121902Z
40entryDN: cn=toolsbeta.automated-toolforge-tests,ou=servicegroups,dc=wikimedia,dc=org
41subschemaSubentry: cn=Subschema
42hasSubordinates: FALSE
43
44# pagedresults: cookie=

Event Timeline

aborrero changed the task status from Open to In Progress.Feb 15 2022, 10:49 AM
aborrero claimed this task.
aborrero triaged this task as Low priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Another data point, this only happens in Debian Stretch bastions:

arturo@nostromo:~$ ssh dev.toolforge.org
Linux tools-sgebastion-08 4.19.0-0.bpo.14-amd64 #1 SMP Debian 4.19.171-2~deb9u1 (2021-02-08) x86_64
Debian GNU/Linux 9.13 (stretch)
[..]
groups: cannot find name for group ID 54872
aborrero@tools-sgebastion-08:~$
arturo@nostromo:~$ ssh login.toolforge.org
Linux tools-sgebastion-07 4.19.0-0.bpo.14-amd64 #1 SMP Debian 4.19.171-2~deb9u1 (2021-02-08) x86_64
Debian GNU/Linux 9.13 (stretch)
[..]
groups: cannot find name for group ID 54872
aborrero@tools-sgebastion-07:~$
arturo@nostromo:~$ ssh login-buster.toolforge.org
Linux tools-sgebastion-10 4.19.0-16-cloud-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
Debian GNU/Linux 10 (buster)
[..]
aborrero@tools-sgebastion-10:~$
arturo@nostromo:~$ ssh dev-buster.toolforge.org
Linux tools-sgebastion-11 4.19.0-16-cloud-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
Debian GNU/Linux 10 (buster)
[..]
aborrero@tools-sgebastion-11:~$

Both stretch bastions and buster bastions use sssd, so trying to figure out what could be different

For whatever reason the buster bastions have unscd installed (and running):

aborrero@cloud-cumin-03:~$ sudo cumin --force -x 'project:tools name:.*bastion.*' "dpkg -l unscd && systemctl is-active unscd"
IGNORE EXIT CODES mode enabled, all commands executed will be considered successful
4 hosts will be targeted:
tools-sgebastion-[07-08,10-11].tools.eqiad1.wikimedia.cloud
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                                                                                                                             
(2) tools-sgebastion-[07-08].tools.eqiad1.wikimedia.cloud                                                                                                                                                          
----- OUTPUT of 'dpkg -l unscd &&... is-active unscd' -----                                                                                                                                                        
dpkg-query: no packages found matching unscd                                                                                                                                                                       
===== NODE GROUP =====                                                                                                                                                                                             
(2) tools-sgebastion-[10-11].tools.eqiad1.wikimedia.cloud                                                                                                                                                          
----- OUTPUT of 'dpkg -l unscd &&... is-active unscd' -----                                                                                                                                                        
Desired=Unknown/Install/Remove/Purge/Hold                                                                                                                                                                          
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend                                                                                                                                     
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=================================
ii  unscd          0.53-1+b1    amd64        Micro Name Service Caching Daemon
active
================                   

This is somehow in parallel to the sssd installation:

aborrero@cloud-cumin-03:~$ sudo cumin --force -x 'project:tools name:.*bastion.*' "dpkg -l sssd && systemctl is-active sssd"
IGNORE EXIT CODES mode enabled, all commands executed will be considered successful
4 hosts will be targeted:
tools-sgebastion-[07-08,10-11].tools.eqiad1.wikimedia.cloud
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                                                                                                                             
(2) tools-sgebastion-[07-08].tools.eqiad1.wikimedia.cloud                                                                                                                                                          
----- OUTPUT of 'dpkg -l sssd && ...l is-active sssd' -----                                                                                                                                                        
Desired=Unknown/Install/Remove/Purge/Hold                                                                                                                                                                          
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend                                                                                                                                     
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version         Architecture Description
+++-==============-===============-============-==============================================
ii  sssd           1.15.0-3+deb9u2 amd64        System Security Services Daemon -- metapackage
active
===== NODE GROUP =====                                                                                                                                                                                             
(2) tools-sgebastion-[10-11].tools.eqiad1.wikimedia.cloud                                                                                                                                                          
----- OUTPUT of 'dpkg -l sssd && ...l is-active sssd' -----                                                                                                                                                        
Desired=Unknown/Install/Remove/Purge/Hold                                                                                                                                                                          
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend                                                                                                                                     
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-==============================================
ii  sssd           1.16.3-3.2   amd64        System Security Services Daemon -- metapackage
active
================

Mentioned in SAL (#wikimedia-cloud) [2022-02-15T11:14:58Z] <arturo> reboot tools-sgebastion-10 for T301736

Mentioned in SAL (#wikimedia-cloud) [2022-02-15T11:16:07Z] <arturo> purge debian package unscd on tools-sgebastion-10/11 for T301736

Dropped the LDAP entries and created them again with this LDIF:

# toolsbeta.automated-toolforge-tests, servicegroups, wikimedia.org
dn: cn=toolsbeta.automated-toolforge-tests,ou=servicegroups,dc=wikimedia,dc=org
cn: toolsbeta.automated-toolforge-tests
member: uid=aborrero,ou=people,dc=wikimedia,dc=org
member: uid=andrew,ou=people,dc=wikimedia,dc=org
member: uid=bd808,ou=people,dc=wikimedia,dc=org
member: uid=dcaro,ou=people,dc=wikimedia,dc=org
member: uid=mdipietro,ou=people,dc=wikimedia,dc=org
gidNumber: 54872
objectClass: groupOfNames
objectClass: posixGroup
objectClass: top

# toolsbeta.automated-toolforge-tests, people, servicegroups, wikimedia.org
dn: uid=toolsbeta.automated-toolforge-tests,ou=people,ou=servicegroups,dc=wikimedia,dc=org
cn: toolsbeta.automated-toolforge-tests
uidNumber: 54872
gidNumber: 54872
homeDirectory: /data/project/automated-toolforge-tests
loginShell: /bin/bash
sn: toolsbeta.automated-toolforge-tests
uid: toolsbeta.automated-toolforge-tests
objectClass: person
objectClass: posixAccount
objectClass: shadowAccount
objectClass: top

Note that I added a potentially missing objectClass: top to cn=toolsbeta.automated-toolforge-tests,ou=servicegroups,dc=wikimedia,dc=org.

Mentioned in SAL (#wikimedia-cloud) [2022-02-15T11:49:22Z] <arturo> invalidate sssd cache in all bastions to debug T301736

The sssd version difference theory is hard to ignore:

aborrero@cloud-cumin-03:~$ sudo cumin --force -x 'project:tools name:.*bastion.*' "dpkg -s sssd | grep Version && groups aborrero | grep cannot"
IGNORE EXIT CODES mode enabled, all commands executed will be considered successful
4 hosts will be targeted:
tools-sgebastion-[07-08,10-11].tools.eqiad1.wikimedia.cloud
FORCE mode enabled, continuing without confirmation

===== NODE GROUP =====                                                                                                                                                                                             
(2) tools-sgebastion-[07-08].tools.eqiad1.wikimedia.cloud                                                                                                                                                          
----- OUTPUT of 'dpkg -s sssd | g...ro | grep cannot' -----                                                                                                                                                        
Version: 1.15.0-3+deb9u2                                                                                                                                                                                           
groups: cannot find name for group ID 54872                                                                                                                                                                        

===== NODE GROUP =====                                                                                                                                                                                             
(2) tools-sgebastion-[10-11].tools.eqiad1.wikimedia.cloud                                                                                                                                                          
----- OUTPUT of 'dpkg -s sssd | g...ro | grep cannot' -----                                                                                                                                                        
Version: 1.16.3-3.2    
aborrero changed the task status from In Progress to Stalled.Feb 15 2022, 12:28 PM
aborrero lowered the priority of this task from Low to Lowest.
aborrero moved this task from Doing to Graveyard on the cloud-services-team (Kanban) board.

The stretch bastions are about to go away soon T277653: Toolforge: add Debian Buster to the grid and eliminate Debian Stretch. I'm not sure if we should keep debugging this.

I dropped all members of the group except me.