There is currently a race condition when reinstalling puppetmaster servers in that the gitpuppet group specifies a gid of 998 however sometimes this is already taken, which causes puppet failures
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | jbond | T228657 Upgrade Puppet Masters and Puppet DB servers | |||
Resolved | MoritzMuehlenhoff | T235067 reimage of puppet servers can fail |
Event Timeline
I looked into this and it's quite a mess!
The debmonitor GID (which was the one clashing) is created in the postinst of debmonitor-client. Users/groups are added with s the default tool in Debian for this (adduser). We use the default Debian config for adduser (due to some dpkg limitation back in 2001 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=541620 it gets created in the adduser.postinst instead of being shipped as a conffile). The default config specifies FIRST_SYSTEM_GID=100 and LAST_SYSTEM_GID=999, i.e. an ID from that pool is used when creating a user with the --system flag.
That's not a very good default for us to begin with: After all we're specifying various GIDs in the range >= 500 in data.yaml.
But there's also a couple more issues here:
(1) We also have some odd behaviour in the way the the GIDs for system users are allocated within the configured range. I did a fleet-wide Cumin for the GID of debmonitor and there are some outliers independent of the distro in use (so no changes in adduser per se):
jessie: (10) labpuppetmaster1001.wikimedia.org,labtestpuppetmaster2001.wikimedia.org,scb[2001-2004].codfw.wmnet,scb[1001-1004].eqiad.wmnet debmonitor:x:497: (2) contint[1001,2001].wikimedia.org debmonitor:x:496: (90) actinium.wikimedia.org,alcyone.wikimedia.org,alsafi.wikimedia.org,aluminium.wikimedia.org,bast3002.wikimedia.org,cobalt.wikimedia.org,conf[2001-2003].codfw.wmnet,dbmonitor[1001,2001].wikimedia.org, dubnium.wikimedia.org,dumpsdata[1001-1002].eqiad.wmnet,etcd[1001-1006].eqiad.wmnet,etherpad1001.eqiad.wmnet,fermium.wikimedia.org,hassaleh.codfw.wmnet,hassium.eqiad.wmnet,helium.eqiad.wmnet, heze.codfw.wmnet,install[1002,2002].wikimedia.org,kraz.wikimedia.org,kubestagetcd[1001-1003].eqiad.wmnet,kubetcd[2001-2003].codfw.wmnet,labmon[1001-1002].eqiad.wmnet,labstore[2001-2004].codfw.wmnet, labstore[1004-1005].eqiad.wmnet,labstore[1006-1007].wikimedia.org,mc[2019-2036].codfw.wmnet,mc[1019-1020,1022-1036].eqiad.wmnet,mendelevium.eqiad.wmnet,oresrdb[2001-2002].codfw.wmnet, oresrdb[1001-1002].eqiad.wmnet,pollux.wikimedia.org,pybal-test2001.codfw.wmnet,tungsten.eqiad.wmnet,ununpentium.wikimedia.org,wezen.codfw.wmnet debmonitor:x:499: (3) labpuppetmaster1002.wikimedia.org,mwlog2001.codfw.wmnet,mwlog1001.eqiad.wmnet debmonitor:x:498: (2) eeden.wikimedia.org,mc1021.eqiad.wmnet debmonitor:x:999: (2) scb[2005-2006].codfw.wmnet debmonitor:x:487: stretch: (2) deploy2001.codfw.wmnet,deploy1001.eqiad.wmnet debmonitor:x:496: (42) boron.eqiad.wmnet,db[1107-1108].eqiad.wmnet,dbstore1001.eqiad.wmnet,es2001.codfw.wmnet,labweb[1001-1002].wikimedia.org,proton[2001-2002].codfw.wmnet,proton[1001-1002].eqiad.wmnet,stat1004.eqiad.wmnet,wdqs[2004-2006].codfw.wmnet,wdqs[1006-1009].eqiad.wmnet,wtp[1025-1042,1044-1048].eqiad.wmnet debmonitor:x:497: (205) acrab.codfw.wmnet,acrux.codfw.wmnet,analytics[1042-1077].eqiad.wmnet,argon.eqiad.wmnet,bast[4002,5001].wikimedia.org,bromine.eqiad.wmnet,chlorine.eqiad.wmnet,conf[1004-1006].eqiad.wmnet, db[2072,2078,2084,2086,2088-2089,2091].codfw.wmnet,db[1096-1099,1101,1103,1105,1116].eqiad.wmnet,dbproxy[1002-1003,1006-1008,1010-1011].eqiad.wmnet,dbstore2002.codfw.wmnet, dns[4001-4002,5001-5002].wikimedia.org,druid[1001-1006].eqiad.wmnet,elnath.codfw.wmnet,eventlog1002.eqiad.wmnet,flerovium.eqiad.wmnet,francium.eqiad.wmnet,furud.codfw.wmnet, ganeti[2001-2008].codfw.wmnet,ganeti[1001-1008].eqiad.wmnet,gerrit2001.wikimedia.org,kafkamon2001.codfw.wmnet,kafkamon1001.eqiad.wmnet,labsdb[1009-1011].eqiad.wmnet,lvs[2001-2006].codfw.wmnet ,lvs1016.eqiad.wmnet,lvs[5001-5003].eqsin.wmnet,lvs[3001-3004].esams.wmnet,lvs[4005-4007].ulsfo.wmnet,maerlant.wikimedia.org,ms-be[2016-2021,2023-2043].codfw.wmnet,ms-be[1016-1026,1028-1043].eqiad.wmnet, ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mx2001.wikimedia.org,neon.eqiad.wmnet,nescio.wikimedia.org,notebook[1003-1004].eqiad.wmnet,ping2001.codfw.wmnet,ping1001.eqiad.wmnet, puppetdb2001.codfw.wmnet,puppetdb1001.eqiad.wmnet,puppetmaster1001.eqiad.wmnet,pybaltest[2002-2003].codfw.wmnet,rhodium.eqiad.wmnet,seaborgium.wikimedia.org,serpens.wikimedia.org, sodium.wikimedia.org,vega.codfw.wmnet,webperf2002.codfw.wmnet,webperf1002.eqiad.wmnet debmonitor:x:499: (42) boron.eqiad.wmnet,db[1107-1108].eqiad.wmnet,dbstore1001.eqiad.wmnet,es2001.codfw.wmnet,labweb[1001-1002].wikimedia.org,proton[2001-2002].codfw.wmnet,proton[1001-1002].eqiad.wmnet,stat1004.eqiad.wmnet,wdqs[2004-2006].codfw.wmnet,wdqs[1006-1009].eqiad.wmnet,wtp[1025-1042,1044-1048].eqiad.wmnet debmonitor:x:497: (592) an-coord1001.eqiad.wmnet,an-master[1001-1002].eqiad.wmnet,an-tool1006.eqiad.wmnet,an-worker[1078-1095].eqiad.wmnet,analytics[1028-1031,1033-1041].eqiad.wmnet,analytics-tool1001.eqiad.wmnet,aqs[1004-1009].eqiad.wmnet,archiva1001.wikimedia.org,authdns[1001,2001].wikimedia.org,bast[1002,2002].wikimedia.org,cloudcontrol[2001,2003]-dev.wikimedia.org, cloudcontrol[1003-1004].wikimedia.org,clouddb2001-dev.codfw.wmnet,cloudelastic[1001-1004].wikimedia.org,cloudnet[2002-2003]-dev.codfw.wmnet,cloudnet[1003-1004].eqiad.wmnet, cloudservices2002-dev.wikimedia.org,cloudservices[1003-1004].wikimedia.org,cloudstore[1008-1009].wikimedia.org,cloudvirt[2001-2003]-dev.codfw.wmnet,cloudvirt[1001-1009,1012-1014,1016-1030].eqiad.wmnet, cloudweb2001-dev.wikimedia.org,cp[2001-2026].codfw.wmnet,cp[1071-1090,1099].eqiad.wmnet,cp[5001-5012].eqsin.wmnet,cp[3007-3008,3010,3030,3032-3036,3038-3047,3049].esams.wmnet, cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,cumin2001.codfw.wmnet,db[2051-2052,2062,2065,2067,2070,2096-2131].codfw.wmnet, db[1061-1062,1067,1070,1075,1077-1079,1086-1087,1092,1094-1095,1100,1102,1104,1112,1118,1120,1126-1140].eqiad.wmnet,dbprov[2001-2002].codfw.wmnet,dbprov[1001-1002].eqiad.wmnet, dbproxy[2001-2004].codfw.wmnet,dbproxy[1012-1021].eqiad.wmnet,dbstore[1003-1005].eqiad.wmnet,dns[1001-1002,2001-2002].wikimedia.org,doc1001.eqiad.wmnet,elastic[2025-2054].codfw.wmnet, elastic[1018-1020,1022-1045,1047-1052].eqiad.wmnet,es[1011,1014-1016,1018-1019].eqiad.wmnet,ganeti[4001-4003].ulsfo.wmnet,grafana1001.eqiad.wmnet,graphite2003.codfw.wmnet,graphite1004.eqiad.wmnet, icinga[1001,2001].wikimedia.org,kafka-main[2001-2005].codfw.wmnet,kafka-main[1001-1005].eqiad.wmnet,kubernetes[2001-2006].codfw.wmnet,kubernetes[1001-1006].eqiad.wmnet,kubestage[1001-1002].eqiad.wmnet, labsdb1012.eqiad.wmnet,labtestservices2003.wikimedia.org,labtestvirt2003.codfw.wmnet,ldap-eqiad-replica[01-02].wikimedia.org,ldap replica[2001-2002].wikimedia.org,logstash[2001-2006,2020-2022].codfw.wmnet, logstash[1007-1012,1020-1022].eqiad.wmnet, lvs[2009-2010].codfw.wmnet,lvs[1013-1015].eqiad.wmnet,maps[2001-2004].codfw.wmnet,maps[1001-1004].eqiad.wmnet,matomo1001.eqiad.wmnet, miscweb2001.codfw.wmnet,miscweb1001.eqiad.wmnet,ms-be[2022,2044-2056].codfw.wmnet,msbe[1044-1056].eqiad.wmnet,multatuli.wikimedia.org,mw[2150-2151,2224-2226,2231,2244-2245,2250].codfw.wmnet, mw[1297-1298,1348].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwmaint2001.codfw.wmnet,mwmaint1002.eqiad.wmnet,mx1001.wikimedia.org,ncredir[2001-2002].codfw.wmnet,ncredir[1001-1002].eqiad.wmnet, pc[2007-2010].codfw.wmnet,pc[1007-1010].eqiad.wmnet,people1001.eqiad.wmnet,phab2001.codfw.wmnet,phab1003.eqiad.wmnet,prometheus[2003-2004].codfw.wmnet,prometheus[1003-1004].eqiad.wmnet, rdb[2003-2006].codfw.wmnet,rdb[1005-1006,1009-1010].eqiad.wmnet,registry[2001-2002].codfw.wmnet,registry[1001-1002].eqiad.wmnet,relforge[1001-1002].eqiad.wmnet,restbase[2009-2020].codfw.wmnet, restbase[1016-1027].eqiad.wmnet,restbase-dev[1004-1006].eqiad.wmnet,rpki2001.codfw.wmnet,rpki1001.eqiad.wmnet,scandium.eqiad.wmnet,schema[2001-2002].codfw.wmnet,schema[1001-1002].eqiad.wmnet, sessionstore[2001-2003].codfw.wmnet,sessionstore[1001-1003].eqiad.wmnet,snapshot1009.eqiad.wmnet,stat1007.eqiad.wmnet,theemin.codfw.wmnet,thumbor[2001-2004].codfw.wmnet, thumbor[1001-1004].eqiad.wmnet,torrelay1001.wikimedia.org,wdqs[2001-2003].codfw.wmnet,wdqs[1003-1005,1010].eqiad.wmnet,weblog1001.eqiad.wmnet,wtp[2001-2020].codfw.wmnet,wtp1043.eqiad.wmnet debmonitor:x:999: (2) cumin1001.eqiad.wmnet,thorium.eqiad.wmnet debmonitor:x:998: (2) debmonitor2001.codfw.wmnet,debmonitor1001.eqiad.wmnet deploy-debmonitor:x:499: debmonitor:x:498: buster: (4) an-tool1007.eqiad.wmnet,puppetmaster[2001-2002].codfw.wmnet,puppetmaster1003.eqiad.wmnet debmonitor:x:997: (2) db1114.eqiad.wmnet,stat1005.eqiad.wmnet debmonitor:x:999: (57) acmechief2001.codfw.wmnet,acmechief1001.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet,an-conf[1001-1003].eqiad.wmnet,an-presto[1001-1005].eqiad.wmnet,an-tool1005.eqiad.wmnet, analytics-tool1004.eqiad.wmnet,auth2001.codfw.wmnet,auth1002.eqiad.wmnet,backup2001.codfw.wmnet,backup1001.eqiad.wmnet,centrallog1001.eqiad.wmnet,cloudbackup2001.codfw.wmnet, failoid2001.codfw.wmnet,failoid1001.eqiad.wmnet,ganeti[2009-2018].codfw.wmnet,gerrit1001.wikimedia.org,grafana1002.eqiad.wmnet,idp1001.wikimedia.org,krb2001.codfw.wmnet,krb1001.eqiad.wmnet, ldapcorp[1001,2001].wikimedia.org,netbox[1001,2001].wikimedia.org,netboxdb2001.codfw.wmnet,netboxdb1001.eqiad.wmnet,netflow2001.codfw.wmnet,netflow1001.eqiad.wmnet,orespoolcounter[2003-2004].codfw.wmnet, orespoolcounter[1003-1004].eqiad.wmnet,phab1001.eqiad.wmnet,poolcounter[2003-2004].codfw.wmnet,poolcounter[1004-1005].eqiad.wmnet,puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet,puppetmaster1002.eqiad.wmnet debmonitor:x:998:
Looking at the jessie hosts there's an interestin time correlation:
The last systems with a 4xx GID for debmonitor were installed in March 2018. The two systems with a 9xx GID (eeden and mc1021) were installed in Sep and Nov 2018, so something changed (most probably in Puppet in between) which influences the allocation within the specified range.
(2) The allocation is also changing within a single installation. When looking at db1114 (a buster system) the prometheus system user has GID 116 and debmonitor has 998 (and both are created with equivalent adduser calls in postinst). Also something that needs to be investigated further.
There's a number of actionables here, for which I'll create separate tasks:
- Puppetise adduser.conf and limit the range of system GIDs to 499 as the upper boundary for system users.
- Investigate (1) and (2)
I think the initial quick fix for the gitpuppet install race is to move it from 998 to 501. We can declare that the range between 500 (currently owned by wikidev) and 700 (where the user groups start to be managed with "ops") is reserved for system level meta groups whose main purpose is to provide a stable GID for cross-system replication of files.
Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.
We can close this task and there's no Puppet server specific change needed. There have been various other changes which fixed this, most notably that we have puppetised defaults for adduser.conf and systemd-sysuser to allocate local system users between 100-499.
Change 818450 had a related patch set uploaded (by Jbond; author: jbond):
[operations/puppet@production] P:base: add stages
Change 818450 merged by Jbond:
[operations/puppet@production] P:adduser: apply adduser before any packages are installed
Change 819541 had a related patch set uploaded (by Jbond; author: Jbond):
[operations/puppet@production] P:adduser: apply adduser before any packages are installed
Change 819580 had a related patch set uploaded (by Jbond; author: jbond):
[operations/puppet@production] P:adduser: apply adduser before any packages are installed
Change 819581 had a related patch set uploaded (by Jbond; author: jbond):
[operations/puppet@production] P:apt: apply apt before any packages are installed
Change 819580 merged by Jbond:
[operations/puppet@production] P:adduser: apply adduser before any packages are installed
Change 819541 abandoned by Jbond:
[operations/puppet@production] P:adduser: apply adduser before any packages are installed
Reason:
Change 819581 abandoned by Jbond:
[operations/puppet@production] P:apt: apply apt before any packages are installed
Reason:
we have fixed this issue with a different patch