Page MenuHomePhabricator

reimage of puppet servers can fail
Closed, ResolvedPublic

Description

There is currently a race condition when reinstalling puppetmaster servers in that the gitpuppet group specifies a gid of 998 however sometimes this is already taken, which causes puppet failures

Event Timeline

I looked into this and it's quite a mess!

The debmonitor GID (which was the one clashing) is created in the postinst of debmonitor-client. Users/groups are added with s the default tool in Debian for this (adduser). We use the default Debian config for adduser (due to some dpkg limitation back in 2001 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=541620 it gets created in the adduser.postinst instead of being shipped as a conffile). The default config specifies FIRST_SYSTEM_GID=100 and LAST_SYSTEM_GID=999, i.e. an ID from that pool is used when creating a user with the --system flag.

That's not a very good default for us to begin with: After all we're specifying various GIDs in the range >= 500 in data.yaml.

But there's also a couple more issues here:

(1) We also have some odd behaviour in the way the the GIDs for system users are allocated within the configured range. I did a fleet-wide Cumin for the GID of debmonitor and there are some outliers independent of the distro in use (so no changes in adduser per se):

jessie:

(10) labpuppetmaster1001.wikimedia.org,labtestpuppetmaster2001.wikimedia.org,scb[2001-2004].codfw.wmnet,scb[1001-1004].eqiad.wmnet
debmonitor:x:497:

(2) contint[1001,2001].wikimedia.org
debmonitor:x:496:

(90) actinium.wikimedia.org,alcyone.wikimedia.org,alsafi.wikimedia.org,aluminium.wikimedia.org,bast3002.wikimedia.org,cobalt.wikimedia.org,conf[2001-2003].codfw.wmnet,dbmonitor[1001,2001].wikimedia.org,
dubnium.wikimedia.org,dumpsdata[1001-1002].eqiad.wmnet,etcd[1001-1006].eqiad.wmnet,etherpad1001.eqiad.wmnet,fermium.wikimedia.org,hassaleh.codfw.wmnet,hassium.eqiad.wmnet,helium.eqiad.wmnet,
heze.codfw.wmnet,install[1002,2002].wikimedia.org,kraz.wikimedia.org,kubestagetcd[1001-1003].eqiad.wmnet,kubetcd[2001-2003].codfw.wmnet,labmon[1001-1002].eqiad.wmnet,labstore[2001-2004].codfw.wmnet,
labstore[1004-1005].eqiad.wmnet,labstore[1006-1007].wikimedia.org,mc[2019-2036].codfw.wmnet,mc[1019-1020,1022-1036].eqiad.wmnet,mendelevium.eqiad.wmnet,oresrdb[2001-2002].codfw.wmnet,
oresrdb[1001-1002].eqiad.wmnet,pollux.wikimedia.org,pybal-test2001.codfw.wmnet,tungsten.eqiad.wmnet,ununpentium.wikimedia.org,wezen.codfw.wmnet
debmonitor:x:499:

(3) labpuppetmaster1002.wikimedia.org,mwlog2001.codfw.wmnet,mwlog1001.eqiad.wmnet
debmonitor:x:498:

(2) eeden.wikimedia.org,mc1021.eqiad.wmnet
debmonitor:x:999:

(2) scb[2005-2006].codfw.wmnet
debmonitor:x:487:



stretch:

(2) deploy2001.codfw.wmnet,deploy1001.eqiad.wmnet
debmonitor:x:496:

(42) boron.eqiad.wmnet,db[1107-1108].eqiad.wmnet,dbstore1001.eqiad.wmnet,es2001.codfw.wmnet,labweb[1001-1002].wikimedia.org,proton[2001-2002].codfw.wmnet,proton[1001-1002].eqiad.wmnet,stat1004.eqiad.wmnet,wdqs[2004-2006].codfw.wmnet,wdqs[1006-1009].eqiad.wmnet,wtp[1025-1042,1044-1048].eqiad.wmnet
debmonitor:x:497:

(205) acrab.codfw.wmnet,acrux.codfw.wmnet,analytics[1042-1077].eqiad.wmnet,argon.eqiad.wmnet,bast[4002,5001].wikimedia.org,bromine.eqiad.wmnet,chlorine.eqiad.wmnet,conf[1004-1006].eqiad.wmnet,
db[2072,2078,2084,2086,2088-2089,2091].codfw.wmnet,db[1096-1099,1101,1103,1105,1116].eqiad.wmnet,dbproxy[1002-1003,1006-1008,1010-1011].eqiad.wmnet,dbstore2002.codfw.wmnet,
dns[4001-4002,5001-5002].wikimedia.org,druid[1001-1006].eqiad.wmnet,elnath.codfw.wmnet,eventlog1002.eqiad.wmnet,flerovium.eqiad.wmnet,francium.eqiad.wmnet,furud.codfw.wmnet,
ganeti[2001-2008].codfw.wmnet,ganeti[1001-1008].eqiad.wmnet,gerrit2001.wikimedia.org,kafkamon2001.codfw.wmnet,kafkamon1001.eqiad.wmnet,labsdb[1009-1011].eqiad.wmnet,lvs[2001-2006].codfw.wmnet
,lvs1016.eqiad.wmnet,lvs[5001-5003].eqsin.wmnet,lvs[3001-3004].esams.wmnet,lvs[4005-4007].ulsfo.wmnet,maerlant.wikimedia.org,ms-be[2016-2021,2023-2043].codfw.wmnet,ms-be[1016-1026,1028-1043].eqiad.wmnet,
ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mx2001.wikimedia.org,neon.eqiad.wmnet,nescio.wikimedia.org,notebook[1003-1004].eqiad.wmnet,ping2001.codfw.wmnet,ping1001.eqiad.wmnet,
puppetdb2001.codfw.wmnet,puppetdb1001.eqiad.wmnet,puppetmaster1001.eqiad.wmnet,pybaltest[2002-2003].codfw.wmnet,rhodium.eqiad.wmnet,seaborgium.wikimedia.org,serpens.wikimedia.org,
sodium.wikimedia.org,vega.codfw.wmnet,webperf2002.codfw.wmnet,webperf1002.eqiad.wmnet
debmonitor:x:499:

(42) boron.eqiad.wmnet,db[1107-1108].eqiad.wmnet,dbstore1001.eqiad.wmnet,es2001.codfw.wmnet,labweb[1001-1002].wikimedia.org,proton[2001-2002].codfw.wmnet,proton[1001-1002].eqiad.wmnet,stat1004.eqiad.wmnet,wdqs[2004-2006].codfw.wmnet,wdqs[1006-1009].eqiad.wmnet,wtp[1025-1042,1044-1048].eqiad.wmnet
debmonitor:x:497:

(592) an-coord1001.eqiad.wmnet,an-master[1001-1002].eqiad.wmnet,an-tool1006.eqiad.wmnet,an-worker[1078-1095].eqiad.wmnet,analytics[1028-1031,1033-1041].eqiad.wmnet,analytics-tool1001.eqiad.wmnet,aqs[1004-1009].eqiad.wmnet,archiva1001.wikimedia.org,authdns[1001,2001].wikimedia.org,bast[1002,2002].wikimedia.org,cloudcontrol[2001,2003]-dev.wikimedia.org,
cloudcontrol[1003-1004].wikimedia.org,clouddb2001-dev.codfw.wmnet,cloudelastic[1001-1004].wikimedia.org,cloudnet[2002-2003]-dev.codfw.wmnet,cloudnet[1003-1004].eqiad.wmnet,
cloudservices2002-dev.wikimedia.org,cloudservices[1003-1004].wikimedia.org,cloudstore[1008-1009].wikimedia.org,cloudvirt[2001-2003]-dev.codfw.wmnet,cloudvirt[1001-1009,1012-1014,1016-1030].eqiad.wmnet,
cloudweb2001-dev.wikimedia.org,cp[2001-2026].codfw.wmnet,cp[1071-1090,1099].eqiad.wmnet,cp[5001-5012].eqsin.wmnet,cp[3007-3008,3010,3030,3032-3036,3038-3047,3049].esams.wmnet,
cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,cumin2001.codfw.wmnet,db[2051-2052,2062,2065,2067,2070,2096-2131].codfw.wmnet,
db[1061-1062,1067,1070,1075,1077-1079,1086-1087,1092,1094-1095,1100,1102,1104,1112,1118,1120,1126-1140].eqiad.wmnet,dbprov[2001-2002].codfw.wmnet,dbprov[1001-1002].eqiad.wmnet,
dbproxy[2001-2004].codfw.wmnet,dbproxy[1012-1021].eqiad.wmnet,dbstore[1003-1005].eqiad.wmnet,dns[1001-1002,2001-2002].wikimedia.org,doc1001.eqiad.wmnet,elastic[2025-2054].codfw.wmnet,
elastic[1018-1020,1022-1045,1047-1052].eqiad.wmnet,es[1011,1014-1016,1018-1019].eqiad.wmnet,ganeti[4001-4003].ulsfo.wmnet,grafana1001.eqiad.wmnet,graphite2003.codfw.wmnet,graphite1004.eqiad.wmnet,
icinga[1001,2001].wikimedia.org,kafka-main[2001-2005].codfw.wmnet,kafka-main[1001-1005].eqiad.wmnet,kubernetes[2001-2006].codfw.wmnet,kubernetes[1001-1006].eqiad.wmnet,kubestage[1001-1002].eqiad.wmnet,
labsdb1012.eqiad.wmnet,labtestservices2003.wikimedia.org,labtestvirt2003.codfw.wmnet,ldap-eqiad-replica[01-02].wikimedia.org,ldap replica[2001-2002].wikimedia.org,logstash[2001-2006,2020-2022].codfw.wmnet,
logstash[1007-1012,1020-1022].eqiad.wmnet, lvs[2009-2010].codfw.wmnet,lvs[1013-1015].eqiad.wmnet,maps[2001-2004].codfw.wmnet,maps[1001-1004].eqiad.wmnet,matomo1001.eqiad.wmnet,
miscweb2001.codfw.wmnet,miscweb1001.eqiad.wmnet,ms-be[2022,2044-2056].codfw.wmnet,msbe[1044-1056].eqiad.wmnet,multatuli.wikimedia.org,mw[2150-2151,2224-2226,2231,2244-2245,2250].codfw.wmnet,
mw[1297-1298,1348].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwmaint2001.codfw.wmnet,mwmaint1002.eqiad.wmnet,mx1001.wikimedia.org,ncredir[2001-2002].codfw.wmnet,ncredir[1001-1002].eqiad.wmnet,
pc[2007-2010].codfw.wmnet,pc[1007-1010].eqiad.wmnet,people1001.eqiad.wmnet,phab2001.codfw.wmnet,phab1003.eqiad.wmnet,prometheus[2003-2004].codfw.wmnet,prometheus[1003-1004].eqiad.wmnet,
rdb[2003-2006].codfw.wmnet,rdb[1005-1006,1009-1010].eqiad.wmnet,registry[2001-2002].codfw.wmnet,registry[1001-1002].eqiad.wmnet,relforge[1001-1002].eqiad.wmnet,restbase[2009-2020].codfw.wmnet,
restbase[1016-1027].eqiad.wmnet,restbase-dev[1004-1006].eqiad.wmnet,rpki2001.codfw.wmnet,rpki1001.eqiad.wmnet,scandium.eqiad.wmnet,schema[2001-2002].codfw.wmnet,schema[1001-1002].eqiad.wmnet,
sessionstore[2001-2003].codfw.wmnet,sessionstore[1001-1003].eqiad.wmnet,snapshot1009.eqiad.wmnet,stat1007.eqiad.wmnet,theemin.codfw.wmnet,thumbor[2001-2004].codfw.wmnet,
thumbor[1001-1004].eqiad.wmnet,torrelay1001.wikimedia.org,wdqs[2001-2003].codfw.wmnet,wdqs[1003-1005,1010].eqiad.wmnet,weblog1001.eqiad.wmnet,wtp[2001-2020].codfw.wmnet,wtp1043.eqiad.wmnet
debmonitor:x:999:

(2) cumin1001.eqiad.wmnet,thorium.eqiad.wmnet
debmonitor:x:998:

(2) debmonitor2001.codfw.wmnet,debmonitor1001.eqiad.wmnet
deploy-debmonitor:x:499:
debmonitor:x:498:


buster:

(4) an-tool1007.eqiad.wmnet,puppetmaster[2001-2002].codfw.wmnet,puppetmaster1003.eqiad.wmnet
debmonitor:x:997:

(2) db1114.eqiad.wmnet,stat1005.eqiad.wmnet
debmonitor:x:999:

(57) acmechief2001.codfw.wmnet,acmechief1001.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet,an-conf[1001-1003].eqiad.wmnet,an-presto[1001-1005].eqiad.wmnet,an-tool1005.eqiad.wmnet,
analytics-tool1004.eqiad.wmnet,auth2001.codfw.wmnet,auth1002.eqiad.wmnet,backup2001.codfw.wmnet,backup1001.eqiad.wmnet,centrallog1001.eqiad.wmnet,cloudbackup2001.codfw.wmnet,
failoid2001.codfw.wmnet,failoid1001.eqiad.wmnet,ganeti[2009-2018].codfw.wmnet,gerrit1001.wikimedia.org,grafana1002.eqiad.wmnet,idp1001.wikimedia.org,krb2001.codfw.wmnet,krb1001.eqiad.wmnet,
ldapcorp[1001,2001].wikimedia.org,netbox[1001,2001].wikimedia.org,netboxdb2001.codfw.wmnet,netboxdb1001.eqiad.wmnet,netflow2001.codfw.wmnet,netflow1001.eqiad.wmnet,orespoolcounter[2003-2004].codfw.wmnet,
orespoolcounter[1003-1004].eqiad.wmnet,phab1001.eqiad.wmnet,poolcounter[2003-2004].codfw.wmnet,poolcounter[1004-1005].eqiad.wmnet,puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet,puppetmaster1002.eqiad.wmnet
debmonitor:x:998:

Looking at the jessie hosts there's an interestin time correlation:
The last systems with a 4xx GID for debmonitor were installed in March 2018. The two systems with a 9xx GID (eeden and mc1021) were installed in Sep and Nov 2018, so something changed (most probably in Puppet in between) which influences the allocation within the specified range.

(2) The allocation is also changing within a single installation. When looking at db1114 (a buster system) the prometheus system user has GID 116 and debmonitor has 998 (and both are created with equivalent adduser calls in postinst). Also something that needs to be investigated further.

There's a number of actionables here, for which I'll create separate tasks:

  • Puppetise adduser.conf and limit the range of system GIDs to 499 as the upper boundary for system users.
  • Investigate (1) and (2)

I think the initial quick fix for the gitpuppet install race is to move it from 998 to 501. We can declare that the range between 500 (currently owned by wikidev) and 700 (where the user groups start to be managed with "ops") is reserved for system level meta groups whose main purpose is to provide a stable GID for cross-system replication of files.

jbond triaged this task as Medium priority.Oct 14 2019, 12:46 PM

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

MoritzMuehlenhoff claimed this task.

We can close this task and there's no Puppet server specific change needed. There have been various other changes which fixed this, most notably that we have puppetised defaults for adduser.conf and systemd-sysuser to allocate local system users between 100-499.

Change 818450 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:base: add stages

https://gerrit.wikimedia.org/r/818450

Change 818450 merged by Jbond:

[operations/puppet@production] P:adduser: apply adduser before any packages are installed

https://gerrit.wikimedia.org/r/818450

Change 819541 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/puppet@production] P:adduser: apply adduser before any packages are installed

https://gerrit.wikimedia.org/r/819541

Change 819580 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:adduser: apply adduser before any packages are installed

https://gerrit.wikimedia.org/r/819580

Change 819581 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:apt: apply apt before any packages are installed

https://gerrit.wikimedia.org/r/819581

Change 819580 merged by Jbond:

[operations/puppet@production] P:adduser: apply adduser before any packages are installed

https://gerrit.wikimedia.org/r/819580

Change 819541 abandoned by Jbond:

[operations/puppet@production] P:adduser: apply adduser before any packages are installed

Reason:

https://gerrit.wikimedia.org/r/819541

Change 819581 abandoned by Jbond:

[operations/puppet@production] P:apt: apply apt before any packages are installed

Reason:

we have fixed this issue with a different patch

https://gerrit.wikimedia.org/r/819581