Page MenuHomePhabricator

Blocked /etc/passwd on sca100[1234] hosts
Closed, ResolvedPublic


Applying rOPUP528824ead815 via puppet failed on sca* hosts with:

/usr/sbin/useradd -c "Manuel Arostegui" -g 500 -s /bin/bash -u 15343 marostegui
useradd: failure while writing changes to /etc/passwd

Event Timeline

jcrespo created this task.Sep 1 2016, 2:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2016, 2:34 PM

Debugging on IRC:

<akosiaris> so this time around it's /etc/passwd that's locked
<akosiaris> not /etc/passwd+
<akosiaris> ?
<akosiaris> moritzm: ^ might be this ?

<moritzm> akosiaris: good catch, sca are the only trustys with 3.13

<akosiaris> ok that was it 
<akosiaris> well kind of
<moritzm> and there's been the firejail update to 0.9.40 afer Madhu joined (for the bug in exitcode handling)
<akosiaris> now I get Error: /usr/bin/gpasswd ops -M ... returned 1 instead of one of [0]

<akosiaris> rename("/etc/group+", "/etc/group")     = -1 EBUSY (Device or resource busy)
<akosiaris> so that's the one I had met last time

I 've done a

service zotero stop
puppet agent -t -v
service zotero stop 
/usr/bin/gpasswd ops -M filippo,jgreen,bblack,andrew,faidon,rush,oblivian,laner,yuvipanda,dzahn,akosiaris,springle,mark,ariel,cmjohnson,otto,robh,tstarling,ori,midom,jmm,jynus,aaron,ema,elukey,gehel,volans,madhuvishy,marostegui
puppet agent -t -v

dance a get the user applied on sca1001, sca1002, sca2001. I 've left sca2002 as is so we can debug this further. sca2002 has been ACKed in icinga by @jcrespo

The locking limitation is fixed in Linux 3.18:

The sca cluster is currently the only cluster with long-running firejail processes using a kernel < 3.18; trusty uses 3.13. scb and the image scalers are running on jessie with Linux 4.4.

(There's one cornercase where this also applies to the standard app servers; the Score extension for creating musical typesheets has a code path which triggers a scaling operation using imagemagick and that conversion is also guarded by firejail. However, such invocations are fairly rare to begin with (IIRC about 100 per week for the entire cluster) and this does only git when making changes to privileged files. Since the mw* cluster are being reimaged to jessie anyway, I'll ignore this.

Alex mentioned that he intends to the sca cluster to jessie in the foreseeable future anyway, so my suggestion is to migrate the sca* cluster to the current HWE kernel: Ubuntu provides backports of the xenial/16.04 kernel to trusty which are officially supported. This solves the problem without making sca-specific tweaks to the firejail config or downgrading to an older firejail version (which is affected by the exitcode passthrough bug anyway).

Any objections?

None on my part

+1. Thank you Moritz!

sca in codfw has been upgraded to the 4.4 kernel series, seems all fine by now. I'll upgrade eqiad on Monday.

Mentioned in SAL [2016-09-05T10:37:16Z] <moritzm> depooling/rebooting/repooling sca1001 for upgrade to Linux 4.4 (T144492)

Mentioned in SAL [2016-09-05T12:47:30Z] <moritzm> depooling/rebooting/repooling sca1002 for upgrade to Linux 4.4 (T144492)

MoritzMuehlenhoff closed this task as Resolved.Sep 5 2016, 1:07 PM

All systems from the sca cluster are now running the 4.4 HWE kernel from trusty, puppet runs are fine again.