Page MenuHomePhabricator

PHD ensuring umask goodness
Closed, ResolvedPublic

Description

Upstream: https://secure.phabricator.com/T7475

Clicking https://phabricator.wikimedia.org/rEFLW573567f4461c199a77136effe8b1fdd1771c1709 from https://phabricator.wikimedia.org/T69192#1091750 results into:

Unhandled Exception ("CommandException")
Command failed with error #128! COMMAND git log -n 1 --format='%P' '573567f4461c199a77136effe8b1fdd1771c1709' STDOUT (empty) STDERR fatal: loose object 573567f4461c199a77136effe8b1fdd1771c1709 (stored in ./objects/57/3567f4461c199a77136effe8b1fdd1771c1709) is corrupt

Revisions and Commits

Event Timeline

Nemo_bis created this task.Mar 5 2015, 3:33 PM
Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis updated the task description. (Show Details)
Nemo_bis added a project: Gitblit-Deprecate.
Nemo_bis added a subscriber: Nemo_bis.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 5 2015, 3:33 PM
demon added a subscriber: demon.

Known, we spotted on IRC a bit ago. Something's up with umasks.

I am the one who responded to this. After some quick checks like verifying the md5sum of the above object I noticed the object had permissions 0400 and readable only by phd. A quick find showed multiple objects in multiple repos having that behaviour. Seems like the first ones got populated around 10:12 UTC today (March 5). New ones with the same permissions continued to be created in various repos as people commited changes in gerrit. After some thought, I restarted phd and watched the new objects that were created. The new objects seemed indeed to have the correct permissions 0444. At this point I used the same find to set go+r permissions on the offending git objects and most repos are back to normal(ish)

The repos I witnessed having problems are:

https://phabricator.wikimedia.org/diffusion/ANZD
https://phabricator.wikimedia.org/diffusion/ECTX
https://phabricator.wikimedia.org/diffusion/EFLW
https://phabricator.wikimedia.org/diffusion/EIME
https://phabricator.wikimedia.org/diffusion/EMFR
https://phabricator.wikimedia.org/diffusion/EPOP
https://phabricator.wikimedia.org/diffusion/EVED
https://phabricator.wikimedia.org/diffusion/EWBA
https://phabricator.wikimedia.org/diffusion/GOJU
https://phabricator.wikimedia.org/diffusion/GTWN
https://phabricator.wikimedia.org/diffusion/MWVA
https://phabricator.wikimedia.org/diffusion/ODNS
https://phabricator.wikimedia.org/diffusion/OMWC
https://phabricator.wikimedia.org/diffusion/OPUP
https://phabricator.wikimedia.org/diffusion/PWBO

Most are OK but other are whining with a different error:

https://phabricator.wikimedia.org/diffusion/EPOP/

Command failed with error #128! COMMAND git ls-tree -z -l '3eaf2829e8f7ea0dd5967ea537f4fcd8ccf37679':'' STDOUT (empty) STDERR fatal: Not a valid object name

I think I just fixed all of them. Turns out there were also directories created with the wrong permissions during. My commands to fix it were

cd /srv/phab/repos

sudo find . -perm 0400 -exec chmod go+r {} \; # Fix the objects themselves
sudo find . -perm 0600 -exec chmod go+r {} \; # Fix the refs (like refs/heads/master)
sudo find . -perm 0700 -exec chmod go+r {} \; # Fix directories like EIME/objects/cf

I did check extensively before messing with any of those with a -ls

I do consider the issue itself resolved but the question is what caused phd to create the file with the wrong permissions

@chasemp, @mmodell any ideas ?

Note: I merged https://gerrit.wikimedia.org/r/#/c/194126/ about one hour before all the trouble started but I fail to see how it might have triggered phd's misbehavior

Krenair added a subscriber: Krenair.Mar 5 2015, 4:16 PM

Thanks @akosiaris!

I have no explanation for this. Updating the json config file does trigger a restart of PHD but I'm having trouble linking up this cause with that effect. Going to keep an eye on it for awhile I guess.

coren added a subscriber: coren.Mar 5 2015, 4:36 PM

Should this be closed as resolved or renamed to something more indicative of "keeping an eye on it?"

chasemp changed the task status from Open to Stalled.Mar 5 2015, 4:57 PM
chasemp claimed this task.
chasemp triaged this task as Low priority.
chasemp set Security to None.

Should this be closed as resolved or renamed to something more indicative of "keeping an eye on it?"

assigned to me / changed priority / stalled for now

cool?

coren added a comment.Mar 5 2015, 4:58 PM

Cool. :-)

I think I actually figured this out. It wasn't until late last night that I put 2+2 together. So this task and T91525 are related.

T91525 is that latest (and last I'd say) in line of similar problems releng has met and needed to solve. puppet would create the files honouring its umask as setup by the puppet-run shell script

What has happened according to my theory is:

  • Chase would always shepherd the change in production after merging by issuing a sudo puppet agent -t -v on iridum.
  • I did NOT issue a sudo puppet agent -t -v on iridium upon merging https://gerrit.wikimedia.org/r/#/c/194126/ as I considered it a very innocuous and difficult to go wrong change. I was wrong.
  • Some 20 minutes later puppet-run kicked in. It did set umask to 0077 and the puppet agent call also had that umask set.
  • phd was restarted due to the change file and it obtained the umask of the process re-starting it, namely puppet agent's, e.g 0077.
  • The first commits started arrived some time later and the corresponding objects were creating with the umask 0077 making them unreadable by anyone else that the owner of the file (phd)
  • phabricator UI started complaining because of the above
  • I restarted phd manually and it got my umask. So now phd started creating the new objects with the correct permissions.
  • I fixed the permissions on the various files and we got back in business

All in all, winds of change started a few months ago in https://gerrit.wikimedia.org/r/#/c/179082/

Then yesterday a butterfly moved its wings in https://gerrit.wikimedia.org/r/#/c/194126/ and chaos theory kicked in.

So actionables:

  • We fixed the puppet agent umask issue so it won't happen again.
  • Actually fix phd to do what a nice daemon should do.

@chasemp It is not nice for a daemon to not set its umask. According to [stevens] a daemon must:

  • Close all open file descriptors.
  • Change current working directory.
  • Reset the file access creation mask
  • Run in the background.
  • Disassociate from process group.
  • Ignore terminal I/O signals.
  • Disassociate from control terminal.
  • Don't reacquire a control terminal.
  • Correctly handle the following circumstances:
    1. Started by System V init process.
    2. Daemon termination by SIGTERM signal.
    3. Children generate SIGCLD signal.

Granted, systemd, no backgrounding and all that, things change and we are a long way from 1974 but the umask issue still stands. So upstream should fix it

[stevens] ( 1 , 2 , 3 , 4 ) Unix Network Programming , W. Richard Stevens, 1994 Prentice Hall.

Qgil added a subscriber: Qgil.Mar 6 2015, 3:14 PM

[stevens] ( 1 , 2 , 3 , 4 ) Unix Network Programming , W. Richard Stevens, 1994 Prentice Hall.

Ok, call me impressed! :)

...

  • Reset the file access creation mask

Nice dude.

Lots of work upstream on phd lately so I confirmed on another install this behavior at HEAD.

Import new repo:

root@phab-01:~# umask
0022
root@phab-01:/var/repo# /srv/phab/phabricator/bin/phd restart
drwxr-xr-x  7 root     root     4096 Mar  6 15:44 TFR/

Import a new repo:

root@phab-01:/var/repo# umask 077
root@phab-01:/var/repo# /srv/phab/phabricator/bin/phd restart
drwx------  7 root     root     4096 Mar  6 15:50 TPYPH/
epriestley: chasemp: good enough for me, I'll file something to get us setting the umask

We're in agreement about phd being wrong here. Upstream task:

https://secure.phabricator.com/T7475

chasemp renamed this task from Unhandled Exception ("CommandException") in diffusion to PHD ensuring umask goodness.Mar 6 2015, 4:46 PM
chasemp updated the task description. (Show Details)
chasemp moved this task from Backlog to Ready To Go on the Phabricator (Upstream) board.
chasemp removed chasemp as the assignee of this task.Mar 9 2015, 3:17 PM
Restricted Application added a subscriber: scfc. · View Herald TranscriptJun 16 2015, 12:51 PM
greg added a subscriber: greg.Dec 2 2015, 6:47 PM

So, just to summarize, what needs to happen here now for us (WMF) going forward? is this blocking anything (this task is in the Gitblit-Deprecate project right now, presumably because people thought it blocked the use of Diffusion?). Are we just in a "wait for upstream to fix the root issue but we're ok for now going forward with creating/importing new repos" situation?

AFA what can be done here, we could wrap phd in our own init script or something. But for Gitblit-Deprecate I don't think this blocks?

Anyone knows if this is actually still an issue or if T128009 fixed this?
@akosiaris, @chasemp or anyone else?

Asking as this task hasn't seen an update since Dec 2015, and https://secure.phabricator.com/T7475 ("Reset umask explicitly in daemons") got resolved on Apr 15, 2016.

Click the example URIs given in T91648#1092283 do not trigger any issue here (note that ANZD does not exist anymore).

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptMay 22 2016, 3:24 PM

Anyone knows if this is actually still an issue or if T128009 fixed this?
@akosiaris, @chasemp or anyone else?
Asking as this task hasn't seen an update since Dec 2015, and https://secure.phabricator.com/T7475 ("Reset umask explicitly in daemons") got resolved on Apr 15, 2016.
Click the example URIs given in T91648#1092283 do not trigger any issue here (note that ANZD does not exist anymore).

I am not familiar with the phabricator testing/deployment process in our infrastructure. From what you point out, it probably has been fixed. But some testing would not hurt. If we got testing grounds around, I can help test.

Danny_B moved this task from To Triage to Upgrades on the Phabricator board.Jul 9 2016, 8:17 PM
Aklapper closed this task as Resolved.Feb 22 2017, 7:47 PM

Two years later, I'll just assume this has been fixed by the upstream patch 10 months ago.

If not, feel free to reopen.