Page MenuHomePhabricator

Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments)
Closed, DeclinedPublic

Description

@yuvipanda emailed engineering list: "On Monday, 15th June 2015, we'll be merging https://gerrit.wikimedia.org/r/#/c/199936/, disabling insecure SSH Agent Forwarding for production access."

This breaks dsh-based restarts of Parsoid service from bast1001.wikimedia.org. Previously we used to do this from tin, but dsh was disabled from there, so we moved to bast1001. If dsh were available on tin, we can still use the proxy-command setup in our local ssh-config to keep dsh working (I think).

In any case, if the above patch merges on Monday, our deployment workflow breaks. Please help us with a workaround / alternative solution and/or delay merging that patch till this restart issue is resolved.

Related Objects

Event Timeline

ssastry created this task.Jun 10 2015, 8:09 PM
ssastry raised the priority of this task from to Needs Triage.
ssastry updated the task description. (Show Details)
ssastry added subscribers: ssastry, yuvipanda.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 10 2015, 8:09 PM
yuvipanda triaged this task as Unbreak Now! priority.Jun 10 2015, 8:10 PM
bd808 added a subscriber: bd808.Jun 10 2015, 11:10 PM

Can we let the mwdeploy user run the command on the parsoid server side? If so the shared ssh-agent (/run/keyholder/proxy.sock) that is used for the scap commands can be used to get from tin to the target servers.

ssastry lowered the priority of this task from Unbreak Now! to Needs Triage.Jun 13 2015, 4:45 PM
ssastry moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.

Temporary fix:

  1. Have proxycommand set up properly on your local .ssh/config
  2. Have a clone of operations/puppet.git
  3. cd to 'modules/dsh/files/group'
  4. Use pssh (or dsh?) to run commands: pssh -p 8 -O 'StrictHostKeyChecking=no' -h parsoid 'sudo service parsoid restart'
  5. Stop doing above once the patch gets merged

@ssastry is this a workable workaround for your deploys atm?

On IRC, @yuvipanda said that the above patch has been merged. @cscott, @Arlolra, we should test 'git deploy service restart' as part of today's deploy.

We can still do canary restart of a single node by logging into that server (via a proxycommand) and restarting parsoid there -- so that part of the deploy workflow is unaffected.

FWIW, we have been using ansible from a local host (via proxycommand) with good success: https://github.com/gwicke/ansible-playground

We'd like to move this to a production deploy host to make sure that all deployers can use the same tools, but that will need rolling back some of the firewalling that was recently added, and ideally a more up-to-date deploy host. We'll also need a solution for per-group or per-user private keys to enable access to specific hosts. We probably don't want to give any mwdeploy user sudo rights on the cassandra cluster, for example.

faidon added a subscriber: faidon.Jun 15 2015, 2:33 PM

FWIW, we have been using ansible from a local host (via proxycommand) with good success: https://github.com/gwicke/ansible-playground
We'd like to move this to a production deploy host to make sure that all deployers can use the same tools, but that will need rolling back some of the firewalling that was recently added, and ideally a more up-to-date deploy host. We'll also need a solution for per-group or per-user private keys to enable access to specific hosts. We probably don't want to give any mwdeploy user sudo rights on the cassandra cluster, for example.

Well this workflow is also based on SSH, so there is no functional difference than dsh in this sense (if Ansible works, dsh will work and vice-versa). In other words, this is completely off-topic to this task, sorry :)

Ottomata triaged this task as Normal priority.Jun 15 2015, 2:46 PM
Ottomata set Security to None.

In other words, this is completely off-topic to this task

As you note, both cases share the same underlying problem. I think it's useful to mention other use cases affected by this, as it can inform the search for better short & longer-term solutions.

Attempt to deploy today:

cscott@tin:/srv/deployment/parsoid/deploy$ git deploy service restart
Error received from salt; raw output:

Failed to authenticate!  This is most likely because this user is not permitted to execute commands, but there is a small possibility that a disk error occurred (check disk/inode usage).
cscott@tin:/srv/deployment/parsoid/deploy$
akosiaris@tin:/srv/deployment/parsoid/deploy$ git deploy service restart
wtp2009.codfw.wmnet: True
wtp2020.codfw.wmnet: True
wtp2016.codfw.wmnet: No status available
wtp2003.codfw.wmnet: True
wtp2013.codfw.wmnet: True
wtp1004.eqiad.wmnet: True
wtp2010.codfw.wmnet: True
wtp1019.eqiad.wmnet: True
wtp1020.eqiad.wmnet: True
wtp1024.eqiad.wmnet: True
wtp1010.eqiad.wmnet: True
wtp1012.eqiad.wmnet: True
wtp1014.eqiad.wmnet: No status available
wtp1023.eqiad.wmnet: True
wtp2017.codfw.wmnet: True
wtp2006.codfw.wmnet: True
wtp1006.eqiad.wmnet: True
wtp2002.codfw.wmnet: True
wtp1017.eqiad.wmnet: True
wtp2012.codfw.wmnet: True
wtp2018.codfw.wmnet: True
wtp1002.eqiad.wmnet: True
wtp2011.codfw.wmnet: No status available
wtp2014.codfw.wmnet: True
wtp2001.codfw.wmnet: True
wtp2015.codfw.wmnet: True
wtp1005.eqiad.wmnet: True
wtp1013.eqiad.wmnet: True
wtp2004.codfw.wmnet: True
wtp1003.eqiad.wmnet: True
wtp1008.eqiad.wmnet: True
wtp1001.eqiad.wmnet: True
wtp1021.eqiad.wmnet: True
wtp1016.eqiad.wmnet: True
wtp2008.codfw.wmnet: No status available
wtp1007.eqiad.wmnet: True
wtp1015.eqiad.wmnet: True
wtp1022.eqiad.wmnet: True
wtp2019.codfw.wmnet: True
wtp1009.eqiad.wmnet: True
wtp2007.codfw.wmnet: True
wtp1018.eqiad.wmnet: True

So, can't reproduce it with my account but I can reproduce it with cscott's

sudo -u cscott -i
cscott@tin:~$ cd /srv/deployment/parsoid/deploy/
cscott@tin:/srv/deployment/parsoid/deploy$ git deploy service restart
Error received from salt; raw output:

Failed to authenticate!  This is most likely because this user is not permitted to execute commands, but there is a small possibility that a disk error occurred (check disk/inode usage).

Working around the issue for now with:

for wtp in `ssh bast1001.wikimedia.org cat /etc/dsh/group/parsoid` ; do echo $wtp ; ssh $wtp sudo service parsoid restart ; done

from my localhost, with ssh proxying configured. (I had to add *.codfw.wmnet to my local ssh proxy config.)

ArielGlenn added a subscriber: ArielGlenn.

claiming this ticket now that we're back in trebuchet-land

git deploy restart batches its job in groups of 10% of the total matching hosts. One batch ran fine to completion with parsoid restarting on those hosts; I don't see any indication of the other jobs even being queued up. This is referring to cscott's job run at 20:27 last night. I very much doubt that this is account-specific; cscott was able to run the other git deploy commands just before the git deploy restart with no problems. I'll look at the code again and do some more poking around.

This is referring to cscott's job run at 20:27 last night. I very much doubt that this is account-specific; cscott was able to run the other git deploy commands just before the git deploy restart with no problems.

Today, I am doubting this as well. Sorry for chasing a red herring yesterday

This is still broken:

$ git deploy service restart
Error received from salt; raw output:

Failed to authenticate!  This is most likely because this user is not permitted to execute commands, but there is a small possibility that a disk error occurred (check disk/inode usage).
cscott@tin:/srv/deployment/parsoid/deploy$ 

But git deploy service restart worked fine when I was doing my OCG deploys today. So it's not totally broken. Just broken for Parsoid.

But git deploy service restart worked fine when I was doing my OCG deploys today. So it's not totally broken. Just broken for Parsoid.

Could related to the fact that the parsoid upstart config waits for 60 secs for the jobs to complete before killing them. The other factor to be considered is whether a stuck process (occasionally, we have one of those on some node) interferes with this.

Oh, as soon as I typed that .. I realized that this doesn't happen when ariel / akosiaris restart. So, maybe it is not exactly that. But, anyway maybe that helps with the investigation.

From IRC, @ssastry confirmed that git deploy service restart doesn't work for him, either:

(04:13:47 PM) subbu: akosiaris, apergos git deploy restart for parsoid deploy failed for me as well /cc cscott 
(04:13:52 PM) subbu: will use the workaround documented in that ticket.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 29 2015, 8:27 PM

*bump* Any progress on this? We continue to use the shell script loop .. but on occasion, because of transient internet connect flakiness, I've had DNS failures .. I have to go manually restart on those failed nodes.

@ArielGlenn, is there anything we can do about this? I am stumped to be honest given my lack of salt knowledge.

ArielGlenn moved this task from Backlog to Up Next on the Salt board.

I've been looking at this and seeing a couple of behaviours, one where I indeed get the 'Failed to authenticate!' warning, though rarely; and one where I get responses but not all hosts successfully report restarting. In the case of the failed authentication, I see this error after the initial test.ping has gone out to collect the names of the hosts on which the batches will be run. After this error the deploy job itself is not run; when I dig into the results of the test.ping I see that they have all come back correctly and that the find_job followup also got results back from all hosts in a short period of time (2-3 seconds). But the results aren't returned to the client. This may be master load unhappiness as the error message suggests; I'd like to see what the behavior is once we move the master off to its own box away from the puppet master. That is pending.

ArielGlenn moved this task from Up Next to active on the Salt board.Oct 27 2015, 9:51 AM

This is indeed a failure between the client and the master; this behavior goes away when we are on neodymium without puppet. Making T115287 a blocking ticket for this one, which will likely be resolved at the same time.

Who can I coordinate on this for testing, now that we are on the nice happy new salt master?

ArielGlenn moved this task from Blocked/Stalled to active on the Salt board.Feb 3 2016, 3:09 PM

We have a deploy today. So I can give this a spin.

Great, let me know if you see this or any other errors so I can track them down. (Of course let me know if it's good news, too!)

ArielGlenn moved this task from active to Blocked/Stalled on the Salt board.Feb 29 2016, 12:16 PM
ArielGlenn moved this task from Blocked/Stalled to active on the Salt board.Mar 9 2016, 11:19 PM
ArielGlenn moved this task from active to testing needed on the Salt board.
ssastry closed this task as Declined.Sep 6 2016, 6:05 PM

Parsoid has moved to scap3, so this is no longer an issue.