Page MenuHomePhabricator

Investigate scap cluster_ssh idling until pressing ENTER repeatedly
Open, LowPublic

Description

Took more than hour to do the cdb rebuild:

2019-05-14 - production
14:47 	<hashar@deploy1001> 	Finished scap: testwiki to 1.34.0-wmf.5 and rebuild l10n cache (duration: 62m 47s)

When running a preliminary scap sync to rebuild l10n cache, I noticed scap output was stuck on cdb rebuilds at 80%. Eventually I kept pressing enter which each time made the progress advance and eventually that flushed whatever queue was stall and the script progressed again:

2019-05-14T14:15:35 mwdebug1002 merge_cdb_updates Updated 417 CDB files(s) in /srv/mediawiki/php-1.34.0-wmf.5/cache/l10n

Half an hour without any log, then I guess I pressed ENTER and:

deploy1001
2019-05-14T14:47:23  Finished scap-cdb-rebuild (duration: 45m 10s)
2019-05-14T14:47:23 Started sync_wikiversions
2019-05-14T14:47:23 Compiled /srv/mediawiki-staging/wikiversions.json to /srv/mediawiki-staging/wikiversions.php

Event Timeline

hashar created this task.May 14 2019, 3:29 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2019, 3:29 PM

The last message were:

2019-05-14T15:14:48mw1263
INFO
sync_common
Copying from deploy1001.eqiad.wmnet to mw1263.eqiad.wmnet

2019-05-14T15:14:48mw1263
INFO
sync_common
Copying from deploy1001.eqiad.wmnet to mw1263.eqiad.wmnet

Which eventually got rebooted after. But that might be a red hearing of some sort.

14:50 <moritzm> rebooting mw1263 for kernel update

hashar triaged this task as Low priority.May 14 2019, 3:45 PM

We will see next train / week whether that still happens.

Reedy added a subscriber: Reedy.May 14 2019, 3:46 PM

I've had something like this before with scap on syncing to the servers, and the progress updates seem to get stuck and the process doesn't finish until I sit there holding enter

I always presumed I'd done something weird to upset my console...

Ah thank you Sam! So I am not alone ;]

From Tyler, next time I should try with scap --verbose sync "foobar" to get debug messages directly to the console and see what happens.

Reedy added a comment.May 14 2019, 4:04 PM

Ah thank you Sam! So I am not alone ;]

From Tyler, next time I should try with scap --verbose sync "foobar" to get debug messages directly to the console and see what happens.

From personal experience, it doesn't seem to happen every time, so hard to replicate on demand

It's like some buffer gets stuck/upset

thcipriani moved this task from Needs triage to Debt on the Scap board.May 14 2019, 4:57 PM

Happened again. I eventually had a lot of process idling out and pressed enter several time. Output of ps is in my homedir on deploy1001.

Eventually I reached:

sync-apaches: 99% (ok: 266; fail: 0; left: 1)

The last process was:

28768 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet

lsof gives me:

ssh     28768 hashar    0u   CHR              136,5      0t0         8 /dev/pts/5
ssh     28768 hashar    1w  FIFO               0,10      0t0 779730380 pipe
ssh     28768 hashar    2w  FIFO               0,10      0t0 779730380 pipe
ssh     28768 hashar    3u  IPv4          779733818      0t0       TCP deploy1001.eqiad.wmnet:56952->mw1338.eqiad.wmnet:ssh (ESTABLISHED)
ssh     28768 hashar    4u  unix 0xffff9d66973e3800      0t0 779733819 type=STREAM
ssh     28768 hashar    5u   CHR              136,5      0t0         8 /dev/pts/5
ssh     28768 hashar    6w  FIFO               0,10      0t0 779730380 pipe
ssh     28768 hashar    7w  FIFO               0,10      0t0 779730380 pipe

On mw1338.eqiad.wmnet:

root     27283  0.0  0.0  69952  5576 ?        Ss   May13   0:10 /usr/sbin/sshd -D
root     29736  0.0  0.0  95180  6772 ?        Ss   13:56   0:00  \_ sshd: mwdeploy [priv]
mwdeploy 29765  0.0  0.0  95180  3736 ?        S    13:56   0:00  |   \_ sshd: mwdeploy@notty

So it is apparently idling doing nothing. And pressing enter one last time unblock the scap command.

I did a ps after entering enter for the last few hosts:

left3
16471 pts/5    Sl+    0:16          \_ /usr/bin/python2 /usr/bin/scap sync testwiki to php-1.344.0-wmf.6 and rebuild l10n cache # T220731
28768 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
30521 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw2244.codfw.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
30522 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw2156.codfw.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
left2
16471 pts/5    Sl+    0:16          \_ /usr/bin/python2 /usr/bin/scap sync testwiki to php-1.344.0-wmf.6 and rebuild l10n cache # T220731
28768 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
30522 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw2156.codfw.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
22363 ?        S      0:00 sshd: hashar@pts/6
left1
16471 pts/5    Sl+    0:16          \_ /usr/bin/python2 /usr/bin/scap sync testwiki to php-1.344.0-wmf.6 and rebuild l10n cache # T220731
28768 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet

So somehow, the ssh command fials to notice that the remote scap pull has completed :-(

Most probably related to imported scap/sh.py which probably should be removed T222372

Mentioned in SAL (#wikimedia-operations) [2020-05-05T19:02:27Z] <brennen> train status: 1.35.0-wmf.31: presently pressing enter through scap-cdb-rebuild; at 8% (T249963, T223287)

mmodell added a subscriber: mmodell.May 5 2020, 7:18 PM

Most probably related to imported scap/sh.py which probably should be removed T222372

what makes you think it's sh.py? This worked before and for some reason it's started failing. I don't really see any indication that sh.py is to blame.

it seems like this isn't just happening in cdb-rebuild but all of the cases where scap does a cluster_ssh loop.

mmodell renamed this task from Investigate scap-cdb-rebuild idling until pressing ENTER repeatedly to Investigate scap cluster_ssh idling until pressing ENTER repeatedly.May 5 2020, 7:27 PM

I did some more reading of the scap code and it doesn't look like sh.py is in the code paths where this is happening, so I really don't think sh.py is the problem.

it seems like this isn't just happening in cdb-rebuild but all of the cases where scap does a cluster_ssh loop.

One interesting part of the pasted output above is

On mw1338.eqiad.wmnet:

root     27283  0.0  0.0  69952  5576 ?        Ss   May13   0:10 /usr/sbin/sshd -D
root     29736  0.0  0.0  95180  6772 ?        Ss   13:56   0:00  \_ sshd: mwdeploy [priv]
mwdeploy 29765  0.0  0.0  95180  3736 ?        S    13:56   0:00  |   \_ sshd: mwdeploy@notty

So it is apparently idling doing nothing. And pressing enter one last time unblock the scap command.

which makes it look like it's the ssh connection idling after the scap process on the target has already ended. That is, is there some ssh configuration setting that we need to tweak here?

@thcipriani yeah, I strongly suspect ssh over scap proper. one thing that might be related is this:

We pass these ssh_config options on the command line, apparently:

-oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null

the man page for ssh_config says this about BatchMode:

If set to yes, passphrase/password querying will be disabled. In addition, the ServerAliveInterval option
will be set to 300 seconds by default (Debian-specific). This option is useful in scripts and other batch
jobs where no user is present to supply the password, and where it is desirable to detect a broken network
swiftly.

And apparently SetupTimeout is an alias for ServerAliveInterval

Have we always been using these debian specific options?

Have we always been using these debian specific options?

identitiesOnly is the newest of those. I think -oBatchMode=yes and -oSetupTimeout have been around as long as scap/ssh.py:
https://gerrit.wikimedia.org/r/#/c/mediawiki/tools/scap/+/113660/

put together a couple of test scripts:

#!/usr/bin/env bash

SSH_AUTH_SOCK=/run/keyholder/proxy.sock /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -v -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap say 'hello'
#!/usr/bin/python

import os

from scap.ssh import (cluster_ssh, OutputHandler)

if __name__ == '__main__':
    os.environ['SSH_AUTH_SOCK'] = '/run/keyholder/proxy.sock'
    for host, status, ohandle in cluster_ssh(hosts=['mw1338.eqiad.wmnet'], command='/usr/bin/scap say hello', user='mwdeploy', key='/etc/keyholder.d/mwdeploy.pub', output_handler=OutputHandler, verbose=True):
        print ohandle.output

And had @brennen run them since he had this problem previously. Ran fine. Either the problem is intermittent or triggered by something outside of this experiment.

brennen moved this task from Backlog to Watching on the User-brennen board.May 7 2020, 11:53 PM

@mmodell noted that he ran into this yesterday and was able to unstick the process by backgrounding and unbackgrounding the process (i.e., Ctrl-Z; fg) and suggested that there might be a buffer that's getting filled and not emptied somewhere -- or maybe that the tty is changing buffering modes somewhere (this stackoverflow question looks superficially similar: https://stackoverflow.com/questions/50010997/python-subprocess-popen-pipe-io-blocking-unexpectedly). We couldn't recreate this problem yesterday :\

Thanks @thcipriani, I intended to post that here. It definitely seems to have something to do with buffering of (one of) the pipes or the pty. It seems to not be in the select + epoll code, or the ProgressReporter.

So the problem is most likely one of the following:

  • Subprocess (python lib code)
  • ssh/sshd config
  • sudo?
  • keyholder?