Page MenuHomePhabricator

Investigate scap cluster_ssh idling until pressing ENTER repeatedly
Closed, ResolvedPublic

Description

Took more than hour to do the cdb rebuild:

2019-05-14 - production
14:47 	<hashar@deploy1001> 	Finished scap: testwiki to 1.34.0-wmf.5 and rebuild l10n cache (duration: 62m 47s)

When running a preliminary scap sync to rebuild l10n cache, I noticed scap output was stuck on cdb rebuilds at 80%. Eventually I kept pressing enter which each time made the progress advance and eventually that flushed whatever queue was stall and the script progressed again:

2019-05-14T14:15:35 mwdebug1002 merge_cdb_updates Updated 417 CDB files(s) in /srv/mediawiki/php-1.34.0-wmf.5/cache/l10n

Half an hour without any log, then I guess I pressed ENTER and:

deploy1001
2019-05-14T14:47:23  Finished scap-cdb-rebuild (duration: 45m 10s)
2019-05-14T14:47:23 Started sync_wikiversions
2019-05-14T14:47:23 Compiled /srv/mediawiki-staging/wikiversions.json to /srv/mediawiki-staging/wikiversions.php

Event Timeline

The last message were:

2019-05-14T15:14:48mw1263
INFO
sync_common
Copying from deploy1001.eqiad.wmnet to mw1263.eqiad.wmnet

2019-05-14T15:14:48mw1263
INFO
sync_common
Copying from deploy1001.eqiad.wmnet to mw1263.eqiad.wmnet

Which eventually got rebooted after. But that might be a red hearing of some sort.

14:50 <moritzm> rebooting mw1263 for kernel update

We will see next train / week whether that still happens.

I've had something like this before with scap on syncing to the servers, and the progress updates seem to get stuck and the process doesn't finish until I sit there holding enter

I always presumed I'd done something weird to upset my console...

Ah thank you Sam! So I am not alone ;]

From Tyler, next time I should try with scap --verbose sync "foobar" to get debug messages directly to the console and see what happens.

Ah thank you Sam! So I am not alone ;]

From Tyler, next time I should try with scap --verbose sync "foobar" to get debug messages directly to the console and see what happens.

From personal experience, it doesn't seem to happen every time, so hard to replicate on demand

It's like some buffer gets stuck/upset

Happened again. I eventually had a lot of process idling out and pressed enter several time. Output of ps is in my homedir on deploy1001.

Eventually I reached:

sync-apaches: 99% (ok: 266; fail: 0; left: 1)

The last process was:

28768 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet

lsof gives me:

ssh     28768 hashar    0u   CHR              136,5      0t0         8 /dev/pts/5
ssh     28768 hashar    1w  FIFO               0,10      0t0 779730380 pipe
ssh     28768 hashar    2w  FIFO               0,10      0t0 779730380 pipe
ssh     28768 hashar    3u  IPv4          779733818      0t0       TCP deploy1001.eqiad.wmnet:56952->mw1338.eqiad.wmnet:ssh (ESTABLISHED)
ssh     28768 hashar    4u  unix 0xffff9d66973e3800      0t0 779733819 type=STREAM
ssh     28768 hashar    5u   CHR              136,5      0t0         8 /dev/pts/5
ssh     28768 hashar    6w  FIFO               0,10      0t0 779730380 pipe
ssh     28768 hashar    7w  FIFO               0,10      0t0 779730380 pipe

On mw1338.eqiad.wmnet:

root     27283  0.0  0.0  69952  5576 ?        Ss   May13   0:10 /usr/sbin/sshd -D
root     29736  0.0  0.0  95180  6772 ?        Ss   13:56   0:00  \_ sshd: mwdeploy [priv]
mwdeploy 29765  0.0  0.0  95180  3736 ?        S    13:56   0:00  |   \_ sshd: mwdeploy@notty

So it is apparently idling doing nothing. And pressing enter one last time unblock the scap command.

I did a ps after entering enter for the last few hosts:

left3
16471 pts/5    Sl+    0:16          \_ /usr/bin/python2 /usr/bin/scap sync testwiki to php-1.344.0-wmf.6 and rebuild l10n cache # T220731
28768 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
30521 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw2244.codfw.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
30522 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw2156.codfw.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
left2
16471 pts/5    Sl+    0:16          \_ /usr/bin/python2 /usr/bin/scap sync testwiki to php-1.344.0-wmf.6 and rebuild l10n cache # T220731
28768 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
30522 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw2156.codfw.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet
22363 ?        S      0:00 sshd: hashar@pts/6
left1
16471 pts/5    Sl+    0:16          \_ /usr/bin/python2 /usr/bin/scap sync testwiki to php-1.344.0-wmf.6 and rebuild l10n cache # T220731
28768 ?        Ss     0:00              \_ /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap pull --no-update-l10n mw1268.eqiad.wmnet mw1314.eqiad.wmnet mw2255.codfw.wmnet mw2290.codfw.wmnet mw2216.codfw.wmnet mw1251.eqiad.wmnet mw2188.codfw.wmnet mw1320.eqiad.wmnet mw1285.eqiad.wmnet

So somehow, the ssh command fials to notice that the remote scap pull has completed :-(

Most probably related to imported scap/sh.py which probably should be removed T222372

Mentioned in SAL (#wikimedia-operations) [2020-05-05T19:02:27Z] <brennen> train status: 1.35.0-wmf.31: presently pressing enter through scap-cdb-rebuild; at 8% (T249963, T223287)

Most probably related to imported scap/sh.py which probably should be removed T222372

what makes you think it's sh.py? This worked before and for some reason it's started failing. I don't really see any indication that sh.py is to blame.

it seems like this isn't just happening in cdb-rebuild but all of the cases where scap does a cluster_ssh loop.

mmodell renamed this task from Investigate scap-cdb-rebuild idling until pressing ENTER repeatedly to Investigate scap cluster_ssh idling until pressing ENTER repeatedly.May 5 2020, 7:27 PM

I did some more reading of the scap code and it doesn't look like sh.py is in the code paths where this is happening, so I really don't think sh.py is the problem.

it seems like this isn't just happening in cdb-rebuild but all of the cases where scap does a cluster_ssh loop.

One interesting part of the pasted output above is

On mw1338.eqiad.wmnet:

root     27283  0.0  0.0  69952  5576 ?        Ss   May13   0:10 /usr/sbin/sshd -D
root     29736  0.0  0.0  95180  6772 ?        Ss   13:56   0:00  \_ sshd: mwdeploy [priv]
mwdeploy 29765  0.0  0.0  95180  3736 ?        S    13:56   0:00  |   \_ sshd: mwdeploy@notty

So it is apparently idling doing nothing. And pressing enter one last time unblock the scap command.

which makes it look like it's the ssh connection idling after the scap process on the target has already ended. That is, is there some ssh configuration setting that we need to tweak here?

@thcipriani yeah, I strongly suspect ssh over scap proper. one thing that might be related is this:

We pass these ssh_config options on the command line, apparently:

-oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null

the man page for ssh_config says this about BatchMode:

If set to yes, passphrase/password querying will be disabled. In addition, the ServerAliveInterval option
will be set to 300 seconds by default (Debian-specific). This option is useful in scripts and other batch
jobs where no user is present to supply the password, and where it is desirable to detect a broken network
swiftly.

And apparently SetupTimeout is an alias for ServerAliveInterval

Have we always been using these debian specific options?

Have we always been using these debian specific options?

identitiesOnly is the newest of those. I think -oBatchMode=yes and -oSetupTimeout have been around as long as scap/ssh.py:
https://gerrit.wikimedia.org/r/#/c/mediawiki/tools/scap/+/113660/

put together a couple of test scripts:

#!/usr/bin/env bash

SSH_AUTH_SOCK=/run/keyholder/proxy.sock /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -oIdentitiesOnly=yes -F/dev/null -v -oUser=mwdeploy -oIdentityFile=/etc/keyholder.d/mwdeploy.pub mw1338.eqiad.wmnet /usr/bin/scap say 'hello'
#!/usr/bin/python

import os

from scap.ssh import (cluster_ssh, OutputHandler)

if __name__ == '__main__':
    os.environ['SSH_AUTH_SOCK'] = '/run/keyholder/proxy.sock'
    for host, status, ohandle in cluster_ssh(hosts=['mw1338.eqiad.wmnet'], command='/usr/bin/scap say hello', user='mwdeploy', key='/etc/keyholder.d/mwdeploy.pub', output_handler=OutputHandler, verbose=True):
        print ohandle.output

And had @brennen run them since he had this problem previously. Ran fine. Either the problem is intermittent or triggered by something outside of this experiment.

@mmodell noted that he ran into this yesterday and was able to unstick the process by backgrounding and unbackgrounding the process (i.e., Ctrl-Z; fg) and suggested that there might be a buffer that's getting filled and not emptied somewhere -- or maybe that the tty is changing buffering modes somewhere (this stackoverflow question looks superficially similar: https://stackoverflow.com/questions/50010997/python-subprocess-popen-pipe-io-blocking-unexpectedly). We couldn't recreate this problem yesterday :\

Thanks @thcipriani, I intended to post that here. It definitely seems to have something to do with buffering of (one of) the pipes or the pty. It seems to not be in the select + epoll code, or the ProgressReporter.

So the problem is most likely one of the following:

  • Subprocess (python lib code)
  • ssh/sshd config
  • sudo?
  • keyholder?

Played with this a bit just now. Don't have permission to lsof or strace on the remote since the ssh process is all that's left hanging on the remote. The state of the process is interruptible sleep: it's hanging waiting on something. All this is to say: we can reproduce (just have brennen or antoine run scap sync-world), but we will probably need an opsen to help us with strace permissions to dig any further in production (which is the only place we can get this to happen so far)

Yeah it is messy :\

scap uses SetupTimeout while the ssh config uses an upper case O for Out. But at least on Buster the case does not matter. If no data get received after SetupTimeout seconds, ssh sends a keep alive packet (type 82) and receives an acknowledgement (type 80). The exchanges can be seen in the debug3 log level enabled via ssh -vvv.

Example output with microseconds timestamps:

ssh -oBatchMode=yes -oSetupTimeout=2 -vvv bastion.wmflabs.org sleep 10 2>&1|ts -s '%.s'
...
3.416860 debug2: exec request accepted on channel 0
5.418773 debug3: send packet: type 80
5.628245 debug3: receive packet: type 82
7.630684 debug3: send packet: type 80
7.839935 debug3: receive packet: type 82
9.842147 debug3: send packet: type 80
9.970037 debug3: receive packet: type 82
11.972525 debug3: send packet: type 80
12.141017 debug3: receive packet: type 82
13.498014 debug3: receive packet: type 96
13.498151 debug2: channel 0: rcvd eof
...

Maybe increasing ssh verbosity for some targets might gives us some more evidences?


We might also have a logical glitch in our scap/ssh.py, such as in cluster_ssh.

A python thread dump might come handy as well, unfortunately I don't think python 2.7 has that builtin. In python 3 one can set PYTHONFAULTHANDLER or pass -x faulthandler then send SIGABRT for example.

Mentioned in SAL (#wikimedia-operations) [2020-11-10T21:23:31Z] <brennen> testing some scap operations, modified to use ssh -n for debugging T223287

Mentioned in SAL (#wikimedia-operations) [2020-11-10T21:24:48Z] <brennen@deploy1001> sync-file aborted: Testing: README.md sync-file with ssh -n for T223287 (duration: 00m 37s)

The last fringe theory I have is that the progress bar writes to stderr with \r with buffering kicking in which holds the actual write. If sys.stderr is in opened in blocking mode, the write() blocks until buffer is filed? I don't know really I could not reproduce locally. Maybe running with python unbuffered (python -u) or adding an explicitly flush would magically solves it:

scap/log.py
  class ProgressReporter(object):
...
      def _progress(self):
          if sys.stdout.isatty():
              fmt = "%-80s\r"
              self._fd.flush()  # Explicit flush
          else:
              fmt = "%-80s\n"
          self._fd.write(fmt % self._output())

I also found out about FancyProgressReporter which can be enabled via fancy_progress = true in the scap configuration. That uses some terminal codes but I don't quite understand whether that is explicitly flushing. Its _progress method has calls to TERM.clear_eol() and TERM.restore().flush(), the later sounds like an explicit flush. Might be worth trying since that would be "just" a config change in our scap config.

When scap ran sync-apaches I shrinked my terminal which kind of broke the progress indicator showing me each of the lines. At the end there are TWO lines with left: 0:

sync-apaches:  99% (ok: 344; fail: 0; left: 3)
sync-apaches:  99% (ok: 345; fail: 0; left: 2)
sync-apaches:  99% (ok: 346; fail: 0; left: 1)
sync-apaches: 100% (ok: 347; fail: 0; left: 0)
sync-apaches: 100% (ok: 347; fail: 0; left: 0)  # <--- extra left 0

So maybe it is an off by one error?

When scap ran sync-apaches I shrinked my terminal which kind of broke the progress indicator showing me each of the lines. At the end there are TWO lines with left: 0:

sync-apaches:  99% (ok: 344; fail: 0; left: 3)
sync-apaches:  99% (ok: 345; fail: 0; left: 2)
sync-apaches:  99% (ok: 346; fail: 0; left: 1)
sync-apaches: 100% (ok: 347; fail: 0; left: 0)
sync-apaches: 100% (ok: 347; fail: 0; left: 0)  # <--- extra left 0

So maybe it is an off by one error?

Not an error. The progress indicator is updated any time a sub-operation completes. And it is updated again when the finish method is called. Normally the indicator overwrites the same line of the terminal but the code assumes your terminal is at least 80 characters wide, so there is probably unexpected advancement to a new line.

hashar closed this task as Declined.EditedNov 3 2021, 9:32 AM

After talking about it again with others, there is no proof pressing actually makes the thing to progress any faster :)

I'm not sure I ever said it made the process any faster... But it did at least make the output continue...

The deployment today has been exceptionally slow. We pressed repeatedly at 10:48 and 11:12. That seems to unleash a bunch of Started rsync common and I am thus reopening this task.

Here is the view for the Kibana scap dashboard with the filter message: Started rsync common:

scap-rsync-common.png (482×1 px, 37 KB)

It might be helpful to have the progress report to show the number of processing currently running. I have no idea why pressing starts new tasks. Maybe some output holds until it receives a new line.

@hashar Interesting. Do you have a full transcript of the operation saved somewhere?

I'd like to know what scap version reports for you. As of today I find scap 4.4.2 on deploy1002. This version:

We ran from deploy1002 so that is scap 4.4.2. I got the Started rsync common message from logstash.

@jnuche was the train conductor this morning. We did it with the stage-train command but I don't think we harnessed it with script or similar term capturer. Maybe stage-train does it for us and store the session somewhere? Else we don't have a replay.

Change 772487 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] ssh.py: Pass -n to ssh commands

https://gerrit.wikimedia.org/r/772487

Change 772487 merged by jenkins-bot:

[mediawiki/tools/scap@master] ssh.py: Pass -n to ssh commands

https://gerrit.wikimedia.org/r/772487

brennen claimed this task.
brennen moved this task from Radar to Done or Declined on the User-brennen board.

[mediawiki/tools/scap@master] ssh.py: Pass -n to ssh commands

OH MY GOD IT WORKS.

full-steam-ahead

I went to press when deploying 1.39.0-wmf.5 today (March 29th) during the sync-apaches phase. I ran stage-train under screen and pressed at:

08:29 UTC
08:54 UTC
09:00 UTC
09:05 UTC
09:13 UTC

Logstash view for message:Started rsync common shows spikes of messages when I pressed enter:

scap_press_enter.png (173×867 px, 13 KB)

And at the end pressing unlocked to scap-cdb-rebuild.

So looks like we still have the issue, at least with sync-apaches. What I have noticed is the progress does not output any newline while scap-cdb-rebuild does output a new progress lines on each state change (in-flight changing).

I unfortunately did not run the train with script nor with script --flush (which disables writes buffering).

And another weirdness I had is php-fpm-restarts taking 6 minutes:

11:56:47 Started php-fpm-restarts
11:56:47 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 347 host(s)
...
12:02:54 Finished php-fpm-restarts (duration: 06m 06s)

For which I did press but that did not speed it up.

@hashar Did you use the scap that's installed on the deploy server, or scap from an up-to-date checkout of the mediawiki/tools/scap? If the former, that would explain why the problem still manifested. The fixed scap has not been deployed yet.

I have used scap from the PATH. I missed the fact Brennen used local checkout last week -:-] Marking resolved again thank you!