Page MenuHomePhabricator

Investigate duration of scap sync for 1.35.0-wmf.23 train deploy
Closed, ResolvedPublic

Description

This took roughly 2h40m yesterday for T233871, which feels like a serious outlier. Anecdotally, times have been trending higher. Will flesh this ticket out as I investigate.

Event Timeline

Dug into logstash with @thcipriani a bit last friday scap-cdb-rebuild duration was 91m 00s.

Time range for scap-cdb-rebuild events: 2020-03-10 19:39:00.000 - 2020-03-10 20:40:00.000.

Screenshot-2020-04-03-10:36:42.png (195×1 px, 15 KB)

  • 2020-03-10 20:09:03 - All the rebuilds were done
  • 2020-03-10 20:39:06 - Started compile_wikiversions

So why the 30-minute gap?

I have memories of a "hit enter to make it go faster" situation here, as described in a conversation since by Antoine. At the time I thought it was just a display artifact, but it seems like something's actually hanging. Ok, so what and why?

Tyler points at scap/main.py +679, _after_cluster_sync(). The relevant-seeming jumping-off point:

# Ask apaches to rebuild l10n CDB files
with log.Timer('scap-cdb-rebuild', self.get_stats()):
    rebuild_cdbs = ssh.Job(
        target_hosts,
        user=self.config['ssh_user'],
        key=self.get_keyholder_key())
    rebuild_cdbs.shuffle()
    rebuild_cdbs.command(
        'sudo -u mwdeploy -n -- %s cdb-rebuild' %
        self.get_script_path())
    rebuild_cdbs.progress(
        log.reporter(
            'scap-cdb-rebuild',
            self.config['fancy_progress']))
    succeeded, failed = rebuild_cdbs.run()
    if failed:
        self.get_logger().warning(
            '%d hosts had scap-cdb-rebuild errors', failed)
        self.soft_errors = True

# Update and sync wikiversions.php
succeeded, failed = tasks.sync_wikiversions(
    target_hosts, self.config, key=self.get_keyholder_key())
if failed:
    self.get_logger().warning(
        '%d hosts had sync_wikiversions errors', failed)
    self.soft_errors = True

Questions:

  • Were SSH connections still open?
  • Either scap cdb-rebuild was hanging on target machines...
    • We ssh into machines in batches of 80, which supports the idea that the parent process is hanging
    • 3x in the whole run we do something across all hosts
      • pull down code
      • l10n
      • update wikiversions
  • ...or it was hanging because something went wrong with the scap process on deploy1001.

Further digging brings us to scap/cli.py +257 and invocations of TERM.reset_colors() and TERM.close().

Suspicion now centers on scap/terminal.py.

I have memories of a "hit enter to make it go faster" situation here, as described in a conversation since by Antoine.

And also in T223287: Investigate scap cluster_ssh idling until pressing ENTER repeatedly

Questions:

  • Were SSH connections still open?
  • Either scap cdb-rebuild was hanging on target machines...
    • We ssh into machines in batches of 80, which supports the idea that the parent process is hanging
    • 3x in the whole run we do something across all hosts
      • pull down code
      • l10n
      • update wikiversions
  • ...or it was hanging because something went wrong with the scap process on deploy1001.

It would be interesting to see the child processes of scap at the time. That is, was scap sync waiting on children to close and the child processes were hanging (possibly waiting on input)? Or did scap somehow lose track of child processes? My bet is on the former, but some digging with strace during one of these incidents would probably say for certain.

Further digging brings us to scap/cli.py +257 and invocations of TERM.reset_colors() and TERM.close().

Suspicion now centers on scap/terminal.py.

This is the last thing that scap cdb-rebuild does before exiting. This code is a library that I'm not terribly familiar with, but seems like the kind of thing that is hard to get right for every terminal emulator in the world.

...

It would be interesting to see the child processes of scap at the time. That is, was scap sync waiting on children to close and the child processes were hanging (possibly waiting on input)? Or did scap somehow lose track of child processes? My bet is on the former, but some digging with strace during one of these incidents would probably say for certain.

T223287#5200989 has some infos showing sshd: mwdeploy@notty on the remote host and deploy1001 having a ssh client with an established TCP connection. That might be -oSetupTimeout=10 or scap/sh.py. is the workaround, maybe because that sends a packet that lets the client terminate the session properly. Anyway we should follow up on the original task T223287

brennen moved this task from Next to Done or Declined on the User-brennen board.

Closing this as the main actionable thing here is to, eventually, solve T223287.