Page MenuHomePhabricator

Delay repooling trending service after a restart
Closed, DeclinedPublic

Description

When the trending service is restarted upon deploy it takes some time for it to rebuild the history. It's an asynchronous process and we can't get notified when it's done due to the nature of the service, but we can wait for some time before retooling the service and moving to the next host, so that varnishes don't get (and cache) intermittent results.

Event Timeline

mobrovac added subscribers: thcipriani, mobrovac.

We need to establish if that would be possible with Scap3. I figure we could do a sleep 30 check script that is run before repooling the servers. @thcipriani does this sound like a viable work-around?

FYI, I disabled the endpoints check today for the very same reason. If we can delay this process after the service restart, we could kill two birds with one stone.

We need to establish if that would be possible with Scap3. I figure we could do a sleep 30 check script that is run before repooling the servers. @thcipriani does this sound like a viable work-around?

hrm. In looking through the code we're currently running checks per stage in concurently (with an arbitrary concurrency of 2). Using something like /bin/sh -c "sleep 30 && run-command" is hacky, but would work currently. Sleep could also be part of whatever script you use to depool.

The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:

  1. Use configuration to define concurrency per stage/check
  2. Allow the use of a delay for a particular check

I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.

hrm. In looking through the code we're currently running checks per stage in concurently (with an arbitrary concurrency of 2).

Euh? Aren't checks run in serial for each stage on the target(s)? If I have 5 checks for the promote stage, I would expect them to run in serial, one after the other, in the order specified in checks.yaml. Is that not the case?

Using something like /bin/sh -c "sleep 30 && run-command" is hacky, but would work currently. Sleep could also be part of whatever script you use to depool.

Yup, that's the hack I had in mind. Making it part of the script is not really an option because it's a script shared amongst all services, and adding a command-line argument (which doesn't really relate to the functionality of the script itself) smells like an outcome creating tech debt :P

The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:

  1. Use configuration to define concurrency per stage/check
  2. Allow the use of a delay for a particular check

I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.

Option 2 would be awesome to have. Something like:

checks.yaml
checks:
  endpoints:
    type: nrpe
    stage: restart_service
    command: check_endpoints_<service>
    delay: 30
  depool:
    type: command
    stage: promote
    command: depool-<service>
  repool:
    type: command
    stage: restart_service
    command: pool-<service>
    delay: 30
thcipriani triaged this task as Medium priority.Feb 4 2017, 6:56 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.

Euh? Aren't checks run in serial for each stage on the target(s)? If I have 5 checks for the promote stage, I would expect them to run in serial, one after the other, in the order specified in checks.yaml. Is that not the case?

That checks run serially per-stage was my assumption as well before looking into the checks code. I see now that we run them 2 at a time. The number 2 is not tied to cores on the deployment server or any configuration afaict.

Ideally, we should be running them serially as that's the most intuitive thing, unless there is some indication in some config value (as yet to be made) that indicates whether a certain stage should run in parallel. A configuration value like: parallel_checks (and the corresponding [stage]_parallel_checks) seems appropriate.

The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:

  1. Use configuration to define concurrency per stage/check
  2. Allow the use of a delay for a particular check

I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.

Option 2 would be awesome to have. Something like:

checks.yaml
checks:
  endpoints:
    type: nrpe
    stage: restart_service
    command: check_endpoints_<service>
    delay: 30
  depool:
    type: command
    stage: promote
    command: depool-<service>
  repool:
    type: command
    stage: restart_service
    command: pool-<service>
    delay: 30

I think this is doable.

mobrovac raised the priority of this task from Medium to High.Feb 10 2017, 12:27 AM

Raising the priority since this is blocking the deployment of new features.

@mobrovac what is the ETA on this? Just curious for planing

@mobrovac what is the ETA on this? Just curious for planing

FWIW, the ability to use the work-around is now live in the current version of scap (that is, checks now run in serial).

I filed T159867: Add a delay configuration option to checks to track scap work in integrating the ability to add a delay in checks.yaml.

Change 341675 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/trending-edits/deploy] Delay the endpoints check for 28 seconds

https://gerrit.wikimedia.org/r/341675

Change 341675 merged by Mobrovac:
[mediawiki/services/trending-edits/deploy] Delay the endpoints check for 28 seconds

https://gerrit.wikimedia.org/r/341675

mobrovac changed the task status from Open to Stalled.Mar 7 2017, 11:57 PM

Deployed and tested, setting as stalled until T159867: Add a delay configuration option to checks is resolved so that we can remove the work-around.

Change 342158 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/trending-edits/deploy] Delay post-restart checks for 45 secs

https://gerrit.wikimedia.org/r/342158

Change 342158 merged by Ppchelko:
[mediawiki/services/trending-edits/deploy] Delay post-restart checks for 45 secs

https://gerrit.wikimedia.org/r/342158

Another, cleaner but more invasive option would be to delay listening on the service socket until startup has completed. This would avoid the need to guesstimate the normal startup time.

The trending service has been undeployed, so this task is not needed any more.