When the trending service is restarted upon deploy it takes some time for it to rebuild the history. It's an asynchronous process and we can't get notified when it's done due to the nature of the service, but we can wait for some time before retooling the service and moving to the next host, so that varnishes don't get (and cache) intermittent results.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Jdlrobson | T156680 Allow API consumer to express a timeframe in hours | |||
Resolved | • mobrovac | T156411 Compute the trending articles over a period of 24h rather than 1h | |||
Declined | None | T156687 Delay repooling trending service after a restart | |||
Open | None | T159867 Add a delay configuration option to checks |
Event Timeline
We need to establish if that would be possible with Scap3. I figure we could do a sleep 30 check script that is run before repooling the servers. @thcipriani does this sound like a viable work-around?
FYI, I disabled the endpoints check today for the very same reason. If we can delay this process after the service restart, we could kill two birds with one stone.
hrm. In looking through the code we're currently running checks per stage in concurently (with an arbitrary concurrency of 2). Using something like /bin/sh -c "sleep 30 && run-command" is hacky, but would work currently. Sleep could also be part of whatever script you use to depool.
The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:
- Use configuration to define concurrency per stage/check
- Allow the use of a delay for a particular check
I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.
Euh? Aren't checks run in serial for each stage on the target(s)? If I have 5 checks for the promote stage, I would expect them to run in serial, one after the other, in the order specified in checks.yaml. Is that not the case?
Using something like /bin/sh -c "sleep 30 && run-command" is hacky, but would work currently. Sleep could also be part of whatever script you use to depool.
Yup, that's the hack I had in mind. Making it part of the script is not really an option because it's a script shared amongst all services, and adding a command-line argument (which doesn't really relate to the functionality of the script itself) smells like an outcome creating tech debt :P
The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:
- Use configuration to define concurrency per stage/check
- Allow the use of a delay for a particular check
I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.
Option 2 would be awesome to have. Something like:
checks: endpoints: type: nrpe stage: restart_service command: check_endpoints_<service> delay: 30 depool: type: command stage: promote command: depool-<service> repool: type: command stage: restart_service command: pool-<service> delay: 30
That checks run serially per-stage was my assumption as well before looking into the checks code. I see now that we run them 2 at a time. The number 2 is not tied to cores on the deployment server or any configuration afaict.
Ideally, we should be running them serially as that's the most intuitive thing, unless there is some indication in some config value (as yet to be made) that indicates whether a certain stage should run in parallel. A configuration value like: parallel_checks (and the corresponding [stage]_parallel_checks) seems appropriate.
The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:
- Use configuration to define concurrency per stage/check
- Allow the use of a delay for a particular check
I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.
Option 2 would be awesome to have. Something like:
checks.yamlchecks: endpoints: type: nrpe stage: restart_service command: check_endpoints_<service> delay: 30 depool: type: command stage: promote command: depool-<service> repool: type: command stage: restart_service command: pool-<service> delay: 30
I think this is doable.
FWIW, the ability to use the work-around is now live in the current version of scap (that is, checks now run in serial).
I filed T159867: Add a delay configuration option to checks to track scap work in integrating the ability to add a delay in checks.yaml.
Change 341675 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/trending-edits/deploy] Delay the endpoints check for 28 seconds
Change 341675 merged by Mobrovac:
[mediawiki/services/trending-edits/deploy] Delay the endpoints check for 28 seconds
Deployed and tested, setting as stalled until T159867: Add a delay configuration option to checks is resolved so that we can remove the work-around.
Change 342158 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/trending-edits/deploy] Delay post-restart checks for 45 secs
Change 342158 merged by Ppchelko:
[mediawiki/services/trending-edits/deploy] Delay post-restart checks for 45 secs
Another, cleaner but more invasive option would be to delay listening on the service socket until startup has completed. This would avoid the need to guesstimate the normal startup time.