Delay repooling trending service after a restart
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Pchelolo
	Jan 30 2017, 10:01 PM

Description

When the trending service is restarted upon deploy it takes some time for it to rebuild the history. It's an asynchronous process and we can't get notified when it's done due to the nature of the service, but we can wait for some time before retooling the service and moving to the next host, so that varnishes don't get (and cache) intermittent results.

Details

	Subject	Repo	Branch	Lines +/-
	Delay post-restart checks for 45 secs	mediawiki/services/trending-edits/deploy	master	+2 -1
	Delay the endpoints check for 28 seconds	mediawiki/services/trending-edits/deploy	master	+8 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Jdlrobson	T156680 Allow API consumer to express a timeframe in hours
Resolved	• mobrovac	T156411 Compute the trending articles over a period of 24h rather than 1h
Declined	None	T156687 Delay repooling trending service after a restart
Open	None	T159867 Add a delay configuration option to checks

Event Timeline

• Pchelolo created this task.Jan 30 2017, 10:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2017, 10:01 PM

We need to establish if that would be possible with Scap3. I figure we could do a sleep 30 check script that is run before repooling the servers. @thcipriani does this sound like a viable work-around?

FYI, I disabled the endpoints check today for the very same reason. If we can delay this process after the service restart, we could kill two birds with one stone.

In T156687#2984052, @mobrovac wrote:

We need to establish if that would be possible with Scap3. I figure we could do a sleep 30 check script that is run before repooling the servers. @thcipriani does this sound like a viable work-around?

hrm. In looking through the code we're currently running checks per stage in concurently (with an arbitrary concurrency of 2). Using something like /bin/sh -c "sleep 30 && run-command" is hacky, but would work currently. Sleep could also be part of whatever script you use to depool.

The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:

Use configuration to define concurrency per stage/check
Allow the use of a delay for a particular check

I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.

In T156687#2987059, @thcipriani wrote:

hrm. In looking through the code we're currently running checks per stage in concurently (with an arbitrary concurrency of 2).

Euh? Aren't checks run in serial for each stage on the target(s)? If I have 5 checks for the promote stage, I would expect them to run in serial, one after the other, in the order specified in checks.yaml. Is that not the case?

Using something like /bin/sh -c "sleep 30 && run-command" is hacky, but would work currently. Sleep could also be part of whatever script you use to depool.

Yup, that's the hack I had in mind. Making it part of the script is not really an option because it's a script shared amongst all services, and adding a command-line argument (which doesn't really relate to the functionality of the script itself) smells like an outcome creating tech debt :P

The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:

Use configuration to define concurrency per stage/check

Allow the use of a delay for a particular check

I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.

Option 2 would be awesome to have. Something like:

checks.yaml

checks:
  endpoints:
    type: nrpe
    stage: restart_service
    command: check_endpoints_<service>
    delay: 30
  depool:
    type: command
    stage: promote
    command: depool-<service>
  repool:
    type: command
    stage: restart_service
    command: pool-<service>
    delay: 30

In T156687#2988590, @mobrovac wrote:

Euh? Aren't checks run in serial for each stage on the target(s)? If I have 5 checks for the promote stage, I would expect them to run in serial, one after the other, in the order specified in checks.yaml. Is that not the case?

That checks run serially per-stage was my assumption as well before looking into the checks code. I see now that we run them 2 at a time. The number 2 is not tied to cores on the deployment server or any configuration afaict.

Ideally, we should be running them serially as that's the most intuitive thing, unless there is some indication in some config value (as yet to be made) that indicates whether a certain stage should run in parallel. A configuration value like: parallel_checks (and the corresponding [stage]_parallel_checks) seems appropriate.

The concurrency of checks seems arbitrary right now. Seems like there are a couple things we should/could do in scap:

Use configuration to define concurrency per stage/check

Allow the use of a delay for a particular check

I'm not sure if the ability specify that certain checks should run in serial makes the ability to delay a particular check superfluous.

Option 2 would be awesome to have. Something like:
checks.yaml
checks:
  endpoints:
    type: nrpe
    stage: restart_service
    command: check_endpoints_<service>
    delay: 30
  depool:
    type: command
    stage: promote
    command: depool-<service>
  repool:
    type: command
    stage: restart_service
    command: pool-<service>
    delay: 30

I think this is doable.

• mobrovac mentioned this in T156680: Allow API consumer to express a timeframe in hours.Feb 9 2017, 8:46 PM

Raising the priority since this is blocking the deployment of new features.

• mobrovac added a parent task: T156411: Compute the trending articles over a period of 24h rather than 1h.Feb 10 2017, 12:28 AM

thcipriani mentioned this in rMSCAa1ae82075ab3: Make checks serial by default.Feb 13 2017, 10:27 PM

@mobrovac what is the ETA on this? Just curious for planing

thcipriani created subtask T159867: Add a delay configuration option to checks.Mar 7 2017, 6:31 PM

In T156687#3071520, @Fjalapeno wrote:

@mobrovac what is the ETA on this? Just curious for planing

FWIW, the ability to use the work-around is now live in the current version of scap (that is, checks now run in serial).

I filed T159867: Add a delay configuration option to checks to track scap work in integrating the ability to add a delay in checks.yaml.

@thcipriani thanks for the update!

Change 341675 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/trending-edits/deploy] Delay the endpoints check for 28 seconds

https://gerrit.wikimedia.org/r/341675

Change 341675 merged by Mobrovac:
[mediawiki/services/trending-edits/deploy] Delay the endpoints check for 28 seconds

https://gerrit.wikimedia.org/r/341675

Deployed and tested, setting as stalled until T159867: Add a delay configuration option to checks is resolved so that we can remove the work-around.

• mobrovac mentioned this in rMSTEDf855460290cf: Delay the endpoints check for 28 seconds.Mar 8 2017, 12:15 AM

Change 342158 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/trending-edits/deploy] Delay post-restart checks for 45 secs

https://gerrit.wikimedia.org/r/342158

• mobrovac mentioned this in rMSTED16730685daa6: Delay post-restart checks for 45 secs.Mar 10 2017, 12:47 AM

Change 342158 merged by Ppchelko:
[mediawiki/services/trending-edits/deploy] Delay post-restart checks for 45 secs

https://gerrit.wikimedia.org/r/342158

Another, cleaner but more invasive option would be to delay listening on the service socket until startup has completed. This would avoid the need to guesstimate the normal startup time.

The trending service has been undeployed, so this task is not needed any more.

Delay repooling trending service after a restartClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Delay repooling trending service after a restart
Closed, DeclinedPublic
Actions

Related Objects
Search...