Page MenuHomePhabricator

Scap's php-fpm restart step "left" counter may be counter-intuitive
Closed, ResolvedPublic

Description

This is what it looks like roughly:

Frame 0
00:17:53 Finished Canaries Synced (duration: 00m 01s)
00:17:53 Started php-fpm-restarts
00:17:53 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 9 host(s)
php-fpm-restart:  33% (in-flight: 1; ok: 3; fail: 0; left: 5)
# mutiple seconds pass..
php-fpm-restart:  66% (in-flight: 1; ok: 6; fail: 0; left: 2) 
# mutiple seconds pass..
php-fpm-restart:  77% (in-flight: 1; ok: 7; fail: 0; left: 1)    
# mutiple seconds pass..               
php-fpm-restart:  88% (in-flight: 1; ok: 8; fail: 0; left: 0)     
# mutiple seconds pass..
Frame 1
00:18:32 Started php-fpm-restarts
00:18:32 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 307 host(s)
php-fpm-restart:   3% (in-flight: 13; ok: 4; fail: 0; left: 108)
Frame 2
00:18:32 Started php-fpm-restarts
00:18:32 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 307 host(s)
php-fpm-restart: 100% (in-flight: 0; ok: 4; fail: 0; left: 0)                   

php-fpm-restart:  23% (in-flight: 14; ok: 32; fail: 0; left: 88)                
php-fpm-restart:  36% (in-flight: 5; ok: 16; fail: 0; left: 23)                 
php-fpm-restart:  32% (in-flight: 14; ok: 43; fail: 0; left: 77)                
php-fpm-restart:  27% (in-flight: 13; ok: 34; fail: 0; left: 78)                
php-fpm-restart:  37% (in-flight: 14; ok: 50; fail: 0; left: 70)                
php-fpm-restart:  50% (in-flight: 5; ok: 22; fail: 0; left: 17)                 
php-fpm-restart:  32% (in-flight: 13; ok: 41; fail: 0; left: 71)                
php-fpm-restart:  34% (in-flight: 13; ok: 43; fail: 0; left: 69)                
php-fpm-restart:  37% (in-flight: 13; ok: 47; fail: 0; left: 65)                
php-fpm-restart:  42% (in-flight: 13; ok: 53; fail: 0; left: 59)                
php-fpm-restart:  43% (in-flight: 13; ok: 54; fail: 0; left: 58)                
php-fpm-restart:  45% (in-flight: 13; ok: 57; fail: 0; left: 55)                
php-fpm-restart:  48% (in-flight: 13; ok: 61; fail: 0; left: 51)                
php-fpm-restart:  52% (in-flight: 12; ok: 65; fail: 0; left: 48)
Frame 3
00:18:32 Started php-fpm-restarts
00:18:32 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 307 host(s)
php-fpm-restart: 100% (in-flight: 0; ok: 4; fail: 0; left: 0)                   
php-fpm-restart: 100% (in-flight: 0; ok: 44; fail: 0; left: 0)                  
php-fpm-restart: 100% (in-flight: 12; ok: 102; fail: 0; left: 11)
Frame 4
00:18:32 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 307 host(s)
php-fpm-restart: 100% (in-flight: 0; ok: 4; fail: 0; left: 0)                   
php-fpm-restart: 100% (in-flight: 0; ok: 44; fail: 0; left: 0)                  
php-fpm-restart: 100% (in-flight: 0; ok: 125; fail: 0; left: 0)                 
php-fpm-restart: 100% (in-flight: 0; ok: 134; fail: 0; left: 0)                 
00:21:03 Finished php-fpm-restarts (duration: 02m 30s)

Observations:

  • Frame 0: For the canary restarts, it appears to be doing only 1 at a time, whereas later we do 10 at a time (in-flight never goes above 1). Is this on purpose?
  • All frames: Magic number 9223372036854775807 has no clear purpose or meaning in its current form. I vaguely recall seeing this during the course of T266055 as being needed to make it actually do the restarts unconditonally.
  • Frame 2: The left counter and percentage appear out of whack, and go both up and down, which is rather confusing.
  • In addition to the counters within a single line going backwards, the fact that there are 4 separate php-fpm-restart groups is also somewhat confusing since the operator doesn't know how many groups there will be, or what each group is for. I'm guessing these are distinct DSH groups (appserver, api, canary, jobrunner+other, or something like that?).

If these could be merged into one progress meter that'd be best, but the abstraction might not lend itself to that so easily. Perhaps an adequate solution to this bug could be to not go backward, and possibly label or number the distinct restart groups in some way :)

Event Timeline

Change 804349 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Perform php-fpm restart as a single job

https://gerrit.wikimedia.org/r/804349

Frame 0: For the canary restarts, it appears to be doing only 1 at a time, whereas later we do 10 at a time (in-flight never goes above 1). Is this on purpose?

@Krinkle The php-fpm restart code is configured to allow 10% of the hosts in the group to restart at the same time. Since there are 9 canaries this ends up only processing one at a time. I don't know if anything bad would happen if all canaries were allowed to be restarted at the same time so I didn't make any changes to this area of code in https://gerrit.wikimedia.org/r/804349

Change 804349 abandoned by Ahmon Dancy:

[mediawiki/tools/scap@master] Perform php-fpm restart as a single job

Reason:

no good. Groups need to have the 10% rule applied to them individually.

https://gerrit.wikimedia.org/r/804349