On September 14th and September 4th, restbase deploys completed successfully via scap, but in the process of deploying they failed to repool any of the nodes, leading to a public-facing outage of restbase as no hosts were healthy. This issue wasn't noticed during the issue on the 4th as an attempted fix of running a manual loop of depool; sleep 60; service restbase restart; sleep 60; repool caused the hosts to be repooled but we didn't notice at the time. On the 14th it was observed that all hosts in restbase-https were in state disabled after a scap deploy.
Log context for both deploys can be seen in deploy1002:/srv/deployment/restbase/deploy/scap/log/ files scap-sync-2023-09-04-0001-1-ge8a6ae47.log and scap-sync-2023-08-01-0001-1-g26bc1a5b.log.
According to the scap logs, the process does get to the restart_service step, and the services themselves are seen to be restarted. However, Scap appears to get to the step Registering scripts in directory /srv/deployment/restbase/deploy-cache/revs/e8a6ae47618e560c6ed77c680cbcdfc786176de5/scap/scripts, run none of the scripts and then proceed to move on to the next node. This means that the host is not repooled before scap moves on to the next host.
Successful repooling from an old log:
{"name": "target.restbase1016.eqiad.wmnet.service_restart", "msg": "Restarting service 'restbase'", "args": [], "levelno": 20, "filename": "tasks.py", "exc_text": null, "lineno": 877, "funcName": "restart_service", "created": 1690885892.9757485, "msecs": 975.7485389709473, "relativeCreated": 1196.2718963623047, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885894.1073978, "msecs": 107.39779472351074, "relativeCreated": 2327.921152114868, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885897.1091292, "msecs": 109.12919044494629, "relativeCreated": 5329.652547836304, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885900.1126294, "msecs": 112.62941360473633, "relativeCreated": 8333.152770996094, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885903.1130977, "msecs": 113.0976676940918, "relativeCreated": 11333.62102508545, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885906.1165996, "msecs": 116.59955978393555, "relativeCreated": 14337.122917175293, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885909.1172671, "msecs": 117.26713180541992, "relativeCreated": 17337.790489196777, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 up in 18.01s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 918, "funcName": "check_port", "created": 1690885912.1178045, "msecs": 117.80452728271484, "relativeCreated": 20338.327884674072, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Registering scripts in directory '/srv/deployment/restbase/deploy-cache/revs/8eb62f2c0db87d9aa2d8b62fab32d4010a077380/scap/scripts'", "args": [], "levelno": 10, "filename": "deploy.py", "exc_text": null, "lineno": 523, "funcName": "_execute_checks", "created": 1690885912.120181, "msecs": 120.18108367919922, "relativeCreated": 20340.704441070557, "host": "restbase1016 .eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Executing check 'delay'", "args": [], "levelno": 20, "filename": "checks.py", "exc_text": null, "lineno": 105, "funcName": "execute", "created": 1690885912.1206133, "msecs": 120.61333656311035, "relativeCreated": 20341.136693954468, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Check 'delay' completed, output: ", "args": [], "levelno": 10, "filename": "checks.py", "exc_text": null, "lineno": 131, "funcName": "execute", "created": 1690885964.125759, "msecs": 125.75888633728027, "relativeCreated": 72346.28224372864, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Executing check 'endpoints'", "args": [], "levelno": 20, "filename": "checks.py", "exc_text": null, "lineno": 105, "funcName": "execute", "created": 1690885964.1269462, "msecs": 126.94621086120605, "relativeCreated": 72347.46956825256, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Check 'endpoints' completed, output: All endpoints are healthy\n", "args": [], "levelno": 10, "filename": "checks.py", "exc_text": null, "lineno": 131, "funcName": "execute", "created": 1690885965.7845993, "msecs": 784.5993041992188, "relativeCreated": 74005.12266159058, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Executing check 'repool'", "args": [], "levelno": 20, "filename": "checks.py", "exc_text": null, "lineno": 105, "funcName": "execute", "created": 1690885965.7851493, "msecs": 785.149335861206, "relativeCreated": 74005.67269325256, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Check 'repool' completed, output: 2023-08-01 10:32:46,120 [INFO] Pooling currently depooled services\n", "args": [], "levelno": 10, "filename": "checks.py", "exc_text": null, "lineno": 131, "funcName": "execute", "created": 1690885966.187623, "msecs": 187.6230239868164, "relativeCreated": 74408.14638137817, "host": "restbase1016.eqiad.wmnet"} {"name": "deploy", "msg": "canary1 deploy successful", "args": [], "levelno": 20, "filename": "deploy.py", "exc_text": null, "lineno": 928, "funcName": "_execute_for_group", "created": 1690885966.2745767, "msecs": 274.57666397094727, "relativeCreated": 83899.03807640076, "message": "canary1 deploy successful", "asctime": "10:32:46", "host": "deploy1002"}
Failed/absent repooling from the latest log:
"name": "target.restbase1016.eqiad.wmnet.service_restart", "msg": "Restarting service 'restbase'", "args": [], "levelno": 20, "filename": "tasks.py", "exc_text": null, "lineno": 897, "funcName": "restart_service", "created": 1694690731.9299133, "msecs": 929.9132823944092, "relativeCreated": 1336.0555171966553, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690733.1923115, "msecs": 192.31152534484863, "relativeCreated": 2598.4537601470947, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690736.1951158, "msecs": 195.1158046722412, "relativeCreated": 5601.258039474487, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690739.1988988, "msecs": 198.8987922668457, "relativeCreated": 8605.041027069092, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690742.2025967, "msecs": 202.59666442871094, "relativeCreated": 11608.738899230957, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690745.2044866, "msecs": 204.48660850524902, "relativeCreated": 14610.628843307495, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 up in 15.01s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 948, "funcName": "check_port", "created": 1694690748.2071803, "msecs": 207.18026161193848, "relativeCreated": 17613.322496414185, "host": "restbase1016.eqiad.wmnet"} {"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Registering scripts in directory '/srv/deployment/restbase/deploy-cache/revs/e8a6ae47618e560c6ed77c680cbcdfc786176de5/scap/scripts'", "args": [], "levelno": 10, "filename": "deploy.py", "exc_text": null, "lineno": 528, "funcName": "_execute_checks", "created": 1694690748.211421, "msecs": 211.42101287841797, "relativeCreated": 17617.563247680664, "host": "restbase1016.eqiad.wmnet"} {"name": "deploy", "msg": "canary1 deploy successful", "args": [], "levelno": 20, "filename": "deploy.py", "exc_text": null, "lineno": 934, "funcName": "_execute_for_group", "created": 1694690748.2692113, "msecs": 269.2112922668457, "relativeCreated": 26463.566780090332, "message": "canary1 deploy successful", "asctime": "11:25:48", "host": "deploy1002"}
This issue is fairly critical as we cannot safely deploy restbase and it has lead to two outages so far.