Page MenuHomePhabricator

restbase deploys via scap lead to all hosts being disabled in conftool
Closed, ResolvedPublic

Description

On September 14th and September 4th, restbase deploys completed successfully via scap, but in the process of deploying they failed to repool any of the nodes, leading to a public-facing outage of restbase as no hosts were healthy. This issue wasn't noticed during the issue on the 4th as an attempted fix of running a manual loop of depool; sleep 60; service restbase restart; sleep 60; repool caused the hosts to be repooled but we didn't notice at the time. On the 14th it was observed that all hosts in restbase-https were in state disabled after a scap deploy.

Log context for both deploys can be seen in deploy1002:/srv/deployment/restbase/deploy/scap/log/ files scap-sync-2023-09-04-0001-1-ge8a6ae47.log and scap-sync-2023-08-01-0001-1-g26bc1a5b.log.

According to the scap logs, the process does get to the restart_service step, and the services themselves are seen to be restarted. However, Scap appears to get to the step Registering scripts in directory /srv/deployment/restbase/deploy-cache/revs/e8a6ae47618e560c6ed77c680cbcdfc786176de5/scap/scripts, run none of the scripts and then proceed to move on to the next node. This means that the host is not repooled before scap moves on to the next host.

Successful repooling from an old log:

{"name": "target.restbase1016.eqiad.wmnet.service_restart", "msg": "Restarting service 'restbase'", "args": [], "levelno": 20, "filename": "tasks.py", "exc_text": null, "lineno": 877, "funcName": "restart_service", "created": 1690885892.9757485, "msecs": 975.7485389709473, "relativeCreated": 1196.2718963623047, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885894.1073978, "msecs": 107.39779472351074, "relativeCreated": 2327.921152114868, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885897.1091292, "msecs": 109.12919044494629, "relativeCreated": 5329.652547836304, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885900.1126294, "msecs": 112.62941360473633, "relativeCreated": 8333.152770996094, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885903.1130977, "msecs": 113.0976676940918, "relativeCreated": 11333.62102508545, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885906.1165996, "msecs": 116.59955978393555, "relativeCreated": 14337.122917175293, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 921, "funcName": "check_port", "created": 1690885909.1172671, "msecs": 117.26713180541992, "relativeCreated": 17337.790489196777, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 up in 18.01s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 918, "funcName": "check_port", "created": 1690885912.1178045, "msecs": 117.80452728271484, "relativeCreated": 20338.327884674072, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Registering scripts in directory '/srv/deployment/restbase/deploy-cache/revs/8eb62f2c0db87d9aa2d8b62fab32d4010a077380/scap/scripts'", "args": [], "levelno": 10, "filename": "deploy.py", "exc_text": null, "lineno": 523, "funcName": "_execute_checks", "created": 1690885912.120181, "msecs": 120.18108367919922, "relativeCreated": 20340.704441070557, "host": "restbase1016
.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Executing check 'delay'", "args": [], "levelno": 20, "filename": "checks.py", "exc_text": null, "lineno": 105, "funcName": "execute", "created": 1690885912.1206133, "msecs": 120.61333656311035, "relativeCreated": 20341.136693954468, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Check 'delay' completed, output: ", "args": [], "levelno": 10, "filename": "checks.py", "exc_text": null, "lineno": 131, "funcName": "execute", "created": 1690885964.125759, "msecs": 125.75888633728027, "relativeCreated": 72346.28224372864, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Executing check 'endpoints'", "args": [], "levelno": 20, "filename": "checks.py", "exc_text": null, "lineno": 105, "funcName": "execute", "created": 1690885964.1269462, "msecs": 126.94621086120605, "relativeCreated": 72347.46956825256, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Check 'endpoints' completed, output: All endpoints are healthy\n", "args": [], "levelno": 10, "filename": "checks.py", "exc_text": null, "lineno": 131, "funcName": "execute", "created": 1690885965.7845993, "msecs": 784.5993041992188, "relativeCreated": 74005.12266159058, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Executing check 'repool'", "args": [], "levelno": 20, "filename": "checks.py", "exc_text": null, "lineno": 105, "funcName": "execute", "created": 1690885965.7851493, "msecs": 785.149335861206, "relativeCreated": 74005.67269325256, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Check 'repool' completed, output: 2023-08-01 10:32:46,120 [INFO] Pooling currently depooled services\n", "args": [], "levelno": 10, "filename": "checks.py", "exc_text": null, "lineno": 131, "funcName": "execute", "created": 1690885966.187623, "msecs": 187.6230239868164, "relativeCreated": 74408.14638137817, "host": "restbase1016.eqiad.wmnet"}
{"name": "deploy", "msg": "canary1 deploy successful", "args": [], "levelno": 20, "filename": "deploy.py", "exc_text": null, "lineno": 928, "funcName": "_execute_for_group", "created": 1690885966.2745767, "msecs": 274.57666397094727, "relativeCreated": 83899.03807640076, "message": "canary1 deploy successful", "asctime": "10:32:46", "host": "deploy1002"}

Failed/absent repooling from the latest log:

"name": "target.restbase1016.eqiad.wmnet.service_restart", "msg": "Restarting service 'restbase'", "args": [], "levelno": 20, "filename": "tasks.py", "exc_text": null, "lineno": 897, "funcName": "restart_service", "created": 1694690731.9299133, "msecs": 929.9132823944092, "relativeCreated": 1336.0555171966553, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690733.1923115, "msecs": 192.31152534484863, "relativeCreated": 2598.4537601470947, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690736.1951158, "msecs": 195.1158046722412, "relativeCreated": 5601.258039474487, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690739.1988988, "msecs": 198.8987922668457, "relativeCreated": 8605.041027069092, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690742.2025967, "msecs": 202.59666442871094, "relativeCreated": 11608.738899230957, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 not up. Waiting 3.00s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 951, "funcName": "check_port", "created": 1694690745.2044866, "msecs": 204.48660850524902, "relativeCreated": 14610.628843307495, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.port_check", "msg": "Port 7231 up in 15.01s", "args": [], "levelno": 10, "filename": "tasks.py", "exc_text": null, "lineno": 948, "funcName": "check_port", "created": 1694690748.2071803, "msecs": 207.18026161193848, "relativeCreated": 17613.322496414185, "host": "restbase1016.eqiad.wmnet"}
{"name": "target.restbase1016.eqiad.wmnet.checks", "msg": "Registering scripts in directory '/srv/deployment/restbase/deploy-cache/revs/e8a6ae47618e560c6ed77c680cbcdfc786176de5/scap/scripts'", "args": [], "levelno": 10, "filename": "deploy.py", "exc_text": null, "lineno": 528, "funcName": "_execute_checks", "created": 1694690748.211421, "msecs": 211.42101287841797, "relativeCreated": 17617.563247680664, "host": "restbase1016.eqiad.wmnet"}
{"name": "deploy", "msg": "canary1 deploy successful", "args": [], "levelno": 20, "filename": "deploy.py", "exc_text": null, "lineno": 934, "funcName": "_execute_for_group", "created": 1694690748.2692113, "msecs": 269.2112922668457, "relativeCreated": 26463.566780090332, "message": "canary1 deploy successful", "asctime": "11:25:48", "host": "deploy1002"}

This issue is fairly critical as we cannot safely deploy restbase and it has lead to two outages so far.

Event Timeline

hnowlan triaged this task as High priority.Sep 14 2023, 3:24 PM

@jnuche The last successful restbase deployment was before the 9e49f86 deploy: disable configured service_name` on secondary target hosts` commit was added to scap.

Really sorry about this issue, I have just deployed a fix to production.

@hnowlan, would it be possible for you to do another deployment to confirm the issue is solved? or does restbase need a specific window for deployments?

jnuche claimed this task.

While working on the solution, I reproduced the bug and tested the fix locally as thoroughly as possible.

Closing this tentatively. Please reopen if necessary.

We 'll schedule a scap deploy for RESTBase, thanks @jnuche

Mentioned in SAL (#wikimedia-operations) [2023-09-20T12:41:18Z] <akosiaris> T346354 deploy RESTBase after bug is fixed