Page MenuHomePhabricator

debmonitor-client.service stays in failed state in case of server errors
Closed, ResolvedPublic

Description

We currently have systemd reporting failed units on a few hosts, namely an-worker1106, cp4030, and dns5002. The issue is due to debmonitor-client.service failing to post an update. Things are fixed simply restarting the service with systemctl restart debmonitor-client.service. We should probably:

  1. make the unit automatically retry on failure
  2. improve logging, at least by not including the whole HTML response
Apr 19 05:05:37 cp4030 debmonitor-client[12721]: ERROR:debmonitor:Failed to execute DebMonitor CLI: Failed to send the update to the DebMonitor server: 500 <!doctype html>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]: <html lang="en">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:   <head>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <title> - DebMonitor</title>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <meta charset="utf-8">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <link rel="icon" type="image/vnd.microsoft.icon" href="/static/icons/favicon.ico" sizes="16x16 32x32">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <link rel="stylesheet" href="/static/css/bootstrap-4.1.0.min.css" integrity="sha384-Ie8SKXi3tR8k3gY3MDTn3jLKnvMKULtJWgJKyr/+R4XLM1Y70OOmQH1u+C4lTVeu">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <link rel="stylesheet" href="/static/css/datatables-bs4_dt-1.10.16_fh-3.1.3.min.css" integrity="sha384-2wlnvqj+X8vJH8sXHZ97GUlX9cEaVKAlCyfJLKAOrlJeGB+Dm2kqEFzKGrQ8BiqL"/>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <link rel="stylesheet" href="/static/css/main-1.0.0.css" integrity="sha384-LhuZqXC/yqwnXyEG+KZWmMBLDxI4TWklVMe0b19aFeLV9nY3TCscTtT7E6biZIAR"/>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <noscript><style nonce="">table.hide-loading { display: table; }</style></noscript>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:   </head>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:   <body>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <nav class="navbar navbar-expand-lg navbar-dark bg-dark mb-3">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       <button class="navbar-toggler navbar-toggler-right" type="button" data-toggle="collapse" data-target="#navbarSupportedContent" aria-controls="navbarSupportedContent" aria-expanded="false" aria-label="Toggle navigation">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         <span class="navbar-toggler-icon"></span>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       </button>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       <a class="navbar-brand" href="/">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         <img src="/static/icons/debmonitor.svg" width="30" height="30" class="d-inline-block align-top" alt="">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         DebMonitor
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       </a>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       <div class="collapse navbar-collapse" id="navbarSupportedContent">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         <ul class="navbar-nav mr-auto">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <li class="nav-item">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             <a class="nav-link" href="/hosts/">Hosts</a>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           </li>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <li class="nav-item">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             <a class="nav-link" href="/kernels/">Kernels</a>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           </li>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <li class="nav-item">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             <a class="nav-link" href="/packages/">Packages</a>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           </li>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <li class="nav-item">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             <a class="nav-link" href="/source-packages/">Source Packages</a>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           </li>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <li class="nav-item">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             <a class="nav-link" href="/images/">Images</a>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           </li>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         </ul>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         <form class="form-inline my-2 my-lg-0" action="/search" method="get">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <input name="q" pattern=".{,}" class="form-control mr-sm-2" type="search" placeholder="Search" aria-label="Search" title="At least  characters required." required>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <button class="btn btn-outline-success my-2 my-sm-0" type="submit">Search</button>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         </form>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       </div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     </nav>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <main role="main" class="container">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       <h2 class="mb-3"></small></h2>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       <div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         
Apr 19 05:05:37 cp4030 debmonitor-client[12721]: <h4 class="alert alert-danger" role="alert">Looks like something went wrong!</h4>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       </div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       <div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       </div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     </main>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <footer>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       <div class="container debmonitor-footer-container">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         <div class="row text-muted">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <div class="col-sm-4">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           </div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <div class="col-sm-4 text-center">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             <a href="/">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:               <img src="/static/icons/debmonitor.svg" width="30" height="30" alt="DebMonitor">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             </a>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           </div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           <div class="col-sm-4 text-right">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:             &copy; 2017 - 2021 Wikimedia Foundation, Inc.
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:           </div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         </div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:       </div>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     </footer>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <script src="/static/js/jquery-3.3.1.slim.min.js" integrity="sha384-0CUP4PMyLco7vy5FVHeXN7iNiGXGshL29yW+6fBxn0IrDzeYVeH0d9LRH/k0/2Bc"></script>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <script src="/static/js/popper-1.14.3.min.js" integrity="sha384-a+wju1euElK2a6YoDwmr7oPq0SvtfwyRMcjzbQa3Nu75pGRN85w79FudvP9M8lHF"></script>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <script src="/static/js/bootstrap-4.1.0.min.js" integrity="sha384-SpHMgSw4L+pOHsqetoHcGuioJ2tAqu5NmpaFUqB3QkINtUezFBNKmxuXEVdiFIwM"></script>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <script src="/static/js/datatables-bs4_dt-1.10.16_fh-3.1.3.min.js" integrity="sha384-tSrtEp0dBw4LpBYZgKH+eo6RvZI63MrdUk4pSdTe7aZBIp9GdmW9kZWEhlnbwQW8"></script>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     <script nonce="">
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     $(function () {
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:         $('[data-toggle="tooltip"]').tooltip();
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     })
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:     </script>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]:   </body>
Apr 19 05:05:37 cp4030 debmonitor-client[12721]: </html>
Apr 19 05:05:37 cp4030 systemd[1]: debmonitor-client.service: Main process exited, code=exited, status=1/FAILURE
Apr 19 05:05:37 cp4030 systemd[1]: debmonitor-client.service: Failed with result 'exit-code'.
Apr 19 05:05:37 cp4030 systemd[1]: debmonitor-client.service: Consumed 1.321s CPU time.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-04-19T08:35:19Z] <ema> restart debmonitor-client.service on cp4030, dns5002, an-worker1106 T280484

Volans added a subscriber: jbond.

Assigning to @jbond that was working on this.

@ema thanks for the task. FWIW some retry logic has already been added to the debmonitor client (see https://gerrit.wikimedia.org/r/c/operations/software/debmonitor/+/679275 ) but not yet deployed.
There is also a pending change (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/679293 and related changes in the dependency chain) that should convert this job to never fail the Icinga systemd check but send an email instead.

Mentioned in SAL (#wikimedia-operations) [2021-04-19T14:53:03Z] <jbond42> update debmonitor-client - T280484