Page MenuHomePhabricator

[tbs.harbor] Harbor core being down is not noticed by simple http request
Closed, ResolvedPublic

Description

Currently the core service is down, but the UI still shows up.

Puppet does not seem to bring it up.

This task is to find a way to make sure that either we bring it up and alert if failed, or at least create an alert that properly checks when it's down (see T325165)

Related Objects

StatusSubtypeAssignedTask
ResolvedLucasWerkmeister
Resolvedmatmarex
ResolvedLegoktm
ResolvedLegoktm
In Progressdcaro
Resolveddcaro
Resolveddcaro
Resolveddcaro
ResolvedNone
Resolveddcaro
Resolveddcaro
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
Resolveddcaro

Event Timeline

dcaro changed the task status from Open to In Progress.
dcaro triaged this task as High priority.
dcaro moved this task from To refine to Doing on the User-dcaro board.

It turns out that puppet is starting it up every time xd, but it fails to connect to the travis DB eventually and fails, until the next puppet run and retries:

2023-02-28T14:38:39Z [ERROR] [/common/utils/utils.go:106]: failed to connect to tcp://ttg4ncgzifw.svc.trove.eqiad1.wikimedia.cloud:5432, retry after 2 seconds :dial tcp 172.16.5.95:5432: connect: connection refused

So yep, the problem was the Db being down, once the DB was up, puppet started harbor again and it self-healed correctly. Closing this for now.

Note that the DB was in trove, and had an old version of the oslo.messaging package.

To ssh to the trove VM follow:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Accessing_Trove_guest_VMs

Once there, there's a service guest-agent that runs on a virtualenv (see ps aux | grep guest-agent), and there you can source /path/to/venv/bin/activate, and check the library versions.
I compared it with a working one to see which versions were different.