Maniphest T320696

Reduce the count of Netbox devices with incorrect status
Open, LowPublic
Actions

Assigned To

None

Authored By

	• ayounsi
	Oct 13 2022, 8:19 AM

Description

This issue comes back regularly, especially through Netbox reports https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/

For example:

Device with failed status but reachable
Device is XXX in Netbox but is missing from PuppetDB (should be ('inventory', 'offline', 'planned', 'decommissioning', 'failed'))
Device is In PuppetDB but with status staged

There are 2 main reasons for which this happens,
The main one is the necessity of Manual status changes, as defined in https://wikitech.wikimedia.org/wiki/Server_Lifecycle (Eg. "The service owner changes Netbox's to ACTIVE.")

This doesn't work as people forget to do such changes. Most of those oversight get caught by the netbox reports, but there is an asymmetry between who looks at the reports and who cause the error (eg. if someone doesn't think about changing the status, they won't think about checking the report). This is exacerbated by the report being virtually always triggered (and thus not triggering an alert on IRC, such alerts being mostly ignored as well).

The ideal fix is to abstract all those status changes through automation but it's not straightforward as some of those states are subjective and depend on the service owners (eg. active vs. staged).

As a first step I suggest that we identify on https://wikitech.wikimedia.org/wiki/File:Server_Lifecycle_Statuses.png which transitions are manual vs. automated.

Then as defined in Netbox (and source of truth) principles
"All data manually entered will go stale" -> "Refrain from adding data that will not drive the infrastructure"
Currently the status is mostly informative, we could make it more compelling by driving production from it. For example if a server have a FAILED status, use Ferm to block all ports except SSH (just a suggestion, other ideas welcome). Or if a server doesn't have an ACTIVE state, don't allow it to be pooled by Pybal.

Last, if the previous 2 points are not possible (or in addition to them) we should improve alerting and user notifications.
One idea is to use the new export from T229397: Puppet: get data (row, rack, site, and other information) from Netbox to add a loud/clear MOTD when the server is not in ACTIVE state. Or have a per server alert in AlertManager instead of the current global report alert.

The other reason I identified is servers "offline" for long enough that they are being evicted from PuppetDB (and becoming ghost hosts on the network). For example with T306835#8211485.

A possibility here is to have the re-image cookbook automatically set the host status to FAILED if the re-imaged failed and the status was ACTIVE/STAGED
And then automatically set the status to STAGED if the re-image finally works (from a previous FAILED status).

Thoughts?

Details

Subject	Repo	Branch	Lines +/-
R:system::role: colour system role based on its name	operations/puppet	production	+7 -2
motd::script: update define to all interpreted strings	operations/puppet	production	+7 -4
P:netbox::host: create a motd for the status	operations/puppet	production	+10 -2
netbox: update allowed state transitions	operations/software/spicerack	master	+7 -8
sre.hosts.reimage: set Netbox to active	operations/cookbooks	master	+4 -3
doc: removed STAGED status from Netbox diagram	operations/software/netbox-deploy	master	+4 -6
Netbox statuses: no more servers in staged	operations/software/netbox-extras	master	+15 -4
P:netbox::host: create a motd for the status	operations/puppet	production	+11 -2

Customize query in gerrit

Related Objects

Mentioned In: T347375: Netbox device location information not available on the first Puppet run of a device
rCCKB5627da180e73: sre.hosts.reimage: set Netbox to active
T314303: Q1:rack/setup/install ganeti103[34]
rOSNEad330acaa581: Netbox statuses: no more servers in staged
T322642: Expose servers production status
Mentioned Here: T310594: Netbox: investigate custom status
T229397: Puppet: get data (row, rack, site, and other information) from Netbox
T306835: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet

Event Timeline

• ayounsi created this task.Oct 13 2022, 8:19 AM

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptOct 13 2022, 8:19 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ayounsi updated the task description. (Show Details)Oct 13 2022, 8:23 AM

As a first step I suggest that we identify on https://wikitech.wikimedia.org/wiki/File:Server_Lifecycle_Statuses.png which transitions are manual vs. automated.

I can easily do that

The MOTD idea is not bad, but let's pick any medium-large cluster, most time people don't ssh to those individually, at most a random one. So it might not help that much.

A possibility here is to have the re-image cookbook automatically set the host status to FAILED if the re-imaged failed and the status was ACTIVE/STAGED
And then automatically set the status to STAGED if the re-image finally works (from a previous FAILED status).

I agree with the last part, I don't think the first part will be useful as usually failed hosts have some HW failure that is independent of any reimage operation.

I agree on the MOTD, it's not a significant change, the last of the 3 options in term of priority/usefulness, but maybe also a low hanging fruit.
Maybe a bit more modern (we can call it the "MOTD of 2022") would be the ability to add it in the Spicerack library. For example if we fetch a host from Netbox using any cookbook and its status is XXX display it.

I agree with the last part, I don't think the first part will be useful as usually failed hosts have some HW failure that is independent of any reimage operation.

It would not cover 100% of the use-cases, but at least cover T306835#8211485

I'm open to suggestion for other use-cases and transitions.

Change 842497 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] motd::script: update redfine to all interpreted strings

https://gerrit.wikimedia.org/r/842497

Change 842498 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:netbox::host: create a motd for the status

https://gerrit.wikimedia.org/r/842498

The main one is the necessity of Manual status changes, as defined in https://wikitech.wikimedia.org/wiki/Server_Lifecycle (Eg. "The service owner changes Netbox's to ACTIVE.")

this is probably too blunt a check but i wonder if we could infer this based on the puppet role. e.g. `if role not in ['insetup', 'spare::system']: status = 'ACTIVE'

Currently the status is mostly informative, we could make it more compelling by driving production from it. For example if a server have a FAILED status, use Ferm to block all ports except SSH (just a suggestion, other ideas welcome). Or if a server doesn't have an ACTIVE state, don't allow it to be pooled by Pybal.

I like both of theses ideas

One idea is to use the new export from T229397: Puppet: get data (row, rack, site, and other information) from Netbox to add a loud/clear MOTD when the server is not in ACTIVE state. Or have a per server alert in AlertManager instead of the current global report alert.

perhaps adding some colour to the motd ... https://gerrit.wikimedia.org/r/c/operations/puppet/+/842498

this is probably too blunt a check but i wonder if we could infer this based on the puppet role. e.g. `if role not in ['insetup', 'spare::system']: status = 'ACTIVE'

We first need to define what ACTIVE means.

if role not in ['insetup', 'spare::system']: status = 'ACTIVE' makes sens for DCops, but not on a service owner point of view (eg. a server can be tested before being pooled into service). Which can be fine too, maybe we don't need to go in the "service owner" realm if they don't need this piece of data and would simplify the server's lifecycle.
Or maybe it would be useful to have an authoritative source to know if it's safe to mess with a server or if it's receiving user traffic.
In that case, another example would be to have the pool/depool command line tools to automatically update the Netbox status, but it would require a Netbox API key on most of the servers (or an abstraction endpoint).
We could also expand the conversation to SREs, each service could have a programmatic way of advertising if it's in production or not.

In T320696#8313511, @Volans wrote:

As a first step I suggest that we identify on https://wikitech.wikimedia.org/wiki/File:Server_Lifecycle_Statuses.png which transitions are manual vs. automated.

I can easily do that

This is now done, see https://wikitech.wikimedia.org/wiki/Server_Lifecycle#/media/File:Server_Lifecycle_Statuses.png
I've also committed the source code at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-deploy/%2B/refs/heads/master/doc/server_lifecycle.dot

if role not in ['insetup', 'spare::system']: status = 'ACTIVE' makes sens for DCops, but not on a service owner point of view (eg. a server can be tested before being pooled into service). Which can be fine too, maybe we don't need to go in the "service owner" realm if they don't need this piece of data and would simplify the server's lifecycle.

I think that would be enough- a service being active is potentially a grey area (e.g. "puppet ready with a disabled flag" "failover but passive" "temporarilly depooled for maintenance" "not a SPOF", ...). I think that is the right level for something automated/DCops/netops controlled. Add if you want the ability to except with a hiera key in case someone requires it, but otherwise automating that I think is good enough.

Further classification probably should be at puppet/service layer "this is a test host" "this is a development host" etc.

Maybe if people require more fine grained status e.g. to know if a host can be rebooted, something standard should be agreed among service owners, but not sure if Netbox is the right location- given I think people were reluctant to convert it on a very general source of truth.

In T320696#8314827, @jbond wrote:

The main one is the necessity of Manual status changes, as defined in https://wikitech.wikimedia.org/wiki/Server_Lifecycle (Eg. "The service owner changes Netbox's to ACTIVE.")

this is probably too blunt a check but i wonder if we could infer this based on the puppet role. e.g. `if role not in ['insetup', 'spare::system']: status = 'ACTIVE'

+1 came here to chime in with this approach as well. From the perspective of server lifecycle I think it's pretty close a sweet spot for automatically tracking if a host is being actively used for something.

In T320696#8316012, @ayounsi wrote:

this is probably too blunt a check but i wonder if we could infer this based on the puppet role. e.g. `if role not in ['insetup', 'spare::system']: status = 'ACTIVE'

We first need to define what ACTIVE means.

if role not in ['insetup', 'spare::system']: status = 'ACTIVE' makes sens for DCops, but not on a service owner point of view (eg. a server can be tested before being pooled into service). Which can be fine too, maybe we don't need to go in the "service owner" realm if they don't need this piece of data and would simplify the server's lifecycle.

I think as a first approximation the role-based status is good enough! And for sure better / more realistic than what we have now. Thank you for taking this on and kickstarting the discussion!

From the feedback here and in the email thread there the trend is to go with the simplest option and not track the service status, only the server status.

A bit bold, but at this point there is an opportunity to fully get rid of the STAGED status, as I'm not sure there is much value in updating the Netbox status of a server based on its Puppet role. Instead consider that the insetup role means ACTIVE (on a physical server point of view).
Getting rid of a status helps reduce the number of possible transitions (see https://wikitech.wikimedia.org/wiki/Server_Lifecycle#/media/File:Server_Lifecycle_Statuses.png ) and makes the whole lifecycle much easier to automate.
@Volans @jbond what do you think?

That said, the STAGED status could also be re-purposed if there are any steps in a server lifecycle that could benefit from it (slightly similar to T310594). @wiki_willy would there be any usecase for the STAGED status if we were to decommission it from its current usage?

Change 842498 merged by Jbond:

[operations/puppet@production] P:netbox::host: create a motd for the status

https://gerrit.wikimedia.org/r/842498

Change 849508 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/puppet@production] P:netbox::host: create a motd for the status

https://gerrit.wikimedia.org/r/849508

Change 849497 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] R:system::role: colour system role based on its name

https://gerrit.wikimedia.org/r/849497

A bit bold, but at this point there is an opportunity to fully get rid of the STAGED status, as I'm not sure there is much value in updating the Netbox status of a server based on its Puppet role.

I was going to merge the coloured motd change however considering this I'm not sure its needed now. We could instead just update system::role to do the colouring e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/849497

Summary from the meeting I/F tooling and automation has had with @Papaul today:

There are no objections from DC-Ops to the proposed solution of dismiss the STAGED status in Netbox and basically have the automation make it move directly from PLANNED to ACTIVE having the meaning that the HW side of thing is now active, the host is powered on and the service owner can use the server.
We could at that point evaluate if we might want to re-introduce the STAGED status for other HW provisioning transitions, but for now it doesn't seem it would add any benefit.

Let's wait few more days if there are any objections before proceeding and then we can proceed IMHO.

• ayounsi mentioned this in T322642: Expose servers production status.Nov 8 2022, 3:14 PM

Change 854961 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.reimage: set Netbox to active

https://gerrit.wikimedia.org/r/854961

Mentioned in SAL (#wikimedia-operations) [2022-11-09T10:02:09Z] <volans> set Netbox status to Active for 299 devices with role=server, tenant=none, status=staged - T320696

Netbox changelog is: https://netbox.wikimedia.org/extras/changelog/?request_id=1b14b4c3-9374-4d4b-b7af-291a0c62aa13

I've also manually updated frav1002 that has been there for long time so it's clearly active. It was the only left out of the previous mass-edit.

Change 854970 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/netbox-extras@master] Netbox statuses: no more servers in staged

https://gerrit.wikimedia.org/r/854970

Change 855026 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/netbox-deploy@master] doc: removed STAGED status from Netbox diagram

https://gerrit.wikimedia.org/r/855026

Change 854970 merged by jenkins-bot:

[operations/software/netbox-extras@master] Netbox statuses: no more servers in staged

https://gerrit.wikimedia.org/r/854970

Change 855026 merged by Volans:

[operations/software/netbox-deploy@master] doc: removed STAGED status from Netbox diagram

https://gerrit.wikimedia.org/r/855026

Volans mentioned this in rOSNEad330acaa581: Netbox statuses: no more servers in staged.Nov 9 2022, 4:45 PM

Change 854961 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: set Netbox to active

https://gerrit.wikimedia.org/r/854961

With the above changes and patches this is now implemented.

Netbox:

no server device has the STAGED status
all reports/customscripts/tools have been updated accordingly
there is a new check in the coherence report to ensure that no server is in the STAGED status

Spicerack

The Netbox module allowed transitions between states has been updated to reflect the new diagram

Cookbooks

The reimage cookbook will automatically set the Netbox status to ACTIVE if the previous status is PLANNED or FAILED.

Wikitech

The documentation for the Server Lifecyle has been updated accordingly (see the diff).
The diagram of the transitions has been updated.

jcrespo awarded a token.Nov 9 2022, 5:26 PM

Mentioned in SAL (#wikimedia-operations) [2022-11-09T17:37:32Z] <volans@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Converted existing STAGED hosts to ACTIVE - volans@cumin1001 - T320696"

Mentioned in SAL (#wikimedia-operations) [2022-11-09T17:40:13Z] <volans@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Converted existing STAGED hosts to ACTIVE - volans@cumin1001 - T320696"

Volans mentioned this in T314303: Q1:rack/setup/install ganeti103[34].Nov 10 2022, 3:18 PM

Change 855610 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] netbox: update allowed state transitions

https://gerrit.wikimedia.org/r/855610

Change 855610 merged by jenkins-bot:

[operations/software/spicerack@master] netbox: update allowed state transitions

https://gerrit.wikimedia.org/r/855610

Change 849508 merged by Jbond:

[operations/puppet@production] P:netbox::host: create a motd for the status

https://gerrit.wikimedia.org/r/849508

Volans mentioned this in rCCKB5627da180e73: sre.hosts.reimage: set Netbox to active.Dec 14 2022, 3:31 PM

Volans mentioned this in T347375: Netbox device location information not available on the first Puppet run of a device.Sep 26 2023, 1:06 PM