Page MenuHomePhabricator

Netbox: tracking of hardware errors / grouping servers in order/batches
Closed, InvalidPublic

Description

We had a DB master crash because of a BBU failure [1] and it turned out that out of six servers ordered of that type three had issues with the BBU. That kind of error pattern detection isn't great with Phab (one needs to manually make the connection and dig in older Phab tasks, which can be very noisy).

I think Netbox would be a good place to also log what hw maintenance had to be done to a server across it's life time.

In addition it would be useful to match servers to "order batches". We have the model type already (e.g. HP ProLiant DL380e Gen8), but it's very coarse ultimately the combination of hardware components in a server can cause different errors.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190923-s3_primary_db_master_crash

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2019, 7:21 AM
herron triaged this task as Medium priority.Sep 25 2019, 3:43 PM

In Netbox we can already filter devices by purchase date, support expiry date and procurement ticket. That should be enough to pinpoint the batch as far as I can tell.
In this specific case for example picking the procurement ticket and the purchase date from Netbox for db1075 you could:

Those filters are available in the UI in the devices page list ( https://netbox.wikimedia.org/dcim/devices/ ) on the right.
@MoritzMuehlenhoff let me know if you think anything is missing that would be useful.

Indeed, and in fact procurement task alone would be enough to identify the batch. Is that what you were looking for @MoritzMuehlenhoff? How could we make this more visible?

Ack, searching by procurement properly addresses the batching aspect, I missed that before.

MoritzMuehlenhoff closed this task as Invalid.Nov 21 2019, 1:53 PM