Page MenuHomePhabricator

Track down and replace very old HW
Closed, ResolvedPublic

Description

The list below matches the criteria:

  • Status: Active in Netbox
  • Replacement date (= 5 years from purchase for servers, 8 years for most other equipment) is before 2019-07-01.

Effectively, this means that this is equipment that should had already been replaced in the fiscal years FY15-16, FY16-17, FY17-18 and FY18-19.

The asks here are:

  • Track down all existing decom/replacement tasks where they exist and link them here
  • If a replacement is underway (e.g. msw1-eqiad, scs-a8-eqiad), execute it with priority
  • Reach out to service owners to get those tasks unstuck if they exist or...
  • Reach out to service owners to ask them to start replacing their hardware ASAP and file procurement and/or decom tasks when tasks do not exist.

This is old to very old (2011-era!) hardware and is affecting our capability to plan and maintain. In some cases it may just be just forgotten decoms (e.g. labsdb1002-array1) or wrong statuses in Netbox.

Due to the FY20-21 planning being already in motion, I'd like to ask for this to be fully fleshed out (replacements planned and underway, most if not all equipment on this list decommissioned) by mid-April at the latest. That way we can use FY19-20 budget for some of these if necessary. @wiki_willy, do you think that's feasible?

HostnamePurchase dateReplacement dateSuggested refreshDecom/replacement taskNotes
analytics1028-10412014-06-262019-06-26FY18-19T227485Currently used for Kerberos migration, decom at end of quarter, will be replaced by T242148
bismuth2014-05-012019-05-01FY18-19T248516
dbproxy1002-10032011-01-272016-01-27FY15-16T245384@Marostegui
dbproxy1007-10112011-01-272016-01-27FY15-16T228768 T245385@Marostegui
es2001-es20042011-10-312016-10-31FY16-17T222592@Marostegui
fmsw-c8-codfw2009-01-012017-01-01FY16-17T253154Upgrading in Q1 DC-Ops
francium2014-05-012019-05-01FY18-19T249903decom blocked by T242009
ganeti1001-10042014-03-242019-03-24FY18-19T255553@akosiaris
helium2011-01-272016-01-27FY15-16T260717No refresh. It has been already superseded by backup1001. @jcrespo/@akosiaris
heze2012-12-132017-12-13FY17-18T260717No refresh. It has been superseded by backup2001 @jcrespo/@akosiaris
labsdb1002-array12013-02-272018-02-27FY17-18T146455
msw1-eqiad2010-11-232018-11-23FY18-19T261449upgrading via T225121
msw-a1-eqiad - msw-a8-eqiad2011-02-182019-02-18FY18-19T259758Upgrading in Q2 DC-Ops
msw-b1-eqiad - msw-b8-eqiad2011-02-182019-02-18FY18-19T259758Upgrading in Q2 DC-Ops
msw-a1-codfw - msw-a8-codfw2009-01-012017-01-01FY16-17T259758Upgrading in Q2 DC-Ops
msw-d4-codfw2009-01-012017-01-01FY16-17T259758Upgrading in Q2 DC-Ops
oresrdb1001-10022014-05-012019-05-01FY18-19T254238, T254240No refresh. @akosiaris
samarium2013-01-222018-01-22FY17-18T197630
scb1001-scb10042013-01-112018-01-11FY17-18No refresh. The cluster is to be decomissioned, services move to kubernetes. @akosiaris
scs-a8-eqiad2011-02-022019-02-02FY18-19T228919DC-Ops Upgrading via T228919
tungsten2011-01-272016-01-27FY15-16T260395Currently running XHGui, being migrated to Ganeti instances in T180761 - xhgui* now on buster, just waiting for data migration from MongoDB now @Dzahn

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
akosiaris added subscribers: jcrespo, akosiaris.

labstore-array0-codfw - labstore-array3-codfw mentions madhuvishy, who has left the foundation for some time now. I 'd defer to bd808/bstorm.

Thanks @akosiaris , I'll check with Bryan or Brooke on the labstore-array* ones. @Marostegui and @akosiaris - much appreciated for hopping on these so quickly. Thanks, Willy

tungsten is currently running XHGui, once https://phabricator.wikimedia.org/T180761 is resolved it can be decommissioned.

mw1221 through mw1226 (6 servers) have been shut down today.

I just discovered that cloudmetrics1001 is old (2015) and need replacement https://netbox.wikimedia.org/dcim/devices/182/

I just discovered that cloudmetrics1001 is old (2015) and need replacement https://netbox.wikimedia.org/dcim/devices/182/

This is true, and we should totally do that :) Please file a procurement task to that effect to make sure this happens and we won't forget.

That said, this is separate from this task and not an accident/typo that it's not part of the list above: cloudmetrics1001's s purchase date is 2015-02-05, which puts its replacement date at 2020-02-05, i.e. this FY, FY19-20. The list above only includes assets with a replacement date < 2019-07-01 (see the task description above), from past FYs.

On labstore-array0-3, doing a bit of archaeology says that these were connected to labstore2001 and 2002 in codfw T93215: rack and connect labstore-array4-codfw in codfw.

Those servers were decommissioned on T243329: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet, but maybe the shelves were missed? On a quick check, it doesn't look like the arrays are actually in racks at this time in netbox. None of them list a location. Those arrays are all superseded by cloudbackup2001/2 and their arrays.

Am I reading something wrong @wiki_willy? It doesn't look like there's anything to do on those arrays.

Hi @Bstorm - thanks for digging into that one yesterday. Papaul just confirmed that he removed the arrays along with labstore2001 and labstore2002, so I'm going to remove the arrays from the list on this task. Much appreciated for your help on them. Thanks, Willy

faidon added a subscriber: Jgreen.
Jgreen removed a subscriber: Jgreen.

added the decom task for bismuth

Checked with Alex last week on the remaining devices missing decom tasks, who said he'd try to get to them when/if possible. @akosiaris - feel free to update this task when things free up a bit for you. Thanks, Willy

Bump! What's the latest here?

@wiki_willy, what's the latest here? What's blocking us from having decom tasks for all of the items above?

@faidon - DMs left in IRC for owners of the remaining items

"helium" and "heze" say above they were already replaced but at the same time they are still in site.pp with the production role. @jcrespo Can they be fully decom'ed now?

I need to ask @akosiaris a question regarding buster, once it is answered, they can be decomissioned.

Change 621038 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom helium and heze

https://gerrit.wikimedia.org/r/621038

Dzahn updated the task description. (Show Details)

What's blocking us from having decom tasks for all of the items above?

Added one for helium and heze per above.

tungsten has been permanently shut down today. one jessie off the list.

located samarium removed from racks corrected netbox error

I need to ask @akosiaris a question regarding buster, once it is answered, they can be decomissioned.

@jcrespo, how may I help?

Hi @akosiaris - there's one more set of hosts on the table missing a decom task for scb1001-scb1004. Can you provide an update on when these can be decommissioned? Thanks, Willy

Hi @akosiaris - there's one more set of hosts on the table missing a decom task for scb1001-scb1004. Can you provide an update on when these can be decommissioned? Thanks, Willy

Hi @wiki_willy. They can't yet, that's why. We are still migrating 2 services from them. I 'll update though this task when that's changed.

Thanks @akosiaris

Hi @wiki_willy. They can't yet, that's why. We are still migrating 2 services from them. I 'll update though this task when that's changed.

I need to ask @akosiaris a question regarding buster, once it is answered, they can be decomissioned.

@jcrespo, how may I help?

So all tasks (except cleanup) were accomplished T238048 to ensure recovery was possible on the new servers. However, the archive repository, even using the old puppet keys active at the time (see documented method used on wiki) don't allow decryption of the old backups. It is my understanding, based on the error messages, that the cypher/library used may not compatible with the cypher available on buster (but I may be wrong). Could you try to confirm this is the case (I am guessing you were the person that run the backups at the time). What would be the right procedure if that is the case (archival data would become unrecoverable)?

I need to ask @akosiaris a question regarding buster, once it is answered, they can be decomissioned.

@jcrespo, how may I help?

So all tasks (except cleanup) were accomplished T238048 to ensure recovery was possible on the new servers. However, the archive repository, even using the old puppet keys active at the time (see documented method used on wiki) don't allow decryption of the old backups. It is my understanding, based on the error messages, that the cypher/library used may not compatible with the cypher available on buster (but I may be wrong). Could you try to confirm this is the case (I am guessing you were the person that run the backups at the time).

Error: openssl.c:78 TLS read/write failure.: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
Error: openssl.c:78 TLS read/write failure.: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
Fatal error: restore.c:473 Data record error. ERR=Resource temporarily unavailable

These ^ right? Those point out that the wrong key for the restoration has been used. The archive is old enough (I see we haven't written anything into it in >5 years) that it's quite possible we require the puppet CA key from palladium probably. I am not sure if we have it anymore. It would expect to find it on puppetmaster1001's /root but I don't seem to be able to.

What would be the right procedure if that is the case (archival data would become unrecoverable)?

I don't think there is one that makes sense. The entire idea of encryption is to not be able to get the data without the key. If we can't get the PuppetCA's key of back then, we are out of luck. On the plus side, I don't have any recollection of that being asked and being critical to restore it (I remember it being asked, but they requester eventually got the data in another way and cancelled the request).

These ^ right? Those point out that the wrong key for the restoration has been used. The archive is old enough (I see we haven't written anything into it in >5 years) that it's quite possible we require the puppet CA key from palladium probably. I am not sure if we have it anymore. It would expect to find it on puppetmaster1001's /root but I don't seem to be able to.

I used the puppet CA Key at the time of the backup taken (more or less), which is archived on the private puppet repo (not the current or the just before the current one), after the last clue you gave me. Was that the wrong one? I cannot find them right now but they were archived as puppet CA certs 2004-2013 or something like that (maybe they are on git history?).

But if it is one that is on palladium, I think they are on the own backups!!

@akosiaris I think the real question here is: Knowing this, do I have your blessing to decom old backup hardware?

@jcrespo & @akosiaris may I ask you to figure this out in a different task? This is a generic task about dozens of servers, so by discussing details about a couple of them we're going to lose the bigger picture :)

What's required here is for you to provide us with an ETA for when we can expect to decommission this very old piece of hardware, if now is not the right time. These two are 2011/2012-era, so we really need to get this going.

@akosiaris I think the real question here is: Knowing this, do I have your blessing to decom old backup hardware?

Sure. Feel free to come up with the ETA that @faidon is requesting in the comment above, consider it a +1 on my side for whenever.

Resolving this task, as we're going to start keeping track of active EOL servers via a different spreadsheet with the team managers going forward. @jcrespo and @akosiaris - for the remaining ones still pending, just keep your manager posted with its decom status, so we can make sure it's accounted for. Thanks, Willy

nskaggs mentioned this in Unknown Object (Task).May 4 2021, 3:22 PM