Track down and replace very old HW
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Feb 13 2020, 3:36 PM

Description

The list below matches the criteria:

Status: Active in Netbox
Replacement date (= 5 years from purchase for servers, 8 years for most other equipment) is before 2019-07-01.

Effectively, this means that this is equipment that should had already been replaced in the fiscal years FY15-16, FY16-17, FY17-18 and FY18-19.

The asks here are:

Track down all existing decom/replacement tasks where they exist and link them here
If a replacement is underway (e.g. msw1-eqiad, scs-a8-eqiad), execute it with priority
Reach out to service owners to get those tasks unstuck if they exist or...
Reach out to service owners to ask them to start replacing their hardware ASAP and file procurement and/or decom tasks when tasks do not exist.

This is old to very old (2011-era!) hardware and is affecting our capability to plan and maintain. In some cases it may just be just forgotten decoms (e.g. labsdb1002-array1) or wrong statuses in Netbox.

Due to the FY20-21 planning being already in motion, I'd like to ask for this to be fully fleshed out (replacements planned and underway, most if not all equipment on this list decommissioned) by mid-April at the latest. That way we can use FY19-20 budget for some of these if necessary. @wiki_willy, do you think that's feasible?

Hostname	Purchase date	Replacement date	Suggested refresh	Decom/replacement task	Notes
analytics1028-1041	2014-06-26	2019-06-26	FY18-19	T227485	Currently used for Kerberos migration, decom at end of quarter, will be replaced by T242148
bismuth	2014-05-01	2019-05-01	FY18-19	T248516
dbproxy1002-1003	2011-01-27	2016-01-27	FY15-16	T245384	@Marostegui
dbproxy1007-1011	2011-01-27	2016-01-27	FY15-16	T228768 T245385	@Marostegui
es2001-es2004	2011-10-31	2016-10-31	FY16-17	T222592	@Marostegui
fmsw-c8-codfw	2009-01-01	2017-01-01	FY16-17	T253154	Upgrading in Q1 DC-Ops
francium	2014-05-01	2019-05-01	FY18-19	T249903	decom blocked by T242009
ganeti1001-1004	2014-03-24	2019-03-24	FY18-19	T255553	@akosiaris
helium	2011-01-27	2016-01-27	FY15-16	T260717	No refresh. It has been already superseded by backup1001. @jcrespo/@akosiaris
heze	2012-12-13	2017-12-13	FY17-18	T260717	No refresh. It has been superseded by backup2001 @jcrespo/@akosiaris
labsdb1002-array1	2013-02-27	2018-02-27	FY17-18	T146455
msw1-eqiad	2010-11-23	2018-11-23	FY18-19	T261449	upgrading via T225121
msw-a1-eqiad - msw-a8-eqiad	2011-02-18	2019-02-18	FY18-19	T259758	Upgrading in Q2 DC-Ops
msw-b1-eqiad - msw-b8-eqiad	2011-02-18	2019-02-18	FY18-19	T259758	Upgrading in Q2 DC-Ops
msw-a1-codfw - msw-a8-codfw	2009-01-01	2017-01-01	FY16-17	T259758	Upgrading in Q2 DC-Ops
msw-d4-codfw	2009-01-01	2017-01-01	FY16-17	T259758	Upgrading in Q2 DC-Ops
oresrdb1001-1002	2014-05-01	2019-05-01	FY18-19	T254238, T254240	No refresh. @akosiaris
samarium	2013-01-22	2018-01-22	FY17-18	T197630
scb1001-scb1004	2013-01-11	2018-01-11	FY17-18		No refresh. The cluster is to be decomissioned, services move to kubernetes. @akosiaris
scs-a8-eqiad	2011-02-02	2019-02-02	FY18-19	T228919	DC-Ops Upgrading via T228919
tungsten	2011-01-27	2016-01-27	FY15-16	T260395	Currently running XHGui, being migrated to Ganeti instances in T180761 - xhgui* now on buster, just waiting for data migration from MongoDB now @Dzahn

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		wiki_willy	T245161 Track down and replace very old HW
Duplicate		None	T245591 Migrate ORES redis database functionality to the redis misc cluster
Resolved		Jclark-ctr	T228919 Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad
Resolved		Dzahn	T247780 decom old appservers in eqiad
Resolved		• Cmjohnson	T253856 decom 36 old appservers in eqiad (onsite, dcops)
Resolved		• Cmjohnson	T254238 decomission oresrdb100[12]
Resolved		Papaul	T254240 Decomission oresrdb2002.codfw.wmnet
Resolved	Request	• Cmjohnson	T255553 decommission ganeti100[1-4].eqiad.wmnet
Resolved	Request	Papaul	T255554 decommission ganeti200[1-6].codfw.wmnet
Resolved		jcrespo	T260717 decom helium and heze

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

labstore-array0-codfw - labstore-array3-codfw mentions madhuvishy, who has left the foundation for some time now. I 'd defer to bd808/bstorm.

Thanks @akosiaris , I'll check with Bryan or Brooke on the labstore-array* ones. @Marostegui and @akosiaris - much appreciated for hopping on these so quickly. Thanks, Willy

tungsten is currently running XHGui, once https://phabricator.wikimedia.org/T180761 is resolved it can be decommissioned.

MoritzMuehlenhoff updated the task description. (Show Details)Feb 19 2020, 8:16 AM

akosiaris added a subtask: T245591: Migrate ORES redis database functionality to the redis misc cluster.Feb 19 2020, 9:12 AM

akosiaris updated the task description. (Show Details)

akosiaris updated the task description. (Show Details)Feb 19 2020, 10:08 AM

wiki_willy updated the task description. (Show Details)Feb 20 2020, 11:45 PM

RobH added a subtask: T228919: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad.Feb 24 2020, 9:41 PM

Dzahn added a subtask: T247780: decom old appservers in eqiad.Mar 16 2020, 6:38 PM

Dzahn mentioned this in T247780: decom old appservers in eqiad.

mw1221 through mw1226 (6 servers) have been shut down today.

I just discovered that cloudmetrics1001 is old (2015) and need replacement https://netbox.wikimedia.org/dcim/devices/182/

In T245161#5975292, @aborrero wrote:

I just discovered that cloudmetrics1001 is old (2015) and need replacement https://netbox.wikimedia.org/dcim/devices/182/

This is true, and we should totally do that :) Please file a procurement task to that effect to make sure this happens and we won't forget.

That said, this is separate from this task and not an accident/typo that it's not part of the list above: cloudmetrics1001's s purchase date is 2015-02-05, which puts its replacement date at 2020-02-05, i.e. this FY, FY19-20. The list above only includes assets with a replacement date < 2019-07-01 (see the task description above), from past FYs.

Dzahn changed the status of subtask T247780: decom old appservers in eqiad from Open to Stalled.Mar 26 2020, 2:28 PM

Jgreen updated the task description. (Show Details)Mar 26 2020, 9:01 PM

Krinkle awarded a token.Mar 26 2020, 9:23 PM

On labstore-array0-3, doing a bit of archaeology says that these were connected to labstore2001 and 2002 in codfw T93215: rack and connect labstore-array4-codfw in codfw.

Those servers were decommissioned on T243329: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet, but maybe the shelves were missed? On a quick check, it doesn't look like the arrays are actually in racks at this time in netbox. None of them list a location. Those arrays are all superseded by cloudbackup2001/2 and their arrays.

Am I reading something wrong @wiki_willy? It doesn't look like there's anything to do on those arrays.

Hi @Bstorm - thanks for digging into that one yesterday. Papaul just confirmed that he removed the arrays along with labstore2001 and labstore2002, so I'm going to remove the arrays from the list on this task. Much appreciated for your help on them. Thanks, Willy

wiki_willy updated the task description. (Show Details)Mar 27 2020, 4:16 PM

faidon updated the task description. (Show Details)Mar 27 2020, 6:45 PM

faidon added a subscriber: Jgreen.

added the decom task for bismuth

Dzahn changed the status of subtask T247780: decom old appservers in eqiad from Stalled to Open.Mar 31 2020, 3:36 PM

Dzahn closed subtask T247780: decom old appservers in eqiad as Resolved.Apr 2 2020, 1:00 PM

Checked with Alex last week on the remaining devices missing decom tasks, who said he'd try to get to them when/if possible. @akosiaris - feel free to update this task when things free up a bit for you. Thanks, Willy

akosiaris mentioned this in T254238: decomission oresrdb100[12].Jun 2 2020, 12:58 PM

akosiaris mentioned this in T254240: Decomission oresrdb2002.codfw.wmnet.

akosiaris updated the task description. (Show Details)Jun 2 2020, 1:01 PM

akosiaris added a subtask: T255553: decommission ganeti100[1-4].eqiad.wmnet.Jun 16 2020, 12:04 PM

akosiaris updated the task description. (Show Details)

akosiaris added a subtask: T255554: decommission ganeti200[1-6].codfw.wmnet.Jun 16 2020, 12:06 PM

akosiaris updated the task description. (Show Details)

aborrero unsubscribed.Jun 16 2020, 2:58 PM

faidon updated the task description. (Show Details)Jun 18 2020, 10:25 AM

faidon updated the task description. (Show Details)Jun 18 2020, 10:31 AM

Papaul closed subtask T254240: Decomission oresrdb2002.codfw.wmnet as Resolved.Jun 23 2020, 3:13 PM

Papaul closed subtask T255554: decommission ganeti200[1-6].codfw.wmnet as Resolved.Jun 24 2020, 4:15 PM

Bump! What's the latest here?

Dzahn updated the task description. (Show Details)Aug 4 2020, 11:53 PM

wiki_willy updated the task description. (Show Details)Aug 5 2020, 9:58 PM

wiki_willy updated the task description. (Show Details)Aug 5 2020, 10:01 PM

• Cmjohnson closed subtask T255553: decommission ganeti100[1-4].eqiad.wmnet as Resolved.Aug 10 2020, 3:38 PM

@wiki_willy, what's the latest here? What's blocking us from having decom tasks for all of the items above?

Dzahn updated the task description. (Show Details)Aug 17 2020, 10:06 PM

wiki_willy updated the task description. (Show Details)Aug 17 2020, 10:30 PM

@faidon - DMs left in IRC for owners of the remaining items

"helium" and "heze" say above they were already replaced but at the same time they are still in site.pp with the production role. @jcrespo Can they be fully decom'ed now?

ayounsi mentioned this in T174203: Investigate decommissioning two eqiad-frack vlans.Aug 18 2020, 8:03 AM

I need to ask @akosiaris a question regarding buster, once it is answered, they can be decomissioned.

wiki_willy updated the task description. (Show Details)Aug 18 2020, 3:50 PM

Change 621038 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom helium and heze

https://gerrit.wikimedia.org/r/621038

gerritbot added a project: Patch-For-Review.Aug 18 2020, 6:36 PM

In T245161#6391124, @faidon wrote:

What's blocking us from having decom tasks for all of the items above?

Added one for helium and heze per above.

tungsten has been permanently shut down today. one jessie off the list.

located samarium removed from racks corrected netbox error

• Cmjohnson closed subtask T254238: decomission oresrdb100[12] as Resolved.Aug 25 2020, 2:34 PM

In T245161#6392857, @jcrespo wrote:

I need to ask @akosiaris a question regarding buster, once it is answered, they can be decomissioned.

@jcrespo, how may I help?

Hi @akosiaris - there's one more set of hosts on the table missing a decom task for scb1001-scb1004. Can you provide an update on when these can be decommissioned? Thanks, Willy

In T245161#6424239, @wiki_willy wrote:

Hi @akosiaris - there's one more set of hosts on the table missing a decom task for scb1001-scb1004. Can you provide an update on when these can be decommissioned? Thanks, Willy

Hi @wiki_willy. They can't yet, that's why. We are still migrating 2 services from them. I 'll update though this task when that's changed.

Thanks @akosiaris

Hi @wiki_willy. They can't yet, that's why. We are still migrating 2 services from them. I 'll update though this task when that's changed.

wiki_willy updated the task description. (Show Details)Sep 2 2020, 5:26 PM

In T245161#6423249, @akosiaris wrote:

In T245161#6392857, @jcrespo wrote:

I need to ask @akosiaris a question regarding buster, once it is answered, they can be decomissioned.

@jcrespo, how may I help?

So all tasks (except cleanup) were accomplished T238048 to ensure recovery was possible on the new servers. However, the archive repository, even using the old puppet keys active at the time (see documented method used on wiki) don't allow decryption of the old backups. It is my understanding, based on the error messages, that the cypher/library used may not compatible with the cypher available on buster (but I may be wrong). Could you try to confirm this is the case (I am guessing you were the person that run the backups at the time). What would be the right procedure if that is the case (archival data would become unrecoverable)?

In T245161#6435679, @jcrespo wrote:

In T245161#6423249, @akosiaris wrote:

In T245161#6392857, @jcrespo wrote:

I need to ask @akosiaris a question regarding buster, once it is answered, they can be decomissioned.

@jcrespo, how may I help?

So all tasks (except cleanup) were accomplished T238048 to ensure recovery was possible on the new servers. However, the archive repository, even using the old puppet keys active at the time (see documented method used on wiki) don't allow decryption of the old backups. It is my understanding, based on the error messages, that the cypher/library used may not compatible with the cypher available on buster (but I may be wrong). Could you try to confirm this is the case (I am guessing you were the person that run the backups at the time).

Error: openssl.c:78 TLS read/write failure.: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
Error: openssl.c:78 TLS read/write failure.: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
Fatal error: restore.c:473 Data record error. ERR=Resource temporarily unavailable

These ^ right? Those point out that the wrong key for the restoration has been used. The archive is old enough (I see we haven't written anything into it in >5 years) that it's quite possible we require the puppet CA key from palladium probably. I am not sure if we have it anymore. It would expect to find it on puppetmaster1001's /root but I don't seem to be able to.

What would be the right procedure if that is the case (archival data would become unrecoverable)?

I don't think there is one that makes sense. The entire idea of encryption is to not be able to get the data without the key. If we can't get the PuppetCA's key of back then, we are out of luck. On the plus side, I don't have any recollection of that being asked and being critical to restore it (I remember it being asked, but they requester eventually got the data in another way and cancelled the request).

These ^ right? Those point out that the wrong key for the restoration has been used. The archive is old enough (I see we haven't written anything into it in >5 years) that it's quite possible we require the puppet CA key from palladium probably. I am not sure if we have it anymore. It would expect to find it on puppetmaster1001's /root but I don't seem to be able to.

I used the puppet CA Key at the time of the backup taken (more or less), which is archived on the private puppet repo (not the current or the just before the current one), after the last clue you gave me. Was that the wrong one? I cannot find them right now but they were archived as puppet CA certs 2004-2013 or something like that (maybe they are on git history?).

But if it is one that is on palladium, I think they are on the own backups!!

@akosiaris I think the real question here is: Knowing this, do I have your blessing to decom old backup hardware?

@jcrespo & @akosiaris may I ask you to figure this out in a different task? This is a generic task about dozens of servers, so by discussing details about a couple of them we're going to lose the bigger picture :)

What's required here is for you to provide us with an ETA for when we can expect to decommission this very old piece of hardware, if now is not the right time. These two are 2011/2012-era, so we really need to get this going.

In T245161#6440050, @jcrespo wrote:

@akosiaris I think the real question here is: Knowing this, do I have your blessing to decom old backup hardware?

Sure. Feel free to come up with the ETA that @faidon is requesting in the comment above, consider it a +1 on my side for whenever.

Resolving this task, as we're going to start keeping track of active EOL servers via a different spreadsheet with the team managers going forward. @jcrespo and @akosiaris - for the remaining ones still pending, just keep your manager posted with its decom status, so we can make sure it's accounted for. Thanks, Willy

LSobanski subscribed.Dec 22 2020, 11:20 AM

jcrespo closed subtask T260717: decom helium and heze as Resolved.Jan 29 2021, 11:12 AM

• nskaggs mentioned this in Unknown Object (Task).May 4 2021, 3:22 PM

Jclark-ctr closed subtask T228919: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad as Resolved.Mar 24 2023, 6:55 PM

Track down and replace very old HWClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Track down and replace very old HW
Closed, ResolvedPublic
Actions

Related Objects
Search...