Page MenuHomePhabricator

cloudstore100{8,9} - Upgrade to 10GbE
Closed, ResolvedPublic

Description

cloudstore1008 and cloudstore1009 are NFS file servers (with +65TB capacity).

They are currently on 1GbE which will be problematic for their use case.

They won't use DRBD, most likely will have hourly syncs instead. Finally, they aren't in use yet so any maintenance is best done before we start using them.

Details

Related Gerrit Patches:

Event Timeline

GTirloni created this task.Jan 17 2019, 8:13 PM

cloudstore1008 is in a5
cloudstore1009 and its array are in a6

@Cmjohnson attempted to address the 10G options during the racking and setup on T193655#4264714.

So, it seems in prior conversations on that task, there was a requirement for 10G and also to be in the same row or adjacent racks? Can we clarify exactly what is needed? If it is just 10G, and they don't have to be in the same row, it makes this much easier to accomplish.

It seems the requirement to be side-by-side might have been because we were planning to use DRBD and needed a direct host-to-host connection but we're not pursuing DRBD anymore.

In discussing this with the WMCS team, the requirement to be side-by-side can be dropped and the servers can be relocated as necessary. Thanks in advance.

CDanis triaged this task as Normal priority.Jan 18 2019, 12:34 AM

@GTirloni I do not have room in row A. These can go into Row D racks D2 and D7. Doing this will require a DNS (ip) change and I will have to fix the servers to use the 10G NIC. A re-install of the OS will be needed. I would like to do at least one on February 7th at 1600UTC (11am EST). If we can do both that would be great but can stagger so I move the 2nd server the following week. Please confirm that this will work for you.

@Cmjohnson that works for me. We can do both if time allows.

RobH added a comment.Feb 7 2019, 5:28 PM

Ok, assisting in this I've done the following:

  • removed cloudstore100[89] from asw2-a-eqiad(ge-5/0/14 & ge-6/0/17) and cloudstore1009 from asw-a-eqiad:ge-6/0/17.
    • removed the descriptions, removed from public vlan, added to disabled group.
  • added both hosts to row D: cloudstore1008 xe-2/0/13 & cloudstore1009 xe-7/0/19 to the public vlan and enabled them

now working on the dns change to follow

Change 488973 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] migrate cloudstore100[89] to row d dns change

https://gerrit.wikimedia.org/r/488973

Change 488973 merged by RobH:
[operations/dns@master] migrate cloudstore100[89] to row d dns change

https://gerrit.wikimedia.org/r/488973

Cmjohnson assigned this task to RobH.Feb 7 2019, 6:40 PM

@RobH Can you do a re-install and hand off to cloud, please.

I moved the servers to row D racks d2 and d7
I connected to 10G switch
I changed bios boot cfg to pxe off the 10G NIC
I updated to latest Idrac f/w
I updated to latest BIOS (urgent update)
I updated Netbox to show new location

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudstore1008.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201902071843_robh_49723_cloudstore1008_wikimedia_org.log.

Completed auto-reimage of hosts:

['cloudstore1008.wikimedia.org']

Of which those FAILED:

['cloudstore1008.wikimedia.org']

Change 488992 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] update cloudstore100[89] mac addresses

https://gerrit.wikimedia.org/r/488992

Change 488992 merged by RobH:
[operations/puppet@production] update cloudstore100[89] mac addresses

https://gerrit.wikimedia.org/r/488992

wmf-decommission-host was executed by robh for cloudstore1008.wikimedia.org and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Skipped downtime host on Icinga (likely already removed)
  • Skipped downtime mgmt interface on Icinga (likely already removed)
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for cloudstore1009.wikimedia.org and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor
RobH reassigned this task from RobH to GTirloni.Feb 7 2019, 7:23 PM

Ok, these are both reinstalled and ready for use/takeover.

@RobH @Cmjohnson thanks a lot for this, really appreciate the effort.

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 208.80.155.125 port 5001 connected with 208.80.155.126 port 56322
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  11.0 GBytes  9.41 Gbits/sec
GTirloni closed this task as Resolved.Feb 8 2019, 6:22 PM