Page MenuHomePhabricator

Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN.
Closed, ResolvedPublic

Description

In T204951: Presto cluster online and usable with test data pushed from analytics prod infrastructure accessible by Cloud (labs) users, we attempted to set up a CloudVPS based Hadoop+Hive and Presto cluster on which we could host public data queryable from Cloud VPS. We tried to set up this cluster in Cloud VPS itself, which failed due to the lack of first class infrastructure support for Cloud VPS based hosting, e.g. monitoring, alerting, etc.

We'd like to move this hardware back into production in the Analytics VLAN. We'll use the puppetization work already done to set up Presto there. At first we'll set it up to use the existing data in Hive and analytics-hadoop. If we again decide to pursue queryable interface from CloudVPS, we'd reinstall these with a dedicated ('public data') Hadoop cluster.

Not sure who can do the network changes for this. I believe we could handle reimaging.

Please rename these nodes to an-presto100[1-5].

Please reimage these as stretch! :) We are not ready for Java 11.

Details

Related Gerrit Patches:
operations/dns : masterRemove cloudvirtan100X references
operations/puppet : productionSet Debian Buster fro an-presto100* nodes
operations/puppet : productionpartman: add more preseed configs for an-presto's d-i
operations/dns : masterAdd DNS entries for an-presto nodes
operations/puppet : productionPrep for installing an-presto nodes
operations/puppet : productionSet cloudvirtan* to role::spare::system

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 5 2019, 5:50 PM
ayounsi added a subscriber: ayounsi.Jun 5 2019, 5:55 PM

What are the needed network changes?

The usual two are:
1/ switch port config (usually for DCops), for that we need to know which hosts are going to which vlan
2/ Firewall config (usually for me or @elukey), for that we need to know what flows are getting in/out of the systems (usually a diagram helps)

Ottomata added a comment.EditedJun 5 2019, 6:10 PM

1/ switch port config (usually for DCops), for that we need to know which hosts are going to which vlan

cloudvirtan100[1-5] should be moved into the Analytics VLAN

2/ Firewall config (usually for me or @elukey), for that we need to know what flows are getting in/out of the systems (usually a diagram helps)

Since it is in the Analytics VLAN, there shouldn't be a need for any special rules (that I know of atm). We might need it to talk to kafka jumbo eventually, but let's deal with that when it happens.

fdans triaged this task as High priority.Jun 6 2019, 4:45 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
ayounsi added a subscriber: Cmjohnson.

Great, over to @Cmjohnson then!

Cmjohnson moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Jun 27 2019, 4:28 PM
Ottomata assigned this task to Cmjohnson.Jul 2 2019, 4:09 PM

Feel free to reassign

Ottomata moved this task from Next Up to Paused on the Analytics-Kanban board.

@Ottomata - can you reach out to Chris on IRC and schedule a time with him on this one? Sounds pretty straight-forward, so I think he just needs to work with you via chat to make sure everything's working properly after the change. Thanks, Willy

@Ottomata
Please decommission the current servers to spare role
Please provide the new hostnames you want to use
These are all located in row B...will that be okay or do you need them spread out across the rows?

Please decommission the current servers to spare role

Ok will do. I'll downtime the the hostnames in icinga when I do.

Please provide the new hostnames you want to use

Oh good point, will talk to @elukey. an-presto100x?

These are all located in row B...will that be okay or do you need them spread out across the rows?

They should be spread out as evenly as possible. Thanks.

These nodes do still need 10G.

Also, Andrew Bogott says:

Those boxes currently have two networks hooked up, for control plane and virt plane. You probably only need one if you're going to manage them as bare metal.

Change 523935 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set cloudvirtan* to role::spare::system

https://gerrit.wikimedia.org/r/523935

Change 523935 merged by Ottomata:
[operations/puppet@production] Set cloudvirtan* to role::spare::system

https://gerrit.wikimedia.org/r/523935

cookbooks.sre.hosts.decommission executed by otto@cumin1001 for hosts: cloudvirtan[1001-1005].eqiad.wmnet

  • cloudvirtan1002.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • cloudvirtan1001.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • cloudvirtan1003.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • cloudvirtan1005.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
  • cloudvirtan1004.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

Alright, nodes are role spare::system and decommed/downtimed in icinga.

Ottomata updated the task description. (Show Details)Jul 17 2019, 3:04 PM

@Cmjohnson back atcha :)

Ottomata moved this task from Paused to In Progress on the Analytics-Kanban board.Jul 17 2019, 3:06 PM
Ottomata updated the task description. (Show Details)Jul 22 2019, 3:08 PM

OO, when we reimage these, let's use Buster! :)

OO, when we reimage these, let's use Buster! :)

I take it back, use Stretch. Buster ships with Java 11, which we are not ready for.

Ottomata updated the task description. (Show Details)Aug 27 2019, 4:07 PM
Cmjohnson reassigned this task from Cmjohnson to Jclark-ctr.Aug 27 2019, 8:00 PM
Cmjohnson added a subscriber: Jclark-ctr.

@Jclark-ctr Can you move these servers as evenly as you can into rows B2/B4 and B7, cable with 10G DAC cables and the mgmt cable please and update netbox and this task with their location and the port numbers you connected the servers.

Host moved cmjohnson. advised to move out if row B in to 10G racks leave 1 in B

host                  	row	 unit    port
cloudvirtan1001	d2	19	   16/17
cloudvirtan1002	a2	29	   24/25
cloudvirtan1003	d7	32	   20/21
cloudvirtan1004	b4	21	    5 / 17
cloudvirtan1005	a4	20	    37/38

@Ottomata Do you still need the 2nd port now that you're not doing the cloud thing? If so which vlan?

We don't!

Just Analytics VLAN for now please.

@Ottomata All the servers are moved and all of them but cloudvirtan1003 are connected to the switch in the correct vlan. @Jclark-ctr if you are still around can you verify that cloudvirtan is connected to switch in rack d7 xe-7/0/20, please.

Also @Jclark-ctr please remove the DAC cables from the 2nd port. They are not needed.

@Cmjohnson removed 2nd dac cable yes it plugged into d7 xe-7/0/2

Next steps?

@Ottomata the on-site work is done, They will need updated production DNS but all are moved and connected.

Change 535209 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Prep for installing an-presto nodes

https://gerrit.wikimedia.org/r/535209

Change 535221 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/dns@master] Add DNS entries for an-presto nodes

https://gerrit.wikimedia.org/r/535221

@Cmjohnson / @Jclark-ctr https://gerrit.wikimedia.org/r/535221 adds DNS for non mgmt entries. Should I modify the mgmt ones too? Also, it seems netbox needs updated.

@Cmjohnson should the following descriptions be updated as well with their an-presto equivalents?

elukey@asw2-a-eqiad> show interfaces descriptions | match cloudvirtan
xe-2/0/24       up    up   cloudvirtan1002
xe-4/0/37       up    up   cloudvirtan1005

elukey@asw2-b-eqiad> show interfaces descriptions | match cloudvirtan
xe-4/0/5        up    down cloudvirtan1004

elukey@asw2-d-eqiad> show interfaces descriptions | match cloudvirtan
xe-2/0/16       up    up   cloudvirtan1001
xe-7/0/20       up    up   cloudvirtan1003

Netbox seems also listing the old names https://netbox.wikimedia.org/search/?q=cloudvirtan&obj_type=

We can do it as well but better triple checking with you to avoid missteps and/or stepping on each other feet :)

The old host definitions for cloudviran are still in debmonitor, puppetdb and site.pp are around, thise need to be dropped as well, currently this is throwing errors for every cluster-wide cumin run.

Change 535209 merged by Ottomata:
[operations/puppet@production] Prep for installing an-presto nodes

https://gerrit.wikimedia.org/r/535209

Change 535221 merged by Ottomata:
[operations/dns@master] Add DNS entries for an-presto nodes

https://gerrit.wikimedia.org/r/535221

Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts:

['an-presto1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909101525_otto_153478.log.

Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts:

cloudvirtan1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909101534_otto_155583_cloudvirtan1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirtan1002.eqiad.wmnet']

Of which those FAILED:

['cloudvirtan1002.eqiad.wmnet']

Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts:

cloudvirtan1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909101535_otto_156624_cloudvirtan1002_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts:

cloudvirtan1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909101544_otto_157468_cloudvirtan1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-presto1002.eqiad.wmnet']

Of which those FAILED:

['an-presto1002.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-presto1001.eqiad.wmnet']

Of which those FAILED:

['an-presto1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudvirtan1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909160613_elukey_54732_cloudvirtan1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirtan1001.eqiad.wmnet']

Of which those FAILED:

['cloudvirtan1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudvirtan1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909160615_elukey_55008_cloudvirtan1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirtan1001.eqiad.wmnet']

Of which those FAILED:

['cloudvirtan1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudvirtan1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909160615_elukey_55100_cloudvirtan1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-presto1001.eqiad.wmnet']

Of which those FAILED:

['an-presto1001.eqiad.wmnet']

Change 536804 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Debian Buster fro an-presto100* nodes

https://gerrit.wikimedia.org/r/536804

Change 536804 merged by Elukey:
[operations/puppet@production] Set Debian Buster fro an-presto100* nodes

https://gerrit.wikimedia.org/r/536804

Change 536835 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] partman: add more preseed configs for an-presto's d-i

https://gerrit.wikimedia.org/r/536835

Change 536835 merged by Elukey:
[operations/puppet@production] partman: add more preseed configs for an-presto's d-i

https://gerrit.wikimedia.org/r/536835

elukey added a comment.EditedSep 16 2019, 9:30 AM
  • an-presto1001 - PXE Boot fails (didn't catch the media failure alert for this one but when I try to force PXE it ends up in booting the previous OS installed version)
NIC in Slot 4 Port 1: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 -
 F4:E9:D4:DB:AF:B0
NIC in Slot 4 Port 2: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 -
 F4:E9:D4:DB:AF:B2

Both shows up in "connected" state, that seems strange. Shouldn't cause troubles but probably it worth to triple check.

*an-presto1002 - all good, reimaged/renamed

Checked the BIOS and only one NIC (the one configured in puppet) has state "connected"

  • an-presto1003 - PXE boot fails with "media failure"
NIC in Slot 4 Port 1: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 -
 F4:E9:D4:DB:E0:B0
NIC in Slot 4 Port 2: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 -
 F4:E9:D4:DB:E0:B2

:B0 shows up as "disconnected" but in puppet it is set for DHCP, meanwhile :B2 shows as "connected". They are both 10g interfaces but I am wondering if any cable plug changed?

  • an-presto1004 - PXE Boot fails (didn't catch the media failure alert for this one but when I try to force PXE it ends up in booting the previous OS installed version)
NIC in Slot 4 Port 1: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 -
 F4:E9:D4:DB:7E:30
NIC in Slot 4 Port 2: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 -
 F4:E9:D4:DB:7E:32

Both shows up in "connected" state, that seems strange. Shouldn't cause troubles but probably it worth to triple check.

  • an-presto1005 - all good, reimaged/renamed

To summarize:

  • an-presto1001 and 1004 still have two NICs configured and fail to PXE boot correctly
  • an-presto1003 has the wrong NIC showing up as connected (i.e. not the one stated in puppet) and fails to PXE boot correctly
  • an-prestoo1002 and 1005 have the only one (the correct) NIC showing up as connected and they PXE boot correctly.

@Cmjohnson @Jclark-ctr Can you check my report when you have a minute and let me know your thoughts? Thanks :)

Thanks for doing this Luca! BTW we don't want buster...I mean I guess we could try it but I'd expect it not to work with Java 11.

Ah, Luca just noted that we will have a Java 8 package for buster. I'm ok with Buster then!

Thanks to the awesome work of @Jclark-ctr an-presto1001 and an-presto1003 are now reimaged, but an-presto1004 is still not working. I think the problem is:

elukey@asw2-b-eqiad> show ethernet-switching interface xe-4/0/5
Routing Instance Name : default-switch
Logical Interface flags (DL - disable learning, AD - packet action drop,
                         LH - MAC limit hit, DN - interface down,
                         SCTL - shutdown by Storm-control,
                         MMAS - Mac-move action shutdown, AS - Autostate-exclude enabled)

Logical          Vlan          TAG     MAC         STP         Logical           Tagging
interface        members               limit       state       interface flags
xe-4/0/5.0                             294912                                     untagged
                 cloud-hosts1-b-eqiad 1118 294912  Forwarding                     untagged

The port is configured with the wrong VLAN :(

Proposed fix for asw2-b:

delete interfaces interface-range cloud-hosts1-b-eqiad member xe-4/0/5
set interfaces interface-range vlan-analytics1-b-eqiad xe-4/0/5

Proposed fix for asw2-b:

delete interfaces interface-range cloud-hosts1-b-eqiad member xe-4/0/5
set interfaces interface-range vlan-analytics1-b-eqiad xe-4/0/5

Aside from a typo, it's vlan-cloud-hosts1-b-eqiad, LGTM

Committed:

elukey@asw2-b-eqiad# show | compare
[edit interfaces interface-range vlan-cloud-hosts1-b-eqiad]
-    member xe-4/0/5;
[edit interfaces interface-range vlan-analytics1-b-eqiad]
     member ge-5/0/7 { ... }
+    member xe-4/0/5;

Cc: @ayounsi

Change 537651 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Remove cloudvirtan100X references

https://gerrit.wikimedia.org/r/537651

elukey added a comment.EditedSep 18 2019, 2:41 PM

Ok so current status:

  • All hosts reimaged to buster and working
  • Renamed hostnames in netbox

Todo:

AWESOME thank youuuu

Change 537651 merged by Elukey:
[operations/dns@master] Remove cloudvirtan100X references

https://gerrit.wikimedia.org/r/537651

elukey@asw2-a-eqiad# show | compare
[edit interfaces xe-2/0/24]
-   description cloudvirtan1002;
+   description an-presto1002;
[edit interfaces xe-4/0/37]
-   description cloudvirtan1005;
+   description an-presto1005;

After applying, I noticed this:

elukey@asw2-a-eqiad> show interfaces descriptions | match an-presto
xe-2/0/24       up    up   an-presto1002
xe-4/0/37                  an-presto1005

{master:7}
elukey@asw2-a-eqiad> show ethernet-switching interface xe-4/0/37

an-presto1005 has been reimaged, but I can't ssh.. Is the empty show ethernet etc.. saying that no vlan setting has been applied?

elukey@asw2-b-eqiad# show | compare
[edit interfaces xe-4/0/5]
-   description cloudvirtan1004;
+   description an-presto1004;
elukey@asw2-d-eqiad# show | compare
[edit interfaces xe-2/0/16]
-   description cloudvirtan1001;
+   description an-presto1001;
[edit interfaces xe-7/0/20]
-   description cloudvirtan1003;
+   description an-presto1003;

The interface descriptions should be ok now.

@Cmjohnson @Jclark-ctr there is one last problem - an-presto1005:

  1. is not connected to any switch as far as I can see from BIOS
  2. on asw2-a the interface with description an-presto1005 (just renamed, it was cloudvirtan1005) seems pointing to a non existent device (hence no VLAN config etc..)

Can you check when you have a minute?

elukey closed this task as Resolved.EditedSep 27 2019, 2:17 PM
elukey@asw2-a-eqiad> show interfaces descriptions | match an-presto
xe-2/0/24       up    up   an-presto1002
xe-4/0/37       up    up   an-presto1005

elukey@asw2-a-eqiad> show ethernet-switching interface xe-4/0/37
Routing Instance Name : default-switch
Logical Interface flags (DL - disable learning, AD - packet action drop,
                         LH - MAC limit hit, DN - interface down,
                         SCTL - shutdown by Storm-control,
                         MMAS - Mac-move action shutdown, AS - Autostate-exclude enabled)

Logical          Vlan          TAG     MAC         STP         Logical           Tagging
interface        members               limit       state       interface flags
xe-4/0/37.0                            294912                                     untagged
                 analytics1-a-eqiad 1030 294912    Forwarding                     untagged

All good!