Page MenuHomePhabricator

rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of the 2 new cloud backup hosts received in T210666

Rack proposal: please advice
wiring configuration : please advice
name proposal : Please advice

@Andrew or @aborrero Please provide me with the information needed, once finish you can just assign the task back to me.

Thanks

cloudbackup2001.codfw.wmnet : Row A rack A7
cloudbackup2001-array1

  • - receive in system on procurement task T210666
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID : Please provide HW RAID configuration and partman recipe
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

cloudbackup2002.codfw.wmnet: Row C rack C7
cloudbackup2002-array1

  • - receive in system on procurement task T210666
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID Please provide HW RAID configuration and partman recipe
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aborrero triaged this task as Medium priority.

Rack proposal: anywhere in codfw, each server in a different rack, a rack with 10G support
Wiring configuration: single 10G connection each server if possible. The mgmt interface connected as in any other server (standard).
Name proposal: Per T210666#4916941, we agreed on calling them cloudbackup. They can be called cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org. I added the entry to https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers

aborrero renamed this task from rack/setup codfw: cloudstore (backups) to rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org.May 29 2019, 9:23 AM
aborrero updated the task description. (Show Details)

On second thoughts, we would like to change the public VLAN for a private one, from .wikimedia.org to .wmnet.

@Papaul is it too late to suggest switching to cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet? The team has been discussing this, and it is probably better to have them on the private network.

aborrero renamed this task from rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org to rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet.May 29 2019, 4:47 PM
aborrero updated the task description. (Show Details)

On IRC:

18:56 <papaul> arturo: i have RAID setup that is unknow for now please discuss with your team and provide me with the information i have only the MD on site and not the PE so once i received the PE i need to move fast on this task thanks

raid config:

  • raid1 for the two os volumes
  • one big raid6 w/lvm for the remaining internal drives
  • another big raid6 w/lvm for the shelf

Both MD's are racked and Netbox updated.

PowerEdge

Virtual Disk 0: RAID1, 223GB, Ready                                           
Virtual Disk 1: RAID6, 106.918TB, Ready

Storage MD
Virtual Disk 0: RAID6, 106.918TB, Ready

@Bstorm when talking about the private network for those hosts, are you referring to the private1 network or the lab private network?

Thanks.

private1, I believe. I mean the internal network. I *think* we are moving away from the lab private network, right @bd808 ?

private1, I believe. I mean the internal network. I *think* we are moving away from the lab private network, right @bd808 ?

Yes, we have been slowly deprecating the cloud-support* VLANs on the basis that the general SRE team found them to be confusing. Things that would have once been placed in those VLANs have been going into the public* VLANs instead for the last 2 years or so.

These servers are 100% WMCS internal infrastructure and not expected to have direct interaction from Cloud VPS instance space. Basically we need these hosts to be networked such that we can push (and restore) backups to them from the {lab,cloud}store servers in eqiad. I think that means the private1-*-codfw VLANs would be an ok place for them, but the public1-*-codfw VLANs may work too and would decrease the chance of any breach on the cloudbackup* hosts providing a jumping off point for attacking production MediaWiki or databases. I'm not sure what switch/router ACL changes would be needed to host them in public1-*-codfw and if that is considered problematic or not.

Change 520137 had a related patch set uploaded (by Papaul; owner: pt1979):
[operations/dns@master] DNS: Add mgmt and production DNS for cloudbackup200[1-2]

https://gerrit.wikimedia.org/r/520137

Change 520137 merged by Arturo Borrero Gonzalez:
[operations/dns@master] DNS: Add mgmt and production DNS for cloudbackup200[1-2]

https://gerrit.wikimedia.org/r/520137

Change 520257 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Install/partman for cloudbackup2001/2002

https://gerrit.wikimedia.org/r/520257

Change 520257 merged by Andrew Bogott:
[operations/puppet@production] Install/partman for cloudbackup2001/2002

https://gerrit.wikimedia.org/r/520257

@Bstorm @Andrew OS install and puppet run done on cloudbakcup2001. It is all yours. Do all your tests if happy, please assign task back to me to do the HW RAID configuration on cloudbackup2002.
Thanks

@Bstorm @Andrew These were once installed with Stretch, but in the mean time Buster was released, let's reimage those before they are put to use?

Change 540668 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] move cloudbackup2001/2002 to Buster

https://gerrit.wikimedia.org/r/540668

Change 540668 merged by Andrew Bogott:
[operations/puppet@production] move cloudbackup2001/2002 to Buster

https://gerrit.wikimedia.org/r/540668

I am finally back looking at this! I'm not sure quite what I should expect regarding the raids here -- I reimaged (for buster) and saw the partitioner offer to create two volumes, but in the OS I only see one:

root@cloudbackup2001:~# pvdisplay
  --- Physical volume ---
  PV Name               /dev/sda5
  VG Name               cloudbackup2001-vg
  PV Size               <195.06 GiB / not usable 4.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              49934
  Free PE               9987
  Allocated PE          39947
  PV UUID               FHdy76-ugXD-fjxQ-PIws-B0CW-F94r-ZBumTY

@Papaul, do you know what I need to do to make the giant raid present to the OS?

(nevermind, I think I see what's happening)

Change 540898 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup2001: update raid config

https://gerrit.wikimedia.org/r/540898

Change 540898 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup2001: update raid config

https://gerrit.wikimedia.org/r/540898

Change 540899 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup partman: the second of many changes yet to come

https://gerrit.wikimedia.org/r/540899

Change 540899 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup partman: the second of many changes yet to come

https://gerrit.wikimedia.org/r/540899

Change 540906 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup: more partman tinkering

https://gerrit.wikimedia.org/r/540906

Change 540906 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup: more partman tinkering

https://gerrit.wikimedia.org/r/540906

Change 540917 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup: add some comments to partman recipe

https://gerrit.wikimedia.org/r/540917

Change 540917 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup: add some comments to partman recipe

https://gerrit.wikimedia.org/r/540917

Change 540921 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labstore backups: make backup interval configurable with hiera

https://gerrit.wikimedia.org/r/540921

Change 540923 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup2001: make a backup server

https://gerrit.wikimedia.org/r/540923

Change 540921 merged by Andrew Bogott:
[operations/puppet@production] labstore backups: make backup interval configurable with hiera

https://gerrit.wikimedia.org/r/540921

Change 540923 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup2001: make a backup server

https://gerrit.wikimedia.org/r/540923

@Bstorm @Andrew OS install and puppet run done on cloudbakcup2001. It is all yours. Do all your tests if happy, please assign task back to me to do the HW RAID configuration on cloudbackup2002.
Thanks

@Papaul, we're happy with the raid setup for cloudbackup2001, so please set up cloudbackup2002 the same way when you have a chance. Thank you!

@Papaul, nevermind, it turns out I can do this from the mgmt console.

2001 is up and looks good. 2002 is blocked awaiting port setup.

*bump* @Papaul I'm still hoping to get cables/port setup for cloudbackup2002.

papaul@asw-c-codfw> show ethernet-switching interface xe-7/0/9    
Routing Instance Name : default-switch
Logical Interface flags (DL - disable learning, AD - packet action drop,
                         LH - MAC limit hit, DN - interface down,
                         SCTL - shutdown by Storm-control,
                         MMAS - Mac-move action shutdown, AS - Autostate-exclude enabled) 

Logical          Vlan          TAG     MAC         STP         Logical           Tagging 
interface        members               limit       state       interface flags  
xe-7/0/9.0                             294912                                     untagged   
                 private1-c-codfw 2019 294912      Forwarding                     untagged
papaul@asw-c-codfw> show interfaces descriptions | match xe-7/0/9 
xe-7/0/9        up    up   cloudbackup2002

I can't PXE boot, so something is broken somewhere. I haven't dug in much though.

Broadcom UNDI PXE-2.1 v214.0.170.0
Copyright (C) 2000-2018 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: B0 26 28 5A DE BC  GUID: 4C4C4544-0037-4A10-804D-B6C04F375832
PXE-E51: No DHCP or proxyDHCP offers were received.

PXE-M0F: Exiting Broadcom PXE ROM.

@Andrew the reason is that cloudbackup2002 is in the .16 network or it supposed to be in the .32 network since it is racked in row C that was my mistake. You can move it from .16 network and put it in .32 network.

Let me know if you have any questions.

Change 548004 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Move cloudbackup2002 from 10.192.16 to 10.192.32

https://gerrit.wikimedia.org/r/548004

Change 548004 merged by Andrew Bogott:
[operations/dns@master] Move cloudbackup2002 from 10.192.16 to 10.192.32

https://gerrit.wikimedia.org/r/548004

Attached patch doesn't seem to make a difference, but also IP address doesn't matter until after the debian installer begins, does it? I think I must be misunderstanding something :)

I can see that the installation is in progress

                                                                           
                                                                              
                                                                              
                                                                              
                                                                              
┌─────────────────────┤ Installing the base system ├──────────────────────┐   
│                                                                         │   
│                                   48%                                   │   
│                                                                         │   
│ Unpacking the base system...                                            │   
│                                                                         │   
└─────────────────────────────────────────────────────────────────────────┘

Note that cloudbackup2002.codfw.wmnet is still using the old mgmt password . Please update it to the new mgmt password.

Thanks

OS install complete, first puppet run complete.

Huh, it must've been something that caught up overnight. Thanks!

Andrew updated the task description. (Show Details)

@Andrew this task has been resolved but both cloudbackup2001 and 2002 are showing "staged" in Netbox for status and both array are showing "planned" too.

I marked both servers as active. I'm not sure I know where to find the arrays in netbox, can you direct me?