Page MenuHomePhabricator

rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet
Open, NormalPublic

Description

This task will track the racking, setup, and installation of the 2 new cloud backup hosts received in T210666

Rack proposal: please advice
wiring configuration : please advice
name proposal : Please advice

@Andrew or @aborrero Please provide me with the information needed, once finish you can just assign the task back to me.

Thanks

cloudbackup2001.codfw.wmnet : Row A rack A7
cloudbackup2001-array1

  • - receive in system on procurement task T210666
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID : Please provide HW RAID configuration and partman recipe
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

cloudbackup2002.codfw.wmnet: Row C rack C7
cloudbackup2002-array1

  • - receive in system on procurement task T210666
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - RAID Please provide HW RAID configuration and partman recipe
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup
    • end on-site specific steps
  • - production dns entries
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch) - update netbox status to staged & IMMEDIATELY RUN/SIGN PUPPET
  • - puppet accept/initial run
  • - handoff for service implementation - service implementation team must change status from staged to active

Event Timeline

Papaul created this task.May 28 2019, 10:46 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 28 2019, 10:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
aborrero reassigned this task from Andrew to Papaul.May 29 2019, 9:16 AM
aborrero triaged this task as Normal priority.

Rack proposal: anywhere in codfw, each server in a different rack, a rack with 10G support
Wiring configuration: single 10G connection each server if possible. The mgmt interface connected as in any other server (standard).
Name proposal: Per T210666#4916941, we agreed on calling them cloudbackup. They can be called cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org. I added the entry to https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers

aborrero renamed this task from rack/setup codfw: cloudstore (backups) to rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org.May 29 2019, 9:23 AM
aborrero updated the task description. (Show Details)
Papaul updated the task description. (Show Details)May 29 2019, 3:31 PM
Papaul updated the task description. (Show Details)May 29 2019, 3:35 PM

On second thoughts, we would like to change the public VLAN for a private one, from .wikimedia.org to .wmnet.

@Papaul is it too late to suggest switching to cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet? The team has been discussing this, and it is probably better to have them on the private network.

aborrero renamed this task from rack/setup codfw: cloudbackup2001.wikimedia.org and cloudbackup2002.wikimedia.org to rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet.May 29 2019, 4:47 PM
aborrero updated the task description. (Show Details)

On IRC:

18:56 <papaul> arturo: i have RAID setup that is unknow for now please discuss with your team and provide me with the information i have only the MD on site and not the PE so once i received the PE i need to move fast on this task thanks
Andrew added a comment.Jun 4 2019, 5:18 PM

raid config:

  • raid1 for the two os volumes
  • one big raid6 w/lvm for the remaining internal drives
  • another big raid6 w/lvm for the shelf

Both MD's are racked and Netbox updated.

Papaul updated the task description. (Show Details)Jun 6 2019, 4:04 PM
Papaul updated the task description. (Show Details)Jun 11 2019, 2:27 PM
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.Jun 12 2019, 1:11 PM

PowerEdge

Virtual Disk 0: RAID1, 223GB, Ready                                           
Virtual Disk 1: RAID6, 106.918TB, Ready

Storage MD
Virtual Disk 0: RAID6, 106.918TB, Ready

Papaul updated the task description. (Show Details)Jun 20 2019, 3:07 PM

@Bstorm when talking about the private network for those hosts, are you referring to the private1 network or the lab private network?

Thanks.

Bstorm added a subscriber: bd808.Jul 1 2019, 6:46 PM

private1, I believe. I mean the internal network. I *think* we are moving away from the lab private network, right @bd808 ?

bd808 added a comment.Jul 1 2019, 7:35 PM

private1, I believe. I mean the internal network. I *think* we are moving away from the lab private network, right @bd808 ?

Yes, we have been slowly deprecating the cloud-support* VLANs on the basis that the general SRE team found them to be confusing. Things that would have once been placed in those VLANs have been going into the public* VLANs instead for the last 2 years or so.

These servers are 100% WMCS internal infrastructure and not expected to have direct interaction from Cloud VPS instance space. Basically we need these hosts to be networked such that we can push (and restore) backups to them from the {lab,cloud}store servers in eqiad. I think that means the private1-*-codfw VLANs would be an ok place for them, but the public1-*-codfw VLANs may work too and would decrease the chance of any breach on the cloudbackup* hosts providing a jumping off point for attacking production MediaWiki or databases. I'm not sure what switch/router ACL changes would be needed to host them in public1-*-codfw and if that is considered problematic or not.

Change 520137 had a related patch set uploaded (by Papaul; owner: pt1979):
[operations/dns@master] DNS: Add mgmt and production DNS for cloudbackup200[1-2]

https://gerrit.wikimedia.org/r/520137

Change 520137 merged by Arturo Borrero Gonzalez:
[operations/dns@master] DNS: Add mgmt and production DNS for cloudbackup200[1-2]

https://gerrit.wikimedia.org/r/520137

Papaul updated the task description. (Show Details)Jul 2 2019, 3:05 PM

Change 520257 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Install/partman for cloudbackup2001/2002

https://gerrit.wikimedia.org/r/520257

Change 520257 merged by Andrew Bogott:
[operations/puppet@production] Install/partman for cloudbackup2001/2002

https://gerrit.wikimedia.org/r/520257

Papaul updated the task description. (Show Details)Jul 3 2019, 4:02 PM
Papaul updated the task description. (Show Details)Jul 3 2019, 4:46 PM
Papaul reassigned this task from Papaul to Andrew.Jul 3 2019, 4:49 PM

@Bstorm @Andrew OS install and puppet run done on cloudbakcup2001. It is all yours. Do all your tests if happy, please assign task back to me to do the HW RAID configuration on cloudbackup2002.
Thanks

@Bstorm @Andrew These were once installed with Stretch, but in the mean time Buster was released, let's reimage those before they are put to use?

Change 540668 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] move cloudbackup2001/2002 to Buster

https://gerrit.wikimedia.org/r/540668

Change 540668 merged by Andrew Bogott:
[operations/puppet@production] move cloudbackup2001/2002 to Buster

https://gerrit.wikimedia.org/r/540668

Andrew added a comment.EditedThu, Oct 3, 8:44 PM

I am finally back looking at this! I'm not sure quite what I should expect regarding the raids here -- I reimaged (for buster) and saw the partitioner offer to create two volumes, but in the OS I only see one:

root@cloudbackup2001:~# pvdisplay
  --- Physical volume ---
  PV Name               /dev/sda5
  VG Name               cloudbackup2001-vg
  PV Size               <195.06 GiB / not usable 4.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              49934
  Free PE               9987
  Allocated PE          39947
  PV UUID               FHdy76-ugXD-fjxQ-PIws-B0CW-F94r-ZBumTY

@Papaul, do you know what I need to do to make the giant raid present to the OS?

Andrew added a comment.Fri, Oct 4, 3:14 PM

(nevermind, I think I see what's happening)

Change 540898 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup2001: update raid config

https://gerrit.wikimedia.org/r/540898

Change 540898 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup2001: update raid config

https://gerrit.wikimedia.org/r/540898

Change 540899 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup partman: the second of many changes yet to come

https://gerrit.wikimedia.org/r/540899

Change 540899 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup partman: the second of many changes yet to come

https://gerrit.wikimedia.org/r/540899

Change 540906 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup: more partman tinkering

https://gerrit.wikimedia.org/r/540906

Change 540906 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup: more partman tinkering

https://gerrit.wikimedia.org/r/540906

Change 540917 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup: add some comments to partman recipe

https://gerrit.wikimedia.org/r/540917

Change 540917 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup: add some comments to partman recipe

https://gerrit.wikimedia.org/r/540917

Change 540921 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labstore backups: make backup interval configurable with hiera

https://gerrit.wikimedia.org/r/540921

Change 540923 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudbackup2001: make a backup server

https://gerrit.wikimedia.org/r/540923

Change 540921 merged by Andrew Bogott:
[operations/puppet@production] labstore backups: make backup interval configurable with hiera

https://gerrit.wikimedia.org/r/540921

Change 540923 merged by Andrew Bogott:
[operations/puppet@production] cloudbackup2001: make a backup server

https://gerrit.wikimedia.org/r/540923