Page MenuHomePhabricator

rack/setup/deploy codfw dedicated backup recovery/provisioning hosts
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of th dedicated backup recovery/provisioning hosts

dbprov2001:
Rack location: Row A rack A4

  • - receive in system on procurement task T216138
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1 vlan for each row)
    • end on-site specific steps
  • - production dns entries added (private1-vlan for each row)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch)
  • - puppet accept/initial run
  • - handoff for service implementation

dbprov2002:
Rack location: Row B rack B4

  • - receive in system on procurement task T216138
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, private1 vlan for each row)
    • end on-site specific steps
  • - production dns entries added (private1-vlan for each row)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (stretch)
  • - puppet accept/initial run
  • - handoff for service implementation

Details

Related Gerrit Patches:

Event Timeline

Papaul created this task.Mar 14 2019, 6:27 PM
Restricted Application added a project: Operations. · View Herald TranscriptMar 14 2019, 6:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Papaul updated the task description. (Show Details)Mar 14 2019, 6:28 PM
Papaul added a comment.EditedMar 14 2019, 6:32 PM

@jcrespo @Marostegui

  • The last db server in codfw is db2096. Can you please replace the "name of the host" if we are going to use something else then db209[7-8]
  • Can you please provide the HW RAID configuration
  • Can you please provide the partman recipe

A4 and B4 are 10GB rack if this racking location doesn't work for you please let me know

Thanks

@Papaul Please see my warnings at T216137#5002854 for Chris, which applys here. I had suggested to use dbstore for these hosts, but @Marostegui didn't agree or disagree at the time, and I am no longer sure it is the right call given there are dbstore hosts which are identical to other db* hosts. Maybe we should call dbprovision2XXX or something similar, as this won't hold live databases? @Marostegui any thoughts?

In theory, you should setup the HDs in RAID6, leave for now the ssds unformatted and non-configured. Partman recipe is not yet ready, as this is a new class of hw and usage, although it you keep only a single sda device as per above, it should be the same db.cfg as the other hosts. I will meet with manuel and give you all information you need soon. Sorry for the inconvenience.

Marostegui moved this task from Triage to In progress on the DBA board.Mar 14 2019, 6:49 PM

Thanks @Papaul!
The rack locations are fine I think.

The hostname: I think we still need to discuss them as these hosts will not be a normal database (neither its config nor its hardware).
RAID configuration: Still also to be discussed as these hosts have 2 SSDs (which we don't know yet which RAID we will use, if 0 or 1) + 8x2TB SATA disks (which we don't know yet which RAID we will use, if 5 or 6).

We will discuss those two things and we'll get back to you!
Thanks for creating this ticket!

@jcrespo I am still not sure if dbstore would be a good name, just because their hardware is completely different from the existing dbstoreXXXX, but on the other hand, having a different set of hostnames, puppet roles etc is also a mess. Let's keep thinking about it and sync up later.

For what is worth: fine with RAID6 for the 2T SATA disks (I reckon that was the initial idea we discussed, so +1 to confirm it here).

Papaul triaged this task as Medium priority.Mar 14 2019, 7:03 PM

Thanks guys. Hopefully I will get the information needed before receiving the servers on 03/22/19.

Hello @Papaul

Jaime and myself discussed a few things.
Hostname: dbprov2001 dbprov2002
RAID0 for SSD
RAID6 for the SATA Disks
partman recipe: db.cfg but don't worry about it, we will probably install it ourselves as long as the RAIDs are done from your side.

@Marostegui Thank you what is the stripe size to use?

Papaul added a comment.EditedMar 18 2019, 3:25 PM

switch port information

dbprov2001: asw-a4-codfw xe-4/0/18
dbprov2002: asw-b4-codfw xe-4/0/3

Papaul updated the task description. (Show Details)Mar 18 2019, 3:27 PM

Change 498119 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add mgmt and production DNS for dbprov200[1-2]

https://gerrit.wikimedia.org/r/498119

Change 498133 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2096

https://gerrit.wikimedia.org/r/498133

Mentioned in SAL (#wikimedia-operations) [2019-03-21T16:01:02Z] <marostegui> Poweroff db2096 for onsite maintenance T218336

Change 498133 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2096

https://gerrit.wikimedia.org/r/498133

jcrespo renamed this task from rack/setup/deploy dedicated backup recovery/provisioning hosts to rack/setup/deploy codfw dedicated backup recovery/provisioning hosts.Mar 21 2019, 4:45 PM

Allow me to edit the title to not confuse it with the same task that will be filed for eqiad :-D

Papaul updated the task description. (Show Details)Mar 21 2019, 4:51 PM

Change 498212 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address entries for dpprov200[1-2]

https://gerrit.wikimedia.org/r/498212

Papaul updated the task description. (Show Details)Mar 21 2019, 8:10 PM
papaul@asw-a-codfw# run show interfaces xe-4/0/18 descriptions       
Interface       Admin Link Description
xe-4/0/18       up    up   dbprov2001
papaul@asw-b-codfw> show interfaces xe-4/0/3 descriptions 
Interface       Admin Link Description
xe-4/0/3        up    up   dbprov2002
Papaul updated the task description. (Show Details)Mar 21 2019, 8:47 PM

@Marostegui @jcrespo all is set at my end RAID 0 for the 2 SSD's and RAID 6 for the 8 other disks don't know who to assign the task to so fell free to take it anytime. Let me know if you have any questions. Thanks.

Also please don't forget to merge the DHCP and DNS changes.

jcrespo removed Papaul as the assignee of this task.Mar 21 2019, 10:14 PM

Change 498119 merged by Marostegui:
[operations/dns@master] DNS: Fix mgmt DNS for dbprov2002

https://gerrit.wikimedia.org/r/498119

I think we need to decide how to install these from a partitioning point of view.
We can install them manually and not use the db.cfg but we need to define the mount points so the SSDs are mounted on a given directory and the rest on the SATA disks?

Cannot we try db.cfg, then mount and partition the ssds later?

Also, we should consider using buster.

SATA disks

All are SATA disks!

Cannot we try db.cfg, then mount and partition the ssds later?

Depending on which one the RAID controller picks as sda, the OS might get installed on the SSDs.

d-i partman-auto/disk   string  /dev/sda
d-i partman-auto/expert_recipe  string  es ::   \
        40000 40000 40000 ext4      \
            $primary{ }     \
            $bootable{ }        \
            method{ format }    \
            format{ }       \
            use_filesystem{ }   \
            filesystem{ ext4 }  \
            mountpoint{ / }     \

Also, we should consider using buster.

SATA disks

All are SATA disks!

You know exactly what I meant. Let me know how you prefer to call them to distinguish them

the OS might get installed on the SSDs.

And it will be faster to just try once than to do 4 manual installs :-)

You know exactly what I meant

Honestly, I don't, I know you call SATA either the SSDs or the HDs, but I cannot remember which ones. The old HDs were SAS and the new SSDs are SATA, so you probably mean SSD?

the OS might get installed on the SSDs.

And it will be faster to just try once than to do 4 manual installs :-)

ok - go for it.

You know exactly what I meant

Honestly, I don't, I know you call SATA either the SSDs or the HDs, but I cannot remember which ones. The old HDs were SAS and the new SSDs are SATA, so you probably mean SSD?

I call SATA the big ones and I call SSDs the SSDs ones. Please suggest a different naming if that is confusing to you.

jcrespo claimed this task.Mar 22 2019, 6:41 PM
Marostegui added a subtask: Unknown Object (Task).Mar 25 2019, 6:13 AM

Change 498768 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Add db1139,db1140,dbprov200*

https://gerrit.wikimedia.org/r/498768

Change 499163 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Prepare dbprov2001/2 and future dbprov1001/2 for production

https://gerrit.wikimedia.org/r/499163

Change 499163 merged by Jcrespo:
[operations/puppet@production] mariadb: Prepare dbprov2001/2 and future dbprov1001/2 for production

https://gerrit.wikimedia.org/r/499163

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['dbprov2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201903270905_jynus_116352.log.

Change 498212 merged by Jcrespo:
[operations/puppet@production] DHCP: Add MAC address entries for dbprov200[1-2]

https://gerrit.wikimedia.org/r/498212

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['dbprov2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201903270928_jynus_121076.log.

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['dbprov2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201903271021_jynus_133092.log.

Change 499452 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] 10.in-addr.arpa: Fix typo

https://gerrit.wikimedia.org/r/499452

Change 499452 merged by Jcrespo:
[operations/dns@master] 10.in-addr.arpa: Fix typo

https://gerrit.wikimedia.org/r/499452

Change 499528 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Setup dbprov2001 as the backup server

https://gerrit.wikimedia.org/r/499528

Change 499528 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Setup dbprov2001 as the backup server

https://gerrit.wikimedia.org/r/499528

Mentioned in SAL (#wikimedia-operations) [2019-03-27T18:12:52Z] <jynus> update grants on db1115 for new provisioning hosts on codfw T218336

I've setup dbprov2001 and sent snapshots for codfw there.

We may have to think a way to coordinate dbprov2001 and dbprov2002. I can see 2 options:

  • Because snapshots happen from cumin hosts, run a single cron there but modify the software to send each transfer to each db based on config and control on software the concurrency (e.g. concurrency is per destination host, so 2 hosts with concurrency 2 means 2 processes, each running 2 simultaneous backups)
  • Setup this on puppet, and make 2 roles/profiles, each with its separate configuration, running on 2 separate crons (which can run at the same time)

I lean more on #1 (single configuration backup for all sections), however, that may not work for dumps -which are initiated locally from the backup dumps. Or we can start to do dumps also from cumin. We need to give it a think.

jcrespo updated the task description. (Show Details)Mar 27 2019, 6:20 PM

I would prefer option #1 because it scales better for the future if we need more hosts and it looks cleaner in general, a central place where everything is handled from.

Also, I think it makes sense to initialize dumps also from cumin rather than from the hosts itself, again, easier to coordinate and easier to remember that everything is on the same place.

Change 499599 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Require wmf mariadb package present

https://gerrit.wikimedia.org/r/499599

Change 499599 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Require wmf mariadb package present

https://gerrit.wikimedia.org/r/499599

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['dbprov2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201903290903_jynus_220305.log.

jcrespo reassigned this task from jcrespo to Papaul.Mar 29 2019, 9:32 AM

@Papaul we need help from you.

We cannot network boot on dbprov2002 (we did on dbprov2001 already).

PXE-E51: No DHCP or proxyDHCP offers were received.

It tries to listen for those on B0 26 28 21 59 38 which is the one configured on the dhcp server.
DNS seems also deployed and right (it was done at the same time than dbprov, that installed successfully)
The right card/port has link up vs the other unconnected card/port which shows no link.

I wonder if there could be some confusion in terms of vlan or correct configuration of the port at ip level.

Thanks,

@jcrespo the problem was that ge-4/0/3 was already part of private1-b-codfw and not xe-4/0/3 so the install is in progress. will hang you the task once the install finished.

Papaul reassigned this task from Papaul to jcrespo.Mar 29 2019, 6:56 PM

@jcrespo all yours let me know if you have any questions

Thanks, it installed with no issues.

Change 500683 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Setup dbprov2002

https://gerrit.wikimedia.org/r/500683

Change 500683 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Setup dbprov2002

https://gerrit.wikimedia.org/r/500683

jcrespo closed this task as Resolved.Apr 3 2019, 3:09 PM
jcrespo reassigned this task from jcrespo to Papaul.
jcrespo updated the task description. (Show Details)

This is done, except the problems with mounting point of the ssds, to be handled separately.

Change 508543 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Chmod /srv/backups/dumps o+x so the disk space check works

https://gerrit.wikimedia.org/r/508543

Change 508543 merged by Jcrespo:
[operations/puppet@production] mariadb: Chmod /srv/backups/dumps o+x so the disk space check works

https://gerrit.wikimedia.org/r/508543