Page MenuHomePhabricator

(Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of dbprov2003.codfw.wmnet

Hostname / Racking / Installation Details

Hostnames: dbprov2003
Racking Proposal: Should not share rows with existing hosts on A4 & B4 for redundancy. 10G required. If not possible, it should not share the same racks.
Networking/Subnet/VLAN/IP: Production network (where production mysql lives), 10G required for high transfer/backup restore to mysql servers.
Partitioning/Raid: Raid setup: Raid setup: Raid6 HDDs (create first so it is the first virtual drive-sda), Raid0 SSDs (create second so it is the second virtual drive-sdb). Both same options as the dbs- write back with 256K of stripe
OS Distro: Buster

Per host setup checklist

dbprov2003.codfw.wmnet: row D rack D4 xe-4/0/20

  • - receive in system on procurement task T257550 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH moved this task from Backlog to Acknowledged on the SRE board.
RobH added a parent task: Unknown Object (Task).
RobH removed a subscriber: RobH.
RobH renamed this task from (Need By: 2020-09-30) rack/setup/install dbprov2003.codfw.wmnet to (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet.Jul 23 2020, 8:22 PM
Interface       Admin Link Description
xe-4/0/20       up    up   dbprov2003

Logical          Vlan          TAG     MAC         STP         Logical           Tagging
interface        members               limit       state       interface flags
xe-4/0/20.0                            294912                                     untagged
                 private1-d-codfw 2020 294912      Forwarding                     untagged

`
 		Status 	Name 	State 	Layout 	Size 	Media Type 	Read Policy 	Write Policy 	Stripe Size 	Secured 	Remaining Redundancy
			Virtual Disk 0 	Online 	RAID-6 	11175 GB	HDD 	Read Ahead 	Write Back 	256K 	No 	2
			Virtual Disk 1 	Online 	RAID-0 	1787.5 GB	SSD 	Read Ahead 	Write Back 	256K 	No 	0

Change 618546 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add production DNS for dbprov2003

https://gerrit.wikimedia.org/r/618546

Change 618546 merged by Papaul:
[operations/dns@master] DNS: Add production DNS for dbprov2003

https://gerrit.wikimedia.org/r/618546

Change 618548 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address for dbprov2003

https://gerrit.wikimedia.org/r/618548

Change 618548 merged by Papaul:
[operations/puppet@production] DHCP: Add MAC address for dbprov2003

https://gerrit.wikimedia.org/r/618548

They use the custom/db.cfg recipe, but only on first install, after that they are moved to custom/reuse-dbprov.cfg. That is why it is important to setup the root filesystem on the HDs (sdas).

Change 618549 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] Add dbprov2003 to site.pp

https://gerrit.wikimedia.org/r/618549

Change 618549 merged by Papaul:
[operations/puppet@production] Add dbprov2003 to site.pp

https://gerrit.wikimedia.org/r/618549

Change 618555 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Add dbprov2003 to the db.cfg partman recipe list

https://gerrit.wikimedia.org/r/618555

Change 618555 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Add dbprov[12]003 to the db.cfg partman recipe list

https://gerrit.wikimedia.org/r/618555

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

dbprov2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202008051821_pt1979_4853_dbprov2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['dbprov2003.codfw.wmnet']

Of which those FAILED:

['dbprov2003.codfw.wmnet']

This was the only issue we had the last time with the same hw and recipe: T218336#5068836

Change 618593 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Fix MAC address for dbprov2003

https://gerrit.wikimedia.org/r/618593

Change 618593 merged by Papaul:
[operations/puppet@production] DHCP: Fix MAC address for dbprov2003

https://gerrit.wikimedia.org/r/618593

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

dbprov2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202008051920_pt1979_12866_dbprov2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['dbprov2003.codfw.wmnet']

and were ALL successful.

Papaul updated the task description. (Show Details)

This is done

Thank you Papaul very much for you work, really appreciated how fast this was completed!

Hey, Papaul, just to discard intruders on dc or other major hardware issues, could you maybe accidentally have pressed the power button of this dbprov2003 host (a host you recently setup correctly) around 16UTC? This not an issue if it was you, just want to discard it was something else (eg. something is half-pressing the power button after installation, like a cable or something)?

Aug 20 16:00:47 dbprov2003 systemd-logind[1310]: Power key pressed
Aug 20 16:00:47 dbprov2003 systemd-logind[1310]: Powering Off...
Aug 20 16:00:47 dbprov2003 systemd-logind[1310]: System is powering down.

(We found it switched off)

It is a non issue if maybe was pressed by accident when installing other hardware, or any other thing that it may be a one-time thing.

dbprov2003 is in D4 an i do not recall working in D4 yesterday when on site. i worked in D2 and C3. the only action taken in D4 yesterday was to close the cabinet door since it was not close all the way so that be have be maybe the only cause

"to close the cabinet door"

Is there any chance that could have caused pressing the power button. Do these servers even have a power button, other than the power supply ones?

closing again, as it has not happened again.