Page MenuHomePhabricator

(Need By: 2021-03-31) rack/setup/install ms-backup100[12]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-backup100[12]

Hostname / Racking / Installation Details

Hostnames: ms-backup1001, ms-backup1002
Racking Proposal: Redundant to each other, preferentially different rows, it not, different racks, as to have power and network redundancy.
Networking/Subnet/VLAN/IP: 10G, single production link, single mgmt link.
Partitioning/Raid: Software RAID1 between disks is enough.
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ms-backup1001:

  • - receive in system on procurement task T272018 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

ms-backup1002:

  • - receive in system on procurement task T272018 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - system incorrectly plugged into port 42 on switch, but reads 41 on this task and netbox, needs to be moved by onsite.
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH renamed this task from (Need By: TBD) rack/setup/install ms-backup100[12] to (Need By: 2021-03-31) rack/setup/install ms-backup100[12].Feb 8 2021, 10:13 PM
RobH created this task.
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).
RobH added subscribers: Jclark-ctr, jcrespo.

@jcrespo: Conversation between you and Arzhel on T272018 seems to indicate some kind of discussion is still pending for these to determine where they can be racked? They just need to be in different 10G racks than one another as far as I can tell, please advise on this task and assign to @Jclark-ctr once you have done so, as this is now pending order.

jcrespo added subscribers: ayounsi, RobH.

@RobH it got resolved. It will go in the regular production internal vlan (same as ms-fe hosts). In the future there is a chance it will move to the public vlan to get a public ip, but that is not the current plan, and @ayounsi mentioned if that is the case, it will not require physical location changes, just a network config change.

RobH removed a subscriber: RobH.
Jclark-ctr updated the task description. (Show Details)

Racked & cabled host handing over to Chris for configuration
Ms-backup1001 A4 u4 p3 id5322.
Ms-backup1001. C2 U41 P41 ID5587

Cmjohnson updated the task description. (Show Details)
Cmjohnson added subscribers: RobH, Cmjohnson.

on-site work completed, moving to @RobH to finish installs

Please note that ms-backup1002 shows port 41 in netbox and on task, but shows its plugged into port 42 on the idrac interface.

@Jclark-ctr: It seems like perhaps you plugged this into port 42 by mistake, can you please move it to port 41?

Please move the connection for ms-backup1002 fromn port 42 (shows 42 on idrac) to the port 41 listed on this task via T274206#6859101. Once that is done please assign back to me for followup and bulk imaging of this task.

IRC update, Chris will be onsite next and can move this cable.

Summary: ms-backup1002 in C2:U41 has DAC cable 5887 plugged into switch port 42 when it should be plugged into switch port 41. Please move to port 41 and reassign this task back to me for imaging, thanks!

the ports were reversed, kafka-logging1002 was in 42 and ms-backup1002 was in 41. I swapped them and all should be good now.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['ms-backup1001.eqiad.wmnet', 'ms-backup1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103092308_robh_24019.log.

Change 670318 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] ms-backup100[12] updates

https://gerrit.wikimedia.org/r/670318

This comment was removed by Jclark-ctr.

Change 670318 merged by RobH:
[operations/puppet@production] ms-backup100[12] updates

https://gerrit.wikimedia.org/r/670318

RobH updated the task description. (Show Details)
RobH removed subscribers: Cmjohnson, Jclark-ctr.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['ms-backup1001.eqiad.wmnet', 'ms-backup1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103092325_robh_26266.log.

Change 670323 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] fixing mac for ms-backup100[12]

https://gerrit.wikimedia.org/r/670323

Change 670323 merged by RobH:
[operations/puppet@production] fixing mac for ms-backup100[12]

https://gerrit.wikimedia.org/r/670323

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['ms-backup1001.eqiad.wmnet', 'ms-backup1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103092344_robh_29801.log.

Completed auto-reimage of hosts:

['ms-backup1001.eqiad.wmnet', 'ms-backup1002.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)

I just realized with jbond that the 2 servers (ms-backup1001.eqiad.wmnet and ms-backup1002.eqiad.wmnet) do not have ipv6 AAAA dns records.

While this is not a production-breaking issue, it is now causing slowdowns (IPv6 timeouts), as I cannot open through DNS a firewall of another host for these client, slowing down a potential media recovery.

Before doing any change to configuration on my own, I would like to double check with dcops if the IP6 assignment/network configuration/inventory on your records is correct (IP6 stack on both servers seems to work ok!), and it was just a case of just a pending step/script failure, etc. and not something else, like lost records or something weirder:

https://netbox.wikimedia.org/ipam/ip-addresses/7916/
https://netbox.wikimedia.org/ipam/ip-addresses/7919/

The DNS records for ms-backup200[12] on codfw seem to be ok.

RobH removed a subscriber: RobH.

@jcrespo I believe this would've been a script failure/issue. I would only select the skip IPv6 button if the task called for it.

Thank you for the feedback! As we all think this was not disabled for some reason (I've seen the setup scripts fail on some steps sometimes, I will do: https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_can_I_add_the_IPv6_AAAA/PTR_records_to_a_host_that_doesn't_have_it? as documented, and that should take care of it.

It is likely super-easy and fast, but I am not going to change DNS minutes away from my Christmas vacation, so will take care of this after I return in 2022

Claiming the task to fix in a few weeks.

jcrespo updated Other Assignee, added: Cmjohnson.

Deployed the change as this:

commit baeb288d4d9713814ac88e9537bbcf0ece5bb9e4                                                
Author: generate-dns-snippets <noc@wikimedia.org>                                              
Date:   Mon Jan 10 16:17:22 2022 +0000                                                         
                                                                                               
    root@cumin1001: Add IPv6 dns records for ms-backup1001/2                                   
                                                                                               
diff --git a/1.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa b/1.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa                                                                                              
index 6e27ad2..cabd95e 100644                                                                  
--- a/1.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa                                                 
+++ b/1.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa                                                 
@@ -49,6 +49,7 @@                                                                              
 7.0.1.0.0.0.0.0.4.6.0.0.0.1.0.0 1H IN PTR aqs1004.eqiad.wmnet.                                
 8.0.1.0.0.0.0.0.4.6.0.0.0.1.0.0 1H IN PTR netboxdb1001.eqiad.wmnet.                           
 0.1.1.0.0.0.0.0.4.6.0.0.0.1.0.0 1H IN PTR kafkamon1002.eqiad.wmnet.                           
+1.1.1.0.0.0.0.0.4.6.0.0.0.1.0.0 1H IN PTR ms-backup1001.eqiad.wmnet.                          
 2.1.1.0.0.0.0.0.4.6.0.0.0.1.0.0 1H IN PTR krb1001.eqiad.wmnet.                                
 7.1.1.0.0.0.0.0.4.6.0.0.0.1.0.0 1H IN PTR kubemaster1001.eqiad.wmnet.                         
 9.1.1.0.0.0.0.0.4.6.0.0.0.1.0.0 1H IN PTR grafana1002.eqiad.wmnet.                            
diff --git a/3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa b/3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa                                                                                              
index 711b906..bf82f76 100644                                                                  
--- a/3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa                                                 
+++ b/3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa                                                 
@@ -40,6 +40,7 @@                                                                              
 7.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0 1H IN PTR matomo1002.eqiad.wmnet.                             
 8.3.1.0.2.3.0.0.4.6.0.0.0.1.0.0 1H IN PTR aqs1005.eqiad.wmnet.                                
 3.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0 1H IN PTR registry1004.eqiad.wmnet.                           
+4.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0 1H IN PTR ms-backup1002.eqiad.wmnet.                          
 8.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0 1H IN PTR mc1046.eqiad.wmnet.                                 
 9.4.1.0.2.3.0.0.4.6.0.0.0.1.0.0 1H IN PTR thumbor1006.eqiad.wmnet.                            
 1.5.1.0.2.3.0.0.4.6.0.0.0.1.0.0 1H IN PTR mc1047.eqiad.wmnet.                                 
diff --git a/eqiad.wmnet b/eqiad.wmnet                                                         
index c35be41..cc14904 100644                                                                  
--- a/eqiad.wmnet                                                                              
+++ b/eqiad.wmnet                                                                              
@@ -1007,7 +1007,9 @@ moss-fe1001                              1H IN AAAA 2620:0:861:101:10:64:0:45                                                                                           
 moss-fe1002                              1H IN A 10.64.48.76                                  
 moss-fe1002                              1H IN AAAA 2620:0:861:107:10:64:48:76                
 ms-backup1001                            1H IN A 10.64.0.111                                  
+ms-backup1001                            1H IN AAAA 2620:0:861:101:10:64:0:111                
 ms-backup1002                            1H IN A 10.64.32.144                                 
+ms-backup1002                            1H IN AAAA 2620:0:861:103:10:64:32:144               
 ms-be1028                                1H IN A 10.64.0.21                                   
 ms-be1029                                1H IN A 10.64.0.25                                   
 ms-be1030                                1H IN A 10.64.0.26                                   
METADATA: {"path": "/tmp/dns-c25pcHBldHM-frtl99bc", "sha1": "baeb288d4d9713814ac88e9537bbcf0ece5bb9e4", "insertions": 4, "deletions": 0, "lines": 4, "files": 3}

dns was a bit unreliable just after the deploy, but it seems fine now, and it solved my timeout issue.