Page MenuHomePhabricator

rack/setup/install ms-be10[44-50].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of ms-be10[44-50].eqiad.wmnet.

Racking Proposal: Please see discussion regarding this in this task's comments, as racking proposal needs review/approval of @fgiunchedi for redundancy planning.

ms-be1044:

  • - receive in system on procurement task T204133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

ms-be1045:

  • - receive in system on procurement task T204133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

ms-be1046:

  • - receive in system on procurement task T204133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

ms-be1047:

  • - receive in system on procurement task T204133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

ms-be1048:

  • - receive in system on procurement task T204133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

ms-be1049:

  • - receive in system on procurement task T204133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

ms-be1050:

  • - receive in system on procurement task T204133
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

RobH triaged this task as High priority.Nov 15 2018, 6:19 PM
RobH created this task.
RobH added a comment.Nov 15 2018, 6:22 PM

Well, netbox makes reviewing current rack placement easier, since it summarizes it:

https://netbox.wikimedia.org/dcim/devices/?q=ms-be&site=eqiad&mac_address=&has_primary_ip=&cf_owner=&cf_purchase_date=&cf_support_contract=&cf_support_until=&cf_ticket=

So for horizontal redundancy / rack planning, I have a few questions:

  • Are any of the existing ms-be systems slated to go away soon?
    • If so, these new ones can share racks with those without issue.
  • Is there a maximum number of ms-be hosts per rack we don't want to exceed?
  • Do we want these on 1G or 10G?
  • Otherwise duplicate settings of existing ms-be hosts (networking vlans, OS, etc...)

@fgiunchedi: If you can address the above, we'll be able to determine where to rack these 7 new hosts.

Well, netbox makes reviewing current rack placement easier, since it summarizes it:
https://netbox.wikimedia.org/dcim/devices/?q=ms-be&site=eqiad&mac_address=&has_primary_ip=&cf_owner=&cf_purchase_date=&cf_support_contract=&cf_support_until=&cf_ticket=
So for horizontal redundancy / rack planning, I have a few questions:

  • Are any of the existing ms-be systems slated to go away soon?
    • If so, these new ones can share racks with those without issue.

Yes, ms-be101[345] are 4.5 years old at this point, I was planning on decom those when replacement hardware is fully in service. Though these systems have 3T disks vs new hardware which has 4T, see below for row allocation.

  • Is there a maximum number of ms-be hosts per rack we don't want to exceed?

Ideally spread out by rack too as much as possible, a "nice to have" property for availability but not a strict requirement. AFAIK there's at least two 10G racks per row so those will do.

  • Do we want these on 1G or 10G?

10G

  • Otherwise duplicate settings of existing ms-be hosts (networking vlans, OS, etc...)

Correct

This is my proposed row allocation to rebalance number of systems vs capacity:

ms-be1044 eqiad row A
ms-be1045 eqiad row A
ms-be1046 eqiad row A
ms-be1047 eqiad row C
ms-be1048 eqiad row C
ms-be1049 eqiad row C
ms-be1050 eqiad row D

If we can make those allocations rack-diverse as much as possible that'd be awesome too!

fgiunchedi reassigned this task from fgiunchedi to RobH.Nov 16 2018, 8:24 AM

@fgiunchedi For racking this is the space I have

I can do at least 3 in A with out a problem,

I can only 2 in C and that would be the same rack (C2)

B can handle 3 or more and D can handle 2 or more with one racking having at least 2.

@fgiunchedi For racking this is the space I have
I can do at least 3 in A with out a problem,
I can only 2 in C and that would be the same rack (C2)
B can handle 3 or more and D can handle 2 or more with one racking having at least 2.

Ok! We can move one from C to B, resulting in this allocation all of 10G ports:

ms-be1044 eqiad row A
ms-be1045 eqiad row A
ms-be1046 eqiad row A
ms-be1047 eqiad row B
ms-be1048 eqiad row C
ms-be1049 eqiad row C
ms-be1050 eqiad row D

Cmjohnson moved this task from Backlog to Racking Tasks on the ops-eqiad board.Dec 5 2018, 4:25 PM

Change 477829 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for ms-be1044-50

https://gerrit.wikimedia.org/r/477829

Change 477829 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for ms-be1044-50

https://gerrit.wikimedia.org/r/477829

Cmjohnson updated the task description. (Show Details)Dec 5 2018, 10:04 PM
Cmjohnson updated the task description. (Show Details)Dec 6 2018, 8:15 PM

@RobH this is already assigned to you but these are ready for you to take over

Change 478993 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] prod dns for ms-be10[44-50].eqiad.wmnet

https://gerrit.wikimedia.org/r/478993

Change 478993 merged by RobH:
[operations/dns@master] prod dns for ms-be10[44-50].eqiad.wmnet

https://gerrit.wikimedia.org/r/478993

Change 479000 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] updates for ms-be10[44-50].eqiad.wmnet

https://gerrit.wikimedia.org/r/479000

Change 479000 merged by RobH:
[operations/puppet@production] updates for ms-be10[44-50].eqiad.wmnet

https://gerrit.wikimedia.org/r/479000

RobH reassigned this task from RobH to fgiunchedi.Dec 11 2018, 7:01 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
fgiunchedi reassigned this task from fgiunchedi to RobH.Dec 12 2018, 9:15 AM

@RobH looks like of these hosts only ms-be1050 is accessible from cumin atm? ditto for logging in as my user via ssh

root@cumin1001:~# cumin 'ms-be10[44-50].eqiad.wmnet' 'uptime'
1 hosts will be targeted:
ms-be1050.eqiad.wmnet
Confirm to continue [y/n]? y
----- OUTPUT of 'uptime' -----                                                                            
 09:13:22 up 14:55,  0 users,  load average: 0.01, 0.06, 0.07
RobH reassigned this task from RobH to fgiunchedi.Dec 12 2018, 5:36 PM

Turns out when you enable puppet on a new install with the cert signed, you must still manually run the first run. Fixed.

Mentioned in SAL (#wikimedia-operations) [2018-12-13T08:50:19Z] <godog> stress-test ms-be10[44-50] - T209618

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Dec 13 2018, 9:37 AM
Dzahn added a subscriber: Dzahn.

<godog> stress-test ms-be10[44-50] - T209618

T211796 shows a new RAID failure on ms-be1045. looks like that happened as a result of the stress test.

edit: false alarm, the RAID check just triggered because it could not connect via NRPE temporarily

@fgiunchedi For racking this is the space I have
I can do at least 3 in A with out a problem,
I can only 2 in C and that would be the same rack (C2)
B can handle 3 or more and D can handle 2 or more with one racking having at least 2.

Ok! We can move one from C to B, resulting in this allocation all of 10G ports:
ms-be1044 eqiad row A
ms-be1045 eqiad row A
ms-be1046 eqiad row A
ms-be1047 eqiad row B
ms-be1048 eqiad row C
ms-be1049 eqiad row C
ms-be1050 eqiad row D

@Cmjohnson the ip addresses of the hosts as in production now don't seem to reflect this allocation:

ms-be1044.eqiad.wmnet has address 10.64.0.138
ms-be1045.eqiad.wmnet has address 10.64.0.139
ms-be1046.eqiad.wmnet has address 10.64.0.140
ms-be1047.eqiad.wmnet has address 10.64.16.28
ms-be1048.eqiad.wmnet has address 10.64.16.30
ms-be1049.eqiad.wmnet has address 10.64.32.74
ms-be1050.eqiad.wmnet has address 10.64.32.75

Namely 1048 is in row B but should be in C and 1050 is in C but should be in D.
I don't mind so much the order so moving 1048 from B to D would also work for me as long as the count-per-row stays the same.

RobH added a comment.EditedDec 17 2018, 4:59 PM

TL;DR / IRC Summary Update:

The new systems are not distributed as evenly across all 4 rows as needed for our existing horizontal redundancy.

@Cmjohnson is going to offline and move ms-be1048 from row B to row D. Once there, it can have its DNS changes (from row B to D internal IP) and reimaged.

@RobH the server has been moved to d2/u31.

  • update netbox
  • remove from asw2-b-eqiad
  • update asw2-d-eqiad

Change 480145 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] updating ms-be1048 production dns

https://gerrit.wikimedia.org/r/480145

Change 480145 merged by RobH:
[operations/dns@master] updating ms-be1048 production dns

https://gerrit.wikimedia.org/r/480145

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['ms-be1048.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201812171900_robh_202179.log.

RobH reassigned this task from Cmjohnson to fgiunchedi.Dec 17 2018, 8:26 PM

Filippo,

ms-be1048 has been relocated into row d & reimaged with its row d ip address. should be all set for you

Mentioned in SAL (#wikimedia-operations) [2018-12-18T09:38:03Z] <godog> swift eqiad-prod: initial weights for ms-be10[44-50].eqiad.wmnet - T209618

Change 480446 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add ms-be10[44-50].eqiad.wmnet

https://gerrit.wikimedia.org/r/480446

Change 480446 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add ms-be10[44-50].eqiad.wmnet

https://gerrit.wikimedia.org/r/480446

RobH removed a subscriber: RobH.

Mentioned in SAL (#wikimedia-operations) [2018-12-19T08:16:20Z] <godog> swift eqiad-prod: more weight for ms-be10[44-50].eqiad.wmnet - T209618

Mentioned in SAL (#wikimedia-operations) [2018-12-26T09:13:40Z] <godog> swift eqiad-prod: more weight for ms-be10[44-50].eqiad.wmnet - T209618

Mentioned in SAL (#wikimedia-operations) [2019-01-02T09:06:03Z] <godog> eqiad-prod: final weight for ms-be10[44-50].eqiad.wmnet - T209618

@Cmjohnson @fgiunchedi
There are 2 new Icinga alerts saying that on ms-be1044 and ms-be1045 the power supplies are not redundant anymore:

Indeed! That's T212861: Rack A2's hosts alarm for PSU broken

fgiunchedi closed this task as Resolved.Jan 8 2019, 2:09 PM

All new hosts in service, resolving.