Page MenuHomePhabricator

(Need By: TBD) rack/setup/install an-worker11[18-41]
Open, MediumPublic

Description

This task will track the racking, setup, and OS installation of an-worker11[18-41]

Hostname / Racking / Installation Details

Hostnames: an-worker10XX (an-worker11[18-41])
Racking Proposal: The racking should be, ideally, spread among the 4 rows that we have. Given what we discussed in T259071#6359706, if possible I'd try to keep as close as similar as possible the number of hadoop workers for each row. It is not a strict/hard requirement, only if possible, we can live with some row unbalanced.
Networking/Subnet/VLAN/IP: 10G, analytics VLAN
Partitioning/Raid: same config as we have for the rest of the an-worker nodes
OS Distro: Stretch (we are working on Buster but it is complicated for us to use Buster nodes for the moment).

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-worker1118:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1119:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1120:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1121:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1122:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1123:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1124:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1125:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1126:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1127:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1128:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1129:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - John to address comment T260445#6842372
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - setup hw raid (raid1 for dual ssds, single disk raid0s for all hdds)
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1130:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1131:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1132:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1133:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - setup hw raid (raid1 for dual ssds, single disk raid0s for all hdds)
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1134:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - setup hw raid (raid1 for dual ssds, single disk raid0s for all hdds)
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1135:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1136:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1137:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1138:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1139:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - John to address comment T260445#6842372
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1140:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - setup hw raid (raid1 for dual ssds, single disk raid0s for all hdds)
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

an-worker1141:

  • - receive in system on procurement task T258727 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - setup hw raid (raid1 for dual ssds, single disk raid0s for all hdds)
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Completed auto-reimage of hosts:

['an-worker1135.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1138.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202101181618_elukey_12915.log.

Completed auto-reimage of hosts:

['an-worker1138.eqiad.wmnet']

Of which those FAILED:

['an-worker1138.eqiad.wmnet']

Completed auto-reimage of hosts:

['an-worker1137.eqiad.wmnet']

and were ALL successful.

elukey updated the task description. (Show Details)Jan 18 2021, 5:55 PM
elukey updated the task description. (Show Details)Jan 18 2021, 5:58 PM

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1136.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202101181804_elukey_27522.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1138.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202101181810_elukey_7545.log.

Completed auto-reimage of hosts:

['an-worker1136.eqiad.wmnet']

and were ALL successful.

elukey updated the task description. (Show Details)Jan 18 2021, 6:29 PM

Completed auto-reimage of hosts:

['an-worker1138.eqiad.wmnet']

and were ALL successful.

@Cmjohnson for an-worker1119 and an-worker1131 I don't have any network link, could you please check if anything is missing from the cabling/config point of view?

@RobH I completed most of the installs except the ones above (and the hosts not racked yet), we should be good! (I'll also do more extensive testing later on)

RobH reassigned this task from RobH to Cmjohnson.Jan 19 2021, 4:31 PM

I'm unsubscribing myself from this, as its been taken over by the subteam, and its causing a lot of noise in my phabricator notifications (so they are less useful)

AFAICT this should go back to Chris for the racking of the remaining hosts.

Please don't resub/add/assign to me until this is ready for me to work on, as the updates are polluting my notifications, thanks!

RobH removed a subscriber: RobH.Jan 19 2021, 4:32 PM

@Cmjohnson before racking the remaining 6 nodes (that we can do it in another task) could you check an-worker1119 and an-worker1131 to see if they are connected to the switch? I am seeing the network link down on both, so I cannot proceed with Debian install :(

replaced Dac cable for an-worker1119 and an-worker1131 @elukey confirmed both are seeing network

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1119.eqiad.wmnet', 'an-worker1131.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202101260812_elukey_31698.log.

Completed auto-reimage of hosts:

['an-worker1131.eqiad.wmnet', 'an-worker1119.eqiad.wmnet']

and were ALL successful.

elukey updated the task description. (Show Details)Jan 26 2021, 4:28 PM

Ok all nodes racked are now working! We have 6 missing node still to rack, ideally in rows not already too used. For example, this is our current distribution in rows for the hosts that we have racked in this task:

===== NODE GROUP =====                                                                                                    
(4) an-worker[1135-1138].eqiad.wmnet                                                                                            
    SysName:      asw2-d-eqiad                                                                                            
===== NODE GROUP =====                                                                                                    
(2) an-worker[1131-1132].eqiad.wmnet                                                                                            
    SysName:      asw2-c-eqiad                                                                                            
===== NODE GROUP =====                                                                                                    
(6) an-worker[1124-1128,1130].eqiad.wmnet                                                                                  
    SysName:      asw2-b-eqiad                                                                                            
===== NODE GROUP =====                                                                                                    
(6) an-worker[1118-1123].eqiad.wmnet                                                                                                                                                     
    SysName:      asw2-a-eqiad

Ideally it would be nice to add the remaining 6 nodes in row C and row D, keeping as constraint that we want to avoid more than 5 hadoop workers in the same rack for availability/resiliency/etc.. @wiki_willy do we have a timeline about when free spots will be freed for 10g racks? I can also help more in moving 1g nodes away from 10g racks if needed :)

Hi @elukey - in looking through Netbox and talking to Chris, this is what I'm thinking, but @Cmjohnson/@Jclark-ctr/@elukey - please call me out if I'm off with any part of this plan, and we can think of an alternative:

  • Decommission wmf3570 in C4 - is this a spare machine?
  • Move snapshot1006 or deploy1001 in C4 to different rack (they're not 10g)
  • Move francium in C7 to a different rack (it's not 10g)
  • Move an-presto1001 and analytics1076 in D2 to D4 - are we still ok w/ rack diversity?
  • Move an-presto1003 and analytics1077 in D7 to D4 - are we still ok w/ rack diversity?
  • Install 2x an-workers in C4, 1x an-worker in C7, 1-2x an-workers in D2, 1-2x an-workers in D7

Ideally it would be nice to add the remaining 6 nodes in row C and row D, keeping as constraint that we want to avoid more than 5 hadoop workers in the same rack for availability/resiliency/etc.. @wiki_willy do we have a timeline about when free spots will be freed for 10g racks? I can also help more in moving 1g nodes away from 10g racks if needed :)

elukey added a comment.EditedJan 27 2021, 8:38 AM

Hi @wiki_willy thanks a lot for following up!

I re-did the calculations of the workers' distribution after the last racking and this is what I got:

# Hosts in rows
      19 A
      19 B
      21 C
      19 D

# Hosts in racks
      1 A/1
      5 A/2
      1 A/4
      2 A/3
      3 A/4
      2 A/5
      5 A/7
      5 B/2
      1 B/3
      5 B/4
      5 B/7
      3 B/8
      5 C/2
      4 C/3
      7 C/4
      4 C/7
      1 C/8
      6 D/2
      4 D/4
      2 D/5
      6 D/7
      1 D/8

That looks way more balanced than my calculations before the racking of the last 18 nodes, so this gives us more flexibility in racking the next 6 nodes. Ideally if we could spread them across rows it would be great, keeping the constraint that the target rack should contain max 5 nodes (there are racks already hosting 6 or 7 nodes for example, that is not ideal but ok for now).

About the proposal above:

Install 2x an-workers in C4, 1x an-worker in C7, 1-2x an-workers in D2, 1-2x an-workers in D7

  • C4 is already super crowded (7 hosts) so I'd avoid it :(
  • C7 looks ok (it can host one more worker)
  • D2 is already crowded (6 nodes) so I'd avoid it, same thing for D7

Let me know if there is another solution, I am available to help moving any node to other racks (not only the Analytics ones) to ease the process. Thanks for the patience!

Hi @elukey - thanks for the mapping. What makes it tough is that the remaining 6x hosts need to be on 10g switches, which really limits our options. Right now, it looks like you're maxed out in almost all our 10g racks (A2, A4, A7, B2, B4, B7, C2, C4, C7, D2, D4, D7). But based on your mapping, I think we make this happen instead, if this works for you:

  • Rack A4: can fit 3x an-workers if we move logstash1020 and db1111 out (these are not 10g hosts), and rack one of the new an-workers on shelf 42
  • Rack C7: pre previous proposal, install 1x an-worker after moving francium to different rack
  • Rack D4: install 2x an-workers here (there's rack space)

That would make a new final distribution of 4x in A4, 5x in C7, and 6x in D4.

Thanks,
Willy

About the proposal above:

Install 2x an-workers in C4, 1x an-worker in C7, 1-2x an-workers in D2, 1-2x an-workers in D7

  • C4 is already super crowded (7 hosts) so I'd avoid it :(
  • C7 looks ok (it can host one more worker)
  • D2 is already crowded (6 nodes) so I'd avoid it, same thing for D7

Let me know if there is another solution, I am available to help moving any node to other racks (not only the Analytics ones) to ease the process. Thanks for the patience!

@wiki_willy thanks a lot! Can we start racking 2 nodes out of 6? If I got it correctly we could:

  • add one node to D4 (rather than 2 I know, but we'd reach the max 5 workers per rack instead of 6)
  • add one node do A4 (without any move)

In the meantime I can work with Fundraising about francium, and possibly with others for logstash1020 and db1111 (the latter seem unlikely ready to be moved but I'll ask). What do you think?

wiki_willy added a comment.EditedThu, Jan 28, 6:00 PM

I think that should work, but let me defer to @Cmjohnson and @Jclark-ctr for any additional concerns though. In summary, here's the game plan (slightly adjusted from my original proposal, so that the nodes don't go over 5x in a rack):

  • Step 1 --- Install 1x an-worker in Rack D4 (there's rack space)
  • Step 2 --- Install 1x an-worker in Rack A4, shelf 42 (if Chris/John are ok racking that high)
  • Step 3 --- Decom francium via T273142, then install 1x an-worker here in rack C7, shelf 22/23
  • Step 4 --- Work with Hugh Nowlan (or Luca) to move maps1001 from shelf 10 to shelf 13 in Rack A4, install 1x an-worker on shelf 9/10
  • Step 5 --- Work with Keith and Manuel to move logstash1020 and db1111 out of Rack A4 into 1g rack, then install 2x an-workers here in A4, shelf 35/36 and shelf 17/18

Thanks,
Willy

@wiki_willy thanks a lot! Can we start racking 2 nodes out of 6? If I got it correctly we could:

  • add one node to D4 (rather than 2 I know, but we'd reach the max 5 workers per rack instead of 6)
  • add one node do A4 (without any move)

In the meantime I can work with Fundraising about francium, and possibly with others for logstash1020 and db1111 (the latter seem unlikely ready to be moved but I'll ask). What do you think?

@wiki_willy db1111 can be moved somewhere else if needed. From our side our needs would be:

  • Choose a day/time so DBAs can depool the host in advance
  • DCOps to provide the future IP in advance so we can change it on the host before powering it off, this way it will boot up with the new ip once racked on its new place.
  • DCOps to change dns

That's it :-)

elukey added a subscriber: hnowlan.EditedMon, Feb 1, 3:22 PM

Had a chat with @hnowlan and maps1001 can be moved with some heads up time. I am available to work with John on moving the nodes with Manuel and Hugh if we agree on the plan.

Next steps: @Cmjohnson @Jclark-ctr to comment the above proposal from Willy :)

If possible I would start with the hosts that have free rack space, that should be 2/6 remaining hosts IIUC.

Thanks @Marostegui, I appreciate it. We discussed this during my staff meeting a bit last week, and @Cmjohnson will work with you and the other server owners on the moves. It won't be during the next few days, as there's a big storm in Virginia....but I'll let @Cmjohnson chime in to propose which dates/times would work the best. Thanks, Willy

@wiki_willy db1111 can be moved somewhere else if needed. From our side our needs would be:

  • Choose a day/time so DBAs can depool the host in advance
  • DCOps to provide the future IP in advance so we can change it on the host before powering it off, this way it will boot up with the new ip once racked on its new place.
  • DCOps to change dns

That's it :-)

Thanks @wiki_willy - I will be off a few days next week but @LSobanski and @Kormat are on this task in case this needs to happen while I am away.

@wiki_willy, @hnowlan Move tickets have been created for db1111 (T273982), logstash1020 (T273984) and maps1001 (T273983). Francium has been decom'd and removed. @elukey once those are moved I will rack the remaining an-workers according to the plan you and Willy worked up.

elukey added a comment.Fri, Feb 5, 4:24 PM

@Cmjohnson wonderful news! I'll follow up in the task to help the owners of the hosts!

RobH updated the task description. (Show Details)Wed, Feb 17, 11:09 PM
Jclark-ctr updated the task description. (Show Details)Thu, Feb 18, 1:11 AM
Jclark-ctr reassigned this task from Cmjohnson to RobH.Thu, Feb 18, 1:15 AM

racked & cabled, bios configured, network configured. handing over to Rob for imaging

Nice work, thanks @Jclark-ctr

racked & cabled, bios configured, network configured. handing over to Rob for imaging

RobH added a comment.Thu, Feb 18, 2:55 AM

updated firmware for idrac for the remainder, will update bios and image tomorrow

Change 665120 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] install params for an-worker11(29|33|34|39|40|41)

https://gerrit.wikimedia.org/r/665120

Change 665120 merged by RobH:
[operations/puppet@production] install params for an-worker11(29|33|34|39|40|41)

https://gerrit.wikimedia.org/r/665120

RobH added a comment.Thu, Feb 18, 4:41 PM

an-worker11(29|33|34|39|40|41):

  • idrac firmware updated
  • bios firmware updated
  • idrac and bios settings & password updated

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1129.eqiad.wmnet', 'an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1139.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102181653_robh_9453.log.

Completed auto-reimage of hosts:

['an-worker1129.eqiad.wmnet', 'an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1139.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']

Of which those FAILED:

['an-worker1129.eqiad.wmnet', 'an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1139.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']

Change 665143 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: set custom recipe for new worker nodes

https://gerrit.wikimedia.org/r/665143

Change 665143 merged by Elukey:
[operations/puppet@production] install_server: set custom recipe for new worker nodes

https://gerrit.wikimedia.org/r/665143

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

an-worker1129.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102181826_elukey_29316_an-worker1129_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1129.eqiad.wmnet']

Of which those FAILED:

['an-worker1129.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

an-worker1129.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102181827_elukey_29343_an-worker1129_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1129.eqiad.wmnet']

Of which those FAILED:

['an-worker1129.eqiad.wmnet']

Change 665193 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] fixing macs for an-workers

https://gerrit.wikimedia.org/r/665193

Change 665193 merged by RobH:
[operations/puppet@production] fixing macs for an-workers

https://gerrit.wikimedia.org/r/665193

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102182137_robh_823.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102182235_robh_11380.log.

Completed auto-reimage of hosts:

['an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']

Of which those FAILED:

['an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']
RobH added a comment.Thu, Feb 18, 11:26 PM

John,

In reviewing the installations from the relocation of an-worker11(29|33|34|39|40|41), I ran into a couple issues:

an-worker1129 shows in idrac that its port1 NIC is attached to xe-4/0/3, but shows xe-4/0/12 in netbox. Can you please double-check and correct netbox for this host? If you aren't sure how, let me know, and just confirm that this is indeed in port 3 and nothing is in port 12 (perhaps it appears an unused SFP-T?)

an-worker1139 shows in idrac that its port1 NIC is attached to xe-4/0/24, but shows xe-4/0/25 in netbox. Can you please double-check and correct netbox for this host? If you aren't sure how, let me know, and just confirm that this is indeed in port 24 and nothing is in port 25?

The other hosts now all PXE boot into the installer, but something is failing during the installer or post install, as the script fails for them. I need to work on a procurement task, but will return to this later to investigate the other servers and their install issue.

RobH added a comment.Fri, Feb 19, 1:35 AM

Ok, so an-worker11[23]9 needs the network stuff figured out by onsite still, but the installer loop issue i was having is due to having mixed raid and non raid disks on the raid ocntroller. the jbod disks have to be individual raid0s, or all disks have to be nonraid jbdos with sw raid setup for os mirror. rather than change how its done, i'll set these up the same as the rest of the hosts, with individual disk raid0s after setup (already done the raid1) of the ssd raid1 mirror to show as sda.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1133.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102190211_robh_18088_an-worker1133_eqiad_wmnet.log.

RobH updated the task description. (Show Details)Fri, Feb 19, 2:13 AM
RobH updated the task description. (Show Details)

Completed auto-reimage of hosts:

['an-worker1133.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)Fri, Feb 19, 2:44 AM

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

['an-worker1134.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102191638_robh_14836.log.

Completed auto-reimage of hosts:

['an-worker1134.eqiad.wmnet', 'an-worker1140.eqiad.wmnet', 'an-worker1141.eqiad.wmnet']

and were ALL successful.

RobH updated the task description. (Show Details)Fri, Feb 19, 5:05 PM
RobH reassigned this task from RobH to Jclark-ctr.Fri, Feb 19, 5:08 PM
RobH removed a project: Patch-For-Review.
RobH added a subscriber: RobH.

Ok, the only two remaining hosts are an-worker11[23]9.

John,

In reviewing the installations from the relocation of an-worker11(29|33|34|39|40|41), I ran into a couple issues:

an-worker1129 shows in idrac that its port1 NIC is attached to xe-4/0/3, but shows xe-4/0/12 in netbox. Can you please double-check and correct netbox for this host? If you aren't sure how, let me know, and just confirm that this is indeed in port 3 and nothing is in port 12 (perhaps it appears an unused SFP-T?)

an-worker1139 shows in idrac that its port1 NIC is attached to xe-4/0/24, but shows xe-4/0/25 in netbox. Can you please double-check and correct netbox for this host? If you aren't sure how, let me know, and just confirm that this is indeed in port 24 and nothing is in port 25?

Once these two network ports are confirmed (I suspect idrac is correct, they are in xe-4/0/3 and xe-4/0/24, but want to double check before I update netbox. John: If you update with confirmation, I'll handle the netbox updates. (Just also confirm the cable IDs on this task via comment) and assign back to me, thank you!

an-worker1139 corrected dac cable for host moved to port 25

Jclark-ctr added a comment.EditedFri, Feb 19, 6:50 PM

an-worker1129 verified host cable is moved to port 12

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1129.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191908_robh_12122_an-worker1129_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

an-worker1139.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191919_robh_13531_an-worker1139_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-worker1129.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['an-worker1139.eqiad.wmnet']

and were ALL successful.

RobH closed this task as Resolved.Fri, Feb 19, 7:50 PM
RobH claimed this task.
RobH updated the task description. (Show Details)

All hosts installed and staged in netbox.

elukey reopened this task as Open.Fri, Feb 26, 7:14 AM

@wiki_willy I am terribly sorry to re-open this task, please be patient, but I discovered that I made an error (got fooled by the racking A4 in T260445#6779304, it is repeated two times, 1 and 3 hosts). The following 4 new nodes got added to A4:

an-worker[1129,1139-1141].eqiad.wmnet

And now we have a single rack with 8 hadoop nodes, that is not ideal: if a rack goes down (power failure, network, etc..) we loose ~10% of the capacity in one go. We should be able to survive if an event like that occurs, but since we haven't added the nodes yet to the cluster (so the can be moved anytime) I was wondering if there was space in other A racks for 10g nodes. We have

A1: 1
A2: 5
A3: 2
A4: 8
A5: 2
A7: 5

Even moving a couple of nodes would help a lot, and again apologies :(

RobH reassigned this task from RobH to wiki_willy.Fri, Feb 26, 4:12 PM

I would recommend opening a new task rather than reopening a resolved racking task and adding to the 'racking' timeline for completed tasks. However, Willy would make the call on this so I'm assigning this to him.

RobH removed a subscriber: RobH.Fri, Feb 26, 4:12 PM

No worries @elukey, it looks like I missed the double count in rack A4 as well. If these hosts need to stay in row A though, the only other 10g options would be in racks A2 or A7. Both are pretty full, but I do see room to fit one server in each rack, near the very top. We typically don't use shelf 42, but it could be possible - @Jclark-ctr will probably need to confirm how tight the space is on shelf 42 is in A2 and A7. Also, ms-be1019 in A2 is EOL, so hopefully the SREs will have a decom task submitted for that soon, which would also free up another spot in the future. Would this work for you?

  • Move an-worker1129 to A2
  • Move an-worker1139 to A7

That would net you 6x servers in A2, 6x servers in A4, and 6x servers in A7. If it does, let's track this via a new task for the server moves.

Thanks,
Willy

@wiki_willy I am terribly sorry to re-open this task, please be patient, but I discovered that I made an error (got fooled by the racking A4 in T260445#6779304, it is repeated two times, 1 and 3 hosts). The following 4 new nodes got added to A4:

an-worker[1129,1139-1141].eqiad.wmnet

And now we have a single rack with 8 hadoop nodes, that is not ideal: if a rack goes down (power failure, network, etc..) we loose ~10% of the capacity in one go. We should be able to survive if an event like that occurs, but since we haven't added the nodes yet to the cluster (so the can be moved anytime) I was wondering if there was space in other A racks for 10g nodes. We have

A1: 1
A2: 5
A3: 2
A4: 8
A5: 2
A7: 5

Even moving a couple of nodes would help a lot, and again apologies :(