Page MenuHomePhabricator

(Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of maps20[05-20].codfw.wmnet. 4 of these were purchased to replace maps2001-2004, and 2 as expansion.

Hostname / Racking / Installation Details

ServersRowrack/Uswitch_port
maps2005AA5/11ge-5/0/10
maps2006BB1/7ge-1/0/2
maps2007CC3/12ge-3/0/11
maps2008DD3/11ge-3/0/10
maps2009BB6/21ge-6/0/16
map2010DD6/4ge-/6/0/3

Hostnames: maps20[05-10]
Racking Proposal: These are replacing (4) and expanding (2) the maps footprint in codfw. Please ensure maps200[5-8] have one server per row, and then place maps20[09-10] in any two different rows. Ensure none of the maps20[05-10] have more than 1 host per rack. (End result will be two rows with 1 maps host each, and 2 rows with 2 maps hosts each, no rack having more than one maps host.)
Networking/Subnet/VLAN/IP: 1G networking, single port connection, production VLAN
Partitioning/Raid: Hardware RAID - raid10
OS Distro: Stretch

Per host setup checklist

maps2005:

  • - receive in system on procurement task T257952 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

maps2006:

  • - receive in system on procurement task T257952 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

maps2007:

  • - receive in system on procurement task T257952 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

maps2008:

  • - receive in system on procurement task T257952 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

maps2009:

  • - receive in system on procurement task T257952 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

maps2010:

  • - receive in system on procurement task T257952 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Acknowledged on the SRE board.
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH removed a subscriber: RobH.
Papaul updated the task description. (Show Details)
Papaul added a subscriber: Gehel.

@Gehel are you the right person for those servers? If yes i need to know what hardware raid type we are going to use.

Thanks

@Papaul Feel free to reach out to me as the point of contact regarding these servers.

We'd like to use RAID10 as our hardware RAID.

(There's a bit of context here where we were advised to use RAID10)

@RKemper thank you for the info. What stripe size for the RAID 10?

@Papaul Sorry for the delayed response!

Thinking out loud, since we're using SSDs I imagine the penalty for not performing sequential reads is not there the way it would be with HDDs, so there's probably not as much benefit of large stripe sizes as their otherwise would be. (I imagine it still helps overhead though).

Also glancing at our existing maps servers we have a lot of large files, but we also seem to have a lot of files that are 10's of kilobytes or less.

I'll hop into DC-Ops irc tomorrow (Sep 10) and ask some questions, since I have a couple general questions that you guys might be able to help with.

TL;DR: I'll get you an answer on Sep 10 after thinking it over some more

Change 626707 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Add production DNS for maps200[5-9] and maps2010

https://gerrit.wikimedia.org/r/626707

Change 626707 merged by Papaul:
[operations/dns@master] DNS: Add production DNS for maps200[5-9] and maps2010

https://gerrit.wikimedia.org/r/626707

Change 627540 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address for maps2005-maps2010

https://gerrit.wikimedia.org/r/627540

Change 627540 merged by Papaul:
[operations/puppet@production] DHCP: Add MAC address for maps2005-maps2010

https://gerrit.wikimedia.org/r/627540

Change 627553 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] Add maps20(0[5-9]|1[0]) to site.pp

https://gerrit.wikimedia.org/r/627553

Change 627553 merged by Papaul:
[operations/puppet@production] Add maps20(0[5-9]|1[0]) to site.pp

https://gerrit.wikimedia.org/r/627553

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009151634_pt1979_20492_maps2005_codfw_wmnet.log.

@RKemper the installing is not able to setup the raid using the partman recipe below . We are using a HW raid 10 and a SW raid10 with 4 disks, this will not work. Please investigate and let me know what partman recipe you will like to use.

Thanks.

maps*) echo partman/standard.cfg partman/raid10-4dev.cfg ;; \
``

Completed auto-reimage of hosts:

['maps2005.codfw.wmnet']

Of which those FAILED:

['maps2005.codfw.wmnet']

@RKemper the installing is not able to setup the raid using the partman recipe below . We are using a HW raid 10 and a SW raid10 with 4 disks, this will not work. Please investigate and let me know what partman recipe you will like to use.

Thanks.

maps*) echo partman/standard.cfg partman/raid10-4dev.cfg ;; \
``

Looking at existing configurations, it looks like maps should have:

maps*) echo partman/standard.cfg partman/hwraid-1dev.cfg`  ;; \

Change 628089 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: add partman configuration for newer maps servers.

https://gerrit.wikimedia.org/r/628089

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009181415_pt1979_25814_maps2005_codfw_wmnet.log.

Change 628089 merged by Papaul:
[operations/puppet@production] maps: add partman configuration for newer maps servers.

https://gerrit.wikimedia.org/r/628089

Completed auto-reimage of hosts:

['maps2005.codfw.wmnet']

Of which those FAILED:

['maps2005.codfw.wmnet']

Still having partman recipe problem maybe because of this line

maps[12]00[1-4]*) echo partman/standard.cfg partman/raid10-4dev.cfg ;; \

i think we have to remove the * an the end if ] otherwise it will not read

maps*) echo partman/standard.cfg partman/hwraid-1dev.cfg ;; \

@Gehel what do you think?

Change 628348 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: fix typo in glob exrpession for maps netboot.cfg

https://gerrit.wikimedia.org/r/628348

Change 628348 merged by Gehel:
[operations/puppet@production] maps: fix typo in glob exrpession for maps netboot.cfg

https://gerrit.wikimedia.org/r/628348

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009181529_pt1979_9410_maps2005_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2005.codfw.wmnet']

Of which those FAILED:

['maps2005.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009181545_pt1979_12312_maps2005_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2005.codfw.wmnet']

Of which those FAILED:

['maps2005.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009181603_pt1979_14734_maps2005_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2005.codfw.wmnet']

Of which those FAILED:

['maps2005.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009181709_pt1979_27191_maps2005_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2005.codfw.wmnet']

and were ALL successful.

hey guys please look at the output below and see if the all looks good on maps2005 so i can resume the install on the other nodes on Monday. Thanks.

pt1979@cumin2001:~$ sudo cumin 'maps2005.codfw.wmnet' 'free -g ; df -hT '
1 hosts will be targeted:
maps2005.codfw.wmnet
Confirm to continue [y/n]? y
----- OUTPUT of 'free -g ; df -hT ' -----
              total        used        free      shared  buff/cache   available
Mem:            125           0         124           0           0         124
Swap:             0           0           0
Filesystem           Type      Size  Used Avail Use% Mounted on
udev                 devtmpfs   63G     0   63G   0% /dev
tmpfs                tmpfs      13G  9.6M   13G   1% /run
/dev/mapper/vg0-root ext4       73G  1.4G   68G   2% /
tmpfs                tmpfs      63G     0   63G   0% /dev/shm
tmpfs                tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                tmpfs      63G     0   63G   0% /sys/fs/cgroup
/dev/mapper/vg0-srv  ext4      3.4T   89M  3.2T   1% /srv
tmpfs                tmpfs      13G     0   13G   0% /run/user/0

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2006.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009221554_pt1979_24576_maps2006_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2006.codfw.wmnet']

Of which those FAILED:

['maps2006.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2006.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009221605_pt1979_26209_maps2006_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2006.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2007.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009221637_pt1979_315_maps2007_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2007.codfw.wmnet']

Of which those FAILED:

['maps2007.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2007.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009221654_pt1979_5446_maps2007_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2007.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2008.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009221907_pt1979_29719_maps2008_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2008.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2009.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009222029_pt1979_12848_maps2009_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2009.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009222116_pt1979_23003_maps2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2010.codfw.wmnet']

Of which those FAILED:

['maps2010.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

maps2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009222131_pt1979_25100_maps2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['maps2010.codfw.wmnet']

and were ALL successful.

Papaul updated the task description. (Show Details)

This is complete

maps2010 is reported as down since about 3 days

Is there a ticket for moving these into production?

Mentioned in SAL (#wikimedia-operations) [2020-10-09T23:13:49Z] <mutante> maps2010 is down since almost 3 days - unhandled crit alert but nothing in SAL and only related ticket says resolved - powercycling it - boots normal but doesn't have a prod role (T260271)

Change 644611 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] maps: remove no-longer-accurate insetup role

https://gerrit.wikimedia.org/r/644611

Change 644611 merged by Hnowlan:
[operations/puppet@production] maps: remove no-longer-accurate insetup role

https://gerrit.wikimedia.org/r/644611