Page MenuHomePhabricator

rack/setup/install replacement to stat1005 (stat1002 replacement)
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of the new system ordered on T162700 to replace stat1002.

This host also has GPU for offloading the statistic crunching.

This system will run as a replacement to stat1002, as a statistics::private role.

  • - receive in system normally off parent task T162700
  • - system is a replacement, so it can rack in any available rack/row, as all rows have the analytics1 subnets.
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (analytics1 vlan for its row)
  • - network port setup (description, enable, analytics1 vlan for its row)
  • - operations/puppet update - https://gerrit.wikimedia.org/r/361079
  • - OS installation - Debian Stretch installed, but can reinstall with Jessie if required. Stat1002 is using trusty, but that is very old.
  • - puppet/salt accept/initial run
  • - handoff for service implementation

Details

Related Gerrit Patches:
operations/puppet : productionSet stat1005's pxe boot option to stretch
operations/puppet : productionstat1005 needs jessie
operations/puppet : productionstat1005 install module update

Event Timeline

RobH edited projects, added ops-eqiad; removed procurement.May 15 2017, 7:18 PM
RobH added projects: Analytics, Analytics-Cluster.
Ottomata added a comment.EditedMay 16 2017, 8:05 AM

stat1004 is already taken! stat1005 please

Ottomata renamed this task from rack/setup/install replacement to stat1002 (stat1004 or misc name?) to rack/setup/install replacement to stat1005 (stat1002 replacement?).May 16 2017, 8:07 AM
Ottomata renamed this task from rack/setup/install replacement to stat1005 (stat1002 replacement?) to rack/setup/install replacement to stat1005 (stat1002 replacement).
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)

I modified the description for the host name, and I also moved the blurb about GPU from T165366 to this ticket, since the stat1002 replacement is the one with the GPU.

@Cmjohnson estimate on these? We'd like to get them up and running by the end of this quarter, so I'm going to need to start scheduling stuff soon.

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.May 25 2017, 6:09 PM

@Ottomata It is in the rack as of today and I am getting through several orders and will do my best. I don't have an exact day but I would expect in the next 2-3 weeks if not sooner. Please monitor the ops-eqiad workboard

Change 355785 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns entries for stat1005 and stat1006 T165366 T165368

https://gerrit.wikimedia.org/r/355785

Change 355785 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns entries for stat1005 and stat1006 T165366 T165368

https://gerrit.wikimedia.org/r/355785

Change 355786 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns entries for dumpsdata1001/2 T165368

https://gerrit.wikimedia.org/r/355786

Change 355786 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns entries for dumpsdata1001/2 T165368

https://gerrit.wikimedia.org/r/355786

Cmjohnson updated the task description. (Show Details)May 30 2017, 6:01 PM
Cmjohnson updated the task description. (Show Details)
Cmjohnson updated the task description. (Show Details)Jun 8 2017, 3:29 PM

Change 357860 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357860

Change 357860 merged by Cmjohnson:
[operations/puppet@production] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357860

Change 357870 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 and stat1005/6 T165366 T165368 T165173 T166264 T165531 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357870

Change 357870 merged by Cmjohnson:
[operations/dns@master] Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 and stat1005/6 T165366 T165368 T165173 T166264 T165531 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357870

Cmjohnson reassigned this task from Cmjohnson to RobH.Jun 8 2017, 7:08 PM
Cmjohnson updated the task description. (Show Details)

Change 357879 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076"

https://gerrit.wikimedia.org/r/357879

Change 357879 abandoned by RobH:
Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076"

https://gerrit.wikimedia.org/r/357879

Mentioned in SAL (#wikimedia-operations) [2017-06-15T00:15:27Z] <mutante> dumpsdata1001 - was reported in icinga as CRIT systemdstate - reason was puppet service was failed with "Invalid value '"no"' for boolean parameter: daemonize" (it was ok on other hosts??). commented the option, stopped puppet, systemctl reset-failed - which made it recover (T165368)

Change 361079 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] stat1005 install module update

https://gerrit.wikimedia.org/r/361079

Change 361079 merged by RobH:
[operations/puppet@production] stat1005 install module update

https://gerrit.wikimedia.org/r/361079

RobH updated the task description. (Show Details)Jun 23 2017, 5:07 PM
RobH reassigned this task from RobH to Ottomata.Jun 23 2017, 5:33 PM
RobH raised the priority of this task from Low to Medium.
RobH removed projects: Patch-For-Review, ops-eqiad.
RobH updated the task description. (Show Details)

System is installed, puppet/salt accepted, and ready to have software/services implemented. Assigning to @Ottomata for implementation.

RobH claimed this task.Jun 23 2017, 5:35 PM

Turns out needs jessie, taking back for reimage.

Change 361082 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] stat1005 needs jessie

https://gerrit.wikimedia.org/r/361082

Change 361082 merged by RobH:
[operations/puppet@production] stat1005 needs jessie

https://gerrit.wikimedia.org/r/361082

Script wmf_auto_reimage was launched by robh on neodymium.eqiad.wmnet for hosts:

['stat1005.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706231750_robh_8687.log.

Completed auto-reimage of hosts:

['stat1005.eqiad.wmnet']

and were ALL successful.

RobH reassigned this task from RobH to Ottomata.Jun 23 2017, 6:50 PM

reimage complete, system ready for you to add to site.pp for specific roles!

Change 362975 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set stat1005's pxe boot option to stretch

https://gerrit.wikimedia.org/r/362975

Change 362975 merged by Elukey:
[operations/puppet@production] Set stat1005's pxe boot option to stretch

https://gerrit.wikimedia.org/r/362975

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['stat1005.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707031225_elukey_24533.log.

Completed auto-reimage of hosts:

['stat1005.eqiad.wmnet']

Of which those FAILED:

set(['stat1005.eqiad.wmnet'])
Nuria moved this task from Wikistats Production to Dashiki on the Analytics board.Jul 3 2017, 3:36 PM
RobH removed a subscriber: RobH.Jul 3 2017, 3:39 PM
Ottomata updated the task description. (Show Details)
Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.
Nuria closed this task as Resolved.Aug 2 2017, 12:33 AM