Page MenuHomePhabricator

(Need by: ASAP) rack/setup/install stat1008
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>.

Please note this is a special setup, as it requires the server from T242149 and the GPU card from T238587 to move forward.

Hostname / Racking / Installation Details

Hostnames: stat1008
Racking Proposal: 1G rack, not shared with other stat hosts (so not in A4, B1, B3, D1, D3.)
Networking/Subnet/VLAN/IP: single 1g connection to asw into analytics vlan
Partitioning/Raid: match other stat hosts, buster install

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

stat1008:

  • - receive in system on procurement task T242149 and the GPU card on T238587
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - install GPU into system, document if there are any pain points, as this chassis + GPU card will be re-ordered 6-12 more times with this as the proof of concept.
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Please assign this task directly to @RobH once system is installed, so he can review and resolve (and move forward with other GPU based orders.)

Related Objects

StatusSubtypeAssignedTask
ResolvedCmjohnson

Event Timeline

RobH triaged this task as High priority.Feb 28 2020, 6:21 PM
RobH created this task.
Restricted Application added a project: Operations. · View Herald TranscriptFeb 28 2020, 6:21 PM
RobH added parent tasks: Unknown Object (Task), Unknown Object (Task).Feb 28 2020, 6:21 PM
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.Feb 28 2020, 6:23 PM

@elukey,

Can you provide the info in the 'Hostname / Racking / Installation Details' section in the task body, and then reassign this from yourself to @Jclark-ctr?

Thanks in advance,

RobH renamed this task from (Need by: ASAP) rack/setup/install new GPU host to (Need by: ASAP) rack/setup/install stat1008.Feb 28 2020, 7:02 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Feb 28 2020, 7:06 PM
RobH moved this task from Backlog to Acknowledged on the Operations board.
RobH removed a subscriber: RobH.Mar 3 2020, 6:00 PM
RobH added a comment.Mar 3 2020, 10:07 PM

Both of these items have arrived at the dc site: T238587 arrived today (@Jclark-ctr mentioned in irc) and T242149 arrived on 2020-03-02.

Jclark-ctr updated the task description. (Show Details)Mar 4 2020, 1:09 AM
Jclark-ctr added a subscriber: RobH.
elukey reassigned this task from elukey to Jclark-ctr.Mar 4 2020, 10:03 AM

Assigning to @Jclark-ctr since the info have already been filled (thanks!)

RobH removed a subscriber: RobH.Mar 4 2020, 3:43 PM
RobH added a subscriber: RobH.
Jclark-ctr updated the task description. (Show Details)Mar 7 2020, 5:40 PM

Installed GPU no problems server will only fit 1 more GPU.

Jclark-ctr reassigned this task from Jclark-ctr to Cmjohnson.Mar 7 2020, 5:45 PM

Racked and cabled handing over to Chris for configuration

Rack a6 unit 17. switchport 22

elukey awarded a token.Mar 7 2020, 7:08 PM

Change 578518 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] update dns mgmt/production for stat1008

https://gerrit.wikimedia.org/r/578518

Change 578518 merged by Cmjohnson:
[operations/dns@master] update dns mgmt/production for stat1008

https://gerrit.wikimedia.org/r/578518

Cmjohnson updated the task description. (Show Details)Mar 10 2020, 12:56 PM
Cmjohnson updated the task description. (Show Details)
RobH removed a subscriber: RobH.Mar 10 2020, 3:14 PM
RobH added a subscriber: RobH.
Cmjohnson updated the task description. (Show Details)Mar 10 2020, 6:06 PM

Change 578573 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add netboot.cfg and dhcpd file for stat1008

https://gerrit.wikimedia.org/r/578573

Change 578573 merged by Cmjohnson:
[operations/puppet@production] Add netboot.cfg and dhcpd file for stat1008

https://gerrit.wikimedia.org/r/578573

Change 578574 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add stat1008 to site.pp role spare

https://gerrit.wikimedia.org/r/578574

Change 578574 merged by Cmjohnson:
[operations/puppet@production] Add stat1008 to site.pp role spare

https://gerrit.wikimedia.org/r/578574

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

stat1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003101826_cmjohnson_10737_stat1008_eqiad_wmnet.log.

Cmjohnson updated the task description. (Show Details)Mar 10 2020, 6:27 PM

Completed auto-reimage of hosts:

['stat1008.eqiad.wmnet']

Of which those FAILED:

['stat1008.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

stat1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003111131_cmjohnson_179724_stat1008_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

stat1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003111244_elukey_194481_stat1008_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['stat1008.eqiad.wmnet']

Of which those FAILED:

['stat1008.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

stat1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003111244_elukey_194513_stat1008_eqiad_wmnet.log.

This seems to be where d-i gets stuck:

            Error while setting up RAID                   │
│ An unexpected error occurred while setting up a preseeded RAID  │
│ configuration

Completed auto-reimage of hosts:

['stat1008.eqiad.wmnet']

Of which those FAILED:

['stat1008.eqiad.wmnet']

From the d-i shell I can see:

~ # fdisk -l
Disk /dev/sda: 7.3 TiB, 7999376588800 bytes, 15623782400 sectors
Disk model: PERC H730P Adp
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6BEC427B-F34C-4AD9-9144-2263F5D5B933

Device      Start         End     Sectors  Size Type
/dev/sda1    2048      585727      583680  285M BIOS boot
/dev/sda2  585728 15623780351 15623194624  7.3T Linux RAID

But the partman recipe is raid10-4dev.cfg. Is there a hw raid configuration already done that might interfere?

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

stat1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003111357_cmjohnson_207409_stat1008_eqiad_wmnet.log.

From the cumin log I see the following:

2020-03-11 14:16:20 [INFO] (cmjohnson) wmf-auto-reimage::print_line: Started first puppet run (sit back, relax, and enjoy the wait)

It seems stuck in there, so I went to the mgmt console and logged as root. I then executed puppet agent -tv and everything worked, except the following error during the first puppet run:

Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install megacli' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package megacli
Error: /Stage[main]/Packages::Megacli/Package[megacli]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install megacli' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package megacli
Notice: /Stage[main]/Raid::Megaraid/File[/usr/local/lib/nagios/plugins/get-raid-status-megacli]: Dependency Package[megacli] has failures: true
Warning: /Stage[main]/Raid::Megaraid/File[/usr/local/lib/nagios/plugins/get-raid-status-megacli]: Skipping because of failed dependencies

Not sure why wmf-reimage is stuck though..

The partition layout don't look what I expected:

elukey@stat1008:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   63G     0   63G   0% /dev
tmpfs                  13G   18M   13G   1% /run
/dev/mapper/vg0-root   73G  1.6G   68G   3% /
tmpfs                  63G     0   63G   0% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                  63G     0   63G   0% /sys/fs/cgroup
/dev/mapper/vg0-srv   2.8T   89M  2.7T   1% /srv
tmpfs                  13G     0   13G   0% /run/user/13926

There should be a 7TB+ /srv partition, or at least a way to expand the lvs volume. But:

elukey@stat1008:~$ sudo lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root vg0 -wi-ao----  74.50g
  srv  vg0 -wi-ao----  <2.84t
  swap vg0 -wi-ao---- 976.00m
elukey@stat1008:~$ sudo pvs
  PV         VG  Fmt  Attr PSize  PFree
  /dev/md0   vg0 lvm2 a--  <3.64t <744.84g

Need to check if the partman scheme is correct for this use case, or if something else is causing this.

Change 579006 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Use raid10-8dev for stat100x hosts

https://gerrit.wikimedia.org/r/579006

Change 579017 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: let new ganeti nodes and logstash1008 use role(insetup)

https://gerrit.wikimedia.org/r/579017

Change 579006 merged by Elukey:
[operations/puppet@production] Use raid10-8dev for stat1008

https://gerrit.wikimedia.org/r/579006

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

stat1008.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202003111815_elukey_252772_stat1008_eqiad_wmnet.log.

Change 579017 merged by Dzahn:
[operations/puppet@production] site: let new ganeti nodes and logstash1008 use role(insetup)

https://gerrit.wikimedia.org/r/579017

Completed auto-reimage of hosts:

['stat1008.eqiad.wmnet']

and were ALL successful.

Cmjohnson closed this task as Resolved.Mar 11 2020, 7:45 PM
Cmjohnson updated the task description. (Show Details)
elukey reopened this task as Open.Mar 16 2020, 1:17 PM
Cmjohnson closed this task as Resolved.Mar 16 2020, 2:32 PM

Thanks, @elukey fixed the issue in netbox