Page MenuHomePhabricator

rack/setup/install dumpsdata100[12]
Closed, ResolvedPublic

Description

This task will track the racking/setup/installation of two new dumpsdata100[12] hosts for eqiad. These were ordered on T161344, and initially requested on T161311.

dumpsdata1001:

  • - receive in system on procurement task T161344
  • - system needs to be racked in a different row than dumpsdata1002, otherwise any 1Gb network capable rack with free space/power/network is fine.
  • - bios/drac/serial setup/testing
  • - hardware raid10 setup of all disks
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (internal vlan)
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible) partition with small / and larger /srv in lvm, no swap.
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

dumpsdata1002:

  • - receive in system on procurement task T161344
  • - system needs to be racked in a different row than dumpsdata1002, otherwise any 1Gb network capable rack with free space/power/network is fine.
  • - bios/drac/serial setup/testing
  • - hardware raid10 setup of all disks
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (internal vlan)
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible) partition with small / and larger /srv in lvm, no swap.
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

Event Timeline

@Cmjohnson: I know you have a LOT of incoming hardware right now, so once the on-site specific steps are done, you can push this to me for the puppet updates/os install/etc...

gerritbot subscribed.

Change 357860 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357860

Change 357860 merged by Cmjohnson:
[operations/puppet@production] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357860

Change 357870 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 T165173 T166264 T165531 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357870

Change 357870 merged by Cmjohnson:
[operations/dns@master] Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 and stat1005/6 T165366 T165368 T165173 T166264 T165531 T165520 T162216 T166076

https://gerrit.wikimedia.org/r/357870

Cmjohnson updated the task description. (Show Details)

Mac address has been added to dhcpd file. Assigning to @RobH for install

Change 357879 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076"

https://gerrit.wikimedia.org/r/357879

Change 357879 abandoned by RobH:
Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076"

https://gerrit.wikimedia.org/r/357879

Change 357949 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] adding in dumpsdata00[12] install params

https://gerrit.wikimedia.org/r/357949

Change 357949 merged by RobH:
[operations/puppet@production] adding in dumpsdata00[12] install params

https://gerrit.wikimedia.org/r/357949

Ok, this is having an installer issue of some sort. Both systems should be identical, but one shows no root filesystem when the partitioning menu launches, and the other doesn't have automatic partitioning take over.

The no root filesystem is the first issue to tackle. I've done a comparison between the two machines and I don't see any difference, but I'm escalating this back to @Cmjohnson for him to double check!

Both Chris and I have reviewed, and everything on these two systems appears identical. The fact we get two results from the same hardware is confusing. I've gone so far as to factory reset everything but DRAC (as I'd lose connectivity then) and one by one ensure each screen has identical settings for everything.

We've also ensured virtual media in idrac is disabled, and that the flexbay dual 1TB SATA are set to the bootable disk in the raid controller settings.

When booting dumpsdata1001, it loads the installer, and when it comes to partition disks, gives:

┌────────────┤ [!!] Partition disks ├─────────────┐
│                                                 │
│               No root file system               │
│ No root file system is defined.                 │
│                                                 │
│ Please correct this from the partitioning menu. │
│                                                 │
│                   <Continue>                    │
│                                                 │
└─────────────────────────────────────────────────┘

This is odd, since dumpsdata1002 will go to the manual partitioning menu just fine. Something is off about the detection/loading of the disks into the partitioning menu. Unfortunately, this screen only allows Continue, and not go back, where I could then view and save the debug logs from the installation to see what is going on.

It is odd, since i've also pulled them out of netboot.cfg entirely to test. When they are out of it, it should ALWAYS show the manual partitioning screen. This will assist in the troubleshooting of these systems.

Why they aren't acting identically to one another, I'm not certain. The disks themselves have been initialized in new arrays, so there shouldn't be any old cruft to cause them to detect otherwise. Since there isn't an actual go back and save logs with this particular error, it makes it difficult.

Ok, the error has now gone away. It may have been an odd condition where the raid was rebuilding, but not certain.

Now the question is how to setup the disk mount points. These two systems have dual 1TB disks for the OS in a hardware raid1 array, presenting as a single 999.7GB disk. The 12 4TB disks are in a hardware raid10 virtual disk, presenting as a sigle 24TB disk to the OS.

The OS and all of the / filesystem will go on the first 999.7GB virtual disk. Should /srv be mounted on the 24TB disk? (If so, I'll need to write a new partman recipe for this, but that is perfectly acceptable!)

chatted with ariel via irc:

sda = hardware raid 1 of 2 1TB disks for a 999.7GB disk
sdb = hardware raid 10 of 12 4TB disks for a 24TB disk

There will be a small /boot and / within an lvm on sda
there will be a single lvm with /data on sdb

Change 358880 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] dumpsdata100[12] new partman recipe

https://gerrit.wikimedia.org/r/358880

Change 358880 merged by RobH:
[operations/puppet@production] dumpsdata100[12] new partman recipe

https://gerrit.wikimedia.org/r/358880

Change 358889 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] tweaking dumpsdata partman recipe

https://gerrit.wikimedia.org/r/358889

Change 358889 merged by RobH:
[operations/puppet@production] tweaking dumpsdata partman recipe

https://gerrit.wikimedia.org/r/358889

Change 358895 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] typo in dumpsdata1002 reverse file

https://gerrit.wikimedia.org/r/358895

Change 358895 merged by RobH:
[operations/dns@master] typo in dumpsdata1002 reverse file

https://gerrit.wikimedia.org/r/358895

So there was a typo in the reverse entry for dumpsdata1002, and it caused it not to detect from the entry in netboot.cfg and use the auto partition recipe. I've cleared the dns recursors of the IP, as well as the hostname and typo hostname entry, but it still fails to the manual parition menu.

I suspect something is cached elsewhere, so I'll just walk away from this for an hour or so and come back to it and see if it detects properly during installation.

dumpsdata1001 has puppet running in a screen session, and once thats done ill accept its salt key and it'll be ready for ariel to work on.

yep, walked away and now its working and installer is running on dumpsdata1002

RobH updated the task description. (Show Details)

ready for service implementation, assigned to ariel for followup. (this task can either be used to track its implementation, or can be resolved.)

Removing ops-eqiad tag....since on-site work i s no longer needed

Closing this. There's more to be done as far as misc dump cron jobs running on these new hosts, but we're well past the basic setup and install.