Page MenuHomePhabricator

rack/setup/install labstore100[67].wikimedia.org
Closed, ResolvedPublic

Description

This task will track the setup/installation of labstore100[67] in eqiad. These were ordered on procurement task T161345, and requested on hardware-requests task T161311.

Racking proposal: These will go in the public vlan, so they can be placed in any of the rows/racks with 1GBE connectivity. Please place them in two different rows from one another for horizontal redundancy.

labstore1006:

  • - receive in system on procurement task T161345
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, public vlan)
    • end on-site specific steps.
  • - production dns entries added (public vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

labstore1007:

  • - receive in system on procurement task T161345
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, public vlan)
    • end on-site specific steps.
  • - production dns entries added (public vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH updated the task description. (Show Details)

Racking proposal: existing labstore systems use the vlan-labs-support1, so these have to be racked in rows A and C, as those rows have that vlan supported/deployed. Please place these two systems in different racks/rows from one another, so one in a 1GBE rack in row A, and the other in a 1GBE rack in row C. This should be confirmed by @chasemp before racking.

Note from irc: these are closer in function to the old dataset boxes rather than existing labstores. They need to go in a public VLAN being externally accessible. :)

RobH updated the task description. (Show Details)

@Cmjohnson Do we have an estimate on when these will be racked? These servers being setup are part of our quarterly goal for Q1 - T168486, so would be awesome to have these up sooner than later! Thank you :)

Cmjohnson updated the task description. (Show Details)

@chasemp Which vlan are these going in...I racked in row A and D....i see the instruction say it's public but I see a comment that it's labs-support. Please confirm Thanks

@Cmjohnson These two need to be in the public vlan.

Change 368445 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding dns entries (mgmt and production) for labstore1006/7 public vlan T167984

https://gerrit.wikimedia.org/r/368445

Change 368445 merged by Cmjohnson:
[operations/dns@master] Adding dns entries (mgmt and production) for labstore1006/7 public vlan T167984

https://gerrit.wikimedia.org/r/368445

@chasemp Do you know the raid cfg you want? The server has (12) 3.5 6Tb disks and (2) 2.5" disk, the disk shelf has (12) 3.5" 6TB disks. I would think the 2 smallers disk are raid 1 and then raid 10 for the other 2. Please confirm.

@chasemp Do you know the raid cfg you want? The server has (12) 3.5 6Tb disks and (2) 2.5" disk, the disk shelf has (12) 3.5" 6TB disks. I would think the 2 smallers disk are raid 1 and then raid 10 for the other 2. Please confirm.

Yup, T161311 seems to agree.

2 dedicated flexbay 1tb os disks

raid 1 for OS

12 * 6TB disks (~36TB in raid10)

Should result in 72TB usable non-OS partition storage, and 1 TB OS partition.

@RobH @chasemp The servers are racked and all preliminary work done. I connected the disk shelf to the server but it's not being seen by the controller. I verified all cables are connected (no link lights) and the disk shelf is powered on. I attached an image of the message I get in the raid UI. Rob, maybe try update f/w....I don't know, this is our first attempt at using HP's for disk shelves. I doubt it's cables because both labstores have this issues.

They are connected like this

on the P441 port 1E is connected DP1 (I/O module A) of the array and port 2E goes to DP1 (I/O module B)

ping @madhuvishy hopefully will have some time to read up on the manuals :)

asw2-d-eqiad:ge-6/0/3 (Description: labstore1007, MAC: 30:e1:71:5f:9d:94 ) has been flapping at a rate of ~40 up/downs per hour.
As the server is not in production yet, I'm disabling the interface. Feel free to re-enable it or ping me when necessary.

@Cmjohnson I tried getting into the management interface for 1007, and powercycled it, booted from network and was looking at console:

It is stuck in a loop saying this:

Attempting Boot From NIC

QLogic UNDI PXE-2.1 v7.14.5
Copyright (C) 2016 QLogic Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.
PXE-E61: Media test failure, check cable
PXE-M0F: Exiting QLogic PXE ROM.

@Cmjohnson I also can't even seem to get into the management interface for labstore1006

☁  ~  ssh root@labstore1006.mgmt.eqiad.wmnet
channel 0: open failed: connect failed: Connection timed out
stdio forwarding failed
ssh_exchange_identification: Connection closed by remote host

@RobH can you do the installs....let's get them accessible and then I will deal with the disk shelf issue. Thanks

The interface flapping issue was because of a mis-connected cable, which @Cmjohnson's fixed now. Both management interfaces are now accessible.

Pasting here for context what the raid management interface claims when trying to obtain info about the disk shelves connected to the external controller (P441)

=> controller all show

Smart Array P441 in Slot 1                (sn: PDNMH0ARH690AP)
Smart Array P840 in Slot 3                (sn: PDNNF0ARH730OY)

=> controller serialnumber=PDNMH0ARH690AP show

Smart Array P441 in Slot 1
   Bus Interface: PCI
   Slot: 1
   Serial Number: PDNMH0ARH690AP
   Cache Serial Number: PEYFP0BRH693TW
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Hardware Revision: B
   Firmware Version: 4.52-0
   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 0 secs
   Cache Board Present: True
   Cache Status: Not Configured
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.8 GB
   No-Battery Write Cache: Disabled
   SSD Caching RAID5 WriteBack Enabled: True
   SSD Caching Version: 2
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK
   SATA NCQ Supported: True
   Spare Activation Mode: Activate on physical drive failure (default)
   Controller Temperature (C): 85
   Cache Module Temperature (C): 50
   Number of Ports: 2 External only
   Encryption: Disabled
   Express Local Encryption: False
   Driver Name: hpsa
   Driver Version: 3.4.16
   Driver Supports SSD Smart Path: True
   PCI Address (Domain:Bus:Device.Function): 0000:05:00.0
   Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
   Controller Mode: RAID
   Pending Controller Mode: RAID
   Controller Mode Reboot: Not Required
   Latency Scheduler Setting: Disabled
   Current Power Mode: MaxPerformance
   Survival Mode: Enabled
   Host Serial Number: MXQ72106XS
   Sanitize Erase Supported: True
   Primary Boot Volume: None
   Secondary Boot Volume: None

=> controller serialnumber=PDNMH0ARH690AP array all show

Error: Controller identified by "serialnumber=pdnmh0arh690ap" does not contain
       any arrays.

=> controller serialnumber=PDNMH0ARH690AP enclosure all show

Error: The specified device does not have any storage enclosures.

Current status: We are not really sure why the disk shelves don't show up. As the next step, @Cmjohnson will try and call HP support and have them help troubleshoot, hopefully on Friday.

Update: We are still blocked on talking to HP Support about the disk shelves.

On labstore1006 when loading the HP raid utility via bios, it gives the following error:
error: no such device: EMBEDDED250.

It does this when either controller slot is picked from the bios selection screen, and then loading the hp raid utility is selected.

I disabled the embedded controller on labstore1006, doesn't eliminate this error.

It does NOT give this error on labstore1007.

I started a ticket with HP.
Your case was successfully submitted. Please note your Case ID: 5322481808 for future reference.

I also checked other systems iwth HP raid controllers. So these showing the raid1 as sdd is a problem, it should show as sda. Other HP systems (like the db systems) show the OS sda on the hw raid controller also as /dev/sda.

Perhaps they can be set to bootable in the GUI for raid setup?

HP requested the AHS log, uploaded the log to their system. Waiting on their response. Only working with 1006 at the moment.

Spoke with HP support, not very helpful. They will not send anyone to help unless we want to pay for it. Going to try and talk to HP storage ppl tomorrow. But so far nobody has an answer.

@madhuvishy I reseated the controller card, removed all cabling, and started over again. This time around, the controller card did not show any errors and i was able to configure the raid on both the internal disk and the enclosure. I set the to 1TB disks as raid 1 and the bootable volume. @RobH could you please install both 1006 and 1007 please.

Change 375079 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] labstore100[67] install params

https://gerrit.wikimedia.org/r/375079

Change 375079 merged by RobH:
[operations/puppet@production] labstore100[67] install params

https://gerrit.wikimedia.org/r/375079

Change 375090 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] labstore100[67] partman recipe tweak

https://gerrit.wikimedia.org/r/375090

Change 375090 merged by RobH:
[operations/puppet@production] labstore100[67] partman recipe tweak

https://gerrit.wikimedia.org/r/375090

Ok, so these detect with the internal raid1 as sdb, the internal raid10 array as sdc, and the external disk array as sda.
d-i grub-installer/bootdev string /dev/sda had to be added, and is now live.

However, labstore1006 is simply rebooting, and refusing to see the disks as bootable, and then failing to PXE. Something isn't right, still trying to figure out what is causing it.

I can see the issue, both labstore1006 and labstore1007 have it.

When checking the boot order in bios, it lists the following:

BIOS/Platform Configuration (RBSU)                                                                                                                                                                                                                                                                                                                                                                              Boot Options > Legacy BIOS Boot Order                                                                                                                                                                                                                                                                                                                                                                             Press the '+' key to move an entry higher in the boot list and the '-' key to move an entry lower                                                                                                       in the boot list. Use the arrow keys to navigate through the Boot Order list.                                                                                                                                                                                                                                                                                                                                   Standard Boot Order(IPL)                                                                                                                                                                                CD ROM/DVD                                                                                                                                                                                              USB DriveKey                                                                                                                                                                                            Hard Drive C:  (see Boot Controller Order)                                                                                                                                                              Embedded FlexibleLOM 1 Port 1 : HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter - NIC                                                                                                                     Embedded LOM 1 Port 1 : HP Ethernet 1Gb 4-port 331i Adapter - NIC                                                                                                                                                                                                                                                                                                                                               Boot Controller Order                                                                                                                                                                                 > Slot 1 : Smart Array P441 Controller - 33533.99 GiB, RAID 1+0 Logical Drive(Target:0, Lun:0)

Under the boot controller order, it ONLY lists the Smart Array P441, which is the external disk shelf. It does not list the Slot 3 : Smart Array P840 Controller. So there is an issue where it doesn't list both controllers for boot order.

This is odd, I'll spend some time googling and see if I cannot figure it out. If I don't have a solution later today, I'll open a ticket with HP support.

One solution described pulling the 840, booting with just the 441, and then after successful post, adding back in the 840. It fixed the issue for this person. However, this will require onsite work, and @Cmjohnson is out until next Tuesday, so I'll keep hunting for another solution. https://h30434.www3.hp.com/t5/Notebook-Hardware-and-Upgrade-Questions/ProLiant-DL380-g9-amp-Smart-Array-boot-controller-order/td-p/5339762

Additionally, the firmware on these controllers reads as 4.52, while the current release is 5.52. The issue may have been resolved with this, but we cannot tell without flashing the firmware. Unfortuantely, the firmware for the raid controller CANNOT be flashed from the ilom interface, but must be loaded via OS (which we cannot boot to) or the HP SPP ISO image.

I have upgraded the system bios and ilom to the newest firmware revisions.

All of these solutions so far require onsite.

All of these solutions so far require onsite.

@Cmjohnson If you are back and onsite today, could you please take a look?

I ended up moving the cards to different pci slots and that fixed the issue. @RobH passing this to you (again)

This comment was removed by RobH.

Change 376152 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] updating labstore100[67] partman recipe

https://gerrit.wikimedia.org/r/376152

Change 376152 merged by RobH:
[operations/puppet@production] updating labstore100[67] partman recipe

https://gerrit.wikimedia.org/r/376152

RobH updated the task description. (Show Details)
RobH removed projects: Patch-For-Review, ops-eqiad.

Ok, after the cards were swapped, the disks now detect in the same order as other hosts. IE: the raid1 flex bays setup as raid1 are showing as sda in the installer, the internal raid10 as sdb, and the external array as sdc. So these use the same partitioning auto recipe as dumpsdata. I removed the new and now invalid recipe from the git repo in the same patchset (linked above).

End result, both systems are installed and ready for use. The external disk array presents as sdc, but is not currently formatted or mounted.

Assigning to @madhuvishy for followup (since she was the last cloud team person I chatted with about this.) You can use this task for further tracking, or resolve as you see fit!

@RobH @Cmjohnson I'm able to log in to both machines with their .wikimedia.org hostnames and run puppet fine.

However, when I hop into the serial console, they both spew series of ACPI errors, like:

labstore1006 login: root
Password: [64842.802805] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[64842.859706] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a55bf04f000), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[64842.922221] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[64860.501079] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[64860.558410] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a55bf04f000), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[64860.621108] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)

Login timed out after 60 secon
Debian GNU/Linux 8 labstore1006 ttyS1

labstore1006 login: [64902.799581] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[64902.855581] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a55bf04f000), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[64902.916367] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[64920.500182] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[64920.557124] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a55bf04f000), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[64920.616598] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
labstore1007 login: root
Password:
Last login: Tue Sep  5 23:43:54 UTC 2017 from puppetmaster1001.eqiad.wmnet on pts/0
Linux labstore1007 4.9.0-0.bpo.3-amd64 #1 SMP Debian 4.9.30-2+deb9u2~bpo8+1 (2017-06-27) x86_64
Linux labstore1007 4.9.0-0.bpo.3-amd64 #1 SMP Debian 4.9.30-2+deb9u2~bpo8+1 (2017-06-27) x86_64
Debian GNU/Linux 8.9 (jessie)
labstore1007 is a Unused spare system (spare::system)
The last Puppet run was at Wed Sep  6 17:28:24 UTC 2017 (1 minutes ago).
Debian GNU/Linux 8 auto-installed on Tue Sep 5 23:37:31 UTC 2017.
root@labstore1007:~# [64185.160180] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[64185.217915] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a523f04f2f8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[64185.279822] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[64198.608388] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[64198.664816] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a523f04f2f8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[64198.726519] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[64245.158867] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[64245.217335] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a523f04f2f8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[64245.279083] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[64258.608589] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[64258.664826] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a523f04f2f8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[64258.724845] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)

I'm not sure what this is exactly.

Probably could use a bios update.

IIRC the bios is already the newest version. I flashed the bios and the ilom when I installed them.

Those messages are due to acpi power meter, which we blacklist as of https://gerrit.wikimedia.org/r/#/c/356422/. A reboot should make the message go away.

@fgiunchedi Thank you! That seems to have fixed it. Resolving this task. Thanks everyone :)

madhuvishy mentioned this in Unknown Object (Task).Mar 21 2018, 5:14 PM