Page MenuHomePhabricator

rack/setup prometheus200[3-4]
Closed, ResolvedPublic

Description

This task will track the racking and initial setup of prometheus200[3-4].
Please verify and see if the racking location and the hostnames are okay and provide partman recipe to use. Thanks

prometheus2003

  • - receive in normally on parent task T149339
  • - rack location A5
  • - create dns entries for internal production IP address, and mgmt entries for both asset tag and hostname
  • - setup bios and drac
  • - update task with port info
  • - install_server module update
  • - install OS
  • - sign/accept puppet & salt keys
  • - hand off to @Filippo for service implementation.

prometheus2004

  • - receive in normally on parent task T149339
  • - rack location B5
  • - create dns entries for internal production IP address, and mgmt entries for both asset tag and hostname
  • - setup bios and drac
  • - update task with port info
  • - install_server module update
  • - install OS
  • - sign/accept puppet & salt keys
  • - hand off to @Filippo for service implementation.

Event Timeline

Papaul triaged this task as Medium priority.Nov 22 2016, 4:15 PM
Restricted Application added a subscriber: Southparkfan. · View Herald Transcript

@Papaul thanks! racking info looks good, only requirement is being in different rows, which it is already.
re: partman I don't think there's a recipe ready, unless there's a particular preference I'd go with hardware raid, namely raid1 for the two ssd and raid10 for the spinning disks. Then partman wise we can do sda/sdb

fgiunchedi renamed this task from rack/setup prometheus200[1-2] to rack/setup prometheus200[3-4].Nov 22 2016, 7:27 PM
fgiunchedi updated the task description. (Show Details)

Change 322970 had a related patch set uploaded (by Filippo Giunchedi):
codfw: rename prometheus200[12] to prometheus200[34]

https://gerrit.wikimedia.org/r/322970

Change 322970 merged by Filippo Giunchedi:
codfw: rename prometheus200[12] to prometheus200[34]

https://gerrit.wikimedia.org/r/322970

There was a mistake in host naming at the beginning (prometheus200[12] already exist as VMs), I've moved prometheus2001 to prometheus2003 and prometheus2002 to prometheus2004

Change 323056 had a related patch set uploaded (by Filippo Giunchedi):
install_server: add prometheus partman

https://gerrit.wikimedia.org/r/323056

@fgiunchedi for the partman do you want for us to use raid1-gpt.cfg?

Change 323056 merged by Filippo Giunchedi:
install_server: add prometheus partman

https://gerrit.wikimedia.org/r/323056

@Papaul yeah prometheus.cfg is merged now, let's try that! Might not work on first try though

@fgiunchedi does this look good?

root@prometheus2003:~# fdisk -l /dev/sda

Disk /dev/sda: 1.5 TiB, 1599741100032 bytes, 3124494336 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x6995cbf9

Device Boot Start End Sectors Size Id Type
/dev/sda1 * 2048 194559 192512 94M 83 Linux
/dev/sda2 194560 3046369279 3046174720 1.4T 83 Linux
/dev/sda3 3046369280 3124492287 78123008 37.3G 8e Linux LVM

root@prometheus2003:~# fdisk -l /dev/sdb

Disk /dev/sdb: 3.7 TiB, 3999688294400 bytes, 7811891200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 1730C887-184F-4B1A-A04D-B918257550DA

Device Start End Sectors Size Type
/dev/sdb1 2048 7811889151 7811887104 3.7T Linux LVM

Papaul updated the task description. (Show Details)

@fgiunchedi you can take over. Thanks

@Papaul I tried rebooting prometheus2003 today for a test and since it wasn't coming back I checked the mgmt which also doesn't seem to answer on to ssh

bast2001:~$ ssh -v prometheus2003.mgmt.codfw.wmnet
OpenSSH_6.7p1 Debian-5+deb8u2, OpenSSL 1.0.2j  26 Sep 2016
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to prometheus2003.mgmt.codfw.wmnet [10.193.1.156] port 22.

The IDRAC extension was loose from the main board. i open open the server to realize that there were no screws attaching the IDRAC extension to the main - board . since this was loose, it might have been damage after server reboot.

unnamed.jpg (384×422 px, 16 KB)

I called Dell and they will send :
1- IDRAC extension
2- Main -board
3- 2 screws
4 - A tech

For now I changed the IDRAC from the dedicated port to LOM2 (NIC2) in case the Tech can't make it tomorrow because it might be possible that Dell don't have all the parts in stock (said Dell) , we can stay access the server.

Good afternoon, Papaul,

I enjoyed working with you today regarding your Failed IDRAC card issue. I have included your service request and dispatch information below:

Service Tag: 94Q2ND2
Service Request: 940785835
Dispatch Number: 323150683

This email confirms your onsite service dispatch to replace your motherboard and iDRAC port. Please be sure to contact me immediately if you do not hear from the onsite technician within the allotted time, or you have any concerns involving the onsite technician.

I will continue to be your service request owner and primary point of contact until your support need with Dell is completely resolved. If you need anything at all please reach me directly by replying to this email.

Hello Papaul,

It looks like there was a back order on the iDRAC port so our dispatching team had to reissue our dispatch using an alternate part from a different warehouse. They sent the iDRAC port as a separate shipment coming directly to you. It looks like the service won’t be able to take place until Monday due to the delay on the part. I included a link to the tracking of the iDRAC port below. I apologize about the delay and will continue to monitor the dispatch and let you know if anything changes.

IDRAC card extention replacement complete. The system is back up and using the the dedicated card.