Page MenuHomePhabricator

(Need By: August 31) rack/setup/install (3) new zookeeper nodes
Closed, ResolvedPublic0 Story Points

Description

This task will track the racking and setup of 3 new zookeeper nodes.

Hostname: an-conf100[123]

Racking Proposal: each host in a different row if possible

an-conf1001:

  • - receive in system on procurement task T220687
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, analytics vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

an-conf1002:

  • - receive in system on procurement task T220687
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, analytics vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

an-conf1003:

  • - receive in system on procurement task T220687
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, analytics vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Details

Related Gerrit Patches:

Event Timeline

RobH triaged this task as Medium priority.Jul 1 2019, 7:22 PM
RobH created this task.
Restricted Application added a project: Operations. · View Herald TranscriptJul 1 2019, 7:22 PM
RobH added a parent task: Unknown Object (Task).Jul 1 2019, 7:23 PM
RobH mentioned this in Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH added a subscriber: Cmjohnson.

@elukey:

Please provide both hostnames and racking information for these nodes.

Also include: ip/subnet info (private or public subnet), OS distro, and partitioning scheme, then assign this to @Cmjohnson for followup.

elukey added a subscriber: Ottomata.Jul 2 2019, 4:00 PM

Thanks a lot for the task!

So the hostnames should follow the new an- convention, I'd say an-zoo100[1-3]? Or possibly an-conf100[1-3]? (maybe more consistent with the actual conf[12]* nomenclature..)

For the racking part, the hosts do need to be in separate rows if possible, but they can share the rack with any other current system.

@Ottomata thoughts?

I like an-conf. Also gives us the option to colocate something else on them if we need to one day.

Thanks!

elukey reassigned this task from elukey to Cmjohnson.Jul 3 2019, 5:45 AM
elukey added a subscriber: elukey.

All right so:

  • hostnames an-conf100[123]
  • ip subnet info: analytics VLAN
  • OS: stretch
  • partitioning scheme: probably a variant of conf-lvm, will come up with a scheme when the hosts will be racked and ready to be installed
  • racking: each host in a different row if possible
wiki_willy renamed this task from rack/setup/install (3) new zookeeper nodes to (Need By: August 31) rack/setup/install (3) new zookeeper nodes.Jul 3 2019, 7:12 AM
elukey mentioned this in Unknown Object (Task).Jul 22 2019, 10:19 AM
RobH updated the task description. (Show Details)Jul 24 2019, 7:38 PM
Cmjohnson updated the task description. (Show Details)Jul 24 2019, 8:00 PM

Change 530163 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns entries for an-conf100[1-3]

https://gerrit.wikimedia.org/r/530163

Change 530163 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns entries for an-conf100[1-3]

https://gerrit.wikimedia.org/r/530163

Cmjohnson updated the task description. (Show Details)Aug 14 2019, 2:55 PM

+an-conf1001 1H IN A 10.65.5.118
+an-conf1002 1H IN A 10.65.5.119
+an-conf1003 1H IN A 10.65.5.120

@Cmjohnson I can help with OS install/partman/etc.. if you want, so I'll free you from the last annoying steps :)

Cmjohnson updated the task description. (Show Details)Aug 20 2019, 3:13 PM

@elukey the site specific portion is complete if you want to take over from here

elukey claimed this task.Aug 20 2019, 3:16 PM
elukey added a project: User-Elukey.

Change 531277 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add base configuration for an-conf100[1-3]

https://gerrit.wikimedia.org/r/531277

Change 531277 merged by Elukey:
[operations/puppet@production] Add base configuration for an-conf100[1-3]

https://gerrit.wikimedia.org/r/531277

Change 531435 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add AAAA/A/PTR records for an-conf100[1-3]

https://gerrit.wikimedia.org/r/531435

Change 531435 merged by Elukey:
[operations/dns@master] Add AAAA/A/PTR records for an-conf100[1-3]

https://gerrit.wikimedia.org/r/531435

Change 531466 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add new partman recipe for an-conf100[1-3]

https://gerrit.wikimedia.org/r/531466

Change 531466 merged by Elukey:
[operations/puppet@production] Add new partman recipe for an-conf100[1-3]

https://gerrit.wikimedia.org/r/531466

Change 531469 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Use standard partman recipe for an-conf100[1-3]

https://gerrit.wikimedia.org/r/531469

Change 531469 merged by Elukey:
[operations/puppet@production] Use standard partman recipe for an-conf100[1-3]

https://gerrit.wikimedia.org/r/531469

@Cmjohnson I was able to install the OS on an-conf1001 via manual PXE install, but I had to set in the BIOS the following serial console setting: Serial Port Address set to Serial Device1=COM2,Serial Device2=COM1 (was Serial Port Address set to Serial Device1=COM1,Serial Device2=COM2). In the platform specific documentation I didn't find any specific indication why my setting is correct, can you give me some info? I'll wait before proceeding with the other two hosts. Without that setting I couldn't see anything past hitting F12 to start PXE boot.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-conf1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201908221201_elukey_27190.log.

Completed auto-reimage of hosts:

['an-conf1001.eqiad.wmnet']

Of which those FAILED:

['an-conf1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-conf1002.eqiad.wmnet', 'an-conf1003.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201908221226_elukey_32888.log.

Change 531701 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Create the new Analytics Zookeeper cluster

https://gerrit.wikimedia.org/r/531701

Change 531701 merged by Elukey:
[operations/puppet@production] Create the new Analytics Zookeeper cluster

https://gerrit.wikimedia.org/r/531701

elukey updated the task description. (Show Details)Aug 22 2019, 2:57 PM

@Cmjohnson I was able to install the OS on an-conf1001 via manual PXE install, but I had to set in the BIOS the following serial console setting: Serial Port Address set to Serial Device1=COM2,Serial Device2=COM1 (was Serial Port Address set to Serial Device1=COM1,Serial Device2=COM2). In the platform specific documentation I didn't find any specific indication why my setting is correct, can you give me some info? I'll wait before proceeding with the other two hosts. Without that setting I couldn't see anything past hitting F12 to start PXE boot.

Did it for all the nodes and installed OS + service (zookeeper). @Cmjohnson is the above setting correct?

wiki_willy reassigned this task from elukey to Cmjohnson.Aug 30 2019, 6:17 PM
wiki_willy added a subscriber: wiki_willy.

Assigning over to @Cmjohnson for @elukey 's question.

Making a summary after a bit of time of the current status:

  • While setting up the hosts, I noticed that console redirection didn't work, and applied Serial Port Address set to Serial Device1=COM2,Serial Device2=COM1 to all the three hosts (was Serial Port Address set to Serial Device1=COM1,Serial Device2=COM2)
  • I had a chat with Chris on IRC some weeks ago, and IIRC he tried to revert my change on an-coord1001 and confirmed that it was working on his side.
  • Today I tested again an-conf1001, with the following:
    • ssh to cumin1001
    • ssh to root@an-conf1001.mgmt.eqiad.wmnet
    • racadm serveraction powercycle
    • console com2
  • I then followed the output on the screen and the issue that I had seems happening again, namely I don't see anything related to the hosts's boot after the first BIOS/System messages (including the Debian prompt to login).

@Cmjohnson am I missing something? How did you test at the time (if you remember) console redirection? Sorry for the trouble, I just want to make sure that this works before putting everything in production :)

Today I tested Redirection after boot set to enabled on an-coord1001's bios but I didn't resolve the problem, the mgmt console is not available. Given the workaround that I used to make it work, is it possible that the serial cabling for the an-conf nodes is inverted between com1/com2? Just trying to think about what could be the problem..

@Jclark-ctr - since Chris had to use a sick day, can one of you guys take a look at this for Luca? Thanks, Willy

@elukey Will be on site tomorrow morning 7:30 et questions regarding host in concern an-coord1001 or an-conf1001. Sent message on IRC to follow up

I followed up with @Jclark-ctr and it seems that there is only one serial port available for the server (so we can't really change any cabling). @Cmjohnson the serial console on all hosts is still broken, let's sync up when you have time / are back :)

elukey added a comment.Oct 8 2019, 4:04 PM

ping again on this :)

I don't know what you need me to do...the servers were setup correctly.

elukey added a comment.Oct 9 2019, 6:48 AM

I don't know what you need me to do...the servers were setup correctly.

There seems to be an issue with the serial mgmt console, and I am not sure if it is something that I am not getting right or if for some reason these new hosts require special settings. As it is clear from the conversation in this task I have helped and tested a lot to avoid requiring one of you guys to spend a lot of time on this (I know that you have a big backlog) but now I would need some help in figuring out how to fix this. The solution that I found is non standard and not really supported by DCOps, and I agree since we should always run consistent configurations, but I am stuck in a limbo now since I cannot use these hosts that are ready due to this issue.

To set expectations: is it the service owner's responsibility to get this configuration to work? If so I'll assign the task to myself and keep working on it, will stop bothering you and Rob.

I'll dig around a bit and check with Dell to see if we can figure why Com1 and Com2 have to be flipped to get it working. Talked to Luca and worse case, if we can't find any answers to why it's happening, then we'll just leave them as is. Thanks, Willy

Papaul added a subscriber: Papaul.EditedOct 11 2019, 4:48 PM

@elukey after workin 4 hours on this, the problem ended up not being the Serial configuration in the BIOS but the GRUB settings. on the systems we have

linux /boot/vmlinuz-4.19.0-6-amd64 root=UUID=8773beee-195f-4\
│2b5-82ec-d17d8f2af41d ro console=ttyS0,115200n8 elevator=deadline
and the others systems running Buster as well we have

linux /vmlinuz-4.19.0-6-amd64 root=/dev/mapper/local-root ro\
│ console=ttyS1,115200n8 elevator=deadline

changing console to console=ttS1 works with the the default serial settings with have on all the other servers. when you reboot the server it goes back again using console=ttS0 and the server get stuck again.

I made the change again on an-conf1001 and did run systemctl enable getty@ttyS1 and reboot the system now it is working so you can do the same for the other systems or find a way to fix GRUB . Let me know if you have any questions.

Great job @Papaul in troubleshooting this and tracking it down to the root cause. Thanks! ~Willy

Papaul added a comment.EditedOct 11 2019, 5:49 PM

@elukey so the issue is that you used

puppet/modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200
and not

puppet/modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200
for the dhcp MAC entries

elukey added a comment.EditedOct 11 2019, 5:59 PM

Thanks a lot for this work! I was not aware of that mistake, and I have also to admit my ignorance about that part of the dhcp configuration. I think that it would be worth to add this to the wikitech documentation so this long investigation will lead to quicker fixes in the future. I am not sure where to best place it, of it anything like that is already existing, but something in pages like https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN30 could be helpful (in a FAQ or things to check if the serial is not showing up anything, etc.). I can help in expanding the documentation if everybody likes this idea. @wiki_willy / @Papaul let me know!

The info is in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Preparation_2 to I have clearly miss it, but a reference in the FAQ of platform docs could help :)

@elukey no need to feel bad about a mistake we all make mistakes just glad that it is fix.

@elukey no need to feel bad about a mistake we all make mistakes just glad that it is fix.

Thanks! What I'd like to do it to avoid that somebody else will spend hours in debugging in the future, it is fine to make mistakes but then better to let everybody know so the chance of repetition is minimal. This is why I am proposing the change in the docs :)

I have no problem with you expanding the documentation : )

Change 542784 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move an-conf* dchp configuration in the right linux-host-entries file

https://gerrit.wikimedia.org/r/542784

Change 542784 merged by Elukey:
[operations/puppet@production] Move an-conf* dchp configuration in the right linux-host-entries file

https://gerrit.wikimedia.org/r/542784

elukey closed this task as Resolved.Oct 14 2019, 6:28 AM

Fixed puppet, then manually /etc/default/grub on all hosts and finally sudo update-grub. Restored all the serial settings via BIOS console, I can confirm that everything works as expected now. Thanks!