Page MenuHomePhabricator

setup bast4002/WMF7218
Closed, ResolvedPublic

Description

This task will track the setup of bast4002/WMF7218. This system is 1 of 3 misc systems ordered on T160936 to replace ALL non-cp systems in ulsfo. @RobH will handle setup.

  • - refactor the role/profile(s) of bastion server, as it now applies multiple roles and fails our style check
  • - apply refactored role for bastion server to bast4002 (this will also have to ensure it works on other bastions)
  • - replace bast4001

virtualization note

Some of these systems will potentially be running ganeti. However, due to not knowing which will and will not, virtualization has been disabled on the CPU for ALL of the new ulsfo misc hosts orders on T160936. This can easily be changed back to enabled via the following instructions:

  • system will need to reboot, please ensure that this won't interrupt any vital services
  • ssh root@bast4002.mgmt.ulsfo.wmnet
  • run the following on the mgmt console:
racadm set iDRAC.ServerBoot.FirstBootDevice BIOS
console com2
  • in another ssh session, login to bast4002.wikimedia.org and reboot it. The racadm command set it to go into bios the next boot.
  • swap back to your mgmt console session and watch it launch system bios.
  • in bios, go to System Bios > Processor Settings > Virtualization Technology & Enable it. Then hit ESC and back out, confirming Yes when it asks to save and exit.
  • boot system back up and it will have the virtualization enabled on the processors.

Event Timeline

Change 386643 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] set bast4002.wikimedia.org production dns entries

https://gerrit.wikimedia.org/r/386643

Change 386643 merged by RobH:
[operations/dns@master] set bast4002.wikimedia.org dns entries

https://gerrit.wikimedia.org/r/386643

Change 386652 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting bast4001 params

https://gerrit.wikimedia.org/r/386652

Change 386652 merged by RobH:
[operations/puppet@production] setting bast4001 params

https://gerrit.wikimedia.org/r/386652

Change 386672 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] splitting bast4002 to its own entry

https://gerrit.wikimedia.org/r/386672

Change 386672 merged by RobH:
[operations/puppet@production] splitting bast4002 to its own entry

https://gerrit.wikimedia.org/r/386672

This is currently installed with jessie, but if we setup a new box, let's use stretch from the start?

This is currently installed with jessie, but if we setup a new box, let's use stretch from the start?

+1 We may as well move to stretch here. For the bastion/installserver role it should be pretty simple?

+1 We may as well move to stretch here. For the bastion/installserver role it should be pretty simple?

I wouldn't expect any problems. It also has role::prometheus::ops, but the prometheus version we're using on prometheus* is a backport of the prometheus version in stretch, so that should be fine as well.

+1 We may as well move to stretch here. For the bastion/installserver role it should be pretty simple?

I wouldn't expect any problems. It also has role::prometheus::ops, but the prometheus version we're using on prometheus* is a backport of the prometheus version in stretch, so that should be fine as well.

Indeed I don't expect any problems, we don't have any stretch + prometheus machine yet but should be fairly easy to test in labs if need be. While we're at it we should be also copying (rsync is fine) the prometheus data from bast4002, I can assist with that.

So this now reads 'bast4001' in the subject, but it is bast4002, just making sure no one changed that intentionally? (I setup the task as bast4002, so checking.)

Ok, I'll reimage, I'm also doing the conversion to bastion profile. (Unless Brandon tells me otherwise, I'm also making all references on this new server be bast4002, since bast4001 will remain online until this system is fully deployed.)

RobH renamed this task from setup bast4001/WMF7218 to setup bast4002/WMF7218.Nov 2 2017, 5:07 PM

More details on the Prometheus part:

  • Allow bast4002 as an additional Prometheus host (https://gerrit.wikimedia.org/r/#/c/393943/)
  • rsync Prometheus data from bast4001 to bast4002
    • Initial rsync can run while Prometheus is running on bast4001. Plus an additional rsync with Prometheus stopped on bast4001 to make sure all data is consistent and flushed
  • Start and verify Prometheus works on bast4002
  • Flip CNAME for prometheus.svc to bast4002

Was just working on the Bastion related Wikitech pages due to Bast1001 being replaced and i noticed we have 2 bastions in ULSFO, 4001 and 4002. stalled?

Was stalled on my lack of time dealing with the prometheus switchover and then switching peoples' SSH configs, otherwise it's ready for switchover.

FWIW I'm happy to assist with the Prometheus part, e.g. next week

RobH added a parent task: Unknown Object (Task).Aug 21 2018, 6:08 PM

Change 464369 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] changing prometheus.svc.ulsfo.wmnet entry to bast4002

https://gerrit.wikimedia.org/r/464369

Change 393943 had a related patch set uploaded (by RobH; owner: BBlack):
[operations/puppet@production] bast4002: switch over prometheus

https://gerrit.wikimedia.org/r/393943

Change 393943 merged by RobH:
[operations/puppet@production] bast4002: switch over prometheus

https://gerrit.wikimedia.org/r/393943

Change 464371 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] remove bast4001 from prometheus firewall exceptions

https://gerrit.wikimedia.org/r/464371

Ok, updates from IRC sync up and followup actions:

Mentioned in SAL (#wikimedia-operations) [2018-10-05T07:20:29Z] <godog> temporarily stop prometheus on bast4001 to finalize data transfer - T179050

Change 464369 merged by Filippo Giunchedi:
[operations/dns@master] changing prometheus.svc.ulsfo.wmnet entry to bast4002

https://gerrit.wikimedia.org/r/464369

Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Prometheus#Sync_data_from_an_existing_Prometheus_host. Once TTL expires bast4001 should no longer receive queries, this can be verified by looking at /var/log/apache2/other_vhosts_access.log.

Data transfer and CNAME flip completed. I've documented the data transfer itself at https://wikitech.wikimedia.org/wiki/Prometheus#Sync_data_from_an_existing_Prometheus_host. Once TTL expires bast4001 should no longer receive queries, this can be verified by looking at /var/log/apache2/other_vhosts_access.log.

This has happened, we can proceed with decom bast4001 (e.g. in puppet)

Can this task be resolved, given we have T178592 to track the bast4001 decom?

Change 464371 abandoned by RobH:
[operations/puppet@production] remove bast4001 from prometheus firewall exceptions

Reason:
old neglected patchset, no longer needed.

https://gerrit.wikimedia.org/r/464371