Page MenuHomePhabricator

create aphlict2001 (Phabricator realtime notifications codfw)
Closed, ResolvedPublic

Description

aphlict1001 is a VM that only exists in eqiad.

it's like a sidekick to the physical phabricator host and runs the realtime notification service

the physical hosts are about to be upgraded from phab1001 to phab1004 and phab2001 to phab2002

to have parity in codfw for the Phabricator setup we would also need to have the codfw-equivalent of aphlict, so aphlict2001.


When the aphlict service was moved from the Phabricator hosts to dedicated VMs it also got it's own discovery DNS name.

So now there is:

[phab1004:~] $ host aphlict.discovery.wmnet
aphlict.discovery.wmnet is an alias for aphlict1001.eqiad.wmnet.


[phab1004:~] $ host aphlict1001.eqiad.wmnet
aphlict1001.eqiad.wmnet has address 10.64.48.39
aphlict1001.eqiad.wmnet has IPv6 address 2620:0:861:107:10:64:48:39

[phab1004:~] $ host aphlict2001.codfw.wmnet
Host aphlict2001.codfw.wmnet not found: 3(NXDOMAIN)

Ultimately this ticket should end with an edit to this in the DNS repo:

; misc services without multiple backends
aphlict               300 IN CNAME aphlict1001.eqiad.wmnet.

Because then it can be moved into the section "; misc web services with multiple backends but without geoip".

That's an upgrade in itself.

Event Timeline

Dzahn renamed this task from create aphlict2001 to create aphlict2001 (Phabricator realtime notifications codfw).Nov 3 2022, 8:06 PM
Dzahn updated the task description. (Show Details)

Change 853010 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] remove phab1001-aphlict.eqiad.wmnet

https://gerrit.wikimedia.org/r/853010

Change 853010 merged by Dzahn:

[operations/dns@master] remove phab1001-aphlict.eqiad.wmnet

https://gerrit.wikimedia.org/r/853010

Wikiakbar1 updated the task description. (Show Details)
Wikiakbar1 updated the task description. (Show Details)
Wikiakbar1 removed a project: Phabricator.
Wikiakbar1 updated the task description. (Show Details)
Wikiakbar1 edited subscribers, added: Wikiakbar1; removed: Aklapper, Dzahn.

created VM with:

dzahn@cumin2002:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 2 --disk 20 --network private --cluster codfw --group D aphlict2001
LSobanski triaged this task as Medium priority.Jan 17 2023, 3:57 PM

Change 888690 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Add puppet role for aphlict vm in codfw

https://gerrit.wikimedia.org/r/888690

It seems that we do three things here, since the VM is already created and shut down:

  • Add puppet role (not many changes to make beyond adding the new host to site.pp)
  • Switch on VM, make sure it bootstraps correctly
  • Change DNS
  • add "insetup puppet role"
  • start VM
  • get console on VM
  • use "install_console" command from a puppetmaster to get intial root on VM (https://wikitech.wikimedia.org/wiki/Ganeti#Sign_Puppet_certs)
  • run puppet agent on VM first time, this creates a cert signing request
  • sign the request on the puppetmaster
  • run puppet agent on VM one more time.. a lot of things get installed all from the "base" module included by the "insetup" role
  • run puppet again until nothing new happens anymore
  • upload change that flips the role from "insetup" to production role, use puppet compiler to see what it does. does it hardcode the host name "1001" somewhere for example? does it start services by default but should they be started on the second instance? keep in mind this role has never had a second server before, be it active or passive.
  • apply puppet change if fine with it and unless other changes are needed
  • change the status of the VM in netbox to "active"

Change 888690 merged by EoghanGaffney:

[operations/puppet@production] Add insetup puppet role for aphlict vm in codfw

https://gerrit.wikimedia.org/r/888690

This VM would not boot. After debugging we found out it's because it's not in DHCP server config.

Until recently we had to add the MAC address of a freshly created VM into DHCP config under the install_server module and without it it would not boot.

But now this has changed and the files are gone from the puppet repo.

It seems what happened here is that there were changes to how this is handled between the time I created this VM and now that @eoghan wans to install an OS on it.

Trying to use the "reimage" cookbook resulted in a "not found in netbox" error but as Eoghan points out it does exist in netbox. It's currently unclear why that happens.

It might be easiest to just delete this VM and create it again with the makevm cookbook and check if the MAC gets addded to DHCP then.

Currently only aphlict1001 but not aphlict2001 is in there:

[install2004:/etc/dhcp] $ grep -r aphlict *
linux-host-entries.ttyS0-115200:host aphlict1001 {
linux-host-entries.ttyS0-115200:    fixed-address aphlict1001.eqiad.wmnet;
[install2004:/etc/dhcp] $

The MAC address of aphlict2001 is aa:00:00:10:b5:6f.

Cookbook cookbooks.sre.ganeti.reimage was started by eoghan@cumin2002 for host aphlict2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by eoghan@cumin2002 for host aphlict2001.codfw.wmnet with OS bullseye completed:

  • aphlict2001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302142210_eoghan_2528502_aphlict2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Change 889531 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Add puppet role to new aphlict VM

https://gerrit.wikimedia.org/r/889531

Change 889531 merged by EoghanGaffney:

[operations/puppet@production] Add puppet role to new aphlict VM

https://gerrit.wikimedia.org/r/889531

The production role has been applied, which is great!

Though it looks like next we need a scap deployment to aphlict2001. Probably would be best to add this as a topic to the next sync meeting with releng., maybe we can do the deploy and also make sure the phab config only refers to the discovery DNS record and not a host name.

Once we have done these things we know we can failover aphlict by just changing a DNS record.

To be truly active-active would be another thing though.

T330393 was created because puppet fails on this machine but we know that and the scap deploy at T329908 should fix that.

Mentioned in SAL (#wikimedia-operations) [2023-02-23T19:50:19Z] <mutante> aphlict2001 - manually created /etc/phabricator/config.yaml - empty file owned by root:phab-deploy to debug for T330393 T322369

19:41 < mutante> scap has defaults to use gerrit as origin: modules/scap/lib/puppet/provider/scap_source/default.rb: "https://gerrit.wikimedia.org/r/#{repo_name}.git"
19:42 < mutante> so you cant use gitlab repos as origin for something deployed with scap
19:42 < mutante> this currently is a problem for phabricator repos which got moved from phabricator itself to gitlab

I split out the phabricator config into a separate module from the main phabricator module in https://gerrit.wikimedia.org/r/c/operations/puppet/+/891841 - next step is to apply this role to the aphlict machines. The catch is that because these machines are configured by hand, we might overwrite the configs that were originally deployed. We'll apply it to aphlict2001 to start with, see if we can make that work, and then apply it to aphlict1001.

Change 895240 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Add the aphlict config on aphlict2001.codfw

https://gerrit.wikimedia.org/r/895240

Change 895240 merged by EoghanGaffney:

[operations/puppet@production] Add the aphlict config on aphlict2001.codfw

https://gerrit.wikimedia.org/r/895240

I've tried deploying phabricator::config with an empty config to aphlict2001 and running scap after that, but scap needs some credentials in there. At a minimum, what it needs is the variables to perform:

13:05:31 Rendering config_file: /srv/deployment/phabricator/deployment-cache/revs/3f2dd1bf5761597244976c8a39822b08c0843199/phabricator/conf/local/local.json using /etc/phabricator/config.yaml
13:05:31 Rendering config_file: /srv/deployment/phabricator/deployment-cache/revs/3f2dd1bf5761597244976c8a39822b08c0843199/phabricator/conf/local/mail.json using /etc/phabricator/config.yaml
13:05:31 Rendering config_file: /srv/deployment/phabricator/deployment-cache/revs/3f2dd1bf5761597244976c8a39822b08c0843199/phabricator/conf/local/phd.json using /etc/phabricator/config.yaml
13:05:31 Rendering config_file: /srv/deployment/phabricator/deployment-cache/revs/3f2dd1bf5761597244976c8a39822b08c0843199/phabricator/conf/local/vcs.json using /etc/phabricator/config.yaml
13:05:31 Rendering config_file: /srv/deployment/phabricator/deployment-cache/revs/3f2dd1bf5761597244976c8a39822b08c0843199/phabricator/conf/local/www.json using /etc/phabricator/config.yaml
13:05:31 Rendering config_file: /srv/deployment/phabricator/deployment-cache/revs/3f2dd1bf5761597244976c8a39822b08c0843199/phabricator/support/preamble.php using /etc/phabricator/config.yaml
13:05:31 Rendering config_file: /srv/deployment/phabricator/deployment-cache/revs/3f2dd1bf5761597244976c8a39822b08c0843199/phabricator/support/redirect_config.json using /etc/phabricator/config.yaml

Putting a useful phabricator config (put in by @Dzahn recently to debug scap with @brennen) in place on the host let scap finish a deploy successfully, so I think we're very close to completion here.

I'll try adding these credentials to the aphlict config and see if that allows it to continue. In the mean time, I'm leaving puppet disabled on that host for the rest of the day.

Change 897852 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Add dummy 'config_deploy_vars' for aphlict

https://gerrit.wikimedia.org/r/897852

Change 897852 merged by EoghanGaffney:

[operations/puppet@production] Add dummy 'config_deploy_vars' for aphlict

https://gerrit.wikimedia.org/r/897852

It looks like the dummy config variables for the phabricator config at least got us as far as a successful puppet deployment, including a scap deploy of phabricator.

I currently see aphlict running on aphlict2001.codfw.wmnet. Before we add it to the DNS template, I'd like to find a way to test if this works correctly. I'll try come up with a good way of doing this soon.

Change 903264 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Assign insetup role to new aphlict vm

https://gerrit.wikimedia.org/r/903264

Change 903264 merged by EoghanGaffney:

[operations/puppet@production] Assign insetup role to new aphlict vm

https://gerrit.wikimedia.org/r/903264

Cookbook cookbooks.sre.ganeti.reimage was started by eoghan@cumin1001 for host aphlict1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by eoghan@cumin1001 for host aphlict1002.eqiad.wmnet with OS bullseye completed:

  • aphlict1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303281209_eoghan_145080_aphlict1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Change 903641 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Add aphlict role to new vm host

https://gerrit.wikimedia.org/r/903641

Change 903641 merged by EoghanGaffney:

[operations/puppet@production] Add aphlict role to new vm host

https://gerrit.wikimedia.org/r/903641

Cookbook cookbooks.sre.ganeti.reimage was started by eoghan@cumin1001 for host aphlict2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by eoghan@cumin1001 for host aphlict2001.codfw.wmnet with OS bullseye completed:

  • aphlict2001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202305041048_eoghan_1906453_aphlict2001.out
    • Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

We now have aphlict2001 running in codfw, We haven't tested it since we don't allow cross-region traffic, but it uses the same puppet modules as aphlict1002 which was brought from bare install to production traffic without intervention so I'm confident it'll work.

I've also updated the DNS file to move aphlict into the ; misc web services with multiple backends but without geoip section.