Page MenuHomePhabricator

eqiad: 1 VM request for WMDE Airflow
Closed, ResolvedPublic

Description

Site/Location:eqiad
Number of systems: 1
Service: Airflow - WMDE
Networking Requirements: internal IP - analytics vlan
Processor Requirements: 4 vCPUs
Memory: 8 GB of RAM
Disks: 100 GB
Other Requirements: none

Event Timeline

Verifying the cluster availability and resources via

stevemunene@cumin1001:~$ sudo cookbook -d sre.ganeti.resource-report eqiad
DRY-RUN: Executing cookbook sre.ganeti.resource-report with args: ['eqiad']
DRY-RUN: START - Cookbook sre.ganeti.resource-report
+-------+-------+-----------+----------+-----------+---------+-----------+
| Group | Nodes | Instances |  MFree   | MFree avg |  DFree  | DFree avg |
+-------+-------+-----------+----------+-----------+---------+-----------+
|   A   |   7   |     37    | 260.4GiB |  37.2GiB  | 13.4TiB |   1.9TiB  |
|   B   |   6   |     33    | 207.5GiB |  34.6GiB  |  9.8TiB |   1.6TiB  |
|   C   |   7   |     36    | 265.3GiB |  37.9GiB  | 13.1TiB |   1.9TiB  |
|   D   |   6   |     34    | 230.0GiB |  38.3GiB  | 10.4TiB |   1.7TiB  |
+-------+-------+-----------+----------+-----------+---------+-----------+
DRY-RUN: END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)

Using group B based on the results.

created the vm with
sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --network analytics --os buster --cluster eqiad --group B an-airflow1007
makevm and reimage succeeded with

Reimage completed:
- an-airflow1007 (**PASS**)
  - Removed from Puppet and PuppetDB if present
  - Deleted any existing Puppet certificate
  - Removed from Debmonitor if present
  - Forced PXE for next reboot
  - Host rebooted via gnt-instance
  - Host up (Debian installer)
  - Set boot media to disk
  - Host up (new fresh buster OS)
  - Generated Puppet certificate
  - Signed new Puppet certificate
  - Run Puppet in NOOP mode to populate exported resources in PuppetDB
  - Found Nagios_host resource for this host in PuppetDB
  - Downtimed the new host on Icinga/Alertmanager
  - First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308141054_stevemunene_2724910_an-airflow1007.out
  - configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  - Rebooted
  - Automatic Puppet run was successful
  - Forced a re-check of all Icinga services for the host
  - Icinga status is optimal
  - Icinga downtime removed
  - Updated Netbox data from PuppetDB


END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-airflow1007.eqiad.wmnet with OS buster
END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1007.eqiad.wmnet

VM is online and reachable resolving this.

Stevemunene moved this task from In Progress to Done on the Data-Platform-SRE board.