Page MenuHomePhabricator

Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons)
Open, HighPublic

Description

This is (almost) the final step!

Progressively forward the remaining 30% of external traffic to MW-on-K8s

What?

Wikikube cluster will be fully serving:

  • External traffic (API, web, mobile)
  • Internal traffic
  • MediaWiki jobs (former jobrunners)

In MW-on-K8s terms, this translates to following deployments

  • mw-web
  • mw-api-int
  • mw-api-ext
  • mw-jobrunner
  • mw-parsoid
  • mw-wikifunctions

What we are not migrating to Wikikube yet

External traffic

Jobs

Internal traffic

Progression

  • 75%
  • 80%
  • 85%
  • 90%
  • 95%
  • 100%

Notes

The above is per DC.

Event Timeline

Clement_Goubert created this task.

Change #1021904 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 75% traffic

https://gerrit.wikimedia.org/r/1021904

Change #1021905 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 75% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/1021905

Change #1021904 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 75% traffic

https://gerrit.wikimedia.org/r/1021904

Change #1021905 merged by Clément Goubert:

[operations/puppet@production] trafficserver: move 75% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/1021905

Change #1023397 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: move 5 eqiad appservers to kubernetes

https://gerrit.wikimedia.org/r/1023397

Change #1023397 merged by Clément Goubert:

[operations/puppet@production] kubernetes: move 5 eqiad appservers to kubernetes

https://gerrit.wikimedia.org/r/1023397

Change #1023412 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 80% traffic

https://gerrit.wikimedia.org/r/1023412

Change #1023413 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 80% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/1023413

Checking after MoveComms-Support was added to this task: what kind of support do you need, if any?

Change #1023412 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 80% traffic

https://gerrit.wikimedia.org/r/1023412

Change #1023413 merged by Clément Goubert:

[operations/puppet@production] trafficserver: move 80% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/1023413

Mentioned in SAL (#wikimedia-operations) [2024-04-24T09:29:39Z] <claime> 80% of external traffix to mw-on-k8s - T362323

Change #1026159 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move 80% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1026159

Change #1026160 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-we, mw-api-ext: bump replicas

https://gerrit.wikimedia.org/r/1026160

Change #1026158 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] k8s: move 5 eqiad appservers to kubernetes

https://gerrit.wikimedia.org/r/1026158

Change #1026158 merged by Hnowlan:

[operations/puppet@production] k8s: move 5 eqiad appservers to kubernetes

https://gerrit.wikimedia.org/r/1026158

Change #1026520 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] scap: make mw1407 a scap proxy

https://gerrit.wikimedia.org/r/1026520

Change #1026520 merged by Hnowlan:

[operations/puppet@production] scap: make mw1407 a scap proxy

https://gerrit.wikimedia.org/r/1026520

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1371.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1409.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1435.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1399.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1405.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1371.eqiad.wmnet with OS bullseye completed:

  • mw1371 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021132_hnowlan_3953702_mw1371.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1409.eqiad.wmnet with OS bullseye completed:

  • mw1409 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021135_hnowlan_3953708_mw1409.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1405.eqiad.wmnet with OS bullseye completed:

  • mw1405 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021137_hnowlan_3953738_mw1405.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1435.eqiad.wmnet with OS bullseye completed:

  • mw1435 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021139_hnowlan_3953714_mw1435.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1399.eqiad.wmnet with OS bullseye completed:

  • mw1399 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021143_hnowlan_3953743_mw1399.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1026160 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: bump replicas

https://gerrit.wikimedia.org/r/1026160

Change #1026159 merged by Hnowlan:

[operations/puppet@production] trafficserver: move 85% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1026159

Change #1028840 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: make 5 eqiad api appservers k8s workers

https://gerrit.wikimedia.org/r/1028840

Change #1028842 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-web, mw-api-ext: bump replicas in advance of traffic shift

https://gerrit.wikimedia.org/r/1028842

Change #1028844 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move k8s traffic shift to 90%

https://gerrit.wikimedia.org/r/1028844