Page MenuHomePhabricator

Toolforge: upgrade main proxy servers to Debian Buster
Closed, ResolvedPublic

Description

We have now support in the puppet tree for building Debian Buster based proxy servers in Toolforge (related: T235059: Toolforge: refresh puppet code for proxy (dynamicproxy) to support Debian Buster)

Currently, tools-proxy-03 and tools-proxy-04 are running Debian Jessie, so they need to be rebuild and switched from the old role (role::toollabs::proxy) to the new one (role::wmcs::toolforge::proxy),
The work for doing such rebuild has been scheduled for 2019-10-28 14:30 UTC. An announcement for the operation was published already: https://lists.wikimedia.org/pipermail/cloud-announce/2019-October/000226.html.

A checklist and concrete operation steps will be added to this task previous to the operation window.

We agreed on both @Bstorm and @JHedden supervising this operation.

Details

Related Gerrit Patches:
operations/puppet : productionFix labsaliaser script to be executable
operations/puppet : productionRevert "cloudvps: ignore stderr in labs-ip-alias-dump.py"
operations/puppet : productiontoollabs: delete unused proxy code

Related Objects

StatusAssignedTask
Openbd808
OpenNone
OpenNone
OpenNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
OpenNone
Resolvedaborrero
OpenJprorama
OpenNone
OpenNone
Openaborrero
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
OpenNone
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
OpenNone
OpenNone
ResolvedKrenair
ResolvedNone
ResolvedAndrew
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
StalledBstorm
ResolvedBstorm
Resolved yuvipanda
DuplicateNone
ResolvedBstorm
ResolvedBstorm
OpenBstorm
DuplicateNone
OpenNone
Resolvedaborrero
DuplicateNone
OpenBstorm
OpenBstorm
ResolvedBstorm
ResolvedBstorm
OpenNone
Resolvedaborrero
OpenNone
Resolvedaborrero
Resolvedaborrero
OpenNone
OpenBstorm
OpenNone

Event Timeline

aborrero triaged this task as Normal priority.Oct 16 2019, 11:29 AM
aborrero created this task.
aborrero updated the task description. (Show Details)
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Please @Bstorm and @JHedden confirm the scheduled operation window works for you both, thanks!

Please @Bstorm and @JHedden confirm the scheduled operation window works for you both, thanks!

2019-10-28 14:30 UTC works for me. Thanks

If I set an alarm, I can be dressed and at a computer by then. I just have to set a reminder for the day before (am in PDT).

cool thanks! I will work on the operation steps soon for you to review.

aborrero updated the task description. (Show Details)Mon, Oct 21, 10:19 AM

Proposed operation steps:

  • downtime monitoring, etc
  • disable puppet in tools-proxy-03/tools-proxy-04
  • change role and hiera keys in the puppet section in horizon for the tools-proxy prefix. We need to delete the role role::toollabs::proxy and add the new one role::wmcs::toolforge::proxy.
  • create 2 new VMs: tools-proxy-05 and tools-proxy-06 using Debian Buster as base image.
  • wait for puppet to complete on the new VMs.
  • ensure Redis data is replicated into the new VMs (read only though)
  • refresh hiera keys that specify the active proxy (Hiera:tools and horizon)
  • reallocate the floating IP 185.15.56.5 from tools-proxy-03 to tools-proxy-05.
  • run puppet everywhere in toolforge
  • check kube2proxy is active in tools-proxy-05 (and happy).
  • Redis is r/w in tools-proxy-05.
  • Check that everything else is working (webservices, etc)
  • Shutdown or delete old VMs tools-proxy-03 and tools-proxy-04.
  • done.

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:34:31Z] <arturo> icinga downtime toolschecker for 1h (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:42:08Z] <arturo> deleted role::toollabs::proxy from the tools-proxy puppet profile (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:43:03Z] <arturo> adding role::wmcs::toolforge::proxy to the tools-proxy puppet prefix (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:45:26Z] <arturo> created VMs tools-proxy-05 and tools-proxy-06 (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:58:55Z] <arturo> added webproxy security group to tools-proxy-05 and tools-proxy-06 (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T15:14:43Z] <arturo> refresh hiera to use tools-proxy-05 as active proxy T235627

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T15:16:47Z] <arturo> tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy T235627

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T15:54:29Z] <arturo> shutting down tools-proxy-03 T235627

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T15:54:57Z] <arturo> tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed T235627

aborrero closed this task as Resolved.Mon, Oct 28, 4:03 PM

This has been done.

Change 546640 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: delete unused proxy code

https://gerrit.wikimedia.org/r/546640

Change 546640 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: delete unused proxy code

https://gerrit.wikimedia.org/r/546640

No idea how it used to work (T56052 is old), but zhuyifei1999@tools-sgebastion-08: ~$ curl tools.wmflabs.org now hangs.

Bstorm reopened this task as Open.Mon, Oct 28, 9:55 PM

I still see this on my laptop, which may not be a coincidence. I cannot reach any tools since we changed (even on a VPN so far). DNS resolution works correctly.

Bstorm added a comment.EditedMon, Oct 28, 10:00 PM

To demonstrate something is weird:

[bstorm@icinga1001]:~ $ curl tools.wmflabs.org
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.14.2</center>
</body>

It works from prod hosts but not cloud (and my local network for some reason).

I don't know about your local network but it breaking specifically within cloud may indicate a labsaliaser problem - resolving tools.wmflabs.org should give the internal IP of tools-proxy-05?

Change 546755 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py"

https://gerrit.wikimedia.org/r/546755

As for my local laptop, it was a random /etc/hosts entry I had from some old troubleshooting.

Change 546756 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Fix labsaliaser script to be executable

https://gerrit.wikimedia.org/r/546756

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T22:55:54Z] <jeh> run labs-ip-alias-dump on cloudservices1003 and cloudservices1004 T235627

Change 546756 merged by Andrew Bogott:
[operations/puppet@production] Fix labsaliaser script to be executable

https://gerrit.wikimedia.org/r/546756

Change 546755 merged by Andrew Bogott:
[operations/puppet@production] Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py"

https://gerrit.wikimedia.org/r/546755

Mentioned in SAL (#wikimedia-cloud) [2019-10-29T10:07:31Z] <arturo> deleting old jessie VMs tools-proxy-03/04 T235627

aborrero closed this task as Resolved.Tue, Oct 29, 10:08 AM

Thanks everyone for the followup with the split DNS situation. Closing task again now.