Page MenuHomePhabricator

Toolforge: upgrade main proxy servers to Debian Buster
Closed, ResolvedPublic

Description

We have now support in the puppet tree for building Debian Buster based proxy servers in Toolforge (related: T235059: Toolforge: refresh puppet code for proxy (dynamicproxy) to support Debian Buster)

Currently, tools-proxy-03 and tools-proxy-04 are running Debian Jessie, so they need to be rebuild and switched from the old role (role::toollabs::proxy) to the new one (role::wmcs::toolforge::proxy),
The work for doing such rebuild has been scheduled for 2019-10-28 14:30 UTC. An announcement for the operation was published already: https://lists.wikimedia.org/pipermail/cloud-announce/2019-October/000226.html.

A checklist and concrete operation steps will be added to this task previous to the operation window.

We agreed on both @Bstorm and @JHedden supervising this operation.

Related Objects

StatusSubtypeAssignedTask
ResolvedBstorm
Resolvedbd808
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
OpenNone
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedJprorama
Resolvedaborrero
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
Resolved dduvall
OpenNone
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
DeclinedNone
Resolvedaborrero
OpenNone
Resolvedaborrero
StalledNone
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
Resolved yuvipanda
DuplicateNone
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
DuplicateNone
ResolvedBstorm
Resolvedaborrero
DuplicateNone
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
DuplicateNone
Resolvedaborrero
ResolvedBstorm
Resolvedbd808
Invalidaborrero
Resolvedbd808
Resolvedbd808
ResolvedSecurityBstorm
Resolvedaborrero
Resolvedbd808
DuplicateNone
ResolvedBstorm
Resolvedbd808
Resolvedbd808

Event Timeline

aborrero triaged this task as Medium priority.Oct 16 2019, 11:29 AM
aborrero created this task.
aborrero updated the task description. (Show Details)
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Please @Bstorm and @JHedden confirm the scheduled operation window works for you both, thanks!

Please @Bstorm and @JHedden confirm the scheduled operation window works for you both, thanks!

2019-10-28 14:30 UTC works for me. Thanks

If I set an alarm, I can be dressed and at a computer by then. I just have to set a reminder for the day before (am in PDT).

cool thanks! I will work on the operation steps soon for you to review.

aborrero updated the task description. (Show Details)Oct 21 2019, 10:19 AM

Proposed operation steps:

  • downtime monitoring, etc
  • disable puppet in tools-proxy-03/tools-proxy-04
  • change role and hiera keys in the puppet section in horizon for the tools-proxy prefix. We need to delete the role role::toollabs::proxy and add the new one role::wmcs::toolforge::proxy.
  • create 2 new VMs: tools-proxy-05 and tools-proxy-06 using Debian Buster as base image.
  • wait for puppet to complete on the new VMs.
  • ensure Redis data is replicated into the new VMs (read only though)
  • refresh hiera keys that specify the active proxy (Hiera:tools and horizon)
  • reallocate the floating IP 185.15.56.5 from tools-proxy-03 to tools-proxy-05.
  • run puppet everywhere in toolforge
  • check kube2proxy is active in tools-proxy-05 (and happy).
  • Redis is r/w in tools-proxy-05.
  • Check that everything else is working (webservices, etc)
  • Shutdown or delete old VMs tools-proxy-03 and tools-proxy-04.
  • done.

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:34:31Z] <arturo> icinga downtime toolschecker for 1h (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:42:08Z] <arturo> deleted role::toollabs::proxy from the tools-proxy puppet profile (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:43:03Z] <arturo> adding role::wmcs::toolforge::proxy to the tools-proxy puppet prefix (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:45:26Z] <arturo> created VMs tools-proxy-05 and tools-proxy-06 (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T14:58:55Z] <arturo> added webproxy security group to tools-proxy-05 and tools-proxy-06 (T235627)

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T15:14:43Z] <arturo> refresh hiera to use tools-proxy-05 as active proxy T235627

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T15:16:47Z] <arturo> tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy T235627

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T15:54:29Z] <arturo> shutting down tools-proxy-03 T235627

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T15:54:57Z] <arturo> tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed T235627

aborrero closed this task as Resolved.Oct 28 2019, 4:03 PM

This has been done.

Change 546640 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: delete unused proxy code

https://gerrit.wikimedia.org/r/546640

Change 546640 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: delete unused proxy code

https://gerrit.wikimedia.org/r/546640

No idea how it used to work (T56052 is old), but zhuyifei1999@tools-sgebastion-08: ~$ curl tools.wmflabs.org now hangs.

Bstorm reopened this task as Open.Oct 28 2019, 9:55 PM

I still see this on my laptop, which may not be a coincidence. I cannot reach any tools since we changed (even on a VPN so far). DNS resolution works correctly.

Bstorm added a comment.EditedOct 28 2019, 10:00 PM

To demonstrate something is weird:

[bstorm@icinga1001]:~ $ curl tools.wmflabs.org
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.14.2</center>
</body>

It works from prod hosts but not cloud (and my local network for some reason).

I don't know about your local network but it breaking specifically within cloud may indicate a labsaliaser problem - resolving tools.wmflabs.org should give the internal IP of tools-proxy-05?

Change 546755 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py"

https://gerrit.wikimedia.org/r/546755

As for my local laptop, it was a random /etc/hosts entry I had from some old troubleshooting.

Change 546756 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Fix labsaliaser script to be executable

https://gerrit.wikimedia.org/r/546756

Mentioned in SAL (#wikimedia-cloud) [2019-10-28T22:55:54Z] <jeh> run labs-ip-alias-dump on cloudservices1003 and cloudservices1004 T235627

Change 546756 merged by Andrew Bogott:
[operations/puppet@production] Fix labsaliaser script to be executable

https://gerrit.wikimedia.org/r/546756

Change 546755 merged by Andrew Bogott:
[operations/puppet@production] Revert "cloudvps: ignore stderr in labs-ip-alias-dump.py"

https://gerrit.wikimedia.org/r/546755

Mentioned in SAL (#wikimedia-cloud) [2019-10-29T10:07:31Z] <arturo> deleting old jessie VMs tools-proxy-03/04 T235627

aborrero closed this task as Resolved.Oct 29 2019, 10:08 AM

Thanks everyone for the followup with the split DNS situation. Closing task again now.