TCP-proxy specific work for {T365259}
Active patches:
topic branch: https://gerrit.wikimedia.org/r/q/topic:%22tcp-proxy%22
TCP-proxy specific work for {T365259}
Active patches:
topic branch: https://gerrit.wikimedia.org/r/q/topic:%22tcp-proxy%22
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Restricted Task | |||||
| Resolved | Dzahn | T411895 gerrit behind CDN | |||
| Resolved | Dzahn | T408532 Deploy a TCP proxy across all DCs | |||
| Resolved | Dzahn | T408064 Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) |
Change #1198281 had a related patch set uploaded (by Jelto; author: Jelto):
[operations/puppet@production] git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies
Change #1198397 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] site/role: create placeholder role/profile for tcpproxy
Change #1198397 merged by Dzahn:
[operations/puppet@production] site/role: create placeholder role/profile for tcpproxy
Change #1200188 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] site: fix regex for tcp-proxy to cover 1002
Change #1200188 merged by Dzahn:
[operations/puppet@production] site: fix regex for tcp-proxy to cover 1002
Change #1200189 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: set puppet7 and firewall provider to ferm for new role
Change #1200189 merged by Dzahn:
[operations/puppet@production] tcpproxy: set puppet7 and firewall provider to ferm for new role
Mentioned in SAL (#wikimedia-operations) [2025-10-30T23:48:48Z] <mutante> forward-fixing to puppet7 on tcp-proxy1001/1002 per T349619 T408532
Change #1200190 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: add config template
Change #1200190 merged by Dzahn:
[operations/puppet@production] tcpproxy: add config template and parameters
Change #1201299 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: add firewall rule to allow gerrit ssh port
Change #1201299 merged by Dzahn:
[operations/puppet@production] tcpproxy: add firewall rule to allow gerrit ssh port
Change #1201311 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: add basic logging config
Change #1201312 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] site: apply tcpproxy role on all VMs created for it
Change #1201311 merged by Dzahn:
[operations/puppet@production] tcpproxy: add basic logging config
Change #1201745 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: greatly reduce connection timeouts
Change #1201745 merged by Dzahn:
[operations/puppet@production] tcpproxy: greatly reduce connection timeouts
Change #1201312 merged by Dzahn:
[operations/puppet@production] site: apply tcpproxy role on all VMs created for it
Change #1201810 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] prometheus::ops: add tcpproxies scrape
Change #1201810 merged by Dzahn:
[operations/puppet@production] prometheus::ops: add tcpproxies scrape
Change #1201820 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] gerrit: allow production networks to connect to gerrit-ssh
Change #1201822 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: use notify to ensure service gets restarted on config changes
Change #1201820 merged by Dzahn:
[operations/puppet@production] gerrit: allow production networks to connect to gerrit-ssh
Change #1201822 merged by Dzahn:
[operations/puppet@production] tcpproxy: use notify to ensure service gets restarted on config changes
Change #1201828 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: add simple puppet service resource to manage haproxy
Change #1201828 merged by Dzahn:
[operations/puppet@production] tcpproxy: add simple puppet service resource to manage haproxy
haproxy configured as TCP-proxy for gerrit-ssh has been deployed on all 14 VMs, across POPs.
We can now connect to localhost 29418 and get gerrit-ssh on each of them.
[tcp-proxy5001:~] $ nc localhost 29418 SSH-2.0-GerritCodeReview_3.10.6 (APACHE-SSHD-2.12.0) [tcp-proxy2001:~] $ nc localhost 29418 SSH-2.0-GerritCodeReview_3.10.6 (APACHE-SSHD-2.12.0) ..
Change #1202152 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] tcpproxy: haproxy: make stats work
Change #1202152 merged by CDanis:
[operations/puppet@production] tcpproxy: haproxy: make stats work
Change #1202163 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] tcpproxy: haproxy: listen on v4+v6 for both ports
Change #1202172 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] tcpproxy: haproxy: log level change to info
Change #1202163 merged by Dzahn:
[operations/puppet@production] tcpproxy: haproxy: listen on v4+v6 for both ports
Change #1202172 merged by Dzahn:
[operations/puppet@production] tcpproxy: haproxy: log level change to info
Change #1202261 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: allow PRODUCTION_NETWORKS to connect to 29418
Change #1202261 merged by Dzahn:
[operations/puppet@production] tcpproxy: allow PRODUCTION_NETWORKS to connect to 29418
We are debugging why things (nc tcp-proxy* 29418) work from SOME of the VMs but not from others..in this pattern:
We had some strange results when trying to debug this together. So I ended up testing every combination between the proxy VMs.
Using nc -z -w3 in a loop produces this type of output below.
Here it becomes obvious that esams and magru are IPv4-only and there are some other outliers. But connections do succed one way or another for all of them when testing like this.
Change #1202782 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/dns@master] allocate eqiad VIP for load balanced tcp-proxy service
Change #1202782 merged by Dzahn:
[operations/dns@master] allocate eqiad VIP for load balanced tcp-proxy service
Change #1202835 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/dns@master] allocate codfw VIP for load-balanced tcp-proxy service
Change #1202842 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] service: add tcpproxy service to service catalog (WIP)
Change #1202835 merged by Dzahn:
[operations/dns@master] allocate codfw VIP for load-balanced tcp-proxy service
Change #1203157 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: include profile::lvs::realserver in role
A proxy is running on 14 VMs, 2 in each of the 7 POPs.
What is missing is the load-balancing part.
Change #1198281 abandoned by Jelto:
[operations/puppet@production] git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies
Reason:
not needed anymore in favor of haproxy
Change #1202842 merged by Dzahn:
[operations/puppet@production] service: add gerrit-https service to service catalog
Change #1203157 abandoned by Dzahn:
[operations/puppet@production] tcpproxy: include profile::lvs::realserver in role
Reason:
replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215240
We have to switch these hosts from nftables back to ferm as firewall provider. Reason: liberica does not support nftables yet.
Change #1215284 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] tcpproxy: switch firewall provider from nftables to ferm
Mentioned in SAL (#wikimedia-operations) [2025-12-04T22:22:16Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: T408532
Change #1215284 merged by Dzahn:
[operations/puppet@production] tcpproxy: switch firewall provider from nftables to ferm
downtimed, ran puppet, rebooted the 14 VMs and verified ferm service is running via cumin/cookbook. they are all on ferm now.
This change should have been linked here. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215240 (thanks cdanis!)
It added the LVS profiles to the tcp-proxy puppet role.
The 14 VMs now have the new gerrit-lb IP, v4 and v6, bound on their loopback interfaces, 7 pairs:
[cumin2002:~] $ sudo cumin 'tcp-*' "ip addr show dev lo | grep global"
14 hosts will be targeted:
tcp-proxy[2001-2002].codfw.wmnet,tcp-proxy[6001-6002].drmrs.wmnet,tcp-proxy[1001-1002].eqiad.wmnet,tcp-proxy[5001-5002].eqsin.wmnet,tcp-proxy[3001-3002].esams.wmnet,tcp-proxy[7001-7002].magru.wmnet,tcp-proxy[4001-4002].ulsfo.wmnet
OK to proceed on 14 hosts? Enter the number of affected hosts to confirm or "q" to quit: 14
===== NODE GROUP =====
(2) tcp-proxy[5001-5002].eqsin.wmnet
----- OUTPUT of 'ip addr show dev lo | grep global' -----
inet 103.102.166.225/32 scope global lo:LVS
inet6 2001:df2:e500:ed1a::2/128 scope global
===== NODE GROUP =====
(2) tcp-proxy[7001-7002].magru.wmnet
----- OUTPUT of 'ip addr show dev lo | grep global' -----
inet 195.200.68.225/32 scope global lo:LVS
inet6 2a02:ec80:700:ed1a::2/128 scope global
===== NODE GROUP =====
(2) tcp-proxy[6001-6002].drmrs.wmnet
----- OUTPUT of 'ip addr show dev lo | grep global' -----
inet 185.15.58.225/32 scope global lo:LVS
inet6 2a02:ec80:600:ed1a::2/128 scope global
===== NODE GROUP =====
(2) tcp-proxy[3001-3002].esams.wmnet
----- OUTPUT of 'ip addr show dev lo | grep global' -----
inet 185.15.59.225/32 scope global lo:LVS
inet6 2a02:ec80:300:ed1a::2/128 scope global
===== NODE GROUP =====
(2) tcp-proxy[4001-4002].ulsfo.wmnet
----- OUTPUT of 'ip addr show dev lo | grep global' -----
inet 198.35.26.97/32 scope global lo:LVS
inet6 2620:0:863:ed1a::2/128 scope global
===== NODE GROUP =====
(2) tcp-proxy[1001-1002].eqiad.wmnet
----- OUTPUT of 'ip addr show dev lo | grep global' -----
inet 208.80.154.225/32 scope global lo:LVS
inet6 2620:0:861:ed1a::2/128 scope global
===== NODE GROUP =====
(2) tcp-proxy[2001-2002].codfw.wmnet
----- OUTPUT of 'ip addr show dev lo | grep global' -----
inet 208.80.153.225/32 scope global lo:LVS
inet6 2620:0:860:ed1a::2/128 scope globalThis should conclude the box:
Prepare tcpproxy VMs for accepting traffic on the new public IPs
on the parent task "Move Gerrit behind the CDN".
And this ticket should be resolved.
from here on anything would be just updating 2 tickets at a time. This is done and if there are small follow-ups they might happen over in T365259
Change #1224057 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] Add Cumin alias for tcpproxy hosts
Change #1224057 merged by Muehlenhoff:
[operations/puppet@production] Add Cumin alias for tcpproxy hosts
Change #1228515 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: change healthcheck URL for service catalog
Change #1228515 merged by Arnaudb:
[operations/puppet@production] gerrit: change healthcheck URL for service catalog