Page MenuHomePhabricator

Opt-in testing of Gerrit-via-CDN
Closed, ResolvedPublic

Description

Opt-in trial of Gerrit via CDN: setup instructions:

Install the latest version of tunnelencabulator:

curl -s 'https://gerrit.wikimedia.org/g/operations/debs/wmf-laptop/+/master/scripts/tunnelencabulator?format=TEXT' \
  | base64 -d \
  | install /dev/stdin ~/bin/tunnelencabulator

Run it.

💙cdanis@wmftop ~ 🕑☕ tunnelencabulator 
[sudo] password for cdanis: 
Traffic redirected via codfw.  Press Ctrl-C when you are done.

It edited your /etc/hosts to point all your towards-WMF traffic to your next-nearest CDN site.
This includes repointing gerrit.wikimedia.org 🚀

Until you exit it, both your https and ssh-29418 traffic towards Gerrit go via the edge CDN!

If you'd rather not need to leave it running, you can instead tunnelencabulator -f to have it do its work and then exit. Then later you should tunnelencabulator --undo (which using -f will also remind you).

If Gerrit/CDN worked well for you, please leave a token 💜

And if it didn't, please leave a comment!

Details

Event Timeline

taavi rescinded a token.
taavi awarded a token.
CDanis triaged this task as High priority.

FAQ

Q: How long should I leave this running?
A: Up to one whole workday at a time. Don't set it and totally forget it -- it's directing you towards one of our edge sites, and that will need to change if we need to depool that edge site.

Thanks so much for this work!

Question:

When I do ssh gerrit.wikimedia.org -p 29418 gerrit show-connections --wide I'm seeing 4 "connections" per PoP without any associated user (confirmed with ss on the host). They seem to be short-lived connections. I guess this is some kind of tcp probe? Is this some kind of health-check behavior?

When I do ssh gerrit.wikimedia.org -p 29418 gerrit show-connections --wide I'm seeing 4 "connections" per PoP without any associated user (confirmed with ss on the host). They seem to be short-lived connections. I guess this is some kind of tcp probe? Is this some kind of health-check behavior?

Yeah, those are end-to-end healthchecks from the edge. We're also churning a bunch of sockets on the gerrit host, and while that's not a huge deal, we can figure out how to turn down the interval there.

When it's an actual user session, it shows up with a source of those same edge hosts:

9b5796f4   cdanis          tcp-proxy2001.codfw.wmnet

I think this may just be a quirk of tunnelencabulator, but when I run it and use ssh gerrit -- gerrit show-connections --wide I see myself entering via IPv6 and not an edge proxy. For me ssh -4 is necessary to route traffic over the tunnel.

Thanks for the explanation @CDanis !

Digging through JVM metrics: gerrit seems to be handling the tcp connections ok. I dug a bit here since we have a lower number of sshd.threads configured than the number of connections gerrit currently thinks there are, but I guess since they're not active threads...it's fine? ¯\_(ツ)_/¯

I think this may just be a quirk of tunnelencabulator, but when I run it and use ssh gerrit -- gerrit show-connections --wide I see myself entering via IPv6 and not an edge proxy. For me ssh -4 is necessary to route traffic over the tunnel.

This is a tunnelencabulator bug, thank you. I'll fix tomorrow. Sorry, my ISP doesn't offer IPv6 and I never think to test with it.

Change #1227846 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/debs/wmf-laptop@master] tunnelencabulator: simple IPv6 support

https://gerrit.wikimedia.org/r/1227846

Change #1227846 merged by CDanis:

[operations/debs/wmf-laptop@master] tunnelencabulator: simple IPv6 support

https://gerrit.wikimedia.org/r/1227846

I also opted-in for using Gerrit behind the CDN (after updating to the correct tunnelencabulator version).

Connecting over https works and I get the caching headers form the CDN:

curl -I -s https://gerrit.wikimedia.org/r | grep cache
x-cache: cp3070 miss, cp3070 pass
x-cache-status: pass
server-timing: cache;desc="pass", host;desc="cp3070"

Cloning the puppet repo over ssh also works and I was able to generates some network traffic on one of the tcp-proxy hosts.

I'll leave this enabled on my machine for the next week.

@CDanis do you think it makes sense to offer a /etc/hosts-solution as well in the task description? So something like "use tunnnelencabulator or change you etc hosts to the nearest address of gerrit-lb.<dc>.wikimedia.org" with a few examples. That might make it easier for volunteers to try it without installing tunnelencabulator. I'm happy to update the description.

I'm using ftr:

cat /etc/hosts | grep gerrit
185.15.59.225          gerrit.wikimedia.org            # esams
2a02:ec80:300:ed1a::2  gerrit.wikimedia.org            # esams

@CDanis do you think it makes sense to offer a /etc/hosts-solution as well in the task description? So something like "use tunnnelencabulator or change you etc hosts to the nearest address of gerrit-lb.<dc>.wikimedia.org" with a few examples. That might make it easier for volunteers to try it without installing tunnelencabulator. I'm happy to update the description.

I was hesitant to recommend this, because it seemed like the exact kind of thing that someone might add to their /etc/hosts manually, forget about for weeks/months ... and then we have to depool an edge for maintenance and it breaks and they can't figure out why.

That being said I'm not strongly opposed :) please go ahead if you'd like.

I was hesitant to recommend this, because it seemed like the exact kind of thing that someone might add to their /etc/hosts manually, forget about for weeks/months ... and then we have to depool an edge for maintenance and it breaks and they can't figure out why.

No I think that makes sense, then let's not advertise changing the /etc/hosts. Testing with tunnelencabulator is probably the cleaner approach here.

Just FYI, when running the commands in the task description on a Mac, I (personally) get:

$ curl -s 'https://gerrit.wikimedia.org/g/operations/debs/wmf-laptop/+/master/scripts/tunnelencabulator?format=TEXT' \
  | base64 -d \
  | install /dev/stdin ~/bin/tunnelencabulator
install: /dev/stdin: Inappropriate file type or format

Just FYI, when running the commands in the task description on a Mac, I (personally) get:

$ curl -s 'https://gerrit.wikimedia.org/g/operations/debs/wmf-laptop/+/master/scripts/tunnelencabulator?format=TEXT' \
  | base64 -d \
  | install /dev/stdin ~/bin/tunnelencabulator
install: /dev/stdin: Inappropriate file type or format

The MacOS bsd flavored install works differently than the gnu install that folks working on Debian platforms would typically have. In this particular case replacing install could look something like curl ... | base64 -d > ~/bin/tunnelencabulator; chmod a+x ~/bin/tunnelencabulator

I edited my /etc/hosts directly to opt-in and works fine for me so far.

checked the box "One full business day of testing with several volunteers" on T411895

How about we send a mail to ops-at-large about this now as the next step?

I just got a pop-up with this error message when expanding files in a changeset:

Error 502: <!DOCTYPE html> [standard error page HTML cut]
<code>Request from [redacted] via cp3073.esams.wmnet, ATS/9.2.11<br>Error: 502, Malformed Server Response Status at 2026-02-06 11:17:05 GMT</code>
[cut]
Endpoint: /changes/*~*/revisions/*/files/*/reviewed

I just got a pop-up with this error message when expanding files in a changeset:

Error 502: <!DOCTYPE html> [standard error page HTML cut]
<code>Request from [redacted] via cp3073.esams.wmnet, ATS/9.2.11<br>Error: 502, Malformed Server Response Status at 2026-02-06 11:17:05 GMT</code>
[cut]
Endpoint: /changes/*~*/revisions/*/files/*/reviewed

Thank you for reporting this. I can not reproduce this with some random changes. Do you have the link to the change with the error?

And what is the actual path to reproduce it?

1: You open a change, link https://gerrit.wikimedia.org/r/c/operations/dns/+/1215709
2: You click on one of the files like https://gerrit.wikimedia.org/r/c/operations/dns/+/1215709/2/templates/wikimedia.org
3: You click on "+510 common lines"?

I found the request:
https://logstash.wikimedia.org/goto/78f43ba4e27abb0fa03ad86fcae79dfc

It's one of three 5xx we've served for Gerrit via the CDN in the past month. The other two were healthchecks from Liberica.
(CDN 5xx are reported to Logstash with 100% sampling rate)

@taavi's request was specifically a PUT to mark a file as reviewed. So it might be tricky to reproduce?

Great thanks for digging. Marking a file as reviewed works on my side on randomly selected changes. If you click on "MARK REVIEWED" when you hoover over a file this sends a PUT. Clicking again sends a DELETE. No 5xx on my side so far.

calling it enough testing based on "one of three 5xx we've served for Gerrit via the CDN in the past month. The other two were healthchecks from Liberica"

FWIW (on the subject of 5xxes) I got this a few minutes ago when trying to save a draft comment:

Error 502: <!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; display: flex; flex-direction: row; flex-wrap: wrap; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { flex: 1; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
summary { font-weight: bold; cursor: pointer; }
details[open] { background: #970302; color: #dfdedd; }
.text-muted { color: #777; }
@media (prefers-color-scheme: dark) {
  a { color: #9e9eff; }
  body { background: transparent; color: #ddd; }
  .footer { border-top: 1px solid #444; background: #060606; }
  #logo { filter: invert(1) hue-rotate(180deg); }
  .text-muted { color: #888; }
}
</style>
<meta name="color-scheme" content="light dark">
<div class="content" role="main">
<a href="https://www.wikimedia.org"><img id="logo" src="https://www.wikimedia.org/static/images/wmf-logo.png" srcset="https://www.wikimedia.org/static/images/wmf-logo-2x.png 2x" alt="Wikimedia" width="135" height="101">
</a>
<div class="content-text">
<h1>Error</h1>
<p>Our servers are currently under maintenance or experiencing a technical problem.

Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p>

<p>See the error message at the bottom of this page for more&nbsp;information.</p>
</div>
</div>
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class='text-muted'><code>Request from [redacted] via [redacted]<br>Error: 502, Malformed Server Response Status at 2026-02-12 21:22:55 GMT</code></p></div>
</html>

Endpoint: /changes/*~*/revisions/*/drafts