Page MenuHomePhabricator

Troubleshoot GitLab nftables throttling after switchover
Closed, ResolvedPublic

Description

After the switchover in T400252: Gitlab switchover (gitlab2002 → gitlab1004) there were reports of legitimate users being blocked by the nftables throttling, see T400252#11046794 for example. Also when comparing the size of the deny list on the old host and on the new host there are more IPs blocked on the new host.

This could be related to the new IP and DNS entry and invalidated browser caches. So clients have to request more resource which would be cached normally. But more troubleshooting and a more reasonable threshold is needed.

It could make sense to switch the throttling to monitoring/logging instead of blocking to review the size of the deny list. Currently throttling is disabled to unblock users (see this change).

Event Timeline

Change #1175043 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable nftables throttling again in monitoring mode

https://gerrit.wikimedia.org/r/1175043

Change #1175043 merged by Jelto:

[operations/puppet@production] gitlab: enable nftables throttling again in monitoring mode

https://gerrit.wikimedia.org/r/1175043

Jelto triaged this task as Medium priority.Aug 4 2025, 1:10 PM

I spot-checked the IPs which would currently be blocked on gitlab1004 and all of them are legitimate users or runners as far as I can tell.

Also the version of nftables changed from bullseye to buster from 0.9 to 1.0 which might have changed how the metering works. So we have to find new thresholds for 1.0 on bookworm.

This is probably also relevant for Gerrit in T392464: Upgrade Gerrit hosts from Bullseye to Bookworm.

Change #1176246 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: adjust nftables throttling thresholds

https://gerrit.wikimedia.org/r/1176246

Change #1176246 merged by Jelto:

[operations/puppet@production] gitlab: adjust nftables throttling thresholds

https://gerrit.wikimedia.org/r/1176246

Unfortunately one of the Digital Ocean Runner IPs is throttled quite often. I'll test higher thresholds.

Change #1177363 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: raise throttling thresholds

https://gerrit.wikimedia.org/r/1177363

Change #1177363 merged by Jelto:

[operations/puppet@production] gitlab: raise throttling thresholds

https://gerrit.wikimedia.org/r/1177363

Change #1177405 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: disable nftables throttling temporarily

https://gerrit.wikimedia.org/r/1177405

Change #1177405 merged by Jelto:

[operations/puppet@production] gitlab: disable nftables throttling temporarily

https://gerrit.wikimedia.org/r/1177405

I know of a few examples of gitlab packaging jobs failing but here's one that I tried to run and know more about: https://gitlab.wikimedia.org/repos/product-analytics/experimentation-lab/experiment-analytics-jobs/-/jobs/583088

The error was

curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)

I know of a few examples of gitlab packaging jobs failing but here's one that I tried to run and know more about: https://gitlab.wikimedia.org/repos/product-analytics/experimentation-lab/experiment-analytics-jobs/-/jobs/583088

The error was

curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)

Thanks for reporting this! Similar issues have been reported in wikimedia-gitlab IRC channel by @dcaro. The throttling has been disabled a few minutes after the job failed. Could you retry and let me know when the error appears again?

@Jelto : sorry, should've mentioned, yes we retried and it worked with the throttling disabled.

Change #1177956 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] nftables: throttle debugging

https://gerrit.wikimedia.org/r/1177956

Change #1177956 merged by Arnaudb:

[operations/puppet@production] nftables: throttle debugging

https://gerrit.wikimedia.org/r/1177956

Change #1178880 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] nftables: throttle debugging

https://gerrit.wikimedia.org/r/1178880

I'll hand this task over to @ABran-WMF while I'm out. Arnaud already started with troubleshooting this and finding a new set of rules and thresholds for the new bookworm hosts.

Change #1178880 merged by Arnaudb:

[operations/puppet@production] nftables: throttle debugging

https://gerrit.wikimedia.org/r/1178880

the last iteration does not seem to have a negative impact on Gerrit and has reduced gitlab1004 DENYLIST to a normal threshold. Graph of the moment where the new configuration is applied: https://grafana.wikimedia.org/goto/hVSdMVuNg?orgId=1
I'll monitore Gitlab's thresholds efficiency, and revert the policy to drop.

Change #1180569 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gitlab: nftables monitoring new thresholds

https://gerrit.wikimedia.org/r/1180569

Change #1180569 merged by Arnaudb:

[operations/puppet@production] gitlab: nftables monitoring new thresholds

https://gerrit.wikimedia.org/r/1180569

Change #1180580 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gitlab: nftables monitoring new thresholds

https://gerrit.wikimedia.org/r/1180580

Change #1180580 merged by Arnaudb:

[operations/puppet@production] gitlab: nftables monitoring new thresholds

https://gerrit.wikimedia.org/r/1180580

Change #1180703 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gitlab: throttling policy toggle

https://gerrit.wikimedia.org/r/1180703

Change #1180703 merged by Arnaudb:

[operations/puppet@production] gitlab: throttling policy toggle

https://gerrit.wikimedia.org/r/1180703

this troubleshooting can be considered done, I'll keep on monitoring tracking/denylist for the next few days to adjust thresholds as needed

ABran-WMF closed subtask Restricted Task as Resolved.Sep 18 2025, 12:21 PM