Page MenuHomePhabricator

implement anti-abuse features for GitLab (Move GitLab behind the CDN)
Closed, ResolvedPublic

Description

In T365259 it was discussed to move Gerrit behind the CDN/loadbalancer for better anti-abuse handling. It was discussed that GitLab may be a better first candidate because the setup is quite similar (one web service and one ssh service) but GitLab is less production-critical and has more test instances available.

Most discussion already happened in T365259 which also holds up for GitLab. So this task is mostly for discussing and tracking the actual technical implementation of anti-abuse hanlding for GitLab.

GitLab consits of multiple machines and services:

This services are not related or distributed in any way. The replicas are standby machines which can be used for emergency switchovers and testing. They run an actual GitLab instance with old (12h) data but this instances are not used for the production GitLab.

Technical exploration

A first exploration was done with LVS. During this exploration, we considered placing GitLab behind the load balancing infrastructure. However, it was determined that this approach would involve significant refactoring and implementation work and overhead, especially since our primary interest is in throttling capabilities. Consequently, this idea was set aside.

We then explored alternative methods of throttling and traffic shaping. Utilizing local tools, such as a separate HAProxy or a firewall, seemed the most promising. Currently, we are testing and implementing throttling using firewall rules. The newer tool nftables offers built-in features that are particularly useful. Therefore, our initial step is to migrate to nftables and then verify potential configurations with it:

  • Upgrade GitLab hosts to nftables
  • Verify that throttling and dynamic IP sets for a denylist are possible in nftables
  • Puppetize a basic set of rules to throttle external HTTP traffic
  • Adjust thresholds
  • enable rules on all instances
  • (Repeat for SSH traffic)?
  • update docs and publish tech news article

Details

Other Assignee
Dzahn
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+1 -4
operations/puppetproduction+4 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+18 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+15 -18
operations/puppetproduction+3 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+81 -0
operations/puppetproduction+11 -0
operations/puppetproduction+7 -12
operations/dnsmaster+8 -4
operations/puppetproduction+9 -9
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1040094 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] conftool-data: add gitlab and replicas

https://gerrit.wikimedia.org/r/1040094

Change #1040261 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add LVS service IPs for gitlab and gitlab-ssh

https://gerrit.wikimedia.org/r/1040261

LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.

Change #1053306 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: switch gitlab-replica-b from iptables to nftables

https://gerrit.wikimedia.org/r/1053306

Change #1053306 merged by Jelto:

[operations/puppet@production] gitlab: switch gitlab-replica-b from iptables to nftables

https://gerrit.wikimedia.org/r/1053306

Change #1053877 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: replace ferm::service with firewall::service

https://gerrit.wikimedia.org/r/1053877

Change #1053879 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: switch gitlab-replica-b from iptables to nftables

https://gerrit.wikimedia.org/r/1053879

Change #1053877 merged by Jelto:

[operations/puppet@production] gitlab: replace ferm::service with firewall::service

https://gerrit.wikimedia.org/r/1053877

Change #1040261 abandoned by Dzahn:

[operations/dns@master] add LVS service IPs for gitlab and gitlab-ssh

Reason:

not likely anymore that we are going to do this

https://gerrit.wikimedia.org/r/1040261

Change #1053879 merged by Jelto:

[operations/puppet@production] gitlab: switch gitlab from iptables to nftables

https://gerrit.wikimedia.org/r/1053879

Jelto renamed this task from Move GitLab behind the CDN to implement anti-abuse features for GitLAb (Move GitLab behind the CDN).Jul 16 2024, 8:51 AM

I migrated the GitLab hosts to nftables which unblocks us using nftables built-in features

Jelto renamed this task from implement anti-abuse features for GitLAb (Move GitLab behind the CDN) to implement anti-abuse features for GitLab (Move GitLab behind the CDN).Jul 16 2024, 8:53 AM

Change #1040094 abandoned by Jelto:

[operations/puppet@production] conftool-data: add gitlab and replicas

Reason:

a different implementation will be tested

https://gerrit.wikimedia.org/r/1040094

Change #1055886 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: add option to throttle and drop traffic using nftables

https://gerrit.wikimedia.org/r/1055886

Change #1055886 merged by Dzahn:

[operations/puppet@production] firewall/gitlab: add option to throttle and drop traffic using nftables

https://gerrit.wikimedia.org/r/1055886

gitlab1003 and gitlab1004 are unchanged.

gitlab2002 now has the new file /etc/nftables/099_throttling-chain_puppet.nft created by puppet through the change above.

For now it is logging but still accepting and not dropping packets.

Next we can change the policy to DROP through a simple Hiera change.

Change #1056581 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: set nft throttling policy to drop on replica

https://gerrit.wikimedia.org/r/1056581

Change #1056581 merged by Jelto:

[operations/puppet@production] gitlab: set nft throttling policy to drop on replica

https://gerrit.wikimedia.org/r/1056581

gitlab1003 and gitlab1004 are unchanged.

gitlab2002 now has the new file /etc/nftables/099_throttling-chain_puppet.nft created by puppet through the change above.

For now it is logging but still accepting and not dropping packets.

Next we can change the policy to DROP through a simple Hiera change.

Thanks for merging and deploying this change! I enabled the more restrictive drop rule on gitlab-replica-b gitlab1003. I was able to trigger the throttling rule with the usual curl loop and the current limit of 1RPS. Manual browsing of the website works as expected (after waiting 5 minutes until the droplist entry is removed). So I'd suggest to leave this running over the weekend and then we can discuss the next steps on Tuesday (enabling it gitlab-wide, RPS and communication for RelEng and the community).

Change #1057190 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: fix port definition in firewall:service

https://gerrit.wikimedia.org/r/1057190

Change #1057190 merged by Dzahn:

[operations/puppet@production] gitlab: fix port definition in firewall:service

https://gerrit.wikimedia.org/r/1057190

Change #1058608 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable throttling for all GitLab instances

https://gerrit.wikimedia.org/r/1058608

Change #1060131 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable nft throttling on role level, but just log

https://gerrit.wikimedia.org/r/1060131

Jelto added a parent task: Restricted Task.Aug 8 2024, 2:40 PM

Change #1060131 merged by Jelto:

[operations/puppet@production] gitlab: enable nft throttling on role level, but just log

https://gerrit.wikimedia.org/r/1060131

I enabled the throttling rule for the production GitLab instance in logging-mode (so offending traffic will just be logged and not throttled). @Dzahn let's review the logs and the blocklist tomorrow in our office hours. After the office hours we can disable logging again to prevent issues like T371951.

ACK, sounds like a plan. Same for Gerrit, as you will have seen I upped the threshold 2 times because I was still seeing IPs in the DENYLIST that appeared like legit users. Now though with the higher value I only see googlebot/googleproxy IPs left, at least as of right now. Let's also review this one last time tomorrow and then consider switching it to DROP.

Change #1063004 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] profile::firewall::nftables_throttling: add option for burst packets

https://gerrit.wikimedia.org/r/1063004

Change #1063004 merged by Jelto:

[operations/puppet@production] profile::firewall::nftables_throttling: add option for burst packets

https://gerrit.wikimedia.org/r/1063004

Change #1064328 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: use burst parameter with nftables_throttling

https://gerrit.wikimedia.org/r/1064328

Change #1064328 merged by Jelto:

[operations/puppet@production] gitlab: use burst parameter with nftables_throttling

https://gerrit.wikimedia.org/r/1064328

After reviewing the DENYLIST and the nftables logs, we noticed that some GitLab runners outside our infrastructure were getting throttled occasionally (mostly the Digital Ocean Kubernetes Runners). To address this, we added a burst setting to the nftables throttling, similar to what we did with Gerrit. The current setting is up to 300 packets/minute with a burst of 1000.

Since applying the new burst setting, there are no more throttling logs, and the DENYLIST is empty. I'll check the logs again before our next office hours, and we can plan to enable throttling next Tuesday.

Change #1066782 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] profile::firewall::nftables_throttling: fix issue of global metering

https://gerrit.wikimedia.org/r/1066782

Change #1066782 merged by Jelto:

[operations/puppet@production] profile::firewall::nftables_throttling: fix issue of global metering

https://gerrit.wikimedia.org/r/1066782

Change #1067337 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: add profile::prometheus::nft_throttling_denylist

https://gerrit.wikimedia.org/r/1067337

Change #1067337 merged by Jelto:

[operations/puppet@production] gitlab: add profile::prometheus::nft_throttling_denylist

https://gerrit.wikimedia.org/r/1067337

Change #1070025 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: adjust throttling threshold for GitLab

https://gerrit.wikimedia.org/r/1070025

Change #1070025 merged by Jelto:

[operations/puppet@production] gitlab: adjust throttling threshold for GitLab

https://gerrit.wikimedia.org/r/1070025

Change #1058608 merged by Jelto:

[operations/puppet@production] gitlab: enable throttling for all GitLab instances

https://gerrit.wikimedia.org/r/1058608

Throttling on all GitLab instances was enabled today. We are dropping packets if there are too many new TCP connections from the same source IP. After 5 minutes they are allowed again. Our own production and cloud networks are excluded from it.

The current threshold is quite high and we have monitored this for a while, so we don't expect it to affect legit users. IPs which are doing more than 300 new TCP connections per minute are blocked for 5 minutes.

In case legitimate users are impacted the following change can be reverted: https://gerrit.wikimedia.org/r/1058608

Change #1071900 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable nftables throttling (drop)

https://gerrit.wikimedia.org/r/1071900

Change #1071900 merged by Jelto:

[operations/puppet@production] gitlab: enable nftables throttling (drop)

https://gerrit.wikimedia.org/r/1071900

Mentioned in SAL (#wikimedia-operations) [2024-09-10T15:39:00Z] <jelto> enabling throttling on GitLab hosts - T366882

In wikimedia-gitlab, there have been some reports of failing jobs (cc @dcaro), such as:

https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/jobs/363466

fatal: unable to access 'https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api.git/': Failed to connect to gitlab.wikimedia.org port 443 after 4 ms: Could not connect to server

https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/jobs/363488

ERROR: Uploading artifacts as "dotenv" to coordinator... error  error=couldn't execute POST against https://gitlab.wikimedia.org/api/v4/jobs/363488/artifacts?artifact_format=gzip&artifact_type=dotenv: Post "https://gitlab.wikimedia.org/api/v4/jobs/363488/artifacts?artifact_format=gzip&artifact_type=dotenv": dial tcp 208.80.153.8:443: connect: connection refused id=363488 token=glcbt-64
FATAL: invalid argument

https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/jobs/363560

fatal: unable to access 'https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api.git/': Failed to connect to gitlab.wikimedia.org port 443 after 4 ms: Could not connect to server

Interestingly, the metrics and logs don't indicate any throttling at that time:
https://grafana-rw.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?forceLogin&from=1726531512350&orgId=1&to=1726617571233&viewPanel=51

There was only one instance of throttling, which occurred at 03:23 UTC, but it was for an IP unrelated to Digital Ocean.

The errors seem to occur immediately (failed to connect, connection refused), whereas rate limiting should usually drop packets and cause a timeout instead.

I'll dig a bit further, but we can try disabling throttling again and see if the errors stop happening. Maybe this is also a networking issue in the Digital Ocean Kubernetes cluster.

Change #1073740 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: set throttling policy to accept again

https://gerrit.wikimedia.org/r/1073740

I reviewed the throttling in the past 7 days, and noticed that eight unique IPv4 addresses and one IPv6 address were blocked.

The IPs come from smaller hosting providers, VPNs, bots, and two are from Internet Archive. I think this is acceptable for now, even though I’d prefer to get archive.org unblocked. That said, we probably shouldn't invest more time on this until we identify thresholds in a more systematic way: see T374909. Having a few weeks or even months of gaps for archive.org should be fine, in my opinion.

I'll write a short piece for Tech News and make sure the Wikitech docs are up to date. Once that’s done, we should be good to resolve this task.

Change #1075504 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: test defs_from_etcd on the replica

https://gerrit.wikimedia.org/r/1075504

Change #1075504 merged by Jelto:

[operations/puppet@production] gitlab: test defs_from_etcd on the replica

https://gerrit.wikimedia.org/r/1075504

Change #1075533 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "gitlab: test defs_from_etcd on the replica"

https://gerrit.wikimedia.org/r/1075533

Change #1075533 merged by Jelto:

[operations/puppet@production] Revert "gitlab: test defs_from_etcd on the replica"

https://gerrit.wikimedia.org/r/1075533

I added a summary of the rate limiting and abuse tooling (including nftables throttling) in Wikitech: https://wikitech.wikimedia.org/wiki/GitLab/Abuse_and_rate_limiting.

Unfortunately the migration to nftables broke the blocked_nets requestctl feature for GitLab. But T348734 can be used to track this issue. I documented it in the wikitech page above as well.

The last thing is to write a small tech news article.

Throttling is active for around one month on all GitLab instances and works as expected. Documentation is updated and announcements sent. If additional tweaking of thresholds should be needed this task can be reopened or in other tasks. So I'm resolving the task.

Change #1073740 abandoned by Jelto:

[operations/puppet@production] gitlab: set throttling policy to accept again

Reason:

issue did not happen again, change is not needed anymore

https://gerrit.wikimedia.org/r/1073740