Page MenuHomePhabricator

Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis
Open, MediumPublic

Description

There have been discussions in the past (like T154698: Prevent contributions attributed to private and WMF IP addresses) about keeping private IP addresses from Cloud-VPS instances from leaking into/being recorded by Wikimedia wikis and other service endpoints that are operated by the Wikimedia Foundation. This has come up again very recently when some continuous integration jobs failed because of the new eqiad1-r region's use of a new private IP range (T208986: WDQS tests can no longer edit test.wikidata.org).

One option that may or may not turn out to be better would be to force the Cloud VPS software defined networking layer to route requests to Wikipedia wikis through the public address space for Cloud VPS. This would prevent internal Wikimedia servers from seeing the private addresses in use. A possible negative effect however would be that all traffic originating from Cloud VPS instances and Toolforge would appear to come from a small range of IPs (or a single IP?). This would in turn make finding a single misbehaving bot or script more difficult and could lead to the very negative outcome of all Cloud VPS/Toolforge actions being blocked to block out a malicious or naive bot.

I am sure there are other pros and cons of using public IPs to communicate between Cloud VPS/Toolforge and the co-located & directly linked servers operated for Wikimedia's production network. These should be discussed (ideally here) before any major implementation change is undertaken. The opinions of the Wikimedia SRE, Analytics, and Security-Team as well as the cloud-services-team would be especially useful in coming to a near term decision.

See also:

Related Objects

Event Timeline

We could give each VPS project a virtual router and 1 public IP to route from but that would probably require us to have /20 public IPv4 addresses.

Another idea is to investigate the flow management capabilities in Neutron and see if that could be used to audit these connections somehow. That will probably be very expensive but I'm not sure.

A final solution would be to go full IPv6 and stop worrying about mapping private/public addresses. I prefer this option for the simplicity in routing.

Yup, T174596 is very much overlapping if not duplicate to this. As that task indicates, it's not even consistent right now, and source NATing depends on whether one hits a main or edge PoP, which in turn depends on the GeoDNS config... So it's something that needs to be addressed one way or another soon.

What's the likelihood of something like this IP block happening on Wikis? Perhaps we can avoid that by setting the right reverse DNS and informing the editing community? Is it a risk that we can accept until, say, IPv6 is deployed in WMCS?

What's the likelihood of something like this IP block happening on Wikis?

It takes very few bad actor actions for an IP to come to the attention of admins & checkusers on enwiki or other large projects. We have not sampled in a while, but in 2016 we found that 24% of edits across all wikis (and 50% of wikidata edits) originated from Cloud VPS IPs. Anything we have in the stack that is rate limiting by IP (restbase? ores?) will probably also trigger more often with this kind of consolidation.

Perhaps we can avoid that by setting the right reverse DNS and informing the editing community?

A good PTR record and whois information for the IP(s) we use for SNAT should help. We really should already be concerned about that for the sake of external sites that may get a large amount of traffic from Cloud VPS/Toolforge hosts. We may also be able to mitigate some of this if hosts with public IPs (like the majority of the Toolforge job grid exec nodes) route directly instead of being consolidated with SNAT. The public IPs on Toolforge grid exec nodes today were added to help with Freenode connection limits which is a similar situation.

Advertising in Tech News plus enlisting the help of the Community Relations folks should help keep blocks from happening as well, or at least make them easier to reverse when they happen.

Is it a risk that we can accept until, say, IPv6 is deployed in WMCS?

We don't have a timeline on IPv6 at all at this point, so functionally I think we should assume that SNAT is a "forever" solution if we apply it now. I have faith that we will get to IPv6 before the heat death of the universe, but I'm not willing to put a more definite date on it than that.

Anything we have in the stack that is rate limiting by IP

Varnish comes to mind.

From analytics the only way we use the IP data from labs, as far as I can tell, is to decide whether a country will be assigned to the request, meaning that if this is an "internal" request no country gets assigned:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/maxmind/CountryDatabaseReader.java#L33

I was, ahem, wondering how do keep this list of internal IPs updated and, well, does not look it has had an update for almost a year which probably means that some IPs are missing/incorrect.
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IpUtil.java#L180

I think this change will matter little cause if we want to quantify "lawful" edits from tools from labs we should be able to do so via user_agent if tools are setting proper UAS according to our policies. If we think this is not the case and we think quantifying edits from labs is crucial we need a way to do that w/o IPs. This is the only use case I can think that might be important.

We certainly want to keep track if this change happens as we will see an artificial "bump" from traffic from the cities in which the datacenter with labs instances is deployed.

A good PTR record and whois information for the IP(s) we use for SNAT should help. We really should already be concerned about that for the sake of external sites that may get a large amount of traffic from Cloud VPS/Toolforge hosts. We may also be able to mitigate some of this if hosts with public IPs (like the majority of the Toolforge job grid exec nodes) route directly instead of being consolidated with SNAT. The public IPs on Toolforge grid exec nodes today were added to help with Freenode connection limits which is a similar situation.

Right! I think that SNATing to dedicated IPs is possible already -- it seemed to be the case for e.g. the newly set up mx-outs. WMCS has much more public IP space than it used to pre-Neutron, so maybe public IPs for Toolforge + other projects that make a lot of edits would be feasible here and address the majority of the concerns?

I'm really not familiar at all with WMCS' figures -- is the bulk of the bot traffic coming from e.g. Toolforge or is it spread out across lots of different projects? The projects I'm more familiar with are more on the testing infrastructure side of things and don't really need their own dedicated IPs, so I don't really have a lot of experience to draw from :)

Also, aren't we co-hosting multiple bots behind each (10/8) IP right now in Toolforge already and therefore already operate with this "bad actor" risk? Has this resulted into issues so far and if so, how frequent have these been be?

I'm really not familiar at all with WMCS' figures -- is the bulk of the bot traffic coming from e.g. Toolforge or is it spread out across lots of different projects? The projects I'm more familiar with are more on the testing infrastructure side of things and don't really need their own dedicated IPs, so I don't really have a lot of experience to draw from :)

We honestly do not have any data on this. The checkuser data audit we did in fall 2016 was the only time I know of that we tried to quantify the importance of Cloud VPS as a whole via on-wiki edit counts and we did not attempt to distinguish between the various projects at that time. I would naively guess that more than half and maybe as much as 90% of the edit activity does come Toolforge, but that is just a guess.

Also, aren't we co-hosting multiple bots behind each (10/8) IP right now in Toolforge already and therefore already operate with this "bad actor" risk? Has this resulted into issues so far and if so, how frequent have these been be?

There is consolidation in Toolforge where we use both Grid Engine and Kubernetes to place multiple user controlled processes on a single VM and thus behind a single IP from the 10/8 space. There are currently 104 distinct instances in Toolforge (44 grid "exec" nodes, 32 grid "web" nodes, 28 k8s workers) that may show up in checkuser audits. The 10/8 range that they show up from itself would be a strong signal to the checkuser community of the "internal" origin of these edits.

The single/small range of IPs concern might all be nervous FUD on my part in reality. Maybe we can get someone like @MusikAnimal to provide input from the point of view of an enwiki admin + checkuser as to the potential impact or confusion of consolidating the edit traffic from Cloud VPS into a smaller number of IPs?

I see it as already having the "bad actor" risk. I ran a check on my bot the other day and it's all over the 10/8 range. For CUs, I think the common workflow would be to run the check on a problematic account, and see what other accounts are editing behind the IPs. Then you'd see a wealth of bots, so you would know it's not OK to hard block that range. You'd very likely check the WHOIS too, and see that it's internal and assume it's something we're hosting. However for your average admin who can't run a check, they might block the account with autoblock enabled, which could in theory bring down all of Toolforge/VPS. This is the standard block option for accounts (on enwiki anyway), so I'm surprised that after all this time I've never heard of bots being affected by collateral damage. Probably because the 10/8 range is so big, so if we narrowed it down, the risk is probably higher :/

Thanks @bd808 and @MusikAnimal :)

By the way, do we need a separate task to discuss all the public endpoints that are not text-lb? Right now it's a blanket all-of-production list, and I think we should start trimming it down ASAP, before all kinds of implicit assumptions/dependencies creep in :)

ayounsi added a parent task: Restricted Task.Nov 13 2018, 12:53 PM

If needed note that I can add a (temporary) logging statement on the firewall to see all flows going from 172.16/12 to our public ranges, if it's of any help. I guess the same is possible on the OpenStack gateway.

aborrero changed the task status from Open to Stalled.Nov 22 2019, 10:21 AM
aborrero triaged this task as Low priority.
aborrero raised the priority of this task from Low to Medium.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

I will be working on this this quarter. The ultimate goal is to merge a patch like the one I will upload next.

Before the patch is merged we would need to ensure that wikis won't block or aggressively ratelimit CloudVPS/Toolforge in a way that would make the service useless.
We would like to communicate beforehand with affected stakeholders in order the raise change awareness, and achieve the smoothest possible change.

Change 656883 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] [DONT MERGE] cloud: NAT egress connections to WMF wikis

https://gerrit.wikimedia.org/r/656883

aborrero changed the task status from Stalled to Open.Jan 18 2021, 1:10 PM

Change 656886 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/homer/public@master] [DONT MERGE] cloud-in4: NAT egress connections to WMF wikis

https://gerrit.wikimedia.org/r/656886

I created this wiki page with information to share with stakeholders: https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis

Change 657067 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/mediawiki-config@master] [DON'T MERGE] Allow Cloud VPS NAT address for $wmgAllowLabsAnonEdits wikis

https://gerrit.wikimedia.org/r/657067

aborrero removed a parent task: Restricted Task.Jan 19 2021, 3:55 PM

as documented here https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis#Timeline the timeline for this change is:

  • 2021-01-25: announce the change to the community. Ask for feedback.
  • 2021-02-01: evaluate collected feedback, address concerns, work with affected stakeholders towards a smooth change.
  • 2021-02-08: introduce change. Monitor services to discover bugs and other issues, and fix them.
  • 2021-02-19: change is considered done and completed.

Change 658890 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Add WMCS to the exception of ratelimit

https://gerrit.wikimedia.org/r/658890

I'm adding this IP to the list of exceptions in medawiki ratelimit. Does that make sense?

My opinion was requested via T273738. We can exempt the IP address from all rate limits and abuse control, but what happens when there is abuse? We often get accidental DoS attacks coming from WMCS, and it would be nice, at least in principle, to be able to trace unwanted traffic back to its source and block it there rather than just blocking the whole of WMCS. There's a risk that during incident response, downtime will be extended due to the need to find solutions that don't block the whole thing.

On the other hand, we don't want edits attributed to RFC 1918 addresses because they're not unique. But maybe that's not a big deal if the only thing that uses an RFC 1918 address is WMCS. The addresses are not private in the sense that we don't want people to know what they are. They're private in the sense of being unroutable, which is maybe no big deal. T208986 does not appear to justify it, it apparently just prompted someone to notice the current situation. So I'm not really seeing a rationale strong enough to justify breaking abuse control.

A lot of things become a lot easier if every instance can have its own public IP address. I don't understand why dual stack IPv6 is difficult, so I'd like more information about that. If you need an excuse to spend time on it, this task seems like a good excuse.

If it's hard to make instances be dual stack, it should be possible to implement IPv4 to IPv6 prefix translation in the NAT server. If the cloud instance sends a packet destined for the IPv4 text-lb, convert it to an IPv6 packet destined for the IPv6 text-lb, with its source address statelessly mapped. Then statelessly translate the return traffic back to IPv4. I gather this is called SIIT and is implemented by the open source project Jool.

While writing T276615#6889195, I realized how good is the fact that we see the instance IPs. As said in the task, eswiki suffers from abuse originating from one of the WMCS-hosted tools. Now, the community can ban one cloud project (as long as IP addresses are relatively stable), but once this is done, community will have only one big hammer - ban WMCS. We really need a really good antiabuse solution if this is done, IMO.