Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis
Open, Needs TriagePublic

Description

There have been discussions in the past (like T154698: Prevent contributions attributed to private and WMF IP addresses) about keeping private IP addresses from Cloud-VPS instances from leaking into/being recorded by Wikimedia wikis and other service endpoints that are operated by the Wikimedia Foundation. This has come up again very recently when some continuous integration jobs failed because of the new eqiad1-r region's use of a new private IP range (T208986: WDQS tests can no longer edit test.wikidata.org).

One option that may or may not turn out to be better would be to force the Cloud VPS software defined networking layer to route requests to Wikipedia wikis through the public address space for Cloud VPS. This would prevent internal Wikimedia servers from seeing the private addresses in use. A possible negative effect however would be that all traffic originating from Cloud VPS instances and Toolforge would appear to come from a small range of IPs (or a single IP?). This would in turn make finding a single misbehaving bot or script more difficult and could lead to the very negative outcome of all Cloud VPS/Toolforge actions being blocked to block out a malicious or naive bot.

I am sure there are other pros and cons of using public IPs to communicate between Cloud VPS/Toolforge and the co-located & directly linked servers operated for Wikimedia's production network. These should be discussed (ideally here) before any major implementation change is undertaken. The opinions of the Wikimedia Operations, Analytics, and Security-Team as well as the cloud-services-team would be especially useful in coming to a near term decision.

See also:

bd808 created this task.Nov 7 2018, 11:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 7 2018, 11:46 PM

We could give each VPS project a virtual router and 1 public IP to route from but that would probably require us to have /20 public IPv4 addresses.

Another idea is to investigate the flow management capabilities in Neutron and see if that could be used to audit these connections somehow. That will probably be very expensive but I'm not sure.

A final solution would be to go full IPv6 and stop worrying about mapping private/public addresses. I prefer this option for the simplicity in routing.

faidon added a comment.Nov 8 2018, 3:16 PM

Yup, T174596 is very much overlapping if not duplicate to this. As that task indicates, it's not even consistent right now, and source NATing depends on whether one hits a main or edge PoP, which in turn depends on the GeoDNS config... So it's something that needs to be addressed one way or another soon.

What's the likelihood of something like this IP block happening on Wikis? Perhaps we can avoid that by setting the right reverse DNS and informing the editing community? Is it a risk that we can accept until, say, IPv6 is deployed in WMCS?

ayounsi added a subscriber: ayounsi.Nov 8 2018, 8:46 PM
bd808 added a comment.Nov 8 2018, 11:55 PM

What's the likelihood of something like this IP block happening on Wikis?

It takes very few bad actor actions for an IP to come to the attention of admins & checkusers on enwiki or other large projects. We have not sampled in a while, but in 2016 we found that 24% of edits across all wikis (and 50% of wikidata edits) originated from Cloud VPS IPs. Anything we have in the stack that is rate limiting by IP (restbase? ores?) will probably also trigger more often with this kind of consolidation.

Perhaps we can avoid that by setting the right reverse DNS and informing the editing community?

A good PTR record and whois information for the IP(s) we use for SNAT should help. We really should already be concerned about that for the sake of external sites that may get a large amount of traffic from Cloud VPS/Toolforge hosts. We may also be able to mitigate some of this if hosts with public IPs (like the majority of the Toolforge job grid exec nodes) route directly instead of being consolidated with SNAT. The public IPs on Toolforge grid exec nodes today were added to help with Freenode connection limits which is a similar situation.

Advertising in Tech News plus enlisting the help of the Community Relations folks should help keep blocks from happening as well, or at least make them easier to reverse when they happen.

Is it a risk that we can accept until, say, IPv6 is deployed in WMCS?

We don't have a timeline on IPv6 at all at this point, so functionally I think we should assume that SNAT is a "forever" solution if we apply it now. I have faith that we will get to IPv6 before the heat death of the universe, but I'm not willing to put a more definite date on it than that.

Nuria added a comment.Nov 9 2018, 12:24 AM

Anything we have in the stack that is rate limiting by IP

Varnish comes to mind.

From analytics the only way we use the IP data from labs, as far as I can tell, is to decide whether a country will be assigned to the request, meaning that if this is an "internal" request no country gets assigned:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/maxmind/CountryDatabaseReader.java#L33

I was, ahem, wondering how do keep this list of internal IPs updated and, well, does not look it has had an update for almost a year which probably means that some IPs are missing/incorrect.
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IpUtil.java#L180

I think this change will matter little cause if we want to quantify "lawful" edits from tools from labs we should be able to do so via user_agent if tools are setting proper UAS according to our policies. If we think this is not the case and we think quantifying edits from labs is crucial we need a way to do that w/o IPs. This is the only use case I can think that might be important.

We certainly want to keep track if this change happens as we will see an artificial "bump" from traffic from the cities in which the datacenter with labs instances is deployed.

A good PTR record and whois information for the IP(s) we use for SNAT should help. We really should already be concerned about that for the sake of external sites that may get a large amount of traffic from Cloud VPS/Toolforge hosts. We may also be able to mitigate some of this if hosts with public IPs (like the majority of the Toolforge job grid exec nodes) route directly instead of being consolidated with SNAT. The public IPs on Toolforge grid exec nodes today were added to help with Freenode connection limits which is a similar situation.

Right! I think that SNATing to dedicated IPs is possible already -- it seemed to be the case for e.g. the newly set up mx-outs. WMCS has much more public IP space than it used to pre-Neutron, so maybe public IPs for Toolforge + other projects that make a lot of edits would be feasible here and address the majority of the concerns?

I'm really not familiar at all with WMCS' figures -- is the bulk of the bot traffic coming from e.g. Toolforge or is it spread out across lots of different projects? The projects I'm more familiar with are more on the testing infrastructure side of things and don't really need their own dedicated IPs, so I don't really have a lot of experience to draw from :)

Also, aren't we co-hosting multiple bots behind each (10/8) IP right now in Toolforge already and therefore already operate with this "bad actor" risk? Has this resulted into issues so far and if so, how frequent have these been be?

bd808 updated the task description. (Show Details)Nov 9 2018, 6:19 PM

I'm really not familiar at all with WMCS' figures -- is the bulk of the bot traffic coming from e.g. Toolforge or is it spread out across lots of different projects? The projects I'm more familiar with are more on the testing infrastructure side of things and don't really need their own dedicated IPs, so I don't really have a lot of experience to draw from :)

We honestly do not have any data on this. The checkuser data audit we did in fall 2016 was the only time I know of that we tried to quantify the importance of Cloud VPS as a whole via on-wiki edit counts and we did not attempt to distinguish between the various projects at that time. I would naively guess that more than half and maybe as much as 90% of the edit activity does come Toolforge, but that is just a guess.

Also, aren't we co-hosting multiple bots behind each (10/8) IP right now in Toolforge already and therefore already operate with this "bad actor" risk? Has this resulted into issues so far and if so, how frequent have these been be?

There is consolidation in Toolforge where we use both Grid Engine and Kubernetes to place multiple user controlled processes on a single VM and thus behind a single IP from the 10/8 space. There are currently 104 distinct instances in Toolforge (44 grid "exec" nodes, 32 grid "web" nodes, 28 k8s workers) that may show up in checkuser audits. The 10/8 range that they show up from itself would be a strong signal to the checkuser community of the "internal" origin of these edits.

The single/small range of IPs concern might all be nervous FUD on my part in reality. Maybe we can get someone like @MusikAnimal to provide input from the point of view of an enwiki admin + checkuser as to the potential impact or confusion of consolidating the edit traffic from Cloud VPS into a smaller number of IPs?

I see it as already having the "bad actor" risk. I ran a check on my bot the other day and it's all over the 10/8 range. For CUs, I think the common workflow would be to run the check on a problematic account, and see what other accounts are editing behind the IPs. Then you'd see a wealth of bots, so you would know it's not OK to hard block that range. You'd very likely check the WHOIS too, and see that it's internal and assume it's something we're hosting. However for your average admin who can't run a check, they might block the account with autoblock enabled, which could in theory bring down all of Toolforge/VPS. This is the standard block option for accounts (on enwiki anyway), so I'm surprised that after all this time I've never heard of bots being affected by collateral damage. Probably because the 10/8 range is so big, so if we narrowed it down, the risk is probably higher :/

Thanks @bd808 and @MusikAnimal :)

By the way, do we need a separate task to discuss all the public endpoints that are not text-lb? Right now it's a blanket all-of-production list, and I think we should start trimming it down ASAP, before all kinds of implicit assumptions/dependencies creep in :)

ayounsi added a parent task: Restricted Task.Nov 13 2018, 12:53 PM

If needed note that I can add a (temporary) logging statement on the firewall to see all flows going from 172.16/12 to our public ranges, if it's of any help. I guess the same is possible on the OpenStack gateway.