Page MenuHomePhabricator

New ORES model relies on translatewiki.net API, which is not hosted on WMF production
Closed, DeclinedPublic

Description

Translatewiki.net isn't served on WMF hardware, and we cannot make direct requests to their API from WMF production. ORES architecture relies on these requests, so we'll need to make changes one way or another.

Alternatives suggested so far:

  • WMF production ORES makes API requests to translatewiki.net through the HTTPS proxy, with a whitelist ensuring no host other than translatewiki.net can be reached outside production via the proxy. Add strict concurrency limits to the number of requests we support for TWN.
  • Run a new ORES instance from WMCE, which would only be responsible for TWN scoring.
  • Offer to host TWN on WMF hardware (cf. original discussion, succession plan).
  • Require that ORES on TWN inject all dependency data in every score request so that no external API call is necessary.

Event Timeline

Based on how the API performance I noted during feature extraction, I didn't expect an issue. Maybe someone from translatewiki can comment about what kind of API usage would be reasonable. We're not pre-caching translatewiki because their changes don't show up in EventStream, so the only activity should be coming via client request. I can't actually find any graphite logging for requests for scores for translatewiki so I don't think we've seen an abnormally high traffic volume.

Note that the URL in the error above works from my desktop machine in Minnesota. But scoring the same revision fails with the error that @awight pasted above. But if I run the same score on our wmflabs service, it returns just fine. So I'm guessing that our prod IP(s) are being blocked somehow.

After some more testing, I think it's likely that we're blocking ourselves. Translatewiki.net is outside of prod. It is the only API ORES has ever tried to access outside of prod. It's probably not working because we don't want to allow ORES to speak to the open internet. We might need to make a rule to allow ORES requests to get out to translatewiki.

@akosiaris, what do you think? Would it be OK to make a rule for translatewiki?

@JBennett may have some thoughts on this from security's point of view.

Nikerabbit subscribed.

I can confirm translatewiki.net does not currently block any WMF ips.

After some more testing, I think it's likely that we're blocking ourselves. Translatewiki.net is outside of prod.

I think that's a good guess. https://ores.wmflabs.org/v3/scores/translatewiki/8570715 for example will work consistently, but ores.wikimedia.org does not.

@JBennett may have some thoughts on this from security's point of view.

I think you should make ores use http://url-downloader.eqiad.wikimedia.org:8080 (or http://url-downloader.codfw.wikimedia.org:8080 if you are in codfw) as a proxy for the requests to translatewiki (but obviously not to the requests to prod).

No service running in production can reach out directly to services running on the internet. It needs to go via a proxy as @Bawolff pointed out above. That's a necessity since we don't want our machines open to the public internet.

But I have to say I don't agree with using the proxy in this case for this.

Using WMF production infrastructure to keep and serve information about non WMF production infrastructure is wrong from an architectural point of view. I know it makes sense from a cost benefit point of view, but coupling a service that powers important production features to a service that does not reside in our infrastructure is not something I long to see happening.

I understand request volume is expected to be low and don't really worry about it and I also understand the benefit to translatewiki, but doing so violates the separation of concerns principle[1].

Add to that the translatewiki.net is is in fact an ocean away (it's based in netcup in Germany per whois) so requests to it suffer a hefty latency, potentially causing cascading issues for production.

That being said, I have no problem with translatewiki scoring being served from any other kind of intrastructure (say WMCS).

[1] https://en.wikipedia.org/wiki/Separation_of_concerns

@akosiaris, I'm confused by the source of your opposition. How exactly does this violate the separation of concerns principle? Surely the fact that we could otherwise enable support of translatewiki purely through configuration alone is evidence of clear separation of concerns. It sounds like "separation of concerns" is not the issue.

This sounds more like Relying on Anything that isn't Ours is Bad(TM). It seems that the concern here is that communication with non-Prod systems that we don't manage and the high latency of interacting with translatewiki's API might affect the support we give to Wikimedia communities within WMF's production cluster.

For the most part, the only real service-level threat is contained within translatewiki.net. If the service (MediaWiki API) becomes degraded, then ORES will struggle to serve translatewiki.net predictions quickly and consistently. This will largely only affect the translatewiki.net models though errors may show up in our metrics & monitoring. This, in turn, will affect translatewiki users and patrollers primarily.

That said, there is potential that latency could cause an issue for ORES service of other models. E.g. if API latency of translatewiki becomes very high and scoring requests relevant to translatewiki are also high, then that might tie up uwsgi workers that might otherwise be used for querying models we need for production Wikimedia wikis. When uwsgi workers are tied up, incoming requests get queued rather than starting work right away. This is an very unlikely scenario because we have far more uwsgi workers than we do celery workers and we have basic anti-DOS protection in place that limits parallel connections/ In fact, we are living in this scenario *right now* because our inability to talk to the translatewiki API causes the maximum number of retries and timeout length before giving up for every single request to translatewiki and we haven't been starved for uwsgi workers.

Now, we're in a weird situation with translatewiki. A huge amount of production code relies on translatewiki. During the recent security incident, the unavailability of translatewiki wrecked havoc on our development and release workflows. Supporting translatewiki in ORES is intended to help limit the threat of similar incidents happening again. I feel like there should be a conversation about bringing translatewiki (or at least an instance of it) into the WMF production cluster due to it being a key part of our infrastructure. But in the meantime, I think that we should consider the threat that leaving translatewiki unsupported WRT quality control work means for WMF production when supporting or not supporting it directly in production ORES.

Finally, I would like to comment on the use of WMCS for supporting translatewiki. We have a WMCS installation of ORES and it largely mimics the production installation of ORES. So that might sound like a good option. Regretfully, we don't have a SLA for ores.wmflabs.org and we do not guarantee any sort of uptime or maintenance cycle with it. We use ores.wmflabs.org to host experiments that require a cluster for testing. We use ores.wmflabs.org to deploy experimental models that are not ready for production use. Asking a quasi-prod platform (at least something that prod relies heavily on and our security/performance teams spent hundreds of hours looking at) to rely on a service with no SLA seems like a bad idea when we have a prod service with dedicated maintenance that can relatively safely host a traslatewiki model.

This sounds more like Relying on Anything that isn't Ours is Bad(TM). It seems that the concern here is that communication with non-Prod systems that we don't manage

That would not be new, by the way: since 2015 we have at least one extension which depends on external (proprietary) API services, https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/Yandex . Then one can make a distinction between server-side and client-side execution etc. etc.

Nemo_bis renamed this task from Are we breaking translatewiki's API? to ORES instance can't connect to translatewiki.net API.Jan 9 2019, 4:24 PM
awight renamed this task from ORES instance can't connect to translatewiki.net API to New ORES model relies on translatewiki.net API, which is not hosted on WMF production.Jan 9 2019, 4:24 PM

Just a note: This case may be a bit unusual. With Yandex, we're relying on an external service in order to serve WMF production wikis. In this case, we're relying on an external service in order to serve the same external service.

@Nikerabbit You might be interested in this discussion—one of the alternatives being discussed is to finally host TWN on Foundation hardware. We'd appreciate your opinion on this and on the original issue.

I will bring this up in the weekly meeting for the security team but I wanted to respond briefly now, I don't know that Security-Team is a primary stakeholder here other than being generally supportive of the value add of ORES+TWN.

Background, {T205563} led to discussion on what else could be explored and {T206564}. I believe @Imarlier and @Halfak went on to pursue the ORES integration. I spoke briefly today with @Joe about concerns with supporting ORES with ores.wikimedia.org in a model where it's servicing and relying on translatewiki and the (my words) administrative domain and resourcing issues. It seems reasonable to me for SRE to be concerned, and I'm not sure what the right implementation details are. This may surface questions generally about the support and integration of translatewiki. Maybe it does make sense for this to be a one-off and allowed under the reasoning that TWN is very necessary for our platforms and in a weird place administratively. Practical may win out over purity here with some safeguards. I really don't know, but I can understand why @akosiaris expressed concern. Maybe this is a question for one or both of the SRE directors for whether SRE is able to direct effort here in their part of supporting ORES.

Here's a possible variation on the "new ORES cluster" proposal. If @Nikerabbit and translatewiki.net wished to host their own ORES, with the understanding (an MoU, even) that WMF can advise but not provide operational support, then we might start to look at our predicament as a good opportunity to build precedent for other so-called third-party installations.

If others agree, I'll add that to the list above.

Here's a possible variation on the "new ORES cluster" proposal. If @Nikerabbit and translatewiki.net wished to host their own ORES, with the understanding (an MoU, even) that WMF can advise but not provide operational support, then we might start to look at our predicament as a good opportunity to build precedent for other so-called third-party installations.

If others agree, I'll add that to the list above.

Given the planned move to using docker for releases, that should be even easier to support.

As I see it, we have three main options:

  1. A dedicated ORES installation is created to support translatewiki. Possibly not an ocean away from where translatewiki is installed :). In absence of privacy concerns (is there any? I am not aware of them) an installation in a cloud environment could work. It will need to be supported, either by WMF staff or by someone else, and that's to be decided.
  2. We determine that translatewiki.net is fundamental to the operations of our systems, and it gets managed by the same people that manage the WMF production. I want to underline this is purely a technical solution and I'm not advocating for a takeover of responsibilities - I'm just stating that such a solution would solve the issues we have here.
  3. We severely rate-limit scoring requests from translatewiki.net, so that it can't use more than X% (where X should be in the single digits, probably) of our available workers

Now, I don't really like the third option but I understand it's the simplest thing to do, and the one that requires the least resources. It's bound not to be a great solution and to create some level of disservice, given the severe restrictions we would set up. And more importantly, it creates infrastructural technical debt for both the WMF and translatewiki.

Another detail I kinda assumed was a given, but it's better to reiterate it:

We'd need ORES to send to the proxy *only* requests for translatewiki.net, and nothing else.

So basically we need to be able to define a proxying whitelist, else this is really a security nightmare (and will break internal requests too).

In general I agree that we should not have a wildcard proxy. Given that ores uses pickle a lot, I'm worried about its security.

  1. A dedicated ORES installation is created to support translatewiki. Possibly not an ocean away from where translatewiki is installed :). In absence of privacy concerns (is there any? I am not aware of them) an installation in a cloud environment could work. It will need to be supported, either by WMF staff or by someone else, and that's to be decided.

We already have ores.wmflabs.org that works with translatewiki

  1. We determine that translatewiki.net is fundamental to the operations of our systems, and it gets managed by the same people that manage the WMF production. I want to underline this is purely a technical solution and I'm not advocating for a takeover of responsibilities - I'm just stating that such a solution would solve the issues we have here.

I like this idea. I'm pretty sure @Nikerabbit can elaborate more why translatewiki is outside of prod. This service is vital to our environment. Without it, anything non-English just wouldn't work. Keep in mind not everyone can speak English.

One idea would be to whitelist twn, I don't know how hard it would be to implement.

For what is worth, let me say that both approaches of a dedicated ORES installation (one perhaps not 2 continents and an ocean away from the actual service) and bringing in translatewiki.net into the fold are way better than the alternative of using ORES production and using the proxy.

Change 486728 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/services/ores/deploy@master] Disable translatewiki

https://gerrit.wikimedia.org/r/486728

Change 486728 merged by Halfak:
[mediawiki/services/ores/deploy@master] Disable translatewiki

https://gerrit.wikimedia.org/r/486728

As a security measure, I disabled translatewiki on prod. It's an obvious case of DDoS vector point. You can exhaust all workers by just requesting for translatewiki scores.

chasemp added a project: Security-Team.

Can this task be closed? ORES is not in use for translatewiki.net.

Yes. ORES is being moved to a new infra but that still doesn't fix the problem of ores not being able to talk to outside. Let's just call it declined.