Page MenuHomePhabricator

Evaluate opening the readonly Wiki Replicas to the WAN (since we already have user authentication)
Closed, DeclinedPublic

Description

Preamble

Similarly to Wikimedia Dumps, Wiki Replicas are designed to share sanitized databases available to a wide public.

Current situation

Wiki Replicas can be accessed only from Wikimedia Cloud internal LAN (local area network).

Strengths

The Toolforge's LAN is a fully-operational comfort zone where users can authenticate easily to whatever database.

The "complicated" MySQL connection parameters are auto-magically prefilled and it just works.

Source: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database

Weaknesses

The following use-cases are extremely discouraged in practice by Wiki Replicas' current design:

  • test/run/developing an app needing reading access to a Wiki Replica (but especially complicated for access to >=1 Wiki Replicas)…
    • …from your personal computer (→ should be encouraged to take advantage of local computational resources)
    • …from a dedicated personal server (→ should be encouraged to have more flexible development pipelines)
    • …from a dedicated server of a Wikimedia Chapter (→ many chapters have their own servers and we should not discourage this independence)

We say that it's "extremely discouraged" since we know that, at the moment, normal connections do not work outside Wikimedia Cloud LAN. Example:

$ mysql --host enwiki.analytics.db.svc.eqiad.wmflabs enwiki_p
ERROR 2005 (HY000): Unknown MySQL server host 'enwiki.analytics.db.svc.eqiad.wmflabs' (-2)

This LAN-only policy is a problem for most people. To workaround it, at the moment some people who need a "reliable working connection" to the Replicas, create a systemd unit on their computer, to keep an autossh daemon running, keeping a connection to Toolforge, to bind a specific custom local port to a specific remote MySQL Replica server. Example:

[Unit]
Description=AutoSSH to Toolforge for Wikimedia Replicas 

[Service]
User=me
Group=me
ExecStart=/usr/bin/autossh -N foo@dev.toolforge.org -L 1234:enwiki.analytics.db.svc.wikimedia.cloud:3306

[Install]
WantedBy=multi-user.target

To then play through the loopback interface over your custom local SSH tunnel:

mysql --user=... --password --host=127.0.0.1 --port=1234 enwiki_p
NOTE: Please note that this SSH tunnel way is suggested from the official Wikimedia Cloud reference. Also note that this is not against the guidelines about "keeping long opened connections": this is just a well-known example from several production environments outside of Toolforge, to be able to do open programmatically a MySQL connection to Toolforge, whatever the length of the MySQL connection, even few seconds.

Source: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_to_the_database_replicas_from_your_own_computer

You may say: «Yeah it's not easy but still acceptable». Then, think about another very simple use-case: "run a query on itwiki, and then dewiki, and then ukwiki, from your computer". Then you understand that this is a frequent problem but with a workaround that is not acceptable anymore:

ssh -N foo@dev.toolforge.org -L 3306:enwiki.analytics.db.svc.wikimedia.cloud:3306 -L 3307:itwiki.analytics.db.svc.wikimedia.cloud:3306 -L 3308:dewiki.analytics.db.svc.wikimedia.cloud:3306 -L 3309:ukwiki.analytics.db.svc.wikimedia.cloud:3306

Really over-complicated and hacky: you need to remember that to query enwiki_p you need 127.0.0.1:3306, while to query itwiki_p you need localhost:3307, while you need to query dewiki_p you need to query localhost:3308, and so on with ukwiki_p that can be accessed only by localhost:3309.

You may say: «Yeah but you can hardode your connection to the precise MySQL slice server like s3.analytics.db.svc.wikimedia.cloud or s1 etc. so you need fewer SSH tunnels» but this is also not feasible since this is a deprecated practice: databases can be migrated to different slices anytime.

More comments

Most Replica users do not expect this over-complicated official workaround (one local SSH tunnel for each replica) since they already have knowledge to other enterprise+critical but very simple database-as-a-service providers, who has just 1 public hostname and you can access to it from whenever you are in the Internet, as long as you have correct username and password to your databases.

I think this situation is one of the main causes of the >75% of Toolforge users declaring that they work more than 50% of their time directly on Toolforge servers, instead that from their own computer (Source: https://meta.wikimedia.org/wiki/Research:Cloud_Services_Annual_Survey/2021).

Proposed solution

Having said the current design of hosts like enwiki.analytics.db.svc.wikimedia.cloud to reach the database enwiki_p is OK for most Replica users, I suggest to kept this design as-is (even if a single frontend entry-point, proxying connections to the correct backends, could improve a lot the situation. But again, this is outside the scope of this proposal).

Also, having said that:

  • these are sanitized replicas (without any private information)
  • users already need MySQL authentication in terms of personal username and password and it's easy to track abuses

It could be nice to:

→ Evaluate the assignation of a public IP address to each Wiki Replica (so, at the moment, allocating 8 IPs).

💟 Thank you for your thoughts!

-boz

NOTE: This Task is not about having everything, or nothing. It's about improving the current situation. For example: if necessary, we may consider publishing these nodes on the Internet but under firewall with an allow-list of trusted IPs, especially for an initial experimental phase. Thank you!

Event Timeline

Andrew subscribed.

Thank you for writing this thoughtful task!

Mysql authentication is a big, scary attack surface and we're not likely to ever want to expose it (or any db server) directly to the open internet.

That said, we all recognize that the current workflow to access these replicas is somewhat ugly, especially from e.g. a Windows laptop. I think we're generally receptive to suggestions about improving that workflow, but I don't have any immediate thoughts about what that would look like... an ssh tunnel may remain your best option (although no doubt that process could be better-documented).

valerio.bozzolan renamed this task from Evaluate opening the Wiki Replicas to the WAN (since we already have user authentication) to Evaluate opening the readonly Wiki Replicas to the WAN (since we already have user authentication).Oct 19 2022, 4:34 PM

Thank you. Good to evaluate all the risks. Having said that this task talks about exposing resources that are already supposed to be read-only and public, so a very wide range of bad things are fortunately avoided by design.

I agree that I'm not much worried about access to the data itself. Rather, I'm concerned about other security issues -- denial of service is the obvious one, but there's also the risk of any possible future exploits in our bespoke mariadb version.

I'm going to close this as 'wontfix' but feel free to re-open or make another proposal if you have other thoughts about how to streamline access to the replicas.