Preamble
Similarly to Wikimedia Dumps, Wiki Replicas are designed to share sanitized databases available to a wide public.
Current situation
Wiki Replicas can be accessed only from Wikimedia Cloud internal LAN (local area network).
Strengths
The Toolforge's LAN is a fully-operational comfort zone where users can authenticate easily to whatever database.
The "complicated" MySQL connection parameters are auto-magically prefilled and it just works.
Source: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database
Weaknesses
The following use-cases are extremely discouraged in practice by Wiki Replicas' current design:
- test/run/developing an app needing reading access to a Wiki Replica (but especially complicated for access to >=1 Wiki Replicas)…
- …from your personal computer (→ should be encouraged to take advantage of local computational resources)
- …from a dedicated personal server (→ should be encouraged to have more flexible development pipelines)
- …from a dedicated server of a Wikimedia Chapter (→ many chapters have their own servers and we should not discourage this independence)
We say that it's "extremely discouraged" since we know that, at the moment, normal connections do not work outside Wikimedia Cloud LAN. Example:
$ mysql --host enwiki.analytics.db.svc.eqiad.wmflabs enwiki_p ERROR 2005 (HY000): Unknown MySQL server host 'enwiki.analytics.db.svc.eqiad.wmflabs' (-2)
This LAN-only policy is a problem for most people. To workaround it, at the moment some people who need a "reliable working connection" to the Replicas, create a systemd unit on their computer, to keep an autossh daemon running, keeping a connection to Toolforge, to bind a specific custom local port to a specific remote MySQL Replica server. Example:
[Unit] Description=AutoSSH to Toolforge for Wikimedia Replicas [Service] User=me Group=me ExecStart=/usr/bin/autossh -N foo@dev.toolforge.org -L 1234:enwiki.analytics.db.svc.wikimedia.cloud:3306 [Install] WantedBy=multi-user.target
To then play through the loopback interface over your custom local SSH tunnel:
mysql --user=... --password --host=127.0.0.1 --port=1234 enwiki_p
You may say: «Yeah it's not easy but still acceptable». Then, think about another very simple use-case: "run a query on itwiki, and then dewiki, and then ukwiki, from your computer". Then you understand that this is a frequent problem but with a workaround that is not acceptable anymore:
ssh -N foo@dev.toolforge.org -L 3306:enwiki.analytics.db.svc.wikimedia.cloud:3306 -L 3307:itwiki.analytics.db.svc.wikimedia.cloud:3306 -L 3308:dewiki.analytics.db.svc.wikimedia.cloud:3306 -L 3309:ukwiki.analytics.db.svc.wikimedia.cloud:3306
Really over-complicated and hacky: you need to remember that to query enwiki_p you need 127.0.0.1:3306, while to query itwiki_p you need localhost:3307, while you need to query dewiki_p you need to query localhost:3308, and so on with ukwiki_p that can be accessed only by localhost:3309.
You may say: «Yeah but you can hardode your connection to the precise MySQL slice server like s3.analytics.db.svc.wikimedia.cloud or s1 etc. so you need fewer SSH tunnels» but this is also not feasible since this is a deprecated practice: databases can be migrated to different slices anytime.
More comments
Most Replica users do not expect this over-complicated official workaround (one local SSH tunnel for each replica) since they already have knowledge to other enterprise+critical but very simple database-as-a-service providers, who has just 1 public hostname and you can access to it from whenever you are in the Internet, as long as you have correct username and password to your databases.
I think this situation is one of the main causes of the >75% of Toolforge users declaring that they work more than 50% of their time directly on Toolforge servers, instead that from their own computer (Source: https://meta.wikimedia.org/wiki/Research:Cloud_Services_Annual_Survey/2021).
Proposed solution
Having said the current design of hosts like enwiki.analytics.db.svc.wikimedia.cloud to reach the database enwiki_p is OK for most Replica users, I suggest to kept this design as-is (even if a single frontend entry-point, proxying connections to the correct backends, could improve a lot the situation. But again, this is outside the scope of this proposal).
Also, having said that:
- these are sanitized replicas (without any private information)
- users already need MySQL authentication in terms of personal username and password and it's easy to track abuses
It could be nice to:
→ Evaluate the assignation of a public IP address to each Wiki Replica (so, at the moment, allocating 8 IPs).
💟 Thank you for your thoughts!
-boz