Page MenuHomePhabricator
Paste P48401

[Session] Past, Present and Future of Wikimedia Cloud Services (Toolforge and friends)
ActivePublic

Authored by aborrero on May 20 2023, 3:18 PM.
== Past, Present and Future of Wikimedia Cloud Services (Toolforge and friends) ==
Date & time: Saturday, May 20th at 16:00 pm EEST / 13:00 pm UTC
== Relevant links ==
* Phabricator task: https://phabricator.wikimedia.org/T333939
Slides: https://commons.wikimedia.org/wiki/File:Past_Present_and_Future_of_WMCS.pdf
* https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_introduction
== Presenters ==
Arturo Gonzalez
Bryan Davis
== Participants ==
* Jorge Huaman
== Notes ==
Presentation & Notes
Past
2001: Wikipedia started (https://en.wikipedia.org/wiki/History_of_Wikipedia), bots started doing things (https://en.wikipedia.org/wiki/Wikipedia:History_of_Wikipedia_bots)(only 9 months before bots were active)
...
2005: Toolserver created by WMDE (https://www.mediawiki.org/wiki/Toolserver:History) - in support of the first Wikimania, Sun Microsystems sponsored the event and donated some hardware. Mark Bergsma took it to a datacenter that was open knowledge related and hosting the first edge-cache, and WMDE took control of running it. Slowly grew with bits of donated hardware.
2010-2011: Wikimedia Labs project started. (https://www.mediawiki.org/w/index.php?title=Wikimedia_Labs&oldid=429621) , Using Openstack. a vrtualization cluster, to make it easier for technical volunteers to have root-rights and contribute directly on various infrastructure projects, e.g. puppetizing how our servers were managed, using automation software instead of pure manual updates and config.
2012: Labs had 120+ projects and 600+ users (https://www.mediawiki.org/wiki/Wikimedia_Engineering/2012-13_Goals#Rationale/Background_2)
2013: ToolLabs project started with intent to replace the aging Toolserver system. (https://www.mediawiki.org/w/index.php?title=Wikimedia_Labs/Toolforge&oldid=651862) Coren was hired to lead, and start working out what needed to be prioritized for keeping, and what needed to be added into the 2nd generation system. Nowadays known as Toolforge.
2014: Yuvipanda said it should be easier to query the replica system, and hence Quarry was built.
2014: Toolserver was shutdown (https://meta.wikimedia.org/w/index.php?title=Toolserver&diff=prev&oldid=9070101). A lot of coordination between WMF/WMDE/Volunteers to move and migrate everything.
2015: PAWS introduced - another Yuvipanda initiative. Came out of the work he did to support the Research team. (originally named for "Pywikibot as a Service", but now extends beyond that).
2015: Yuvipanda built the original Kubernetes cluster. (https://lists.wikimedia.org/pipermail/labs-l/2015-September/004033.html) Replacing the GridEngine system with more maintainable setup.
2015: "Labs labs labs" naming problem. (https://wikitech.wikimedia.org/wiki/Help:Labs_labs_labs)
2017: Trying to get more resources for all these systems, and Chase Pettet and Bryan proposed the more understandable-to-outsiders Cloud project and team. (https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/H7ILCNLMHRRY4YTOCNQ45PST3ECP6JFR/)
2017: Labs and Tool Labs rebranded to Cloud VPS and Toolforge (https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Rebranding_Cloud_Services_products)
2017: Wiki-Replicas redesigned, 2nd gen. (https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/) Lost some things because of table-complexities. [?]
2018: Openstack Neutron network deployed. Up until that point had had the same core network design (dating from 2011), which had been deprecated since 2014. Making changes to modernize Openstack. Big project. (https://phabricator.wikimedia.org/phame/post/view/120/neutron_is_here/)
2019: HTTPS enforced for Toolforge (https://phabricator.wikimedia.org/phame/post/view/132/migrating_tools.wmflabs.org_to_https/)
2019: Ubuntu replaced with Debian in Toolforge (https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation) Largely driven by decisions by the SREs who run the core Wiki system. They'd made the same change, and the team followed along.
2020: Toolforge Kubernetes cluster rebuilt (https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration), we were early adopters, and had to do direct security patches. Nowadays we have support-configuration [?]
2020: new toolforge.org domain introduced (https://wikitech.wikimedia.org/wiki/News/Toolforge.org) important for security aspects, prevents cross-tool scripting attacks, and more.
2020: Wiki-Replicas redesigned (https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign), again some pluses and minuses. Went from 3 servers to 8 servers, more RAM, CPU, but also stopped having all the dbs for all the wikis on each db server, so lost ability to do cross-db queries. (still looking for solutions to help people get around that) but gained a bunch of stability.
Celebrate! 20 years of staff (WMF/WMDE/More Affiliates) working hand in hand with volunteers as equals
Present:
What "just" happened", is happening now, and is about to happen.
Numbers:
* ~30.5% of total wiki edits come from WMCS [April 2023]
* ~ 3200 Toolforge tools
* ~ 2500 Toolforge maintainers
* ~ 1000 Cloud VPS virtual machines
* ~ 200 Cloud VPS projects
* 4000 vCPU, 27 TB memory, 140 TB storage
Recent changes
* Toolhub (https://toolhub.wikimedia.org/)
* New domain names were introduced:
* wmcloud.org
* wikimediacloud.org
* wikimedia.cloud (https://wikitech.wikimedia.org/wiki/News/Phasing_out_the_.wmflabs_domain)
* HTTPS enforced at front proxy layer (https://wikitech.wikimedia.org/wiki/News/HTTPS_enforcement_at_shared_proxy)
* Openstack Trove: a database-as-a-service offering (https://wikitech.wikimedia.org/wiki/Help:Trove_database_user_guide)
* Terraform support: Taavi led this work, can be used to manage your infrastructure (https://wikitech.wikimedia.org/wiki/Help:Using_Terraform_on_Cloud_VPS)
* publicly reachable openstack API endpoints
* Ceph distributed network storage: a game-changer for Cloud VPS. [?]
* Cinder volumes (attachable disk space) (https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances)
* Toolforge jobs framework (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework), as a way to help forge users migrate everything off of the old framework and to the Kubernetes backend
Now and soon
* ToolsDB improvements
* Toolforge builds service (https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Ongoing_Efforts/Toolforge_Build_Service/Overview) - generate Docker container images containing several different runtimes in an automated fashion (tomorrow Slavina will be presenting a session on this, 9:30 in the small hacking space)
* Openstack Magnum (k8s-as-a-Service) - currently usable (with some rough edges), but volunteers for testing welcome!
* Cloud VPS object storage (swift) - coming soon™!
Future:
Let's talk about the next 10 years. We really want your input.
Toolforge: possible futures
* Toolforge as a platform
* Toolforge push-to-deploy (deployed directly/automatically from a GitLab repo)
* No ssh interaction required (the current need for ssh acts as a barrier, keeping out less technical community members)
* Central log system (Taavi is working on this.)
* NFS ? Currently in use, but outdated; they'd like to replace it
* ToolsDB ?
Cloud VPS: many things they'd like to change/improve
* Multiple DataCenter support - would like to expand beyond a single datacenter in the USA
* Tenant Networks - Ability for cloud VPS services to define their own network inside
* Kubernetes as undercloud
* IPv6 - on the list for a while, but they never have time to explore it
* Beyond virtual machines ?
* GPUs ? - For tools that require it like LargeLanguageModels and MachineLearning models. Is there a desire in the community to have these available to use?
Data Services
* Wiki-Replicas improvements ?
* Superset as a replacement for Quarry ? Ongoing investigation by Taavi.
* preview: http://superset.wmcloud.org/
* Database-as-a-Service beyond Trove ? Maybe MongoDB (or an open-source alternative..) is wanted?
* Data pipelines ?
* Analytics / Data lake ?
New computing abstractions?
* Helm charts as deployment unit ? We currently use a virtual machine, but what if we used Helm as the unit of computing?
* What does the Wiki community need ?
New computing abstractions?
[image] - https://commons.wikimedia.org/w/index.php?title=File:Past_Present_and_Future_of_WMCS.pdf&page=18
We have several layers of abstraction.
Hardware => Cloud VPS => Toolforge => ?? => PAWS & Quarry => No code???
At the hardware layer there are very few users who have access (basically just Taavi and WMF).
Are we missing any layers of abstractions that are wanted/needed?
Musikanimal: Symfony bundle for Toolforge is available, that enables a single query for [?]
I'm also excited about the further abstraction, such as no-ssh. That's a barrier for many potential users.
* We have ideas for a web-UI, like in PAWS. But still with a command-line.
* Heroku is another possibility - they popularized the idea of a git repo, and that kicks off a build-pipeline. (Continuous deployment upon commit.) That's another way to eliminate the SSH/commandline 'tax'.
An improvement upon the above: setting the deployment pipeline to be triggered by changes to a branch tagged as "release version". A good compromise between manual deployment and "deploy on commit." You only push to production when you really mean to.
Re: computing abstractions, functions and lambdas. Now sure what the job-framework is capable of. perhaps a user only care about resources, and don't care about scaling and nodes. Helm-charts and Google Co-pilot [???].
* Is there a strong desire in the community for lambdas?
** Is that covered by Wikifunctions? -- There's lots of potential with that, but need to see how it works in practice, and how it integrates.
* Wonder how this differs between kubernets and serverless frameworks.
* The question is: Are you developing something that is currently hard to deploy on any of our offerings, and if so what would you need instead?
** Something that splits the backend/frontend services. Standard API backends [?]
* Currently it's possible to split backend/frontend by creating two different tools & deploying them independently. But a future solution could be deploying multiple containers
Request: Need a monitoring tool on Toolforge. Many of my tools are based on cron and k8s and other, but no way to understand why they've failed from an error. "Why did my script fail?" - E.g. one of my scripts was failing for months, and I need to download massive logs and search thru them.
Response: it's something that the team has discussed, for improving logging and alerts. One problem right now is that they don't want to expose error messages in case they contain private information. There's a spectrum of things that can get us to a better place. Log aggregation is one; not having to ssh in to get access to the logs. Taavi is hoping to get logs integrated into the admin dashboard (??)
K8s does have a health-check, that we could explore how to expose.
Request: resource-consumption metrics. Sometimes a script will freeze, and occasionally it's because of too many errors [?].
* We've got a bit of that, but aren't exposing it well. We get metrics per namespace, and the dashboard tool has some of it. Perhaps we can get some of it with Prometheus?
Also, need better documentation about what we already have!
On GPUs: will be a non-trivial expense, so they're trying to determine whether there's real need/demand. -
Perhaps distinguish between hardware for Training models vs Running models? Might be cheaper.