Page MenuHomePhabricator

Determine basic hosting parameters for Toolhub
Closed, ResolvedPublic

Description

As currently envisioned, Toolhub's core will be a custom web application written in Python using the Django framework. It will need a database (probably MySQL/MariaDB based on current Wikimedia preferences) and a robust full text search engine (Elasticsearch). A job queue system of some sort would be useful for offloading tasks from the main request/response cycle (maybe Redis + Celery, maybe something lighter weight?). Web crawling to collect toolinfo.json records published external to Toolhub will need some scheduled job service (possibly just cron'd invocations of web endpoints exposed by the core application). Local development/testing environments should also be considered from the start of the project.

Based on @bd808's current assumptions, a full stack deployment of Toolhub is likely to need:

  • Python >=3.7
  • Numerous 3rd party Python libraries (Django, etc)
  • MySQL/MariaDB database
  • Memcached
  • Redis
  • Elasticsearch
  • Task runner (Celery or similar)

Using Docker containers to deliver code to production would be nice. This would allow deploying the application on a Kubernetes cluster for production and Docker-compose or a local k8s cluster (minikube, k3s, etc) for development and testing. It should also let us avoid the deployment preparation challenges of using scap3 to deploy Python applications.

Designing the application with an expectation of running from a container under Kubernetes management opens up one more question: what Kubernetes cluster will be used for the production deployment? @bd808 believes there are 3 viable options today:

  • Work with the ServiceOps team in SRE to bring Python support to the current production Kubernetes cluster. This will benefit other projects (ORES, Striker) in addition to Toolhub. It will not reduce the complexity of finding database, memcached, redis, elasticsearch, etc services to connect to in production.
  • Toolforge! With the 2020 Kubernetes cluster in Toolforge it has become possible to grant expanded quotas to a single tool. This plus the existence of suitable database (toolsdb) and elasticsearch services in Toolforge makes it possible to imagine getting up and running in Toolforge relatively simply. Another benefit of this would be a light weight process for granting non-WMF/WMDE staff access to the tool to deploy and troubleshoot. Downsides: toolhub.toolforge.org hostname may make communicating that this service is scoped beyond Toolforge hosted tools more difficult; deployment automation in Toolforge is currently a per-tool challenge; log aggregation and monitoring solutions would need to be found.
  • Cloud VPS. All of the needed infrastructure could be provisioned in a dedicated Cloud VPS project. This seems like the worst choice as everything would be locally maintained. This is great for flexibility on "day 1" but a maintenance burden for "day 2+".

Event Timeline

bd808 triaged this task as High priority.Aug 21 2020, 8:23 PM

@Joe and @Reedy, I would like your perspectives on this topic. What pros/cons stand out the most for each of you about the idea of hosting in the main network versus Toolforge? Is there a thing either of you can point to that makes one or the other the most obvious choice?

In a discussion with @srishakatux, we ended up leaning pretty heavily towards the idea of using Toolforge as staging system for user acceptance testing and the Wikimedia Foundation's production Kubernetes cluster plus support services for the user facing site. The benefits of existing log aggregation and monitoring systems we can leverage in production feel like they outweigh the increased friction for getting deploy access and coordinating with other teams for using their shared services (mariadb, elasticsearch, etc). Having a staging environment in Toolforge would still give us an opportunity to get technical volunteers involved in deployments and maintenance and then possibly sponsor them for production deployment rights if there is interest and need later.

I would still like to hear from @Joe and @Reedy for any additional points they might be able to make for or against these options. Or even better to hear a 4th option that makes my 3 initial ideas look sad. :)

In a discussion with @srishakatux, we ended up leaning pretty heavily towards the idea of using Toolforge as staging system for user acceptance testing and the Wikimedia Foundation's production Kubernetes cluster plus support services for the user facing site. The benefits of existing log aggregation and monitoring systems we can leverage in production feel like they outweigh the increased friction for getting deploy access and coordinating with other teams for using their shared services (mariadb, elasticsearch, etc). Having a staging environment in Toolforge would still give us an opportunity to get technical volunteers involved in deployments and maintenance and then possibly sponsor them for production deployment rights if there is interest and need later.

I would still like to hear from @Joe and @Reedy for any additional points they might be able to make for or against these options. Or even better to hear a 4th option that makes my 3 initial ideas look sad. :)

Apologies for taking a long time to respond, but you assigned the task to me at the start of my unforeseen leave of absence that ended yesterday :)

So, let's exclude option 3) without further discussion for the reson you cited.

Option 1) is the most sound from a technical POV, IMHO. Python support for the pipeline/blubber already exists, and thus deploying this software should be mostly matter of writing an appropriate .pipeline directory in the repository, and then add a suitable helm chart (which should be mostly trivial as well).
This would not solve the social problems you cite here - specifically it would require a production deployment access to allow deploying / restarting the software. Also it would not resolve the need to ask for storage resources (which should be picked up with the data persistence team and other teams).

Option 2) is ok technically but has the downside to have the directory being part of the PaaS it's being a directory for, in addition to the other issues you outlined. I usually advice against doing that, as it would become unusable for hosting something like e.g. a toolsforge external status page, which you might desire to do at some point.

If we want to go with option 1) we need to plan for it, specifically:

  • Serviceops will need to allocate some time to assist with the deployment and reserving a couple redis instances for the celery backend
  • We will need to request some space on the elasticsearch production clusters for this tool. Not sure how we would proceed there, so maybe we need to involve the Search Platform team early in the process
  • We will need to request some space on the mariadb misc cluster to the data persistence team

None of the above asks is weird or unusual, and nothing (including running a python app on kubernetes) will be uncharted territory come next year. At this point, I'd summarize the above as follows:

Host in production
Pros:

  • More dedicated resources possible
  • CDN
  • Standard for production-grade services, meaning it will benefit from the evolution of the deployment pipeline.
  • Staging
  • Multi-dc
  • SRE support available
  • Independent of toolforge

Cons:

  • More intra-team dependencies for the setup
  • Harder to delegate trust
  • More work upfront

I have one major objection to option 2 (which is otherwise viable): it would mean hosting a tools directory on the same platform as the tools it's listing, which means any issue of the platform would reflect. Apart from that, the disadvantages would be having lesser available resources, the absence of SRE support (but AIUI WMCS could provide SRE support on toolsforge?), and probably overall a less "standard" installation compared to the rest of our production stack. On the other hand, it would allow an easier volunteer involvement in maintaining the tool, and less steep upfront investment in the initial setup.

In the end, I think it's a choice between "more reliability" versus "easier setup/community involvement in maintenance", and I guess it's up to you to make the call. I have no major objections to going with either option :)

In the end, I think it's a choice between "more reliability" versus "easier setup/community involvement in maintenance", and I guess it's up to you to make the call. I have no major objections to going with either option :)

Thanks @Joe. I think the main downside I personally identified for production hosting was access for deployments. Now that I have done some work with the PipelineLib process I feel like this will ultimately be a problem that is solved by tooling and that Toolhub could be deployed in a nearly completely automated fashion in the future.

Let's call this {{Done}} with the recommendation that we make deployment in the production Kubernetes cluster the goal of the project.