[Refactor] [ci-charts] environments share and link to a "repository pool" to save space
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	SDunlap
	Sep 19 2024, 9:10 PM

Description

In vhost-environments, patchdemo keeps a pool of repositories for core, extensions, skins, and modules. When an environment is created, it hard links the files instead of copying them, saving space.

Catalyst (or more precisely the mediawiki chart in ci-tools that catalyst uses), clones the repositories each time it makes a new environment. This means a full mediawiki environment consumes ~4GB of space.

The goal of this task is to copy or emulate the vhost-enveronments' "repository pool" pattern in ci-charts. They may be done with shared persistent volumes.

Below is an example of the steps of a solution. There may be other solutions, and the engineer should feel free to explore other options.

add a persistent volume that is attached to patchdemo
when k8s-patchdemo starts, it clones each repository that patchdemo supports into a "repository pool" on that persistent volume
add the same persistent volume to each k8s-environment created through the mediawiki chart in ci-charts
when starting an environment, instead of git clone-ing a new copy of the environment, make a new worktree from the repo in the repository pool

A/C

when creating a new environment, the increase in disk usage on the kubernetes cluster is significantly smaller than what it was before this effort

Details

Other Assignee: • jnuche

Related Objects

Mentioned In: T376273: Repository pool: create new component responsible for the pool
T376272: Repository pool: fetch and gc remote changes periodically

Event Timeline

SDunlap created this task.Sep 19 2024, 9:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2024, 9:10 PM

SDunlap triaged this task as High priority.Sep 20 2024, 11:03 PM

thcipriani edited projects, added Catalyst (Camp Muir); removed Catalyst.Sep 23 2024, 4:13 PM

thcipriani set the point value for this task to 3.Sep 23 2024, 4:16 PM

thcipriani assigned this task to EBomani.Sep 23 2024, 4:18 PM

thcipriani updated Other Assignee, added: • jnuche.

thcipriani moved this task from Backlog to Ready on the Catalyst (Camp Muir) board.

jnuche opened https://gitlab.wikimedia.org/repos/qte/catalyst/patchdemo/-/merge_requests/56

create a persistent volume with a pool of cloned wiki repositories

jnuche opened https://gitlab.wikimedia.org/repos/qte/catalyst/ci-charts/-/merge_requests/23

use pool of repositories to clone wiki repos

• jnuche moved this task from Ready to In progress on the Catalyst (Camp Muir) board.Sep 30 2024, 12:46 PM

• jnuche mentioned this in T376272: Repository pool: fetch and gc remote changes periodically.Oct 2 2024, 12:58 PM

• jnuche mentioned this in T376273: Repository pool: create new component responsible for the pool.Oct 2 2024, 1:09 PM

jnuche merged https://gitlab.wikimedia.org/repos/qte/catalyst/patchdemo/-/merge_requests/56

create a persistent volume with a pool of cloned wiki repositories

jnuche merged https://gitlab.wikimedia.org/repos/qte/catalyst/ci-charts/-/merge_requests/23

use pool of repositories to clone wiki repos

Maintenance_bot removed a project: Patch-For-Review.Oct 4 2024, 1:30 PM

This has been deployed to production by @jnuche

The solution implemented as part of this task works for a single node only, so @SDunlap asked me to investigate options for a solution we can use once we start scaling our cluster horizontally.

After spending some time looking into it, I've found a few possible approaches for the repos cache (and any other shared data we may have in the future) for multiple nodes:

Using DaemonSets: This is a purely local, non-distributed solution. We would deploy DaemonSet instances on every node that would be responsible for creating the repo pool and keeping it updated.
- Pros:
  - Simple solution. No need to intall new storage plugins in the cluster or manage extra configuration for the Cloud VPS hosts
  - For the repo cache use case, no need to attach disks to the Cloud VPS hosts
- Cons:
  - Not an actual distributed solution, data wouldn't be shared across the nodes. Would need to be replaced with a different approach if we need actual data sharing across the nodes in the future
  - If we need to use larger sets of data in the future, we would need to make sure every Cloud VPS host has a disk of the appropriate size attached at the right location
Longhorn: Longhorn is maintained by SUSE, which also created K3s. I actually tried out this approach by creating a two-node K3s cluster with Longhorn installed. I found out it was relatively easy to deploy two pods, one on each node, and then share data between them using a single volume created via the "longhorn" StorageClass. Currently I'm leaning towards using this solution in the future.
- Pros:
  - Significantly simpler than other distributed solutions I investigated
  - Maintained by the creators of K3s, which will probably ensure stability and a good integration
  - Already tested, worked out-of-the-box for our current use case
  - Comes with snapshots and backups out-of-the-box
- Cons:
  - Like other distributed solutions, we will need configuration management on the Cloud VPS instances (to e.g. install required packages)
  - I read on forums that performance is bad for heavy concurrent writing, but this will probably never be a required use case for us. It's also important to note that this seems to arise from the fact that for distributed writing, Longhorn relies on NFS under the hood. So I would expect a similar performance impact for other NFS-based solutions (see below)
  - Similarly, I saw people complaining Longhorn performs excessive data replication
Rook+Ceph: Use the widely popular Rook operator (a Kubernetes extension) to provide storage. People in forums seem to prefer this over Longhorn for reliable, heavy-duty production use. The impression I got however is that this setup is probably overkill for the Catalyst needs.
- Pros:
  - Users consistently claim a good experience with Rook and better performance that Longhorn
  - Rook comes with out-of-the-box tooling to help you create the Ceph cluster
- Cons:
  - Like other distributed solutions, config management will be needed for the hosts
  - Seems like a much bigger, resource-heavier solution than what we need
  - My guess is this will be harder to maintain than, say, Longhorn. I'm a bit wary of what is essentially building our own Ceph cluster on top of Cloud VPS (but maybe not an issue if all Ceph traffic happens inside the cluster network?)
Use the cluster to provide an NFS server and StorageClass: In this approach we basically build our NFS-based solution inside K3s. An NFS server gets deployed to the cluster and a StorageClass created to provision the persistent volumes. This old-ish post provides a template on how to do it. The provisioner mentioned there has long been deprecated, but this one should be able to replace it.
- Pros:
  - It would be a specialized solution for our use case and in theory could end up being less resource-intensive than LH or Rook
- Cons:
  - On the flip side of the coin, we would be reinventing the wheel already invented by LH or Rook
  - This approach requires development time form our side and consequently we can also expect more maintenance than the solutions mentioned above
  - Like other distributed solutions, config management is needed
Manually create an NFS server: Create an NFS server on one of the Could VPS hosts and then use an NFS client provider in the K3s cluster to provision volumes. See this post for an example.
- Pros:
  - Similarly to solution .4, this could be more resource-efficient
  - Overall the implementation looks like it would be simpler than solution .4, which also means less maintenance
- Cons
  - Like other distributed solutions, config management is needed
  - In fact, in this case a more significant amount of Puppet config would be needed to create the NFS server
  - As in the case of the Ceph cluster solution, I would expect surprises when trying to build an NFS server directly on top of Cloud VPS (as opposed to an NFS server embedded into K3s)
Use NFS from Cloud VPS: Here, instead of creating an NFS server directly or indirectly via K3s we would use the NFS volumes provided by Cloud VPS. The rest of the solution would be the same as .5
- Pros:
  - Same benefits as solution .5
  - Simpler Puppet config management compared to solution .5
- Cons:
  - Like other distributed solutions, config management is needed
  - Judging by the docs our cloud team doesn't seem super confident on our own NFS shared storage solution

@jnuche tested option #2 and it was pretty straight-forward and we're leaning towards that solution.

For now, we're still on a single node. As a followup, we'll make a task to decide among the options @jnuche has enumerated. When we're ready to scale, we'll pick up that task.

[Refactor] [ci-charts] environments share and link to a "repository pool" to save spaceClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

[Refactor] [ci-charts] environments share and link to a "repository pool" to save space
Closed, ResolvedPublic3 Estimated Story Points
Actions