Page MenuHomePhabricator

NFS backups on aptly aren't set up for tools-services-05.tools.eqiad1.wikimedia.cloud and jessie packages are gone
Open, Needs TriagePublic

Description

tools-services-05.tools.eqiad1.wikimedia.cloud appears to be not set to mount NFS, so syncing has been to local disk (which doesn't really help us much).

Mount nfs after deleting the local disk extra copy, perhaps. I don't think we've done near enough to eliminate jessie packages since this blocks image building.

Event Timeline

Bstorm renamed this task from NFS backups on aptly aren't set up for tools-services-05.tools.eqiad1.wikimedia.cloud to NFS backups on aptly aren't set up for tools-services-05.tools.eqiad1.wikimedia.cloud and jessie packages are gone.Tue, Jul 13, 9:09 PM
Bstorm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T23:29:11Z] <bstorm> mounted nfs on tools-services-05 and backing up aptly to NFS dir T286003

Not entirely sure how to restore jessie packages, and worse, not entirely sure they will compile if attempted. I suppose we only strictly need webservice, but even that probably needs to be the older python 2 version.

bstorm@tools-services-05:~$ ls /data/project/.system/aptly/
tools-services-01.tools.eqiad.wmflabs  tools-services-05.tools.eqiad1.wikimedia.cloud
tools-services-02.tools.eqiad.wmflabs  tools-sge-services-03.tools.eqiad.wmflabs

I have more information on this.

Debian Jessie can be considered old-old-old-stable. Too many olds in there. I think jessie shouldn't be supported anymore anywhere. Including containers.

In my opinion we should focus instead on the problem of how to handle toolforge tools that use jessie-based docker images (which can be considered unmaintained, dead or a similar category). But I'm open to other options, and I'm willing to do the jessie aptly restoration if we decide so.

I have more information on this.

The problem is that is placing it on a single live disk. A live disk is vulnerable to corruption on a serious VM failure (ceph will just replicate corruption if it finds it, so there is zero protection there). Replicating it to NFS is cheap, not really a problem on its own and makes existing processes continue to make sense. On NFS, we get two remote copies via the backup on separate machines that will not replicate corruption (in this case) from one disk to another (since the initial rsync is a logical copy). There is still significant value to keeping it on NFS where most of the rest of the data related to Toolforge is and nearly no real downside. When we have Swift backup, we could back it up to a different ceph "disk" that isn't a live OS-running disk. In short, I don't think this reduces dependency on NFS without adding an alternative that does what NFS does (second copy plus less vulnerable to one source of corruption, at least). If this contradicts anything I said earlier, then I apologize for not thinking it through more then. If we care about data, it is better to back it up *somehow*, though I know I agreed that using a single host with a cinder vol makes sense since we can eat the downtime on aptly of moving it.

In my opinion we should focus instead on the problem of how to handle toolforge tools that use jessie-based docker images (which can be considered unmaintained, dead or a similar category). But I'm open to other options, and I'm willing to do the jessie aptly restoration if we decide so.

I agree we should focus on deprecating those images (we've just not got around to it), but the first step shouldn't be to make things difficult for deployments on our end. Specifically, I don't want things like the work that folks are doing on webservice now to cause surprise and unannounced breakages when we are fully able to prevent that--unintentional breakage is fine since it's deprecated.

We need to communicate an end to the containers and then remove them from user availability before we make it impossible for us to build them. There is a possibility that our current version of webservice won't run, but we should be able to run the most recent valid version in there. We only really need a repo to exist and for it to contain the toollabs-webservice package. It doesn't need any of the other cruft, I don't think. I'll make a task specifically for properly burying the jessie containers if you can restore that repo (though the repo only really needs some version of webservice initially).

I think we should put a finer point on "NFS removal" since I don't share the belief that "NFS is just bad" that was used here previously, nor do I believe that we can easily completely dump it or something like it. I think NFS should be quota'd, self-service (previous alternatives were), actually fail over now that manual failover actually works and our services should not depend firmly on the existing design. I'll do a writeup on that or maybe use a "proposal task".

Mentioned in SAL (#wikimedia-cloud) [2021-07-16T11:52:36Z] <arturo> created jessie-tools aptly repository on tools-services-05 (T286003)

Mentioned in SAL (#wikimedia-cloud) [2021-07-16T11:57:38Z] <arturo> added toollabs-webservice_0.75_all to jessie-tools aptly repo (T286003)

I added the package back!

root@tools-services-05:~# aptly repo show --with-packages jessie-tools
Name: jessie-tools
Comment: Toolforge packages for Debian 8 (Jessie)
Default Distribution: jessie-tools
Default Component: main
Number of packages: 1
Packages:
  toollabs-webservice_0.75_all

Let me know if we need any other package (misctool, jobutils, etc).

Thanks! I don't think we need any other packages. Now I should probably check if webservice 0.75 works on jessie. 😐

I'll make that task if it doesn't exist and link it to this if it does. We need to not ignore those containers since they are really not great to have around forever.

I find a way to test the portion of webservice needed on a jessie container and then close this.