Page MenuHomePhabricator

NFS broken - Cannot build nor run my tool anymore
Closed, ResolvedPublicBUG REPORT

Description

I tried to look for an existing ticket about NFS currently broken for Toolforge but couldn't find one.

Since Sunday 7th January NFS is broken for my tool. I cannot run it anymore with logs enabled, and I cannot build it anymore from bastion, see below.

Steps to replicate the issue (include links if applicable):

  • Login to dev.tools.wmflabs.org
  • become spacemedia
  • ls -la /data/project/spacemedia/logs
  • webservice --cpu 1 --mem 4Gi jdk17 start /data/project/spacemedia/run.sh web
  • rebuild my tool with maven

What happens?:

  • logs folder has a very recent date (drwxrwsr-x 2 root tools.spacemedia 20480 Jan 7 12:18), date at which I started to face problems. I did not do anything with this folder.
  • webservice does not start if logs are enabled to be written to NFS (i.e a pod starts but with only 2 Mb of memory: I guess my application crashes immediately with an I/O error, cannot check for sure without logs...). Application starts fine if I configure it to log only on standard output, but it's not viable...
  • build hangs when it tries the first write access to NFS:
[INFO] --- maven-clean-plugin:3.3.2:clean (default-clean) @ spacemedia ---
[INFO] Deleting /mnt/nfs/labstore-secondary-tools-project/spacemedia/spacemedia/target

What should have happened instead?:

  • logs folder should be much older
  • application should be able to start with NFS logs
  • I should be able to build it on bastion, as before 7th of January

Event Timeline

As far as I can tell, the logs directory can be written into:

tools.spacemedia@tools-sgebastion-10:~$ touch logs/test
tools.spacemedia@tools-sgebastion-10:~$ echo foo > logs/test
tools.spacemedia@tools-sgebastion-10:~$ cat logs/test
foo
tools.spacemedia@tools-sgebastion-10:~$ rm logs/test

webservice --cpu 1 --mem 4Gi jdk17 start /data/project/spacemedia/run.sh web

Is this currently configured to log to NFS or somewhere else? I tried this, and it started something, and based on toolforge webservice logs -f (docs) it seems to be doing something.

My logback configuration file is /data/project/spacemedia/conf/logback-spring-toolforge.xml

Originally the file was configured to log on file system:

<include resource="org/springframework/boot/logging/logback/defaults.xml" />
<property name="LOG_FILE" value="/data/project/spacemedia/logs/spacemedia.log"/>
<include resource="org/springframework/boot/logging/logback/file-appender.xml" />

<root level="INFO">
    <appender-ref ref="FILE" />
</root>

I changed it to log on console to be able to start the webservice:

<include resource="org/springframework/boot/logging/logback/defaults.xml" />
<include resource="org/springframework/boot/logging/logback/console-appender.xml" />

<root level="INFO">
    <appender-ref ref="CONSOLE" />
</root>

But now I can't build my tool with updates, it hangs when trying to delete the /data/project/spacemedia/spacemedia/target directory.

That's odd, I indeed can't delete that directory from any Toolforge bastion node. Moving the directory is possible, however:

tools.spacemedia@tools-sgebastion-11:~$ mv /data/project/spacemedia/spacemedia/target /data/project/spacemedia/spacemedia/target-old
tools.spacemedia@tools-sgebastion-11:~$ ls -lah /data/project/spacemedia/spacemedia/target-old
total 12K
drwxr-sr-x 3 tools.spacemedia tools.spacemedia 4.0K Jan  9 10:18 .
drwxr-sr-x 7 tools.spacemedia tools.spacemedia 4.0K Jan  9 16:14 ..
drwxr-sr-x 2 tools.spacemedia tools.spacemedia 4.0K Jan  1 00:27 web
tools.spacemedia@tools-sgebastion-11:~$ mv /data/project/spacemedia/spacemedia/target-old /data/project/spacemedia/spacemedia/target
tools.spacemedia@tools-sgebastion-11:~$ ls -lah /data/project/spacemedia/spacemedia/target
total 12K
drwxr-sr-x 3 tools.spacemedia tools.spacemedia 4.0K Jan  9 10:18 .
drwxr-sr-x 7 tools.spacemedia tools.spacemedia 4.0K Jan  9 16:14 ..
drwxr-sr-x 2 tools.spacemedia tools.spacemedia 4.0K Jan  1 00:27 web

but it still can't be deleted.

It's that specific directory that's cursed, not the name:

tools.spacemedia@tools-sgebastion-11:~$ mv /data/project/spacemedia/spacemedia/target /data/project/spacemedia/spacemedia/target-old
tools.spacemedia@tools-sgebastion-11:~$ mkdir /data/project/spacemedia/spacemedia/target
tools.spacemedia@tools-sgebastion-11:~$ rm -rv /data/project/spacemedia/spacemedia/target
removed directory '/data/project/spacemedia/spacemedia/target'

Thank you! I am now able to build my tool again.
I've done the same with the logs folder (mv logs logs-old && mkdir logs) and now my tool can start and log again, as before 7th of January.

Feel free to close the ticket if you don't have time or the motivation to investigate further on what happened on Sunday so that my folders became cursed (I have no idea).

fnegri triaged this task as Medium priority.Jan 16 2024, 4:34 PM
fnegri added a project: cloud-services-team.
taavi claimed this task.

I was able to delete logs-old via the NFS server.