Page MenuHomePhabricator

Zoomviewer has ~450,000 files in NFS home directory
Closed, ResolvedPublic

Description

The zoomviewer tool currently has a very large amount (459,570) of error and output files in its home directory. Even though these files are small in size, a directory with 1,000s of files is very inefficient and causes long delays in response times with high resource utilization.

find project/zoomviewer -maxdepth 1 -name 'zoom_*.[eo]*' | wc -l
459570
Sample file list
-rw-rw----    1 51295 51295          0 Mar 20  2019 zoom_531fc23031c6654b73e6bf817cd19b5e.out
-rw-r--r--    1 51295 51295        594 Mar 20  2019 zoom_531fc23031c6654b73e6bf817cd19b5e.err
-rw-rw----    1 51295 51295          0 Mar 20  2019 zoom_d46ed4ffa8a368ef1c9dc8509d4fe00f.out
-rw-rw----    1 51295 51295          0 Mar 20  2019 zoom_29fde228cdf8bfa98188feeca6b22c96.out
-rw-rw----    1 51295 51295          0 Mar 20  2019 zoom_b66f9af8fc4f81e641ffb0b755816f42.out
-rw-rw----    1 51295 51295        412 Mar 20  2019 zoom_29fde228cdf8bfa98188feeca6b22c96.err
-rw-rw----    1 51295 51295          0 Mar 20  2019 zoom_e1f56fd3f32ec59051c2a7e08beaf38a.out

Please add cleanup process for these files, possibly using the prune.sh cron job referenced in T136190

Event Timeline

2 months later and no response on this ticket. :(

Today zoomviewer is the single largest consumer of NFS disk space in Toolforge. It has 395G of files on disk and 498,880 $HOME/zoom_*.{err,out} log files.

Mentioned in SAL (#wikimedia-cloud) [2020-05-21T00:34:17Z] <bd808> Deleting gird job log files older than 7 days directly on the NFS backend server (T248188)

Mentioned in SAL (#wikimedia-cloud) [2020-05-21T00:34:17Z] <bd808> Deleting gird job log files older than 7 days directly on the NFS backend server (T248188)

Doing this from a Toolforge bastion or grid node would have been an extra horrible load on the NFS server as there were 497,179 files matching this description.

root@labstore1004:/srv/tools/shared/tools/project/zoomviewer# find . -maxdepth 1 -name 'zoom_*.[eo]*' -mtime +2|wc -l
497179
root@labstore1004:/srv/tools/shared/tools/project/zoomviewer# find . -maxdepth 1 -name 'zoom_*.[eo]*' -mtime +7 -exec rm {} \+
root@labstore1004:/srv/tools/shared/tools/project/zoomviewer# find . -maxdepth 1 -name 'zoom_*.[eo]*' | wc -l
5401

This has helped clean up the inode consumption issues a bit, but there are still 384G of files in its $HOME/public_html/cache.

2 months later and no response on this ticket. :(

Sorry, I'm a horrible person. I deleted them all.

find . -maxdepth 1 -name 'zoom_*.[eo]*' -delete

Took surprisingly little time (I expected this ro run for hours, it tool maybe a minute). How do I prevent these files from getting created in the first place? Alternatively I could delete them using cron (and keep only the most recent ones).

Zoomviewer is a high use tool, and since it deals with large resolution images, and generates multiresolution pyramids, it will use a lot of storage.

Took surprisingly little time (I expected this ro run for hours, it tool maybe a minute).

It probably would have taken hours if I had not already deleted the 497,179 oldest files from the NFS server side. That was reasonably quick as well, but would have been slooooow from an NFS client.

How do I prevent these files from getting created in the first place? Alternatively I could delete them using cron (and keep only the most recent ones).

Adjusting your jsub command that creates the job to include -o /dev/null -e /dev/null should do the trick. -o is the path that the STDOUT of the job will be logged to (this defaults to $HOME/<job name>.out) and -e is the path that STDERR of the job will be logged to (this defaults to $HOME/<job name>.err).

If having recent logs is useful for debugging problems that the jobs may have, adding a cleanup job the removes log files after a few days would be a reasonable solution as well. In that case I would still suggest using -o and -e to move the log files out of the tool's $HOME and into $HOME/logs just to make it possible to run ls in $HOME reasonably. One simple way to do that would be to use -o $HOME/logs -e $HOME/logs. When a log destination is a directory ($HOME/logs in this case), Grid Engine makes a file named <job name>.{o,e}<job number> in the output directory for output and error respectively.

Zoomviewer is a high use tool, and since it deals with large resolution images, and generates multiresolution pyramids, it will use a lot of storage.

I agree that it is a high usage tool. In T136190: zoomviewer is using 837G out of 8T for Tools you added a find /data/project/zoomviewer/public_html/cache -mtime +90 -delete cleanup job to purge cached data after 90 days. I think it would be reasonable to revisit that cache duration to reduce your long term storage footprint. Here are some rough numbers of the total storage reductions that could be made at different cutoff points:

$ for p in 60 45 30 14 7; do
  echo -n "${p}: "
  find . -type f -mtime +${p} -exec du -ch {} + |
    grep total$ |
    cut -f1 |
    awk '{
      ex = index("KMG", substr($1, length($1)))
      val = substr($1, 0, length($1) - 1)
      prod = val * 10^(ex * 3)
      sum += prod }
    END {printf "%.2fG\n", sum/1024/1024/1024;}'
done
60: 68.92G
45: 84.75G
30: 123.87G
14: 241.21G
7: 286.85G

Lowering the cache duration to 60 days would move you from the #1 NFS consumer to #2. 30 days would make you the #3 consumer. 14 days would make you the #6 largest consumer. 7 days would still have you at the 6th spot.

$ sort -rh ~bd808/tools-disk-20200519.txt |head
395G    zoomviewer
342G    wikidata-exports
216G    templatetiger
214G    paws
207G    oar
162G    wikidata-analysis
134G    videoconvert
106G    archive-things-4
93G     wdumps
84G     phetools

Grafana is now reporting this project is using 793 GB

Doing an ls in the tool's $HOME is useless because of the number of job logs:

$ find . -maxdepth 1 -name 'zoom_*.[eo]*' -mtime +2|wc -l
1297522

This is the same issue I talked about in T248188#6154453. I am running find . -maxdepth 1 -name 'zoom_*.[eo]*' -delete on the NFS server to clean them up again.

Mentioned in SAL (#wikimedia-cloud) [2020-08-26T18:12:40Z] <wm-bot> <root> Added -o /dev/null -e /dev/null to jsub commands generated by index.php (T248188)

Disk savings with different retention windows for files in $HOME/public_html/cache today:

$ for p in 60 45 30 14 7; do
      echo -n "${p}: "
      find . -type f -mtime +${p} -exec du -ch {} + |
        grep total$ |
        cut -f1 |
        awk '{
          ex = index("KMG", substr($1, length($1)))
          val = substr($1, 0, length($1) - 1)
          prod = val * 10^(ex * 3)
          sum += prod }
        END {printf "%.2fG\n", sum/1024/1024/1024;}'
    done
60: 88.76G
45: 348.13G
30: 376.81G
14: 400.56G
7: 419.65G

I really think this should be changed from 90 days to 30 days at most.

@dschwen this tool is currently using 310 GB of space. Is that expected, or do we need to add more cleanup automation here?

Yes, this is expected. The tool generates multiresolution pyramids for very large (gigapixel) images on commons. Generating the MRPs takes a while (job queue) so if we delete them the user experience will worsen.

(side-<s>note</s>rant, why cant the oauth thingie bounce be back to the issue page when I log in, but instead it sends me to phabricator home which has horrible UX and does not help me get to where I need to be...)

Yeah, I guess we could go with that. What's a reasonable time? Also the retention would have to be based on atime rather than ctime/mtime. Do we have that available on that FS?

Weird...

tools.zoomviewer@tools-sgebastion-07:~$ crontab -l
....
23   2  *   *   *      jsub ${HOME}/prune.sh >/dev/null 2>&1
tools.zoomviewer@tools-sgebastion-07:~$ cat prune.sh
#!/bin/bash

find /data/project/zoomviewer/public_html/cache -mtime +90 -delete

tools.zoomviewer@tools-sgebastion-07:~$

This should already prune everything older than 90days

But - also weird - this:

tools.zoomviewer@tools-sgebastion-07:~$ find /data/project/zoomviewer/public_html/cache -mtime +95
/data/project/zoomviewer/public_html/cache/070f21cdb8b1b00590ff07330e099b60.jpg
/data/project/zoomviewer/public_html/cache/ac356083d8fedd41536ad77190424301.jpg
/data/project/zoomviewer/public_html/cache/22fdf835b189f0ef7a31e30f9a2468ee.jpg
/data/project/zoomviewer/public_html/cache/3e813f5ca840bc5c2bd1b113a1a0d984.jpg
/data/project/zoomviewer/public_html/cache/fe4fb85e2cbf1db8ca10bac5d1f83b88.jpg
/data/project/zoomviewer/public_html/cache/080db89c7a28f3e7d51c6fb19d7e6426.jpg
/data/project/zoomviewer/public_html/cache/0548b48ae7c9d0874ca1d97f0ab9799a.jpg
/data/project/zoomviewer/public_html/cache/1d285299d8e60383ffca338e022bde1e.jpg
/data/project/zoomviewer/public_html/cache/4015e42e3567ea921a26fe2cda46e8cf.jpg
...

Looks like the prune script is not run :-/

You have some jobs in error states. That may be related to old network issues and things. I can clear them up for you.

2111667 0.00000 zoom_f5aa1 tools.zoomvi Eqw   09/08/2020 23:20:47                                    1
2111789 0.00000 zoom_88080 tools.zoomvi Eqw   09/08/2020 23:29:39                                    1
2111790 0.00000 zoom_a4089 tools.zoomvi Eqw   09/08/2020 23:29:57                                    1
2011769 0.00000 zoom_52f2d tools.zoomvi qw    11/06/2020 00:30:27                                    1
2011770 0.00000 zoom_1f726 tools.zoomvi qw    11/06/2020 00:30:27                                    1

tools.zoomviewer@tools-sgebastion-08:~$ qstat -j 2111667
==============================================================
job_number:                 2111667
exec_file:                  job_scripts/2111667
submission_time:            Tue Sep  8 23:20:47 2020
owner:                      tools.zoomviewer
uid:                        51295
group:                      tools.zoomviewer
gid:                        51295
sge_o_path:                 /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
sge_o_shell:                /bin/bash
sge_o_workdir:              /mnt/nfs/labstore-secondary-tools-project/zoomviewer/public_html
sge_o_host:                 tools-sgewebgrid-lighttpd-0911
account:                    sge
stderr_path_list:           NONE:NONE:/dev/null
hard resource_list:         h_vmem=2097152k
mail_list:                  tools.zoomviewer@tools.wmflabs.org
notify:                     FALSE
job_name:                   zoom_f5aa112f3022f0c3f11d679042ce764e
stdout_path_list:           NONE:NONE:/dev/null
jobshare:                   0
hard_queue_list:            task
env_list:                   TERM=NONE
job_args:                   f5aa112f3022f0c3f11d679042ce764e,Glaciar_Quelccalla.jpg
script_file:                /mnt/nfs/labstore-secondary-tools-project/zoomviewer/public_html/multires.sh
binding:                    NONE
job_type:                   binary
error reason          1:      can't get password entry for user "tools.zoomviewer". Either user does not exist or error with NIS/LDAP etc.
scheduling info:            Job is in error state

Those are cleaned up. I'll see if I can find anything about that prune job.

Mentioned in SAL (#wikimedia-cloud) [2020-11-06T00:38:35Z] <bstorm> Running prune.sh job manually T248188

tools.zoomviewer@tools-sgebastion-08:~$ jsub ${HOME}/prune.sh
Your job 2011897 ("prune") has been submitted

tools.zoomviewer@tools-sgebastion-08:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1974458 0.25093 lighttpd-z tools.zoomvi r     11/05/2020 01:07:33 webgrid-lighttpd@tools-sgewebg     1
2011897 0.25000 prune      tools.zoomvi r     11/06/2020 00:38:03 task@tools-sgeexec-0940.tools.     1

So it is running at least for now. I did not redirect stderr/out so any issues show up.

Thanks! Side question do I need to change commonswiki.web.db.svc.eqiad to commonswiki.analytics.db.svc.eqiad.wmflabs (I'm seeing some DB errors in the logs)

Yes. commonswiki.web.db.svc.eqiad doesn't work. I'm not sure when that went away, though. commonswiki.analytics.db.svc.eqiad.wmflabs is correct

There wasn't enough stuck jobs to explain that job not running. Each user can launch like 50 jobs.

It was clearly was attempted by the cron server:
Nov 5 02:23:01 tools-sgecron-01 CRON[2759]: (tools.zoomviewer) CMD (jsub ${HOME}/prune.sh >/dev/null 2>&1)
However, since it redirects everything to /dev/null, I can't really tell what happened after that.

Aklapper edited subscribers, added: dschwen; removed: Phamhi.

Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

nskaggs claimed this task.

Zoomviewer presently is using <300G, mostly in /public_html/cache. I don't see the presence of enormous numbers of log files.