Page MenuHomePhabricator

2021-01-17: tools NFS share cleanup
Closed, DuplicatePublic

Description

Today this paged:

NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1263267 MB (15% inode=81%):

Similar to T247315

Event Timeline

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2021-01-17T16:53:53Z] <arturo> icinga downtime labstore1004 /srv/tools space check for 3 days (T272247)

It's nice to see the alert being accurate these days.
/dev/drbd4 8.0T 6.3T 1.4T 83% /srv/tools

Got some clear heavy users here:

Screen Shot 2021-01-19 at 9.41.49 AM.png (1×2 px, 750 KB)

I'll run a du as well for file level grabs.

Running ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%k KB %p\n" > tools_large_files_20210119.txt

The bigger files:

19749772 KB /srv/tools/shared/tools/project/request/error.log
21072788 KB /srv/tools/shared/tools/project/mediawiki-feeds/error.log
22473872 KB /srv/tools/shared/tools/project/wikidata-primary-sources/error.log
22900348 KB /srv/tools/shared/tools/project/khanamalumat/qaus.err
23343528 KB /srv/tools/shared/tools/project/cluebotng/logs/relay_irc.log
24260512 KB /srv/tools/shared/tools/project/fiwiki-tools/logs/seulojabot2.log
24343364 KB /srv/tools/shared/tools/project/ifttt/www/python/src/ifttt.log
26970304 KB /srv/tools/shared/tools/project/mix-n-match/error.log
27890700 KB /srv/tools/shared/tools/project/img-usage/public_html/wikidata-20170130-all.json
31437236 KB /srv/tools/shared/tools/project/freebase/freebase-rdf-latest.gz
31811904 KB /srv/tools/shared/tools/project/wdumps/dumpfiles/generated/wdump-1107.nt.gz
31811908 KB /srv/tools/shared/tools/project/wdumps/dumpfiles/generated/wdump-1104.nt.gz
32818048 KB /srv/tools/shared/tools/project/khanamalumat/purawiki.err
34621292 KB /srv/tools/shared/tools/project/verification-pages/verification-pages/log/production.log.1
34792852 KB /srv/tools/shared/tools/project/geohack/error.log
35880272 KB /srv/tools/shared/tools/project/wdumps/dumpfiles/generated/wdump-1097.nt.gz
36023964 KB /srv/tools/shared/tools/project/ping08bot/mybot.out
36285016 KB /srv/tools/shared/tools/project/wiki2prop/prediction_ranked_Wiki2PropDEPLOY_year2018_embedding300LG_DEPLOY.h5
49303704 KB /srv/tools/shared/tools/project/splinetools/dumps/enwiki-20141106-pages-articles.xml
64778744 KB /srv/tools/shared/tools/project/wikidata-analysis/public_html_tmp/dumpfiles/json-20191125/20191125.json.gz
78643272 KB /srv/tools/shared/tools/project/robokobot/virgule.err
89133980 KB /srv/tools/shared/tools/project/.shared/dumps/20201221.json.gz
89481636 KB /srv/tools/shared/tools/project/.shared/dumps/20210104.json.gz
101857128 KB /srv/tools/shared/tools/project/magnus-toolserver/error.log
107005676 KB /srv/tools/shared/tools/project/meetbot/meetbot.out
107035912 KB /srv/tools/shared/tools/project/meetbot/logs/messages.log
194101748 KB /srv/tools/shared/tools/project/mix-n-match/mnm-microsync.err

A few of those are easy enough to just clean up myself.

Mentioned in SAL (#wikimedia-cloud) [2021-01-19T22:34:50Z] <bstorm> truncating 194 GB error log '/data/project/mix-n-match/mnm-microsync.err' T272247

Mentioned in SAL (#wikimedia-cloud) [2021-01-19T22:43:03Z] <bstorm> truncated 107GB log '/data/project/meetbot/logs/messages.log' T272247

Mentioned in SAL (#wikimedia-cloud) [2021-01-19T22:48:30Z] <bstorm> truncated 100GB error log /data/project/magnus-toolserver/error.log T272247

Mentioned in SAL (#wikimedia-cloud) [2021-01-19T22:57:43Z] <bstorm> truncated 75GB error log /data/project/robokobot/virgule.err T272247

That was enough to get a recovery. However, it seems like a good idea to see what users can clean up since there are projects taking up quite significant space.

Mentioned in SAL (#wikimedia-cloud) [2021-01-19T23:30:37Z] <bstorm> truncated 36GB mybot.out file T272247

Mentioned in SAL (#wikimedia-cloud) [2021-01-19T23:32:30Z] <bstorm> truncated 34GB error log file that was full of warnings like "Only variables should be passed by reference in /data/project/geohack/public_html/geohack.php on line 192" T272247

That brings us down to /dev/drbd4 8.0T 5.6T 2.1T 73% /srv/tools. The user tickets should bring things well into the safe zone when their cleanups are done (one is already done).

We got paged again today:

PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1259132 MB (15% inode=80%):

server check:

/dev/drbd4      8.0T  6.4T  1.3T  85% /srv/tools

Mentioned in SAL (#wikimedia-cloud) [2021-02-05T10:59:02Z] <arturo> icinga-downtime labstore1004 tools share space check for 1 week (T272247)