That brings us down to /dev/drbd4 8.0T 5.6T 2.1T 73% /srv/tools. The user tickets should bring things well into the safe zone when their cleanups are done.
That was enough to get a recovery. However, it seems like a good idea to see what users can clean up since there are projects taking up quite significant space.
The bigger files:
19749772 KB /srv/tools/shared/tools/project/request/error.log 21072788 KB /srv/tools/shared/tools/project/mediawiki-feeds/error.log 22473872 KB /srv/tools/shared/tools/project/wikidata-primary-sources/error.log 22900348 KB /srv/tools/shared/tools/project/khanamalumat/qaus.err 23343528 KB /srv/tools/shared/tools/project/cluebotng/logs/relay_irc.log 24260512 KB /srv/tools/shared/tools/project/fiwiki-tools/logs/seulojabot2.log 24343364 KB /srv/tools/shared/tools/project/ifttt/www/python/src/ifttt.log 26970304 KB /srv/tools/shared/tools/project/mix-n-match/error.log 27890700 KB /srv/tools/shared/tools/project/img-usage/public_html/wikidata-20170130-all.json 31437236 KB /srv/tools/shared/tools/project/freebase/freebase-rdf-latest.gz 31811904 KB /srv/tools/shared/tools/project/wdumps/dumpfiles/generated/wdump-1107.nt.gz 31811908 KB /srv/tools/shared/tools/project/wdumps/dumpfiles/generated/wdump-1104.nt.gz 32818048 KB /srv/tools/shared/tools/project/khanamalumat/purawiki.err 34621292 KB /srv/tools/shared/tools/project/verification-pages/verification-pages/log/production.log.1 34792852 KB /srv/tools/shared/tools/project/geohack/error.log 35880272 KB /srv/tools/shared/tools/project/wdumps/dumpfiles/generated/wdump-1097.nt.gz 36023964 KB /srv/tools/shared/tools/project/ping08bot/mybot.out 36285016 KB /srv/tools/shared/tools/project/wiki2prop/prediction_ranked_Wiki2PropDEPLOY_year2018_embedding300LG_DEPLOY.h5 49303704 KB /srv/tools/shared/tools/project/splinetools/dumps/enwiki-20141106-pages-articles.xml 64778744 KB /srv/tools/shared/tools/project/wikidata-analysis/public_html_tmp/dumpfiles/json-20191125/20191125.json.gz 78643272 KB /srv/tools/shared/tools/project/robokobot/virgule.err 89133980 KB /srv/tools/shared/tools/project/.shared/dumps/20201221.json.gz 89481636 KB /srv/tools/shared/tools/project/.shared/dumps/20210104.json.gz 101857128 KB /srv/tools/shared/tools/project/magnus-toolserver/error.log 107005676 KB /srv/tools/shared/tools/project/meetbot/meetbot.out 107035912 KB /srv/tools/shared/tools/project/meetbot/logs/messages.log 194101748 KB /srv/tools/shared/tools/project/mix-n-match/mnm-microsync.err
A few of those are easy enough to just clean up myself.
Running ionice -c 3 nice -19 find /srv/tools -type f -size +100M -printf "%k KB %p\n" > tools_large_files_20210119.txt
It's nice to see the alert being accurate these days.
/dev/drbd4 8.0T 6.3T 1.4T 83% /srv/tools
Fri, Jan 15
This is done,
This is done.
This is done.
The project is now created with default quotas, which should be able to get you started.
This is done.
This one is all set.
@Missvain Please check now. This service doesn't always recover well after an outage and needs a kick. The network failed for a while during T261134: upgrade cloud-vps openstack to Openstack version 'Stein', which would quite likely cause this.
That fixed it:
Thu, Jan 14
Project is created with default quotas. You should have access via Horizon now.
Done. You should be good to go. Let me know if you don't have access to the IP or something.
This project is created with default quotas. Please try it out at https://horizon.wikimedia.org
The write throttle is unchanged partly because we haven't upgraded the DRBD network yet. At very least, NFS reads should no longer feel like you are mounting it over a cell phone network.
The read throttle on bastions is now much higher.
This task should be linked to this patch. Oops: https://gerrit.wikimedia.org/r/c/operations/puppet/+/655952
I haven't been able to come back around to this yet. It seems like it could be added to the extensions in the singleuser image in https://github.com/toolforge/paws/blob/master/images/singleuser/install-extensions
Wed, Jan 13
Approved in weekly meeting.
Approved in weekly meeting
Tue, Jan 12
$ kubectl -n maintain-kubeusers logs maintain-kubeusers-7f7b44754c-mgrjj starting a run Homedir already exists for /data/project/adhs-wde Wrote config in /data/project/adhs-wde/.kube/config Provisioned creds for user adhs-wde finished run, wrote 1 new accounts
That fixed it. This was likely caused by a latency issue in etcd slowing down the cleanup of a failed request. Until we can make etcd more performant (T267966) we are going to see issues around that, so I think I need to teach this service how to clean up after itself (will create subtask).
maintain-kubeusers-7f7b44754c-kkm76 0/1 CrashLoopBackOff 1513 32d
Unfortunately, the problem is the latter.
This suggests your tool does not have authentication credentials created. That either means you beat the service that creates that or that the service is broken.
I'll check that out. The script starts up every minute, but that's clearly not right.
Mon, Jan 11
I believe all replicas pass puppet now (after creating that grant). @Marostegui if you can check that the software is doing what it should be doing now, I think this can be closed.
Yep, the user is not created. Creating the grant using the info in modules/role/templates/mariadb/grants/wiki-replicas.sql since that appears to be what is in the existing replicas.
That patch was safe on the old servers (no change). On the multi-instance I see the error: Access denied for user 'wmf-pt-kill'@'localhost'. That sounds like it is pretty close to working.
Sat, Jan 9
We confirmed this is the standby, so it won't impact the cloud during this nonsense (and thus isn't a "unbreak now" or real outage).
I just checked the web console, and apparently the network adapter's status is "unknown"
That might just be this version of iLO being Helpful, though. On Monday, if this is under warranty, we could parse the active health log, possibly (if it is enabled).
That last one is obviously from much earlier, but that's kinda weird.
Does it have broken hardware or something? This is from dmesg:
That error is happening a fair bit. Dunno if that is related.
I was about to make another ticket for it until I saw your comment
Fri, Jan 8
So at this point, this has been in a failover state for a couple months. The last time this happened we gave up and failed back (and it happened again). I believe the warranty expired in 2020, so the opportunity to fix this on the last round of sudden reboots is already gone. That might not leave us with much. The system is strained while in a failover state, but it has no automatic HA.
According to T268285: update RAID controller firmware on labstore1006, 1007, we are already on recent firmware with regard to this issue. I'd briefly discussed involving HPE to get a fix with @Jclark-ctr back on that ticket, but I'm not sure that was done or if we have a service agreement/warranty either way.
This is the same failure. Not a very useful suggestion though:
I suspect this is related to https://lists.wikimedia.org/pipermail/wikitech-l/2020-November/094044.html
Looks fine. Thanks for finding the bug.
Thu, Jan 7
@aborrero I need to sync up with you on the naming and IPVS stuff here when you have time. I'll suggest a scheduled time if I miss you tomorrow.
At this point, each proxy has the capability to route to the new replicas as well as the old ones, but it only routes to each instances primary. I presume we want the analytics replica to be the standby for "web" and vice versa, right @Marostegui? That seems better than requiring manual intervention if something happens.