Fri, Feb 14
That last issue (the resolution failure) was a side-effect of work I was doing for T229441. That issue is resolved, but now the failure us
Thu, Feb 13
I'm not sure this is necessary. I'm currently experimenting with creating a recordset named '<project>.<instance>' in the codfw1dev.wikimedia.cloud zone (which is owned by cloudinfra-codfw1dev) and it works fine. Since it will only ever be sink acting on the wikimedia.cloud zones, probably that's sufficient.
The patch to re-organize is:
This server is now drained and ready for whatever.
I think this is ready to go -- we can switch to upstream images once we have Cinder working.
@Petrb I'm going to remove the dumps and scratch mounts now, which shouldn't affect you. If at some point in the future you want to move to local storage just ping on this ticket and I can clean up the other mounts.
@Nemo_bis, @CristianCantoro, @akosiaris et. al, can you confirm whether or not your project makes use of the Dumps mount on your VMs? If you do that's fine, we're just trying to clean up unused mounts.
@DeltaQuad If you have things in the NFS home mount then it's probably fine to keep it; can you respond as to which of the other mounts (dumps/project/scratch) you are using? This will let me clean up unused mounts.
@Epantaleo Your project is mounting the 'dumps' NFS volume under /mnt/nfs/ -- we're just wondering if that's something that you use or if it can be cleaned up. Either way is fine.
@dschwen, can you respond as to which of (dumps, scratch, home, project) you're using in this project? Any of them is fine, we just want to clean up unused mounts. Thanks.
ok! Thanks for confirming.
Your description sounds like a pretty good use case for 'scratch' -- is that what you're using now, or are you doing your work in /data/project? (It may be that scratch is too slow for this purpose, but it might be worth a try.)
Tue, Feb 11
Hello Math project users! Can someone please respond here about your use of NFS (project, scratch, and dumps) and indicate if it's practical to stop using any of those three mounts?
It's useful to have mounted -- one more thing to see break during VM creation.
@Maximilianklein using NFS is fine, we just wanted to eliminate the cases where it was mounted but unused.
@Addshore, can you please follow up on this? I see @https://phabricator.wikimedia.org/p/LucasWerkmeister/ creating a fair number of files in /home but not much in /data/project.
@Hjfocs can you comment on whether or not the wikidata-primary-source-tool makes use of the nfs 'dumps' mount that's provided? We're trying to eliminate unnecessary mounts.
there are (at least) two remaining things here:
The patch to re-organize is:
Fri, Feb 7
This is quite a bit better now.
Thu, Feb 6
This is happening because yaml.safe_dump() (and yaml.dump()) does some weird arbitrary quoting of things:
I've confirmed that the behavior with the yaml-based UI is correct. For the guided interface, strings are unquoted and non-string types (numbers, booleans, etc.) are quoted. weird.
Wed, Feb 5
Hello! Can you give us a bit of info about what resources you expect to use? Ram, Cores, Disk space? Also, are you hoping to offload storage onto NFS or NFS/scratch? (If the latter, you may be disappointed by performance)
Tue, Feb 4
Ah, dammit, dc-ops missed this ticket and now 1013 is back in service on 1G. So it's no longer a good time to do this, there's real workload on that host.
As per https://phabricator.wikimedia.org/T171289, getting $project (and, possibly $deployment) from a fact may be problematic.
- email sent to wikitech-l and cloud-announce
I'll do it now.
For a few seconds interruption I wouldn't expect this to be very disruptive. If you schedule it in my morning (e.g. 15:00 UTC) then I can send out notice to users &c. and be around in case unexpected things happen.
Mon, Feb 3
Yep, I can reproduce this and it's awful.
Tue, Jan 28
@Tpt sorry for the delay in responding to this -- lots of team travel lately. I've reset 2fa on wikitech for you so you should be all set now.
Fri, Jan 24
I've re-read the code a bit, and it's not obvious to me that we need to specify a tenant to sink (outside of the implicit tenant associated with the zone id from the config). So for starters let's just try swapping in a tenant-owned zone and see if everything just works.
Wed, Jan 22
Tue, Jan 21
@Papaul, please decom associated disk shelves when you pull these servers. Thank you!
"This command group will be removed in 17.0.0 (Queens). The quota_usage_refresh subcommand has been deprecated and is now a no-op since quota usage is counted from resources instead of being tracked separately."
The decom task is now in Papaul's hands; everything else is done.
@Papaul, I'm not positive but I think these servers have disk shelves attached. If so those shelves can also be decom'd at the same time.
Added a clearer error handler
Sun, Jan 19
ran "sudo ufw allow ssh" on the instance
Jan 17 2020
With merged patches, this is still activated by a systemd timer, but has a retry loop. I think that gets us what we want: we'll get the same alert as before, but only if the script fails three times in a row over several minutes.
Note that this project is slated for deletion. I failed to notify the admins (or shutdown VMs) in a timely fashion so the deletion is scheduled for 2020-02-15. It's quite likely that this problem will go away then.
Jan 16 2020
Jan 15 2020
@Bstorm I'm still hoping for your confirmation that this is all working so we can shut down the old servers.
This looks better to me now:
The right fix for this is to build a new monitoring system which I'm not going to dive into immediately