It's similar to what we are doing with labstore1004/1005, or same model. I believe those are in neighboring racks.
Took about a day and half to leak 7 instances
Wed, Mar 22
seems that way, I didn't see that sal and texted @Marostegui to ask (sorry buddy!)
Tue, Mar 21
Let's hold on this one out of the pending 3 for last, I want to do some more review on CPU specs since the existing is such a hodgepodge and our model is in flux.
Thanks for the overview @Freddy2001. We'll get to this within the week.
Mon, Mar 20
ignores permissions and does not reproduce root@dumps-stats:/data/project# ls -lh wikistats
This is still super important but the immediate issue of this task is resolved it seems and followed up by T159721: labvirt1001 and 1002 cannot launch new VMs
Sat, Mar 18
Fri, Mar 17
This is a known issue and T158420 will resolve it but at present there is no mechanism for maintainer per-user replica creds, only per tool. It's in progress though.
I'm not sure if this is the right solution, almost certainly it's not a good solution. How large are the backups expected to be?
closing due to age and activity (seems fixed?)
closing due to age and activity, I don't think this has been an issue
no longer accurate
closing due to age and activity
closing this due to age and activity
this is long since old I believe
this seems to not be an issue and I'm not inclined to worry about it with the current demand
precise is no longer supported
Since these are production servers it seems most appropriate they would appear in prod graphite/promethius
I'm closing for age and lack of activity
as of efcac33f8a5d00427a0593e9e7b6e8a020c86f40 this is hopefully at least viable and further work should be tracked in specific issues
considering age and no activity I'm bouncing this task
Eventually this workload moves to k8s with all others but for now I'm marking this declined with https://phabricator.wikimedia.org/T156981#3077562
Since this is >1yr old and we haven't updated it at all I'm going to close in favor of resurfacing the issue if needed
We have the basics of this now:
We removed paramiko from the backup pipeline
We have this running as of 99ef86ae0e2b74370e543d3fe22a46e8b0928df3 and have found several issues from the normalized and ongoing testing
A note that the appointed time grows nigh, and this is quickly becoming the most mysterious item left on the list:
should be good to go, let me know if not
We want to experiment with enabling lookupcache=all everywhere. This is currently set on the k8s-workers and the bastions afaict. Passing to @madhuvishy from conversations yesterday. It appears historical reasoning on disabling lookupcache are not well understood, and we should look how changing this effects an active mount (remount?) and consider how to roll it out to consolidate. Not only are bastions different from trusty exec nodes atm, but also from k8s workers. That's the sort of inconsistency we'll spend no limit of time fighting.
labstore1003.eqiad.wmnet:/dumps nfs4 28T 18T 11T 64% /public/dumps
Thu, Mar 16
It should appear on a Puppet run sometime in the next hour or so.
this is no longer allowed
@dschwen is 2fa working for you via wikitech currently? Can you disable and renable 2fa and see if it works in both venues?
I'll attempt to run the view generation when I can in the next few days
My understanding is this has been put on hold until TBD. I don't want to leave this request for hardware open and create confusion, and I'm not sure what the specs and needs will be when things come back around.
I'm working from the assumption this issue is fine now.
@Sabas88 we could configure /public/dumps to be available on all instances in this project. Is that acceptable?
marking resolved as this may be an artifact of bad config gone by but let's track new happenings in new tickets
@Kanashimi can you speak to this?
I spoke with @yuvipanda and in his words 'this idea should die in a fire'
raw etherpad script for posterity
seems like this is sorted, we'll reopen if issues surface
Wed, Mar 15
This maintenance lasted just shy of an hour. All Kubernetes services
should have been back at around 50 minutes in. This was longer than the
expected 30 due to extra time for initial depooling of existing Pods. At
the moment all Kubernetes services seem to be functioning as expected.
Non-Kubernetes Tool Labs functions were not impacted.