This might be related to this:
codfw1-dev is running Ocata, and I've scheduled an upgrade window for eqiad1. Approximate steps are...
This project has been deleted.
The new cloudbackup hosts are now up and running. The tools backup happens on Thursdays and the misc projects backup happens on Tuesdays.
All of these VMs have been deleted and the project is pending deletion at the end of the year.
Wed, Dec 4
As per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2019_Purge, this project is now a candidate for deletion since no one has claimed it on wiki or responded to my emails. It's not too late to indicate otherwise on that page if it's still of use to someone.
Mon, Dec 2
I see nothing at all in the syslog that would explain this crash -- just an empty spot
Wed, Nov 27
This is an actively used server. It hasn't been reporting to puppet because of:
this is now deployed on eqiad1-r
@Krenair when you have time can you retest this? I suspect that it was fixed by various upgrades.
closing until someone reports this happening again
I think we have as many of these as we need now :)
I think this is moot now since the migration is long since finished.
using a new, local-to-codfw1dev-database now
we don't have any near-term plans to support this.
resolved with instance-puppet git repo
I think this is largely resolved -- we have monitoring that keeps us from having too many eggs in one basket.
this can be closed, can't it?
I'm going to close this pending a re-appearance of the issue
this is resolved with the new git backend for horizon puppet config.
this hasn't been an issue lately.
now the yaml backend is the default.
I think this is still happening:
this seems to be fixed.
I have mournfully ripped out the code that manages class documentation :(
Because this requires coordination with DBAs and pooling/depooling it's not straightforward to automate.
this is just a symptom of T239347, so closing in favor of that
Also, can we maybe be more specific about that grant and use certain IPs instead of using %? For either nova and the new nova_cell0
Tue, Nov 26
yep, tagged you by mistake.
Those steps sound right to me. Backups would be nice -- I think the nova db is backed up but I'm not positive.
I need to alter this so it doesn't make a commit if there was nothing to delete.
Mon, Nov 25
I sent the initial survey invitation this morning.
Fri, Nov 22
I just ran an experiment forcing my traffic from one labweb to the other, and my session persisted. So it's not a split-brain issue, or at least not an obvious one.
done -- logs are nice and quiet now.
That patch seems to quiet the alerts; I'll see about building and deploying
I wiped arturo's tokens from the keystone database.
Thu, Nov 21
I upgraded pdns to version 4 yesterday and now there's a lot more of this. I don't see the metrics being complained about defined in prometheus-pdns-exporter so I'm not sure how to address this -- @MoritzMuehlenhoff if you want to point me in the right direction I'm happy to do the coding.
eqiad1 (cloudservices1003/1004) now running ocata Designate.
I reproduced what Arturo is seeing -- the session cookie is present /until/ I visit horizon, at which point it's cleared. So Horizon definitely thinks that we're not allowed. It also looks to me like the keystone tokens are created correctly (with 7-day lifespan) so I'm not sure who is making the decision that our access has expired.
Wed, Nov 20
I'm having trouble producing this reliably enough to debug. If this happens to someone else, please paste the contents of your sessionid cookie here before logging in again so I can try to track things down.
For the new project, please file a ticket here: https://phabricator.wikimedia.org/project/profile/2875/