Drives not being mounted at /srv is the right behavior. The lvms aren't mounted by default because if they were, our bdsync based backups would fail, claiming that the lvms were mounted. /srv/backup is defined as a canonical location to mount our lvms manually by puppet - https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/labs/nfs/secondary_backup/misc.pp#L5, in case we need to look at or test things.
Mon, Jan 15
Thu, Jan 11
Here's a draft of the failover plan for the dumps distribution servers:
Tue, Jan 9
@Discasto Please continue the conversation on the same ticket, thank you. Could you check now?
@Discasto Should be resolved now, could you check, and resolve this task if things are good? Thanks!
Looks like the flannel pod in tools-paws-worker-1007 was also stuck in a weird state.
@jcrespo Thanks for pointing that out! Will add that to our docs.
Sat, Jan 6
@bd808 jfyi I also cleaned up the dns entries for these replicas, see patch in above comments.
PAWS cluster DNS broke too, and all the workers had switched the default policy for Chain FORWARD to DROP again. I fixed by running sudo iptables -P FORWARD ACCEPT across the paws-workers. So these two things seem related.
Fri, Jan 5
Noting that I merged https://gerrit.wikimedia.org/r/353508 and applied profile::wmcs::nfs::ferm to the new dumps distribution servers labstore1006&7, and the ferm rules seem to be working well.
We'll do the upgrade to stretch for all labstore servers as a separate step after testing stretch for NFS. We would like to have parity in the OS versions across all the labstores to keep operational overhead minimal. I'm opening a different task - T184290 for upgrading labstore* to stretch and resolving this for now.
Thu, Jan 4
@Cyberpower678 Any update on this? Thanks!
@jcrespo Sounds great! Let's puppetize and tweak later if needed. Thank you :)
With T184018, this should be resolved for now. We plan to upgrade to the newer k8s version to mitigate the iptables rule issue soon. Let's reopen/make a new task if this issue happens again.
Tue, Jan 2
We are at high utilization by the dumps project again, 2T or 5T available storage. Please cleanup excess files and data soon, thank you!
Thu, Dec 28
I think this is fixed now, and PAWS is back up.
Dec 15 2017
The script is unbroken now and runs alright. Leaving this open until we remove all the metadata for labsdb1001 and 3 from labsdbaccount.account_host post decommission.
I dropped all the status=absent accounts from labsdbaccount.account_host for labsdb1001 and 1003.
Dec 6 2017
@elukey I recommend copying home directories on notebook1002 and back them up somewhere on notebook1001, and send a note to analytics and research-l asking folks to just use 1001. I don't think anyone uses 1002, but a few users have notebooks there, so notifying would be good.
Nov 27 2017
Nov 20 2017
With https://gerrit.wikimedia.org/r/#/c/391892/, this is all done now.
Nov 17 2017
Raw notes from Etherpad in rolling this all out:
Nov 16 2017
This is all done now :)
Nov 13 2017
Everything seems to be good now, I'm resolving this task. Thanks a ton @Marostegui!
@jcrespo Understood, I wasn't aware of that. We are in the right track then :)
@jcrespo That makes sense, I didn't know the private data was the reason we didn't do the wildcard grants. Lets leave it as is then, @aborrero may be soon working on automating a little better our flow to import a new DB into the replicas and set up access, and we can explore giving the grant on a per view database level every time we do that, in an automated fashion.
Nov 10 2017
@aborrero and I caught up on this, and it looks like all the DNS records are created now:
Nov 9 2017
Noting here that proprietary software is not usually installed on WMCS environments per https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use#What_uses_of_Labs_do_we_not_like.3F (Proprietary Software).
@Marostegui right, okay. Thanks! Do we have a ticket for this issue?
@aborrero FYI, after Manuel's magic, I've run
@Marostegui Worked now! what did you have to do?
Also running directly on labsdb1011,
Nov 8 2017
When the config is account required nologin.so I've only been able to reproduce this behavior during the firstboot stage. I've tried applying auth required nologin.so post boot to see how the behavior changes, and been able to log in every time, despite that config existing.
Nov 3 2017
I've fixed the grants for pawikisource_p now.
Nov 2 2017
Nov 1 2017
It looks like it may be time to say goodbye to this server. I've spent some time today looking at the state of the storage configuration, and the damage, and if anything at all might be possible to recover the disk.
Disk setup for labsdb1001
Oct 30 2017
The 1001 reboot is all done. Notes from my planning etherpad:
Oct 27 2017
I've now rolled this out to labsdb10[01|03|09|10|11]. @Marostegui Is there a file/config/logs somewhere you'd like me to persist these grants? Thanks for your help :)
Cool, I've run
@Marostegui Sounds good, thanks
@Marostegui Yeah that sounds right to me! Cool if I run that across the wiki replicas?
Oct 26 2017
Started a planning doc for the reboots here - https://etherpad.wikimedia.org/p/labsdb-reboots
Fixed! @MusikAnimal can you verify that your credentials work now and close this? Thank you :)
Oct 25 2017
+1 That sounds like the right thing to do
@bd808 I looked at the accounts set up we have now, and it looks like the labsdbadmin user is already set up with remote (specific ips) permissions, but it only has Grant_priv and Create_user_priv, which in turn we use to Create accounts and grant View privileges for toolforge users/tool accounts.
Awesome thanks @akosiaris!
Oct 24 2017
I've updated the lists, and our wiki here -https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown
Proposed timing for the 2 reboots:
Reopening since we are scheduling the labsdb1001 and 1003 reboots over the next couple weeks.