Fri, Nov 17
Raw notes from Etherpad in rolling this all out:
Thu, Nov 16
This is all done now :)
Mon, Nov 13
Everything seems to be good now, I'm resolving this task. Thanks a ton @Marostegui!
@jcrespo Understood, I wasn't aware of that. We are in the right track then :)
@jcrespo That makes sense, I didn't know the private data was the reason we didn't do the wildcard grants. Lets leave it as is then, @aborrero may be soon working on automating a little better our flow to import a new DB into the replicas and set up access, and we can explore giving the grant on a per view database level every time we that in an automated fashion.
Fri, Nov 10
@aborrero and I caught up on this, and it looks like all the DNS records are created now:
Thu, Nov 9
Noting here that proprietary software is not usually installed on WMCS environments per https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use#What_uses_of_Labs_do_we_not_like.3F (Proprietary Software).
@Marostegui right, okay. Thanks! Do we have a ticket for this issue?
@aborrero FYI, after Manuel's magic, I've run
@Marostegui Worked now! what did you have to do?
Also running directly on labsdb1011,
Wed, Nov 8
When the config is account required nologin.so I've only been able to reproduce this behavior during the firstboot stage. I've tried applying auth required nologin.so post boot to see how the behavior changes, and been able to log in every time, despite that config existing.
Fri, Nov 3
I've fixed the grants for pawikisource_p now.
Thu, Nov 2
Wed, Nov 1
It looks like it may be time to say goodbye to this server. I've spent some time today looking at the state of the storage configuration, and the damage, and if anything at all might be possible to recover the disk.
Disk setup for labsdb1001
Mon, Oct 30
The 1001 reboot is all done. Notes from my planning etherpad:
Fri, Oct 27
I've now rolled this out to labsdb10[01|03|09|10|11]. @Marostegui Is there a file/config/logs somewhere you'd like me to persist these grants? Thanks for your help :)
Cool, I've run
@Marostegui Sounds good, thanks
@Marostegui Yeah that sounds right to me! Cool if I run that across the wiki replicas?
Thu, Oct 26
Started a planning doc for the reboots here - https://etherpad.wikimedia.org/p/labsdb-reboots
Fixed! @MusikAnimal can you verify that your credentials work now and close this? Thank you :)
Wed, Oct 25
+1 That sounds like the right thing to do
@jcrespo @bd808 I looked at the accounts set up we have now, and it looks like the labsdbadmin user is already set up with remote (specific ips) permissions, but it only has Grant_priv and Create_user_priv, which in turn we use to Create and grant accounts for toolforge users/tool accounts.
Awesome thanks @akosiaris!
Tue, Oct 24
I've updated the lists, and our wiki here -https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown
Proposed timing for the 2 reboots:
Reopening since we are scheduling the labsdb1001 and 1003 reboots over the next couple weeks.
@akosiaris Hello, we have a package builder node in tools that seems to be running into some trouble with the new buster stuff in puppet (logs in task description)
Oct 16 2017
I've decided to look at alternatives because the /etc/nologin mechanism seems to be flaky.
Oct 11 2017
Hi @Sowjanyavemuri, please feel free to reach out if you need any help completing your proposal. We are available on #wikimedia-cloud or here on Phabricator for any questions, or if you'd like to start learning about our different systems. https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction is a good place to start reading :)
Oct 10 2017
Oct 2 2017
The webservice tests should be fixed too! I'll let @chasemp verify and resolve this.
Also fixed the labsdb1005 check with https://gerrit.wikimedia.org/r/381885
For the labsdb1001 & 1003 tests, the error was:
Oct 1 2017
These instances should be up now.
Sep 26 2017
Project https://wikitech.wikimedia.org/wiki/Nova_Resource:Webperf created with User phedenskog as projectadmin
Sep 25 2017
Thanks @MoritzMuehlenhoff, I tried that and came up with these two plots, don't really see much of a difference as far as auth related services go.
I went through the list here -https://tools.wmflabs.org/openstack-browser/puppetclass/role::puppetmaster::standalone and upgraded apache and ran puppet.
@Andrew do we have a list of instances, they just need a apt-get install --upgrade apache2 and restart for apache. There was an unattended apache upgrade that rolled out last week, and I fixed tools and labs-puppetmaster before it rolled out.
Sep 20 2017
I was able to change the firstboot script, that runs when a new instance is created and being booted for the first time, to create an /etc/nologin file at the beginning of it's run, and deletes the file at the end, after ensuring that NFS is mounted. I tested this by building test images for Trusty and Jessie.
Sep 19 2017
Steps to set up lvms
Sep 18 2017
@HaeB I've enabled pdf exports for SWAP now, but it may error out if it runs into unicode characters it can't parse, I'm not sure why, and it seems to be a longer task, so tabling it here for now.
Sep 15 2017
Sep 8 2017
@fgiunchedi Thank you! That seems to have fixed it. Resolving this task. Thanks everyone :)
Sep 6 2017
Sep 5 2017
All of these solutions so far require onsite.
Sep 3 2017
Looking at auth.log on ws-web, I saw a bunch of:
Aug 31 2017
I'm closing this as resolved since running the maintain-views script for the new views went fine. Please reopen if there are any issues. Thanks!
Looks like it is talking to the old puppetmaster url and failing
This is all done, new private key committed in ops/private. New certs are showing up okay!
Aug 30 2017
Thank you so much, that all looks right. Closing this as resolved!
Aug 29 2017
@Vacio You are in the right place! If you can hop on to the #wikimedia-cloud IRC channel sometime, we can help you figure this out easier real time :)
@ArielGlenn Sounds good, I would push towards a larger window of atleast 2 hours - 45 minutes to an hour for 3 rsyncs + some cleanup seems like cutting it close.
@Vacio Could you please elaborate on what the problem is? Did you try signing up to wikitech and did you run into an error? If so what? You can create a wikitech account at https://wikitech.wikimedia.org/w/index.php?title=Special:CreateAccount
1001/3 have not been rebooted because of the fear of catastrophic hardware failure and their impending decomm.
@Marostegui We talked about this today in our meeting, and think that since we don't have significant user traffic moved over from 1001/3 to the new WikiReplica servers yet, we should hold off from rebooting these server for longer, given that Moritz mentioned during our last discussion that we can afford to hold off, and the immediate attack vectors have already been plugged in place.
@Papaul Yup that's perfect, thanks!
@Papaul, Hardware RAID 10 on both labstore2001 and 2002, with 6 or 8 disks per logical/virtual RAID drive would be great (12 still feels like a really big disk).
Aug 28 2017
@Cmjohnson Thank you!
@ArielGlenn Thanks for the summary! Looks right - one note is that I would prefer that the dumpsdata host(primary or secondary) is the pristine source for both labstore1006 and 7, rather than the labstores trying to sync between each other.
Update: We are still blocked on talking to HP Support about the disk shelves.
Aug 25 2017
2001 is done too.
@Papaul, thanks for splitting up the shelves! I've reimaged the servers, and that part looks right
Aug 23 2017
Current status: We are not really sure why the disk shelves don't show up. As the next step, @Cmjohnson will try and call HP support and have them help troubleshoot, hopefully on Friday.
The interface flapping issue was because of a mis-connected cable, which @Cmjohnson's fixed now. Both management interfaces are now accessible.
Aug 21 2017
@Cmjohnson I also can't even seem to get into the management interface for labstore1006