Fri, Mar 16
Also pinging @Krinkle
Chatted with @Ottomata today in #wikimedia-analytics, and we decided to use a similar strategy for the stat/notebook mounts. We'll mount shares from labstore1006/7 in /mnt, and symlink the active NFS one to /mnt/data (which is the current access point for stat users).
Initial PoC patch for nfsclient.pp changes https://gerrit.wikimedia.org/r/#/c/403767/1
@Volans Indeed, I fixed up the script based on the comments, we can close this task when the patch is merged! Thank you
I think the firewall stuff already exists now, we can do this as part of the dumps NFS migration on April 2, and have the shares available from labstore1006|7 on notebook*. I added T188644 as a parent task.
@Ottomata Hey! Do you know anything about these logs? :) I'd like to make it so that we can fetch from the mwlog server when we move to the new dumps set up.
Wed, Mar 14
Tue, Mar 13
@Kolossos I see utilization has climbed up again to over 600G. How can we ensure we don't have to keep making these tickets to clean up? We are happy to help figure out long term strategies!
Resolving this for now. This project still has high utilization, albeit less than before. We can discuss strategies to mitigate in T159930.
Sun, Mar 11
Fri, Mar 9
Wed, Mar 7
Draft timeline - T168486#4033572
Draft timeline for migration:
Tue, Mar 6
tools-worker-1011 was having issues allowing non-root logins. I rebooted it:
Mon, Mar 5
See T188726 for new task on datasets in other/
Thu, Mar 1
That seems like it would work yes :)
Hey @Tim-moody, Chase is on-call this week and will make the changes soon :) Thanks for your patience!
@WMDE-Fisch Argh sorry, should be fixed for real now!
Our current theory is that running when snapshot-manager runs lvs to check if a snapshot exists, it throws these read errors, potentially because the older snapshots are full or unreadable for some reason. But they will get deleted anyway so these errors are red herrings and don't affect the backups. We can either fix logging these errors, or remove the snapshots at the source server after the backup is done to avoid this problem.
Despite the error lines in both cron logs for reading misc-snap, both backups seem to have completed successfully. Pasting everything but the Error lines from the above logs
On labstore1004 I see:
Some logs from nova-conductor corresponding to the time of incident, doesn't seem like the root cause but correlates with the db spike. https://phabricator.wikimedia.org/P6770
Things seem a lot better now since
Wed, Feb 28
The script is failing due to existing user account clash issue that we hoped would go away with the 1001|3 decommission - it looks like we still have older accounts in labsdb1005 that cause the same problem.
Fri, Feb 23
+1 to handling on labstore boxes. Puppet should be able to do it.
Thu, Feb 22
Tue, Feb 20
Yup looks good, backups have been running fine for the last 2 weeks.
The servers are moved and up and running! Thanks for your work @Cmjohnson.
Sun, Feb 18
I renamed the hack script in tools.paws to paws-userhomes-hack.bash, it now looks like:
Feb 14 2018
Thanks for this work @jcrespo!
I've dropped the metadata for labsdb1001 and 1003 from labsdbaccounts.account_host. It now looks like
Feb 8 2018
Feb 7 2018
puppet seems to be the only other one but no in Cloud Services knows much about it or maintains it - we only found data in there from 2012, and it doesn't seemed to be referenced anywhere in puppet.
+1 on moving only once!
@ayounsi No we can't lose both without service interruption. I am not sure how we can have row level redundancy in this case if there is only 10G availability in one row.
@srishakatux Perfect, thank you!
@Cmjohnson Can we move them to a row with 10G then? These are in public vlan so don't need labs-support. I believe they are currently in A and D.
@Cmjohnson When we racked labstore1006 & 7 we approved the proposal for racking in 1GBE racks (T167984). I did not know that we had specifically ordered (Hardware request - T161311) 10G NICs on these boxes because the public dumps servers need those enabled (discussed in T118154#3017229)
Feb 6 2018
@srishakatux Yes, we are willing to mentor this for GSoC 2018 or Outreachy Round 16. Let me know if there's anything I need to do on my side to have this up as a project. Thanks :)
@jcrespo I'd like to drop all the accounts metadata for labsdb1001 & 3 from labsdbaccounts.account_host on m5-master to close this task.