@Kolossos I see utilization has climbed up again to over 600G. How can we ensure we don't have to keep making these tickets to clean up? We are happy to help figure out long term strategies!
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 13 2018
Resolving this for now. This project still has high utilization, albeit less than before. We can discuss strategies to mitigate in T159930.
Mar 11 2018
Mar 9 2018
Mar 7 2018
Draft timeline - T168486#4033572
Draft timeline for migration:
Mar 6 2018
In T159930#4029145, @Nemo_bis wrote:If the higher usage is periodic, I want to encourage setting up automatic clean up jobs after the dumps are processed
I might have missed something, but this is not something that clean up would help. It's about transferring bigger datasets which are only produced occasionally. Clearly they can be split down in smaller pieces, but to do so we might end up increasing the usage of resources (download something, write it, read it, process and split it, write again, move elsewhere etc.).
In T174468#3924301, @Hydriz wrote:I have managed to reduce the disk usage to less than 500G. However, the original problem still stands where the dumps project may have a very high utilization of disk space during certain periods of time which may negatively affect other CloudVPS projects. Is it possible for a separate labstore volume to be created just for the dumps project?
tools-worker-1011 was having issues allowing non-root logins. I rebooted it:
Mar 5 2018
See T188726 for new task on datasets in other/
In T171541#3889918, @ArielGlenn wrote:These have been running for awhile now. The only thing that doesn't get synced over on a regular basis are the various datasets pulled or pushed onto dataset1001 from kiwix, mwlog hosts, etc. Instead of setting up an additional sync job for those, we ought to just enable those syncs to happen on labstore1006 and sync from there to 1007.
- profile::dumps::fetcher with appropriate hiera settings and permissions on stat1005 will take care of the incoming datasets
- profile/manifests/phabricator/main.pp has a stanza for the push to dataset1001, so it should get a new stanza added, or convert this to pull
- role/manifests/logging/mediawiki/udp2log.pp has a stanza for push to dumps.wikimedia.org, so it should get a new stanza added, or convert this to pull
Then this task could be closed.
Mar 1 2018
That seems like it would work yes :)
Hey @Tim-moody, Chase is on-call this week and will make the changes soon :) Thanks for your patience!
@WMDE-Fisch Argh sorry, should be fixed for real now!
Our current theory is that running when snapshot-manager runs lvs to check if a snapshot exists, it throws these read errors, potentially because the older snapshots are full or unreadable for some reason. But they will get deleted anyway so these errors are red herrings and don't affect the backups. We can either fix logging these errors, or remove the snapshots at the source server after the backup is done to avoid this problem.
Despite the error lines in both cron logs for reading misc-snap, both backups seem to have completed successfully. Pasting everything but the Error lines from the above logs
On labstore1004 I see: root@labstore1004:~# lvs /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558073344: Input/output error /dev/misc/misc-snap: read failed after 0 of 4096 at 5497558130688: Input/output error /dev/misc/misc-snap: read failed after 0 of 4096 at 0: Input/output error /dev/misc/misc-snap: read failed after 0 of 4096 at 4096: Input/output error /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert misc-project misc owi-aos--- 5.00t misc-snap misc swi-I-s--- 1.00t misc-project 100.00 test misc -wi-ao---- 10.00g tools-project tools owi-aos--- 8.00t tools-snap tools swi-I-s--- 1.00t tools-project 100.00 root@labstore1004:~# Are snapshots not being created each time or did 1T of content really fill up while sync was in progress?
Some logs from nova-conductor corresponding to the time of incident, doesn't seem like the root cause but correlates with the db spike. https://phabricator.wikimedia.org/P6770
Things seem a lot better now since
Feb 28 2018
The script is failing due to existing user account clash issue that we hoped would go away with the 1001|3 decommission - it looks like we still have older accounts in labsdb1005 that cause the same problem.
Feb 23 2018
+1 to handling on labstore boxes. Puppet should be able to do it.
Feb 22 2018
Feb 20 2018
Yup looks good, backups have been running fine for the last 2 weeks.
The servers are moved and up and running! Thanks for your work @Cmjohnson.
Feb 18 2018
I renamed the hack script in tools.paws to paws-userhomes-hack.bash, it now looks like:
Feb 14 2018
Thanks for this work @jcrespo!
I've dropped the metadata for labsdb1001 and 1003 from labsdbaccounts.account_host. It now looks like
In T183029#3971300, @jcrespo wrote:This is the new list of m5 databases being backed up- if you confirm that is as intended, you can proceed now we have proper[sic] backups.
root@dbstore2001:/srv/backups/m5.20180214102352$ ls *-schema-create.sql.gz ceilometer-schema-create.sql.gz designate_pool_manager-schema-create.sql.gz designate-schema-create.sql.gz glance-schema-create.sql.gz keystone-schema-create.sql.gz labsdbaccounts-schema-create.sql.gz labspuppet-schema-create.sql.gz neutron-schema-create.sql.gz nodepooldb-schema-create.sql.gz nova-schema-create.sql.gz striker-schema-create.sql.gz
Feb 8 2018
Feb 7 2018
puppet seems to be the only other one but no in Cloud Services knows much about it or maintains it - we only found data in there from 2012, and it doesn't seemed to be referenced anywhere in puppet.
+1 on moving only once!
@ayounsi No we can't lose both without service interruption. I am not sure how we can have row level redundancy in this case if there is only 10G availability in one row.
@srishakatux Perfect, thank you!
@Cmjohnson So to clarify, do both row A and D (or the racks we have these servers in - D6 and A1) not have 10G enabled?
@Cmjohnson Can we move them to a row with 10G then? These are in public vlan so don't need labs-support. I believe they are currently in A and D.
@Cmjohnson When we racked labstore1006 & 7 we approved the proposal for racking in 1GBE racks (T167984). I did not know that we had specifically ordered (Hardware request - T161311) 10G NICs on these boxes because the public dumps servers need those enabled (discussed in T118154#3017229)
Feb 6 2018
@srishakatux Yes, we are willing to mentor this for GSoC 2018 or Outreachy Round 16. Let me know if there's anything I need to do on my side to have this up as a project. Thanks :)
Drop: test_labsdbaccounts
Backup: labsdbaccounts
@jcrespo I'd like to drop all the accounts metadata for labsdb1001 & 3 from labsdbaccounts.account_host on m5-master to close this task.
Feb 5 2018
@chasemp The crons are scheduled for tomorrow and day after :) My manual backups got done fine.
@MarkAHershberger I've applied the quota increase - let me know if it's all good. Thanks!
Feb 1 2018
Fixed with https://gerrit.wikimedia.org/r/407460. I'm running manual backup jobs now for both shares on screen. Will close after confirming that the scheduled crons run successfully next week.
Jan 31 2018
@notconfusing Great, thank you!
Jan 29 2018
Noting here that I added Brooke on Tue, Jan 23rd after Bryan's fix, and made sure Rush was still in the list after I did so.
@notconfusing Is this service still active? Are there ongoing clean up jobs in place to delete files that are generated? I see that the usage has now grown to 160G, and want to make sure we don't end up with really high utilization again. Thanks!