This is probably a bug where tables are not cleaned up correctly? As mentioned in T132431:
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | jcrespo | T132431 labsdb1001 and labsdb1003 short on available space | |||
Resolved | Matthewrbowker | T133321 `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 |
Event Timeline
When we grepped for xtools_tmp the things that stood out were ./public_html/autoblock/core and ./public_html/pages/core which are both xtools, not Wikihistory. They are compiled binaries, but it's unclear how they are actually used.
I dropped a solid 8,000 tables or so and everything seems to be running fine, leaving this month's tables in place. This of course is no solution, as they'll just climb back up again. We need to figure why they are being created, and at the very least automate removing them when we no longer need them.
Obligatory It's Over 9000!
These would be core dump files recorded by the Linux kernel to snapshot the state of some fatal error. Although valuable for debugging segmentation faults we should probably tune the grid servers to not create core files normally. That's probably worthy of a separate phab task.
I dropped a solid 8,000 tables or so and everything seems to be running fine, leaving this month's tables in place. This of course is no solution, as they'll just climb back up again. We need to figure why they are being created, and at the very least automate removing them when we no longer need them.
Thanks for the cleanup.
Thanks for the info :) Looks like it's the counter that's creating the tables: https://github.com/x-tools/xtools/blob/master/modules/Counter.php#L373-L394
The table is supposed to be dropped right after it's created and processed with L158-L161. Obviously it is not =P
I will investigate further!
I didn't even know the core did that. I guess I'm really outdated with the code, being familiar with the old original xTools that I moved from toolserver to labs.
This is creating large operational problems on labs due to lack of space.
I had to delete all 2016* tables- if this continues, the next step will be disabling the user or enforce very strict quotas before it continues degrading the service for other tools (disabling based on the fact that there is not really a maintainer, and that could pose a security thread). Please note that problems with this tool were warned almost 1 year ago, so I do not consider this actions drastic or without warning. It is not ok that a tool takes more space than enwiki and wikidata combined, that is not what labsdb are supposed to host. If the resources it takes are justified, maintainers should ask for dedicated resources- replica dbs are supposed to hold only intermediate/static results for easy querying.
The modifications made to the database code appear to not be working then...
Taking this task, I'm just going to build a real quick and dirty cleanup engine triggered by Cron. Any row older than about a week will be removed.
Cleanup engine implemented. It's now running nightly at midnight via jsub and cron, Moving task to normal as we are just watching it now.
I deeply thank the quick response. This was indeed a crisis. I will monitor the evolution, too.
I am going to call this task resolved as of now. There are currently 95 tables, and only 6 are older than a week (I believe there's a slight regexp error, which I shall debug).
From my tool logs, this is the latest run of the cleanup engine:
Attempting to clean up "201701241057464ae56789661468c6b5684e43604a95f1"... ... done! Attempting to clean up "201701241057464ae56789661468c6b5684e43604a95f1_parent"... ... done! Attempting to clean up "20170125012942468c9702f0342ebeff65d2ef5fdc53b9"... ... done! Attempting to clean up "20170125012942468c9702f0342ebeff65d2ef5fdc53b9_parent"... ... done! Attempting to clean up "20170125045439f2a58f573c45f7e34e624cb5f4bff162"... ... done! Attempting to clean up "20170125045439f2a58f573c45f7e34e624cb5f4bff162_parent"... ... done! Attempting to clean up "20170125083820bf97c050c800e4735e8b7c4d178273ce"... ... done! Attempting to clean up "20170125083820bf97c050c800e4735e8b7c4d178273ce_parent"... ... done! Attempting to clean up "20170125084552bf97c050c800e4735e8b7c4d178273ce"... ... done! Attempting to clean up "20170125084552bf97c050c800e4735e8b7c4d178273ce_parent"... ... done! Attempting to clean up "20170125090054a8de4d9de0ea7c31d93cd38f63f712f9"... ... done! Attempting to clean up "20170125090054a8de4d9de0ea7c31d93cd38f63f712f9_parent"... ... done! Attempting to clean up "20170125105657208f492b3cc14d0acbb2f09cec38e123"... ... done! Attempting to clean up "20170125105657208f492b3cc14d0acbb2f09cec38e123_parent"... ... done! Attempting to clean up "20170125110324208f492b3cc14d0acbb2f09cec38e123"... ... done! Attempting to clean up "20170125110324208f492b3cc14d0acbb2f09cec38e123_parent"... ... done! Attempting to clean up "20170125123247cd11942a4e0f40d9c235ac047b12c518"... ... done! Attempting to clean up "20170125123247cd11942a4e0f40d9c235ac047b12c518_parent"... ... done! Attempting to clean up "20170125123416cd11942a4e0f40d9c235ac047b12c518"... ... done! Attempting to clean up "20170125123416cd11942a4e0f40d9c235ac047b12c518_parent"... ... done! Attempting to clean up "201701251307409698757046dab036e11177d217612de0"... ... done! Attempting to clean up "201701251307409698757046dab036e11177d217612de0_parent"... ... done! Attempting to clean up "201701251658068f8fc46824184c8f274d887ff2f04284"... ... done! Attempting to clean up "201701251658068f8fc46824184c8f274d887ff2f04284_parent"... ... done!
The cleanup job has been running successfully.
I ran it manually, here is the output.
16:54 [xtools]tools.xtools@tools-bastion-03:~🍺 php database_cleanup.php 2017-03-17 16:55:02Attempting to clean up "20170103212637e542e0119d5a9ccad17ae7374de673d2"... ... done! Attempting to clean up "20170103212637e542e0119d5a9ccad17ae7374de673d2_parent"... ... done! Attempting to clean up "20170118043651e850e92689e08eb834937d0c220fd7e9"... ... done! Attempting to clean up "20170118043651e850e92689e08eb834937d0c220fd7e9_parent"... ... done! Attempting to clean up "20170120153955810e5005ad36bb2af88f887ae6ca2c4c"... ... done! Attempting to clean up "20170120153955810e5005ad36bb2af88f887ae6ca2c4c_parent"... ... done! Attempting to clean up "2017012710200943980e2db7a6bc8aea9ea05979b4594f"... ... done! Attempting to clean up "2017012710200943980e2db7a6bc8aea9ea05979b4594f_parent"... ... done! Attempting to clean up "20170128062903e542e0119d5a9ccad17ae7374de673d2"... ... done! Attempting to clean up "20170128062903e542e0119d5a9ccad17ae7374de673d2_parent"... ... done! Attempting to clean up "20170131034511e542e0119d5a9ccad17ae7374de673d2"... ... done! Attempting to clean up "20170131034511e542e0119d5a9ccad17ae7374de673d2_parent"... ... done! Attempting to clean up "20170212172229e542e0119d5a9ccad17ae7374de673d2"... ... done! Attempting to clean up "20170212172229e542e0119d5a9ccad17ae7374de673d2_parent"... ... done! Attempting to clean up "20170213025527810e5005ad36bb2af88f887ae6ca2c4c"... ... done! Attempting to clean up "20170213025527810e5005ad36bb2af88f887ae6ca2c4c_parent"... ... done! Attempting to clean up "20170213025557810e5005ad36bb2af88f887ae6ca2c4c"... ... done! Attempting to clean up "20170213025557810e5005ad36bb2af88f887ae6ca2c4c_parent"... ... done! Attempting to clean up "20170213025612810e5005ad36bb2af88f887ae6ca2c4c"... ... done! Attempting to clean up "20170213025612810e5005ad36bb2af88f887ae6ca2c4c_parent"... ... done! Attempting to clean up "20170215100931e542e0119d5a9ccad17ae7374de673d2"... ... done! Attempting to clean up "20170215100931e542e0119d5a9ccad17ae7374de673d2_parent"... ... done! Attempting to clean up "20170215162656810e5005ad36bb2af88f887ae6ca2c4c"... ... done! Attempting to clean up "20170215162656810e5005ad36bb2af88f887ae6ca2c4c_parent"... ... done! Attempting to clean up "20170217110256e542e0119d5a9ccad17ae7374de673d2"... ... done! Attempting to clean up "20170217110256e542e0119d5a9ccad17ae7374de673d2_parent"... ... done! Attempting to clean up "20170218061937772e4161748e4743a8f5aae904a4c064"... ... done! Attempting to clean up "20170218061937772e4161748e4743a8f5aae904a4c064_parent"... ... done! Attempting to clean up "20170301184135e17f8a423f2bc85ef1cf14b87db2a00b"... ... done! Attempting to clean up "20170301184135e17f8a423f2bc85ef1cf14b87db2a00b_parent"... ... done! Attempting to clean up "20170304025859e542e0119d5a9ccad17ae7374de673d2"... ... done! Attempting to clean up "20170304025859e542e0119d5a9ccad17ae7374de673d2_parent"... ... done! Attempting to clean up "20170306130027063e129c2ebd557537a4408d7dcd5f6f"... ... done! Attempting to clean up "20170306130027063e129c2ebd557537a4408d7dcd5f6f_parent"... ... done! Attempting to clean up "20170308172704810e5005ad36bb2af88f887ae6ca2c4c"... ... done! Attempting to clean up "20170308172704810e5005ad36bb2af88f887ae6ca2c4c_parent"... ... done!
What is the difference between labsdb1001 and labsdb1003? Does labsdb1001 correlate to s1.labsdb?
What is the difference between labsdb1001 and labsdb1003? Does labsdb1001 correlate to s1.labsdb?
We change dynamically the load for each server. To connect to user databases, I think the best way is to use the dnss c1.labsdb or c3.labsdb to connect to a particular physical host. It is documented here: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Naming_conventions