Page MenuHomePhabricator

`s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003
Closed, ResolvedPublic

Description

This is probably a bug where tables are not cleaned up correctly? As mentioned in T132431:

Woah. Why is xtooks on the list? It shouldn't be using that much DB space. It should be using almost nothing.

Event Timeline

RobH removed a subscriber: RobH.Apr 21 2016, 6:53 PM
Matthewrbowker added a comment.EditedApr 21 2016, 7:03 PM

Over 10,000 tables in that database right now.

See P2939

Cyberpower678 added a comment.EditedApr 21 2016, 7:36 PM

Over 10,000 tables in that database right now.
http://pastebin.com/TpLU4RqH

WTF??????????????????? That's like...what??? We need to disable Wikihistory now.

When we grepped for xtools_tmp the things that stood out were ./public_html/autoblock/core and ./public_html/pages/core which are both xtools, not Wikihistory. They are compiled binaries, but it's unclear how they are actually used.

I dropped a solid 8,000 tables or so and everything seems to be running fine, leaving this month's tables in place. This of course is no solution, as they'll just climb back up again. We need to figure why they are being created, and at the very least automate removing them when we no longer need them.

bd808 added a comment.Apr 21 2016, 8:43 PM

Over 10,000 tables in that database right now.
http://pastebin.com/TpLU4RqH

Obligatory It's Over 9000!

When we grepped for xtools_tmp the things that stood out were ./public_html/autoblock/core and ./public_html/pages/core which are both xtools, not Wikihistory. They are compiled binaries, but it's unclear how they are actually used.

These would be core dump files recorded by the Linux kernel to snapshot the state of some fatal error. Although valuable for debugging segmentation faults we should probably tune the grid servers to not create core files normally. That's probably worthy of a separate phab task.

I dropped a solid 8,000 tables or so and everything seems to be running fine, leaving this month's tables in place. This of course is no solution, as they'll just climb back up again. We need to figure why they are being created, and at the very least automate removing them when we no longer need them.

Thanks for the cleanup.

MusikAnimal added a comment.EditedApr 21 2016, 8:53 PM

Thanks for the info :) Looks like it's the counter that's creating the tables: https://github.com/x-tools/xtools/blob/master/modules/Counter.php#L373-L394

The table is supposed to be dropped right after it's created and processed with L158-L161. Obviously it is not =P

I will investigate further!

I didn't even know the core did that. I guess I'm really outdated with the code, being familiar with the old original xTools that I moved from toolserver to labs.

Matthewrbowker triaged this task as High priority.

Assigning to MusikAnimal as he's investigating further.

Matthewrbowker moved this task from Inbox to Working on the XTools board.Nov 24 2016, 12:03 AM
scfc moved this task from Triage to Backlog on the Toolforge board.Dec 4 2016, 8:30 PM
jcrespo raised the priority of this task from High to Unbreak Now!.Jan 28 2017, 3:34 PM
jcrespo added a subscriber: jcrespo.

This is creating large operational problems on labs due to lack of space.

I had to delete all 2016* tables- if this continues, the next step will be disabling the user or enforce very strict quotas before it continues degrading the service for other tools (disabling based on the fact that there is not really a maintainer, and that could pose a security thread). Please note that problems with this tool were warned almost 1 year ago, so I do not consider this actions drastic or without warning. It is not ok that a tool takes more space than enwiki and wikidata combined, that is not what labsdb are supposed to host. If the resources it takes are justified, maintainers should ask for dedicated resources- replica dbs are supposed to hold only intermediate/static results for easy querying.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJan 28 2017, 3:34 PM

The modifications made to the database code appear to not be working then...

Taking this task, I'm just going to build a real quick and dirty cleanup engine triggered by Cron. Any row older than about a week will be removed.

Matthewrbowker lowered the priority of this task from Unbreak Now! to Normal.Jan 28 2017, 6:58 PM

Cleanup engine implemented. It's now running nightly at midnight via jsub and cron, Moving task to normal as we are just watching it now.

I deeply thank the quick response. This was indeed a crisis. I will monitor the evolution, too.

Matthewrbowker closed this task as Resolved.Feb 1 2017, 7:30 PM
Matthewrbowker moved this task from Working to Complete on the XTools board.

I am going to call this task resolved as of now. There are currently 95 tables, and only 6 are older than a week (I believe there's a slight regexp error, which I shall debug).

From my tool logs, this is the latest run of the cleanup engine:

Attempting to clean up "201701241057464ae56789661468c6b5684e43604a95f1"...
    ... done!
Attempting to clean up "201701241057464ae56789661468c6b5684e43604a95f1_parent"...
    ... done!
Attempting to clean up "20170125012942468c9702f0342ebeff65d2ef5fdc53b9"...
    ... done!
Attempting to clean up "20170125012942468c9702f0342ebeff65d2ef5fdc53b9_parent"...
    ... done!
Attempting to clean up "20170125045439f2a58f573c45f7e34e624cb5f4bff162"...
    ... done!
Attempting to clean up "20170125045439f2a58f573c45f7e34e624cb5f4bff162_parent"...
    ... done!
Attempting to clean up "20170125083820bf97c050c800e4735e8b7c4d178273ce"...
    ... done!
Attempting to clean up "20170125083820bf97c050c800e4735e8b7c4d178273ce_parent"...
    ... done!
Attempting to clean up "20170125084552bf97c050c800e4735e8b7c4d178273ce"...
    ... done!
Attempting to clean up "20170125084552bf97c050c800e4735e8b7c4d178273ce_parent"...
    ... done!
Attempting to clean up "20170125090054a8de4d9de0ea7c31d93cd38f63f712f9"...
    ... done!
Attempting to clean up "20170125090054a8de4d9de0ea7c31d93cd38f63f712f9_parent"...
    ... done!
Attempting to clean up "20170125105657208f492b3cc14d0acbb2f09cec38e123"...
    ... done!
Attempting to clean up "20170125105657208f492b3cc14d0acbb2f09cec38e123_parent"...
    ... done!
Attempting to clean up "20170125110324208f492b3cc14d0acbb2f09cec38e123"...
    ... done!
Attempting to clean up "20170125110324208f492b3cc14d0acbb2f09cec38e123_parent"...
    ... done!
Attempting to clean up "20170125123247cd11942a4e0f40d9c235ac047b12c518"...
    ... done!
Attempting to clean up "20170125123247cd11942a4e0f40d9c235ac047b12c518_parent"...
    ... done!
Attempting to clean up "20170125123416cd11942a4e0f40d9c235ac047b12c518"...
    ... done!
Attempting to clean up "20170125123416cd11942a4e0f40d9c235ac047b12c518_parent"...
    ... done!
Attempting to clean up "201701251307409698757046dab036e11177d217612de0"...
    ... done!
Attempting to clean up "201701251307409698757046dab036e11177d217612de0_parent"...
    ... done!
Attempting to clean up "201701251658068f8fc46824184c8f274d887ff2f04284"...
    ... done!
Attempting to clean up "201701251658068f8fc46824184c8f274d887ff2f04284_parent"...
    ... done!

Thank you, again!

jcrespo renamed this task from `s51187__xtools_tmp` database using 272G on labsdb1001 to `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003.Mar 17 2017, 4:53 PM
jcrespo reopened this task as Open.

The cleanup job has been running successfully.

I ran it manually, here is the output.

16:54 [xtools]tools.xtools@tools-bastion-03:~🍺  php database_cleanup.php
2017-03-17 16:55:02Attempting to clean up "20170103212637e542e0119d5a9ccad17ae7374de673d2"...
    ... done!
Attempting to clean up "20170103212637e542e0119d5a9ccad17ae7374de673d2_parent"...
    ... done!
Attempting to clean up "20170118043651e850e92689e08eb834937d0c220fd7e9"...
    ... done!
Attempting to clean up "20170118043651e850e92689e08eb834937d0c220fd7e9_parent"...
    ... done!
Attempting to clean up "20170120153955810e5005ad36bb2af88f887ae6ca2c4c"...
    ... done!
Attempting to clean up "20170120153955810e5005ad36bb2af88f887ae6ca2c4c_parent"...
    ... done!
Attempting to clean up "2017012710200943980e2db7a6bc8aea9ea05979b4594f"...
    ... done!
Attempting to clean up "2017012710200943980e2db7a6bc8aea9ea05979b4594f_parent"...
    ... done!
Attempting to clean up "20170128062903e542e0119d5a9ccad17ae7374de673d2"...
    ... done!
Attempting to clean up "20170128062903e542e0119d5a9ccad17ae7374de673d2_parent"...
    ... done!
Attempting to clean up "20170131034511e542e0119d5a9ccad17ae7374de673d2"...
    ... done!
Attempting to clean up "20170131034511e542e0119d5a9ccad17ae7374de673d2_parent"...
    ... done!
Attempting to clean up "20170212172229e542e0119d5a9ccad17ae7374de673d2"...
    ... done!
Attempting to clean up "20170212172229e542e0119d5a9ccad17ae7374de673d2_parent"...
    ... done!
Attempting to clean up "20170213025527810e5005ad36bb2af88f887ae6ca2c4c"...
    ... done!
Attempting to clean up "20170213025527810e5005ad36bb2af88f887ae6ca2c4c_parent"...
    ... done!
Attempting to clean up "20170213025557810e5005ad36bb2af88f887ae6ca2c4c"...
    ... done!
Attempting to clean up "20170213025557810e5005ad36bb2af88f887ae6ca2c4c_parent"...
    ... done!
Attempting to clean up "20170213025612810e5005ad36bb2af88f887ae6ca2c4c"...
    ... done!
Attempting to clean up "20170213025612810e5005ad36bb2af88f887ae6ca2c4c_parent"...
    ... done!
Attempting to clean up "20170215100931e542e0119d5a9ccad17ae7374de673d2"...
    ... done!
Attempting to clean up "20170215100931e542e0119d5a9ccad17ae7374de673d2_parent"...
    ... done!
Attempting to clean up "20170215162656810e5005ad36bb2af88f887ae6ca2c4c"...
    ... done!
Attempting to clean up "20170215162656810e5005ad36bb2af88f887ae6ca2c4c_parent"...
    ... done!
Attempting to clean up "20170217110256e542e0119d5a9ccad17ae7374de673d2"...
    ... done!
Attempting to clean up "20170217110256e542e0119d5a9ccad17ae7374de673d2_parent"...
    ... done!
Attempting to clean up "20170218061937772e4161748e4743a8f5aae904a4c064"...
    ... done!
Attempting to clean up "20170218061937772e4161748e4743a8f5aae904a4c064_parent"...
    ... done!
Attempting to clean up "20170301184135e17f8a423f2bc85ef1cf14b87db2a00b"...
    ... done!
Attempting to clean up "20170301184135e17f8a423f2bc85ef1cf14b87db2a00b_parent"...
    ... done!
Attempting to clean up "20170304025859e542e0119d5a9ccad17ae7374de673d2"...
    ... done!
Attempting to clean up "20170304025859e542e0119d5a9ccad17ae7374de673d2_parent"...
    ... done!
Attempting to clean up "20170306130027063e129c2ebd557537a4408d7dcd5f6f"...
    ... done!
Attempting to clean up "20170306130027063e129c2ebd557537a4408d7dcd5f6f_parent"...
    ... done!
Attempting to clean up "20170308172704810e5005ad36bb2af88f887ae6ca2c4c"...
    ... done!
Attempting to clean up "20170308172704810e5005ad36bb2af88f887ae6ca2c4c_parent"...
    ... done!

What is the difference between labsdb1001 and labsdb1003? Does labsdb1001 correlate to s1.labsdb?

jcrespo closed this task as Resolved.EditedMar 17 2017, 5:18 PM

What is the difference between labsdb1001 and labsdb1003? Does labsdb1001 correlate to s1.labsdb?

We change dynamically the load for each server. To connect to user databases, I think the best way is to use the dnss c1.labsdb or c3.labsdb to connect to a particular physical host. It is documented here: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Naming_conventions