Page MenuHomePhabricator

tungsten is out of space on /srv
Closed, ResolvedPublic

Description

Apparently, mongodb has consumed all space available. Please tell us what to do next (delete all mongodb files?).

Details

Related Gerrit Patches:
operations/mediawiki-config : masterStartProfiler: Disable production sampling for xhgui

Event Timeline

jcrespo created this task.Mar 23 2017, 9:19 AM
Krinkle triaged this task as High priority.Mar 23 2017, 8:33 PM

Change 344504 had a related patch set uploaded (by Krinkle):
[operations/mediawiki-config@master] StartProfiler: Disable production sampling for xhgui

https://gerrit.wikimedia.org/r/344504

Change 344504 merged by jenkins-bot:
[operations/mediawiki-config@master] StartProfiler: Disable production sampling for xhgui

https://gerrit.wikimedia.org/r/344504

Krinkle added a comment.EditedMar 23 2017, 9:35 PM

Apparently, mongodb has consumed all space available. Please tell us what to do next (delete all mongodb files?).

We currently store both X-Wikimedia-Debug profiles as well as a 1:100,000 sampling of production requests. We only use XGHui for the former so I've disabled the latter with https://gerrit.wikimedia.org/r/344504.

Once merged, I'll figure out what XGHui's database schema looks like and delete all the profiles from production requests, but preserve the manually stored WMD profiles. That should clear up over 99% of the occupied space from MongoDB.

Mentioned in SAL (#wikimedia-operations) [2017-03-23T23:26:47Z] <Krinkle> Removing xhgui.results entries from before 1 June 2016 (T161196)

Krinkle added a comment.EditedMar 23 2017, 11:33 PM

Instead of using the mongo cli on tungsten directly, worked around it for now by sending the delete query from PHP on terbium, since the MongoDB is open for app servers.

http://php.net/manual/en/mongocollection.count.php
http://php.net/manual/en/class.mongodate.php
https://docs.mongodb.com/manual/reference/command/find/

[22:20 UTC] krinkle at terbium.eqiad.wmnet in ~
$ mwscript eval.php --wiki aawiki
> $mongo = new MongoClient( 'mongodb://tungsten.eqiad.wmnet:27017', [] );
> $db = $mongo->xhprof;
> return $db->results->findOne();
array(2) {
  ["meta"]=>
  array(8) {
    ["url"]=>
    string(23) "/w/api.php?action=query"
    ["SERVER"]=> ..
    ["simple_url"]=>
    string(23) "/w/api.php?action=query"
    ["request_ts"]=>
    object(MongoDate)#239 (2) {
      ["sec"]=>
      int(1456789729)
      ["usec"]=>
      int(0)
    }
 ...
  ["profile"]=> ..

The first record is from February 29. From trial and error I found that there is probably no index on xhprof.results.meta.request_ts since any query based on that field using $lt or $gt always times out when comparing against something recent. Looking for rows before or after February or March 2016 comes back quickly, so it's presumably ordered that way and starts from the top.

return $db->results->findOne(['meta.request_ts' => [ '$lt' => new MongoDate(strtotime('1 March 2016'))]]);
array(2) {
  ["meta"]=>
  array(8) {
    ...
    ["request_ts"]=>
    object(MongoDate)#239 (2) {
      ["sec"]=>
      int(1456789729)
...

> return $db->results->count();
2533703

> return $db->results->remove(['meta.request_ts' => [ '$lt' => new MongoDate(strtotime('1 March 2016'))]], ['justOne' => true]);
 n: 1, err: null, ok: 1

> return $db->results->count();
2533702 // -1

> return $db->results->remove(['meta.request_ts' => [ '$lt' => new MongoDate(strtotime('1 June 2016'))]]);

Caught exception MongoCursorTimeoutException: Timed out waiting for data

Most any query times out, so I'm not surprised this one did as well. As far as I can tell it does, however, continue on the server. Manually polling count() shows that it's been steadily dropping for at least 30 minutes, and still going..

> return $db->results->count();
2240192 // -293,510

EDIT: It finally finished after 2 hours

> return $db->results->count();
1959640 // -280,552
> return $db->results->count();
1959640 // -0

Mentioned in SAL (#wikimedia-operations) [2017-03-24T02:11:57Z] <Krinkle> Removing xhgui.results entries from before 1 December 2016 in MongoDB on tungsten (T161196)

Mentioned in SAL (#wikimedia-operations) [2017-03-24T06:10:04Z] <Krinkle> Removing xhgui.results entries before 1-Dec-2016 finished. Running xhgui->command(compact=>results) now. T161196

jcrespo added a comment.EditedMar 24 2017, 5:01 PM

The alert is still present. I wonder if it requires defragmenting? http://stackoverflow.com/a/2975296/342196

Edit: I didn't read the last comment. Maybe it is still running?

@jcrespo Given read/write operations started working again after a few hours I think it completed now. I ran it again and this time it completed within a few seconds with an ok response. However the space is still not being released, according to https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=tungsten.

$mongo = new MongoClient( 'mongodb://tungsten.eqiad.wmnet:27017', [] );
$db = $mongo->xhprof;
> return $db->command(['compact' => 'results']);
array(1) {
  ["ok"]=>
  float(1)
}
> return $db->command(['compact' => 'results', 'force' => true]);
array(1) {
  ["ok"]=>
  float(1)
}
Krinkle closed this task as Resolved.Mar 24 2017, 8:52 PM
Krinkle removed a project: Patch-For-Review.

Even after removing all entries, compact ended in no reclaimed space. Also as expected, repairDatabase doesn't' work due to a lack of space. I tried it anyway under the suspicion that may all it needs is the current occupied space (1.6TB) + enough space for how large the new version would be (< 50GB). However while that should suffice (since it's gonna be much smaller than 50 GB) , it refuses to try under the expectation that it should be able to end up the same size as it was.

> return $db->command(['repairDatabase' => 1]);
array(2) {
  ["ok"]=>
  float(0)
  ["errmsg"]=>
  string(112) "Cannot repair database xhprof having size: 1689195118592 (bytes) because free disk space is: 55165923328 (bytes)"

This is reasonable and I didn't feel like bypassing that warning. Instead, given that xghui/MonoDB will auto-create the database, I decided to just drop() the collection and database.

This finally worked.

Mentioned in SAL (#wikimedia-operations) [2017-03-25T03:15:10Z] <Krinkle> Re-create optimised indexes for xhgui in mongodb on tungsten per https://github.com/perftools/xhgui/tree/v0.7.0#installation (lost after T161196)

Mentioned in SAL (#wikimedia-operations) [2017-03-25T03:15:10Z] <Krinkle> Re-create optimised indexes for xhgui in mongodb on tungsten per https://github.com/perftools/xhgui/tree/v0.7.0#installation (lost after T161196)

$ mwscript eval.php --wiki aawiki
> $mongo = new MongoClient( 'mongodb://tungsten.eqiad.wmnet:27017', [] );
> $db = $mongo->xhprof;

> $db->results->ensureIndex(['meta.SERVER.REQUEST_TIME' => -1]);
> $db->results->ensureIndex(['profile.main().wt' => -1]);
> $db->results->ensureIndex(['profile.main().mu' => -1]);
> $db->results->ensureIndex(['profile.main().cpu' => -1]);
> $db->results->ensureIndex(['meta.url' => 1]);

Thank you very much for working on that. Sorry for the inconveniences this may have caused. Sorry I was a but pushy about fixing it, but normally I get much less receptive response, and in the end things stop working. I thank you deeply for a prompt and fast action!