Page MenuHomePhabricator

Cluster-wide major compactions: parsoid.data-parsoid table
Closed, ResolvedPublic

Description

As noted elsewhere, the droppable tombstone ratio is abnormal quite high. Major compactions are not a viable long-term strategy for this problem, but do result in a significant reduction to the droppable ratio, as well as reclaiming significant disk space.

The parsoid.html compactions are nearly complete, and since parsoid.data-parsoid is another high utilization table, we should run a pass of major compactions there as well.

  • eqiad
    • a
      • restbase1007.eqiad.wmnet
        • a
        • b
        • c
      • restbase1010.eqiad.wmnet
        • a
        • b
        • c
      • restbase1011.eqiad.wmnet
        • a
        • b
        • c
    • b
      • restbase1008.eqiad.wmnet
        • a
        • b
        • c
      • restbase1012.eqiad.wmnet
        • a
        • b
        • c
      • restbase1013.eqiad.wmnet
        • a
        • b
        • c
    • d
      • restbase1009.eqiad.wmnet
        • a
        • b
        • c
      • restbase1014.eqiad.wmnet
        • a
        • b
        • c
      • restbase1015.eqiad.wmnet
        • a
        • b
        • c
  • codfw
    • b
      • restbase2001.codfw.wmnet
        • a (0.007)
        • b
        • c
      • restbase2002.codfw.wmnet
        • a (0.006)
        • b
        • c
      • restbase2007.codfw.wmnet
        • a (0.006)
        • b
        • c
    • c
      • restbase2003.codfw.wmnet
        • a (0.006)
        • b
        • c
      • restbase2004.codfw.wmnet
        • a (0.006)
        • b
        • c
      • restbase2008.codfw.wmnet
        • a (0.006)
        • b
        • c
    • d
      • restbase2005.codfw.wmnet
        • a (0.007)
        • b
        • c
      • restbase2006.codfw.wmnet
        • a (0.006)
        • b
        • c
      • restbase2009.codfw.wmnet
        • a (0.006)
        • b
        • c

Event Timeline

Eevans created this task.Sep 20 2016, 8:12 PM
Eevans moved this task from Backlog to Next on the Cassandra board.
Eevans moved this task from Next to In-Progress on the Cassandra board.Sep 22 2016, 8:44 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Sep 23 2016, 5:49 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)Sep 23 2016, 5:53 PM
Eevans updated the task description. (Show Details)Sep 26 2016, 4:40 PM

Over in T143226: Cluster-wide major compactions: parsoid.html table, I removed repaired-at timestamps (where they existed), since they split the compaction pool and prevented major compactions from being effective at reducing the droppable tombstone ratio. The same is true for wikipedia parsoid.data-parsoid (in addition to others, I suspect). These timestamps will need to be removed as well before continuing here.

Mentioned in SAL (#wikimedia-operations) [2016-10-05T15:39:44Z] <urandom> T146211: Restarting Cassandra on restbase1007-a.eqiad.wmnet to mark parsoid.data-parsoid tables unrepaired

Mentioned in SAL (#wikimedia-operations) [2016-10-05T15:48:06Z] <urandom> T146211: Restarting Cassandra on restbase1007-b.eqiad.wmnet to mark parsoid.data-parsoid tables unrepaired

Mentioned in SAL (#wikimedia-operations) [2016-10-05T15:54:15Z] <urandom> T146211: Restarting Cassandra on restbase1007-c.eqiad.wmnet to mark parsoid.data-parsoid tables unrepaired

Mentioned in SAL (#wikimedia-operations) [2016-10-05T17:31:59Z] <urandom> T146211: Performing rolling restart of restbase1010.eqiad.wmnet Cassandra instances, and marking SSTables unrepaired.

Mentioned in SAL (#wikimedia-operations) [2016-10-05T17:58:33Z] <urandom> T146211: Performing rolling restart of restbase1011.eqiad.wmnet Cassandra instances, and marking SSTables unrepaired.

Eevans added a comment.EditedOct 5 2016, 6:04 PM

Marking SSTables unrepaired with:

sudo c-foreach-restart --execute-post-shutdown "curl https://phab.wmfusercontent.org/file/data/uk5p7ehlbegir265rduu/PHID-FILE-52ba4hq35ljiymvcmxvg/Masterwork_From_Distant_Lands | bash -s {id}"
2016-10-05 17:59:48,236 INFO     [a] Disabling client ports...
2016-10-05 17:59:52,033 INFO     [a] Draining...
2016-10-05 18:01:19,562 INFO     [a] Stopping service cassandra-a
2016-10-05 18:01:22,275 INFO     [a] Executing post-shutdown command: curl https://phab.wmfusercontent.org/file/data/uk5p7ehlbegir265rduu/PHID-FILE-52ba4hq35ljiymvcmxvg/Masterwork_From_Distant_Lands | bash -s {id}
2016-10-05 18:01:54,919 INFO     [a] Found: local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ-data-ka-13342-Data.db (repaired at 1426826797204)
2016-10-05 18:01:54,920 INFO     [a] -- Setting unrepaired...done
2016-10-05 18:01:54,920 INFO     [a] Starting service cassandra-a
2016-10-05 18:01:54,951 WARNING  [a] CQL (10.64.0.117:9042) not listening (will retry)...
2016-10-05 18:02:06,959 WARNING  [a] CQL (10.64.0.117:9042) not listening (will retry)...
2016-10-05 18:02:18,971 WARNING  [a] CQL (10.64.0.117:9042) not listening (will retry)...
2016-10-05 18:02:30,977 WARNING  [a] CQL (10.64.0.117:9042) not listening (will retry)...
2016-10-05 18:02:42,985 INFO     [a] CQL (10.64.0.117:9042) is UP
2016-10-05 18:02:43,060 INFO     [b] Disabling client ports...
2016-10-05 18:02:50,444 INFO     [b] Draining...
2016-10-05 18:04:27,981 INFO     [b] Stopping service cassandra-b
2016-10-05 18:04:30,848 INFO     [b] Executing post-shutdown command: curl https://phab.wmfusercontent.org/file/data/uk5p7ehlbegir265rduu/PHID-FILE-52ba4hq35ljiymvcmxvg/Masterwork_From_Distant_Lands | bash -s {id}
2016-10-05 18:05:06,698 INFO     [b] Found: la-21913-big-Data.db (repaired at 1426826797204)
2016-10-05 18:05:06,698 INFO     [b] -- Setting unrepaired...done
2016-10-05 18:05:06,699 INFO     [b] Starting service cassandra-b
2016-10-05 18:05:06,747 WARNING  [b] CQL (10.64.0.118:9042) not listening (will retry)...
2016-10-05 18:05:18,761 WARNING  [b] CQL (10.64.0.118:9042) not listening (will retry)...
2016-10-05 18:05:30,773 WARNING  [b] CQL (10.64.0.118:9042) not listening (will retry)...
2016-10-05 18:05:42,782 WARNING  [b] CQL (10.64.0.118:9042) not listening (will retry)...
2016-10-05 18:05:54,798 INFO     [b] CQL (10.64.0.118:9042) is UP
2016-10-05 18:05:54,800 INFO     [c] Disabling client ports...
2016-10-05 18:06:03,235 INFO     [c] Draining...
2016-10-05 18:07:32,517 INFO     [c] Stopping service cassandra-c
2016-10-05 18:07:35,346 INFO     [c] Executing post-shutdown command: curl https://phab.wmfusercontent.org/file/data/uk5p7ehlbegir265rduu/PHID-FILE-52ba4hq35ljiymvcmxvg/Masterwork_From_Distant_Lands | bash -s {id}
2016-10-05 18:08:12,688 INFO     [c] Found: local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ-data-ka-61-Data.db (repaired at 1426826797204)
2016-10-05 18:08:12,689 INFO     [c] -- Setting unrepaired...done
2016-10-05 18:08:12,689 INFO     [c] Found: local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ-data-ka-299-Data.db (repaired at 1426826797204)
2016-10-05 18:08:12,689 INFO     [c] -- Setting unrepaired...done
2016-10-05 18:08:12,689 INFO     [c] Starting service cassandra-c
2016-10-05 18:08:12,733 WARNING  [c] CQL (10.64.0.119:9042) not listening (will retry)...
2016-10-05 18:08:24,743 WARNING  [c] CQL (10.64.0.119:9042) not listening (will retry)...
2016-10-05 18:08:36,755 WARNING  [c] CQL (10.64.0.119:9042) not listening (will retry)...
2016-10-05 18:08:48,767 WARNING  [c] CQL (10.64.0.119:9042) not listening (will retry)...
2016-10-05 18:09:00,777 INFO     [c] CQL (10.64.0.119:9042) is UP

Mentioned in SAL (#wikimedia-operations) [2016-10-05T18:17:41Z] <urandom> T146211: Performing rolling restart of RESTBase rack 'b' Cassandra instances, and marking SSTables unrepaired.

Mentioned in SAL (#wikimedia-operations) [2016-10-05T18:46:37Z] <urandom> T146211: Performing rolling restart of RESTBase eqiad rack 'd' Cassandra instances, and marking SSTables unrepaired.

Eevans closed this task as Resolved.Nov 29 2016, 9:31 PM

These ad hoc manual compactions were completed (more than once, in fact); Closing