Page MenuHomePhabricator

Compress and defragment tables on labsdb hosts
Closed, ResolvedPublic

Description

Wikireplicas hosts are approaching 90% usage on /srv:

===== NODE GROUP =====
(1) labsdb1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T   10T  1.7T  87% /srv
===== NODE GROUP =====
(1) labsdb1009.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T   10T  1.7T  86% /srv
===== NODE GROUP =====
(1) labsdb1012.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    14T   11T  3.9T  73% /srv
===== NODE GROUP =====
(1) labsdb1011.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T   11T  1.6T  87% /srv

I just did a quick check on enwiki and wikidata and there are indeed tables that need to be compressed (specially those that are temporary but also quite big), so I assume it is the case for most of the wikis

1+--------------+--------------------------------------------------+------------+
2| table_schema | table_name | row_format |
3+--------------+--------------------------------------------------+------------+
4| enwiki | abuse_filter | Compressed |
5| enwiki | abuse_filter_action | Compressed |
6| enwiki | abuse_filter_history | Compressed |
7| enwiki | abuse_filter_log | Compressed |
8| enwiki | actor | Compact |
9| enwiki | archive | Compressed |
10| enwiki | babel | Compressed |
11| enwiki | betafeatures_user_counts | Compressed |
12| enwiki | bv2011_edits | Compressed |
13| enwiki | bv2013_edits | Compressed |
14| enwiki | bv2015_edits | Compressed |
15| enwiki | bv2017_edits | Compact |
16| enwiki | category | Compressed |
17| enwiki | categorylinks | Compact |
18| enwiki | change_tag | Compressed |
19| enwiki | change_tag_def | Compact |
20| enwiki | comment | Compact |
21| enwiki | content | Compact |
22| enwiki | content_models | Compact |
23| enwiki | externallinks | Compressed |
24| enwiki | filearchive | Compressed |
25| enwiki | flaggedimages | Compressed |
26| enwiki | flaggedpage_config | Compressed |
27| enwiki | flaggedpage_pending | Compressed |
28| enwiki | flaggedpages | Compressed |
29| enwiki | flaggedrevs | Compressed |
30| enwiki | flaggedrevs_promote | Compressed |
31| enwiki | flaggedrevs_statistics | Compressed |
32| enwiki | flaggedrevs_stats | Compressed |
33| enwiki | flaggedrevs_stats2 | Compressed |
34| enwiki | flaggedrevs_tracking | Compressed |
35| enwiki | flaggedtemplates | Compressed |
36| enwiki | geo_tags | Compressed |
37| enwiki | global_block_whitelist | Compressed |
38| enwiki | image | Compressed |
39| enwiki | imagelinks | Compact |
40| enwiki | interwiki | Compressed |
41| enwiki | ip_changes | Compact |
42| enwiki | ipblocks | Compressed |
43| enwiki | ipblocks_restrictions | Compact |
44| enwiki | iwlinks | Compact |
45| enwiki | l10n_cache | Compressed |
46| enwiki | langlinks | Compact |
47| enwiki | linter | Compact |
48| enwiki | logging | Compressed |
49| enwiki | math | Compact |
50| enwiki | mathoid | Compressed |
51| enwiki | module_deps | Compact |
52| enwiki | oldimage | Compressed |
53| enwiki | ores_classification | Compact |
54| enwiki | ores_model | Compressed |
55| enwiki | page | Compressed |
56| enwiki | page_assessments | Compressed |
57| enwiki | page_assessments_projects | Compact |
58| enwiki | page_props | Compressed |
59| enwiki | page_restrictions | Compressed |
60| enwiki | pagelinks | Compact |
61| enwiki | pagetriage_log | Compressed |
62| enwiki | pagetriage_page | Compressed |
63| enwiki | pagetriage_page_tags | Compressed |
64| enwiki | pagetriage_tags | Compressed |
65| enwiki | pif_edits | Compressed |
66| enwiki | protected_titles | Compressed |
67| enwiki | recentchanges | Compressed |
68| enwiki | redirect | Compressed |
69| enwiki | revision | Compressed |
70| enwiki | revision_actor_temp | Compact |
71| enwiki | revision_comment_temp | Compact |
72| enwiki | searchindex | Compact |
73| enwiki | site_identifiers | Compressed |
74| enwiki | site_stats | Compressed |
75| enwiki | sites | Compressed |
76| enwiki | slot_roles | Compact |
77| enwiki | slots | Compact |
78| enwiki | templatelinks | Compact |
79| enwiki | transcode | Compressed |
80| enwiki | updatelog | Compressed |
81| enwiki | user | Compressed |
82| enwiki | user_former_groups | Compact |
83| enwiki | user_groups | Compressed |
84| enwiki | user_properties | Compressed |
85| enwiki | wbc_entity_usage | Compressed |
86| enwiki | wikilove_log | Compressed |
87| wikidatawiki | abuse_filter | Compressed |
88| wikidatawiki | abuse_filter_action | Compressed |
89| wikidatawiki | abuse_filter_history | Compressed |
90| wikidatawiki | abuse_filter_log | Compressed |
91| wikidatawiki | actor | Compact |
92| wikidatawiki | archive | Compressed |
93| wikidatawiki | babel | Compressed |
94| wikidatawiki | betafeatures_user_counts | Compressed |
95| wikidatawiki | bv2013_edits | Compressed |
96| wikidatawiki | bv2015_edits | Compressed |
97| wikidatawiki | bv2017_edits | Compressed |
98| wikidatawiki | category | Compressed |
99| wikidatawiki | categorylinks | Compressed |
100| wikidatawiki | change_tag | Compressed |
101| wikidatawiki | change_tag_def | Compact |
102| wikidatawiki | comment | Compact |
103| wikidatawiki | config | Compressed |
104| wikidatawiki | content | Compact |
105| wikidatawiki | content_models | Compact |
106| wikidatawiki | externallinks | Compressed |
107| wikidatawiki | filearchive | Compressed |
108| wikidatawiki | geo_tags | Compressed |
109| wikidatawiki | global_block_whitelist | Compressed |
110| wikidatawiki | globalblocks | Compressed |
111| wikidatawiki | image | Compressed |
112| wikidatawiki | imagelinks | Compressed |
113| wikidatawiki | interwiki | Compressed |
114| wikidatawiki | ip_changes | Compact |
115| wikidatawiki | ipblocks | Compressed |
116| wikidatawiki | ipblocks_restrictions | Compact |
117| wikidatawiki | iwlinks | Compressed |
118| wikidatawiki | l10n_cache | Compressed |
119| wikidatawiki | langlinks | Compressed |
120| wikidatawiki | linter | Compressed |
121| wikidatawiki | logging | Compressed |
122| wikidatawiki | math | Compressed |
123| wikidatawiki | mathoid | Compressed |
124| wikidatawiki | module_deps | Compressed |
125| wikidatawiki | oldimage | Compressed |
126| wikidatawiki | ores_classification | Compressed |
127| wikidatawiki | ores_model | Compressed |
128| wikidatawiki | page | Compressed |
129| wikidatawiki | page_props | Compressed |
130| wikidatawiki | page_restrictions | Compressed |
131| wikidatawiki | pagelinks | Compact |
132| wikidatawiki | protected_titles | Compressed |
133| wikidatawiki | recentchanges | Compressed |
134| wikidatawiki | redirect | Compressed |
135| wikidatawiki | revision | Compressed |
136| wikidatawiki | revision_actor_temp | Compact |
137| wikidatawiki | revision_comment_temp | Compact |
138| wikidatawiki | revtag | Compressed |
139| wikidatawiki | searchindex | Compressed |
140| wikidatawiki | site_identifiers | Compressed |
141| wikidatawiki | site_stats | Compressed |
142| wikidatawiki | sites | Compressed |
143| wikidatawiki | slot_roles | Compact |
144| wikidatawiki | slots | Compact |
145| wikidatawiki | templatelinks | Compact |
146| wikidatawiki | transcode | Compressed |
147| wikidatawiki | translate_groupreviews | Compressed |
148| wikidatawiki | translate_groupstats | Compressed |
149| wikidatawiki | translate_messageindex | Compressed |
150| wikidatawiki | translate_metadata | Compressed |
151| wikidatawiki | translate_reviews | Compressed |
152| wikidatawiki | translate_sections | Compressed |
153| wikidatawiki | updatelog | Compressed |
154| wikidatawiki | user | Compressed |
155| wikidatawiki | user_former_groups | Compressed |
156| wikidatawiki | user_groups | Compressed |
157| wikidatawiki | user_properties | Compressed |
158| wikidatawiki | wb_changes | Compressed |
159| wikidatawiki | wb_changes_dispatch | Compressed |
160| wikidatawiki | wb_changes_subscription | Compressed |
161| wikidatawiki | wb_id_counters | Compressed |
162| wikidatawiki | wb_items_per_site | Compressed |
163| wikidatawiki | wb_property_info | Compressed |
164| wikidatawiki | wb_terms | Compressed |
165| wikidatawiki | wbc_entity_usage | Compressed |
166| wikidatawiki | wbqc_constraints | Compressed |
167| wikidatawiki | wbs_propertypairs | Compressed |
168| wikidatawiki | wikimedia_editor_tasks_entity_description_exists | Compact |
169+--------------+--------------------------------------------------+------------+

I think at the same time we we can just defragment all the tables, as I think they've not been defragmented since these hosts were set up around 2 years ago

Status:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Compressing the following tables on all wikis will probably make a big difference already, as they are pretty big and they are not compressed:

slots
revision_actor_temp
revision_comment_temp
comment
categorylinks

I also realised that s2,s6 and s7 doesn't look compressed. So I am going to start compressing those on labsdb1012 (analytics host, which is only used at the start of the month). So we can see how much different it makes.

Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    14T   11T  3.9T  73% /srv

Mentioned in SAL (#wikimedia-operations) [2019-05-13T06:09:00Z] <marostegui> Compress s2, s6 and s7 on labsdb1012 - T222978

Change 510126 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbproxy: Switchover labsdb1009 to 11, reorganize weights

https://gerrit.wikimedia.org/r/510126

Change 510126 merged by Jcrespo:
[operations/puppet@production] dbproxy: Switchover labsdb1009 to 11, reorganize weights

https://gerrit.wikimedia.org/r/510126

Mentioned in SAL (#wikimedia-operations) [2019-05-14T16:30:30Z] <jynus> stop replication and start table recompression on labsdb1009 T222978

s6 finished compression on labsdb1012, so this host is ready to get all the other tables compressed like labsdb1009 is having at the moment.

root@labsdb1012:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    14T  9.5T  4.5T  68% /srv

We got around 600GB back if compared against T222978#5175314

Mentioned in SAL (#wikimedia-operations) [2019-05-17T07:11:58Z] <marostegui> Compress s7 on labsdb1012 T222978

Mentioned in SAL (#wikimedia-operations) [2019-05-20T15:26:15Z] <marostegui> Stop replication on labsdb1011 to start compressing tables - T222978

I have updated the status of each host on the task description, but leaving this here for the record too:

===== NODE GROUP =====
(1) labsdb1012.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    14T  7.3T  6.7T  53% /srv
===== NODE GROUP =====
(1) labsdb1011.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T  9.0T  2.7T  77% /srv
===== NODE GROUP =====
(1) labsdb1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T   10T  1.7T  86% /srv
===== NODE GROUP =====
(1) labsdb1009.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T  8.5T  3.1T  74% /srv

@Bstorm Did your maintenance finish? Can we continue these tasks?

Yes, we talked the other day and maintenance is done for now. She might need to do some more next Monday

Change 514273 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Depool labsdb1011 for maintenance

https://gerrit.wikimedia.org/r/514273

Change 514273 merged by Jcrespo:
[operations/puppet@production] mariadb: Depool labsdb1011 for maintenance

https://gerrit.wikimedia.org/r/514273

Restarting compression on labsdb1011 (stopping replication, depooled). CC @Bstorm

Mentioned in SAL (#wikimedia-operations) [2019-06-04T16:32:45Z] <marostegui> Compress some more tables on labsdb1012 before upgrading the host tomorrow T222978

Mentioned in SAL (#wikimedia-operations) [2019-06-05T05:28:58Z] <marostegui> Keep compressing tables on labsdb1012 - T222978

Change 515018 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: Depool labsdb1010 to proceed with compression

https://gerrit.wikimedia.org/r/515018

Change 515018 merged by Jcrespo:
[operations/puppet@production] labsdb: Depool labsdb1010 to proceed with compression

https://gerrit.wikimedia.org/r/515018

Change 515036 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplica: Depool actually labsdb1010

https://gerrit.wikimedia.org/r/515036

Change 515036 merged by Jcrespo:
[operations/puppet@production] wikireplica: Depool actually labsdb1010

https://gerrit.wikimedia.org/r/515036

labsdb1012 finished the compression on all its tables

root@labsdb1012:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    14T  5.8T  8.2T  42% /srv

@jcrespo can we maybe stop maintenance on labsdb1010 so it can catch up before the s4 failover next week and resume the compression after it? (T224852)
I would prefer to have close to no lag on all the hosts if possible for the failover.

labsdb1010 current status:

root@labsdb1010:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T  8.4T  3.3T  72% /srv

As we spoke...I have stopped the compression.
The last compressed table was fawiki.revision so when resumed, it needs to be resumed from line 14054 of /home/jynus/labsdb1010_tables_to_compress.txt

Going to start replication now on all threads.

Mentioned in SAL (#wikimedia-operations) [2019-06-12T14:38:16Z] <marostegui> Start replication on all threads on labsdb1010 - T222978

Marostegui lowered the priority of this task from High to Medium.Jun 13 2019, 7:05 AM

Decreasing to normal as the situation is better now (but let's keep working on it as we are doing now):

===== NODE GROUP =====
(1) labsdb1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T  8.4T  3.3T  72% /srv
===== NODE GROUP =====
(1) labsdb1009.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T  8.7T  3.0T  75% /srv
===== NODE GROUP =====
(1) labsdb1012.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    14T  5.9T  8.2T  42% /srv
===== NODE GROUP =====
(1) labsdb1011.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T  8.6T  3.1T  74% /srv

The failover was done, so we can probably keep compressing tables.
@jcrespo let me know if you would like to handling this yourself or you want me to take over so you can focus on backups :)

Change 518029 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy: Depool labsdb1011

https://gerrit.wikimedia.org/r/518029

Change 518029 merged by Marostegui:
[operations/puppet@production] dbproxy: Depool labsdb1011

https://gerrit.wikimedia.org/r/518029

Mentioned in SAL (#wikimedia-operations) [2019-06-20T13:20:45Z] <marostegui> Reload haproxy on dbproxy1010 and dbproxy1011 to depool labsdb1011 - T222978

Mentioned in SAL (#wikimedia-operations) [2019-06-20T13:23:08Z] <marostegui> Stop replication on labsdb1011 to defragment tables T222978

Change 519948 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy: Depool labsdb1011

https://gerrit.wikimedia.org/r/519948

Change 519948 merged by Marostegui:
[operations/puppet@production] dbproxy: Depool labsdb1011

https://gerrit.wikimedia.org/r/519948

Mentioned in SAL (#wikimedia-operations) [2019-07-01T04:49:01Z] <marostegui> Change pt-kill value on labsdb1009 temporarily, from 300 to 14400 T222978

Mentioned in SAL (#wikimedia-operations) [2019-07-01T04:50:41Z] <marostegui> Reload haproxy on dbproxy1010 and dbproxy1011 to depool labsdb1011 - T222978

Mentioned in SAL (#wikimedia-operations) [2019-07-01T04:53:29Z] <marostegui> Keep compressing tables on labsdb1011 - T222978

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.
bd808 subscribed.

labsdb1011 is fully done:

root@labsdb1011:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T  6.2T  5.5T  53% /srv

Mentioned in SAL (#wikimedia-operations) [2019-07-05T07:17:52Z] <marostegui> Compress small wikis on labsdb1009 T222978

In order to cause less disruption to the service I am trying a different approach with labsdb1009.
I am compressing around 50k tables from almost 700 wikis which are smaller than 10GB size without depooling the host so the load won't be as higher as we've seen on the other hosts. And we'll need to depool the hosts only for the big big wikis.
Those tables are small enough that replication isn't a problem and the tables are compressed very very fast that metadata locking isn't an issue either (also because those wikis aren't that used).

From the math I have done this will be running for around 24-30h

Mentioned in SAL (#wikimedia-operations) [2019-07-08T05:31:33Z] <marostegui> Compress medium wikis on labsdb1009 - T222978

After this big batch of wiki compression only 3555 tables were left to be compressed - I am now trying to compress medium size wikis, between 20G and 100GB (a total of 3000 tables). If this goes fine, only just the bigger wikis (just one table for enwiki, ruwiki, wikidata) would be left to be compressed, so only depooling for them, which would reduce the amount of days that we'd need to have 1009 depooled and the service won't be as much degraded.
Will report back once this new batch is done.

Mentioned in SAL (#wikimedia-operations) [2019-07-22T05:24:34Z] <marostegui> Compress more tables on labsdb1009 - T222978

Mentioned in SAL (#wikimedia-operations) [2019-07-25T11:53:04Z] <marostegui> Compress s3 wikis on labsdb1010 - T222978

Mentioned in SAL (#wikimedia-operations) [2019-07-31T05:00:02Z] <marostegui> Compress s6 on labsdb1010 - T222978

Mentioned in SAL (#wikimedia-operations) [2019-08-02T09:22:05Z] <marostegui> Compress s7 on labsdb1010 - T222978

Change 528152 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1010: Depool labsdb1010

https://gerrit.wikimedia.org/r/528152

Change 528152 merged by Marostegui:
[operations/puppet@production] dbproxy1011: Depool labsdb1010

https://gerrit.wikimedia.org/r/528152

Mentioned in SAL (#wikimedia-operations) [2019-08-05T14:31:59Z] <marostegui> Reload haproxy on dbproxy1011 to depool labsdb1010 T222978

Mentioned in SAL (#wikimedia-operations) [2019-08-06T05:06:39Z] <marostegui> Reload haproxy on dbproxy1011 to repool labsdb1010 T222978

labsdb1010 finished its compression

root@labsdb1010:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    12T  6.2T  5.5T  53% /srv
Marostegui updated the task description. (Show Details)