Page MenuHomePhabricator

labsdb1001 and labsdb1003 short on available space
Closed, ResolvedPublic

Description

/dev/mapper/tank-data              xfs       3.0T  2.7T  335G  90% /srv
/dev/mapper/userdata_1001-userdata xfs       3.3T  1.1T  2.2T  33% /srvuserdata

A quick (but not that big) space saver could be /srv/sqldatas2.sql.gz: 42G Dec 17 2014 (could be moved/deleted)

Of the users/groups/tools DBs those are the bigger ones:

272G	s51187__xtools_tmp
64G	u3532__
53G	p50380g50816__pop_stats
20G	u2815__old_p
13G	s51127__dewiki_lists

Event Timeline

Acknowledged (non-sticky) the warning on icinga

So this just paged again on icinga/sms/irc:

PROBLEM - MariaDB disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 179614 MB (5% inode=99%)

Just FYI.

RobH triaged this task as High priority.Apr 21 2016, 6:05 PM

Woah. Why is xtooks on the list? It shouldn't be using that much DB space. It should be using almost nothing.

Where does s51187__xtools_tmp live? I can't seem to use it with sql local:

MariaDB [(none)]> USE s51187__xtools_tmp;
ERROR 1049 (42000): Unknown database 's51187__xtools_tmp'

@Cyberpower678 any insight on this? I think it's being written to by these continuous enwiki_update scripts, which apparently are for the Wikihistory tool. Surely we don't need all 272G of data and can probably drop a crap ton of rows. With the normal xtools-articleinfo tool back up and running, I don't think many people are using Wikihistory anyway.

Where does s51187__xtools_tmp live?

It is the enwiki/s1 host.

It actually went out of space during the spike:

Thu Apr 21 18:01:55 2016 TokuFT file system space is really low and access is restricted
160421 18:01:55 [ERROR] Master 's5': Slave SQL: Could not execute Write_rows_v1 event on table wikidatawiki.wb_entity_per_page; Disk full (wb_entity_per_page); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s5-bin.001938, end_log_pos 835811123, Gtid 0-171970704-3290249448, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's5': Slave: Disk full (wb_entity_per_page); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's5': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's5-bin.001938' position 835810855
160421 18:01:55 [ERROR] Master 's7': Slave SQL: Could not execute Write_rows_v1 event on table arwiki.pagelinks; Disk full (pagelinks); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s7-bin.001895, end_log_pos 848293313, Gtid 0-171970590-1811187008, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's7': Slave: Disk full (pagelinks); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's7-bin.001895' position 848293100
160421 18:01:55 [ERROR] Master 's4': Slave SQL: Could not execute Write_rows_v1 event on table commonswiki.globalimagelinks; Disk full (globalimagelinks); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s4-bin.001622, end_log_pos 80104532, Gtid 0-171970591-1795256580, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's4': Slave: Disk full (globalimagelinks); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's4': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's4-bin.001622' position 80104227
160421 18:01:55 [ERROR] Master 's2': Slave SQL: Could not execute Write_rows_v1 event on table itwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s2-bin.001902, end_log_pos 450115880, Gtid 0-171970567-2792438909, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's2': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [Warning] Master 's2': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's2': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's2-bin.001902' position 450115605
160421 18:01:55 [ERROR] Master 's3': Slave SQL: Could not execute Write_rows_v1 event on table shwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s3-bin.001750, end_log_pos 606067483, Gtid 0-171966669-2474745814, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's3': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [Warning] Master 's3': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's3': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's3-bin.001750' position 606067161
160421 18:01:56 [ERROR] Master 's6': Slave SQL: Could not execute Write_rows_v1 event on table frwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s6-bin.001371, end_log_pos 864552667, Gtid 0-171970705-1887782537, Internal MariaDB error code: 1021
160421 18:01:56 [Warning] Master 's6': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:56 [Warning] Master 's6': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:56 [ERROR] Master 's6': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's6-bin.001371' position 864552423
160421 18:02:01 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_sleepers_txn] Data truncated for column 'STATE' at row 1
160421 18:02:06 [ERROR] Master 's1': Slave SQL: Could not execute Write_rows_v1 event on table enwiki.abuse_filter_log; Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s1-bin.002098, end_log_pos 27645379, Gtid 0-171974683-3815619482, Internal MariaDB error code: 1021
160421 18:02:06 [Warning] Master 's1': Slave: Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:02:06 [Warning] Master 's1': Slave: Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:02:06 [ERROR] Master 's1': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's1-bin.002098' position 27645084
Thu Apr 21 18:20:40 2016 TokuFT file system space is low
jcrespo renamed this task from labsdb1001 short on available space to labsdb1001 and labsdb1003 short on available space.Jun 16 2016, 8:58 AM

labsdb1003 is now also at 10% free space.

14% now maybe enough until decommission?

jcrespo claimed this task.

For now.

We just had a spike on temporary tables being created, causing service disruption to all users.

Do you think it is still worth compressing whatever is on InnoDB (big wikis) not compressed?

37G of logfile in 4 days is quite a bit (we are logging warnings):

[root@labsdb1001 09:01 /srv/sqldata]
# ls -lh labsdb1001.err
-rw-r----- 1 mysql mysql 37G Apr 17 09:01 labsdb1001.err

[root@labsdb1001 09:01 /srv/sqldata]
# head -n2 labsdb1001.err

170413 23:53:31 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_sleepers_txn] Data truncated for column 'STATE' at row 1

[root@labsdb1001 09:01 /srv/sqldata]
# tail -f -n1 labsdb1001.err
170417  9:01:05 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 1

Maybe we can try to move it to / as there are 330G available there. And we can get 37-40G back permanently on this host if we want to keep the log logging all that.

We added 1 extra terabyte by deleting /srvuserdata on both hosts- this will likely impact performance negatively, but at leasy they can now receive schema changes.