Page MenuHomePhabricator

labsdb1001 and labsdb1003 short on available space
Closed, ResolvedPublic

Description

/dev/mapper/tank-data              xfs       3.0T  2.7T  335G  90% /srv
/dev/mapper/userdata_1001-userdata xfs       3.3T  1.1T  2.2T  33% /srvuserdata

A quick (but not that big) space saver could be /srv/sqldatas2.sql.gz: 42G Dec 17 2014 (could be moved/deleted)

Of the users/groups/tools DBs those are the bigger ones:

272G	s51187__xtools_tmp
64G	u3532__
53G	p50380g50816__pop_stats
20G	u2815__old_p
13G	s51127__dewiki_lists

Event Timeline

Volans created this task.Apr 12 2016, 10:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2016, 10:11 AM

Acknowledged (non-sticky) the warning on icinga

RobH added a subscriber: RobH.Apr 21 2016, 6:05 PM

So this just paged again on icinga/sms/irc:

PROBLEM - MariaDB disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 179614 MB (5% inode=99%)

Just FYI.

RobH triaged this task as High priority.Apr 21 2016, 6:05 PM

Woah. Why is xtooks on the list? It shouldn't be using that much DB space. It should be using almost nothing.

Where does s51187__xtools_tmp live? I can't seem to use it with sql local:

MariaDB [(none)]> USE s51187__xtools_tmp;
ERROR 1049 (42000): Unknown database 's51187__xtools_tmp'

@Cyberpower678 any insight on this? I think it's being written to by these continuous enwiki_update scripts, which apparently are for the Wikihistory tool. Surely we don't need all 272G of data and can probably drop a crap ton of rows. With the normal xtools-articleinfo tool back up and running, I don't think many people are using Wikihistory anyway.

Where does s51187__xtools_tmp live?

It is the enwiki/s1 host.

RobH removed a subscriber: RobH.Apr 21 2016, 6:52 PM

It actually went out of space during the spike:

Thu Apr 21 18:01:55 2016 TokuFT file system space is really low and access is restricted
160421 18:01:55 [ERROR] Master 's5': Slave SQL: Could not execute Write_rows_v1 event on table wikidatawiki.wb_entity_per_page; Disk full (wb_entity_per_page); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s5-bin.001938, end_log_pos 835811123, Gtid 0-171970704-3290249448, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's5': Slave: Disk full (wb_entity_per_page); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's5': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's5-bin.001938' position 835810855
160421 18:01:55 [ERROR] Master 's7': Slave SQL: Could not execute Write_rows_v1 event on table arwiki.pagelinks; Disk full (pagelinks); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s7-bin.001895, end_log_pos 848293313, Gtid 0-171970590-1811187008, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's7': Slave: Disk full (pagelinks); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's7-bin.001895' position 848293100
160421 18:01:55 [ERROR] Master 's4': Slave SQL: Could not execute Write_rows_v1 event on table commonswiki.globalimagelinks; Disk full (globalimagelinks); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s4-bin.001622, end_log_pos 80104532, Gtid 0-171970591-1795256580, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's4': Slave: Disk full (globalimagelinks); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's4': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's4-bin.001622' position 80104227
160421 18:01:55 [ERROR] Master 's2': Slave SQL: Could not execute Write_rows_v1 event on table itwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s2-bin.001902, end_log_pos 450115880, Gtid 0-171970567-2792438909, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's2': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [Warning] Master 's2': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's2': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's2-bin.001902' position 450115605
160421 18:01:55 [ERROR] Master 's3': Slave SQL: Could not execute Write_rows_v1 event on table shwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s3-bin.001750, end_log_pos 606067483, Gtid 0-171966669-2474745814, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's3': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [Warning] Master 's3': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's3': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's3-bin.001750' position 606067161
160421 18:01:56 [ERROR] Master 's6': Slave SQL: Could not execute Write_rows_v1 event on table frwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s6-bin.001371, end_log_pos 864552667, Gtid 0-171970705-1887782537, Internal MariaDB error code: 1021
160421 18:01:56 [Warning] Master 's6': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:56 [Warning] Master 's6': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:56 [ERROR] Master 's6': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's6-bin.001371' position 864552423
160421 18:02:01 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_sleepers_txn] Data truncated for column 'STATE' at row 1
160421 18:02:06 [ERROR] Master 's1': Slave SQL: Could not execute Write_rows_v1 event on table enwiki.abuse_filter_log; Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s1-bin.002098, end_log_pos 27645379, Gtid 0-171974683-3815619482, Internal MariaDB error code: 1021
160421 18:02:06 [Warning] Master 's1': Slave: Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:02:06 [Warning] Master 's1': Slave: Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:02:06 [ERROR] Master 's1': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's1-bin.002098' position 27645084
Thu Apr 21 18:20:40 2016 TokuFT file system space is low
jcrespo moved this task from Triage to In progress on the DBA board.Apr 22 2016, 4:26 PM
jcrespo moved this task from In progress to Next on the DBA board.Apr 25 2016, 10:49 AM
jcrespo moved this task from Next to In progress on the DBA board.May 4 2016, 9:50 AM
jcrespo renamed this task from labsdb1001 short on available space to labsdb1001 and labsdb1003 short on available space.Jun 16 2016, 8:58 AM

labsdb1003 is now also at 10% free space.

scfc moved this task from Triage to Backlog on the Toolforge board.Dec 5 2016, 4:14 AM

14% now maybe enough until decommission?

jcrespo closed this task as Resolved.Mar 28 2017, 4:51 PM
jcrespo claimed this task.

For now.

jcrespo reopened this task as Open.Apr 8 2017, 11:21 AM

We just had a spike on temporary tables being created, causing service disruption to all users.

Do you think it is still worth compressing whatever is on InnoDB (big wikis) not compressed?

Paladox added a subscriber: Paladox.Apr 8 2017, 1:25 PM

37G of logfile in 4 days is quite a bit (we are logging warnings):

[root@labsdb1001 09:01 /srv/sqldata]
# ls -lh labsdb1001.err
-rw-r----- 1 mysql mysql 37G Apr 17 09:01 labsdb1001.err

[root@labsdb1001 09:01 /srv/sqldata]
# head -n2 labsdb1001.err

170413 23:53:31 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_sleepers_txn] Data truncated for column 'STATE' at row 1

[root@labsdb1001 09:01 /srv/sqldata]
# tail -f -n1 labsdb1001.err
170417  9:01:05 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 1

Maybe we can try to move it to / as there are 330G available there. And we can get 37-40G back permanently on this host if we want to keep the log logging all that.

jcrespo closed this task as Resolved.Apr 27 2017, 9:42 AM

We added 1 extra terabyte by deleting /srvuserdata on both hosts- this will likely impact performance negatively, but at leasy they can now receive schema changes.