labsdb1001 and labsdb1003 short on available space
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	Apr 12 2016, 10:11 AM

Description

/dev/mapper/tank-data              xfs       3.0T  2.7T  335G  90% /srv
/dev/mapper/userdata_1001-userdata xfs       3.3T  1.1T  2.2T  33% /srvuserdata

A quick (but not that big) space saver could be /srv/sqldatas2.sql.gz: 42G Dec 17 2014 (could be moved/deleted)

Of the users/groups/tools DBs those are the bigger ones:

272G	s51187__xtools_tmp
64G	u3532__
53G	p50380g50816__pop_stats
20G	u2815__old_p
13G	s51127__dewiki_lists

Related Objects
Search...

Status	Assigned	Task
Resolved	jcrespo	T132431 labsdb1001 and labsdb1003 short on available space
Resolved	Matthewrbowker	T133321 `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003
Resolved	marcmiquel	T133322 u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003
Resolved	Dispenser	T133323 u2815__old_p (dispenser) database using 20G on labsdb1001 (enwiki)
Resolved	Marostegui	T133325 s51127__dewiki_lists (merlbot) database using 13G on labsdb1001 (enwiki)
Resolved	kaldari	T133326 p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki)
Resolved	jcrespo	T162519 s51362 has been rate limited to 2 concurrent connections for creating hundreds of 1400-second queries to labsdb1001 and labsdb1003 every 10 seconds

Event Timeline

Volans created this task.Apr 12 2016, 10:11 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2016, 10:11 AM

Acknowledged (non-sticky) the warning on icinga

Krenair added subscribers: MusikAnimal, Cyberpower678, Matthewrbowker, Technical13.Apr 12 2016, 2:33 PM

jcrespo added projects: Toolforge, Cloud-Services.Apr 15 2016, 7:33 AM

yuvipanda added subscribers: valhallasw, • chasemp, bd808.Apr 15 2016, 7:36 AM

So this just paged again on icinga/sms/irc:

PROBLEM - MariaDB disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 179614 MB (5% inode=99%)

Just FYI.

RobH triaged this task as High priority.Apr 21 2016, 6:05 PM

Woah. Why is xtooks on the list? It shouldn't be using that much DB space. It should be using almost nothing.

valhallasw mentioned this in T133321: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003.Apr 21 2016, 6:25 PM

valhallasw created subtask T133321: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003.

valhallasw created subtask T133322: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003.

Where does s51187__xtools_tmp live? I can't seem to use it with sql local:

MariaDB [(none)]> USE s51187__xtools_tmp;
ERROR 1049 (42000): Unknown database 's51187__xtools_tmp'

@Cyberpower678 any insight on this? I think it's being written to by these continuous enwiki_update scripts, which apparently are for the Wikihistory tool. Surely we don't need all 272G of data and can probably drop a crap ton of rows. With the normal xtools-articleinfo tool back up and running, I don't think many people are using Wikihistory anyway.

Where does s51187__xtools_tmp live?

It is the enwiki/s1 host.

valhallasw created subtask T133323: u2815__old_p (dispenser) database using 20G on labsdb1001 (enwiki).Apr 21 2016, 6:29 PM

valhallasw created subtask T133325: s51127__dewiki_lists (merlbot) database using 13G on labsdb1001 (enwiki).

valhallasw created subtask T133326: p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki).Apr 21 2016, 6:41 PM

RobH unsubscribed.Apr 21 2016, 6:52 PM

bd808 mentioned this in T129630: Collect and display basic metrics for all tools (service groups).Apr 21 2016, 8:09 PM

It actually went out of space during the spike:

Thu Apr 21 18:01:55 2016 TokuFT file system space is really low and access is restricted
160421 18:01:55 [ERROR] Master 's5': Slave SQL: Could not execute Write_rows_v1 event on table wikidatawiki.wb_entity_per_page; Disk full (wb_entity_per_page); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s5-bin.001938, end_log_pos 835811123, Gtid 0-171970704-3290249448, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's5': Slave: Disk full (wb_entity_per_page); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's5': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's5-bin.001938' position 835810855
160421 18:01:55 [ERROR] Master 's7': Slave SQL: Could not execute Write_rows_v1 event on table arwiki.pagelinks; Disk full (pagelinks); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s7-bin.001895, end_log_pos 848293313, Gtid 0-171970590-1811187008, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's7': Slave: Disk full (pagelinks); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's7-bin.001895' position 848293100
160421 18:01:55 [ERROR] Master 's4': Slave SQL: Could not execute Write_rows_v1 event on table commonswiki.globalimagelinks; Disk full (globalimagelinks); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s4-bin.001622, end_log_pos 80104532, Gtid 0-171970591-1795256580, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's4': Slave: Disk full (globalimagelinks); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's4': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's4-bin.001622' position 80104227
160421 18:01:55 [ERROR] Master 's2': Slave SQL: Could not execute Write_rows_v1 event on table itwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s2-bin.001902, end_log_pos 450115880, Gtid 0-171970567-2792438909, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's2': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [Warning] Master 's2': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's2': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's2-bin.001902' position 450115605
160421 18:01:55 [ERROR] Master 's3': Slave SQL: Could not execute Write_rows_v1 event on table shwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s3-bin.001750, end_log_pos 606067483, Gtid 0-171966669-2474745814, Internal MariaDB error code: 1021
160421 18:01:55 [Warning] Master 's3': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [Warning] Master 's3': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:55 [ERROR] Master 's3': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's3-bin.001750' position 606067161
160421 18:01:56 [ERROR] Master 's6': Slave SQL: Could not execute Write_rows_v1 event on table frwiki.revision; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s6-bin.001371, end_log_pos 864552667, Gtid 0-171970705-1887782537, Internal MariaDB error code: 1021
160421 18:01:56 [Warning] Master 's6': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:56 [Warning] Master 's6': Slave: Disk full (revision); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:01:56 [ERROR] Master 's6': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's6-bin.001371' position 864552423
160421 18:02:01 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_sleepers_txn] Data truncated for column 'STATE' at row 1
160421 18:02:06 [ERROR] Master 's1': Slave SQL: Could not execute Write_rows_v1 event on table enwiki.abuse_filter_log; Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full"), Error_code: 1021; handler error No Error!; the event's master log s1-bin.002098, end_log_pos 27645379, Gtid 0-171974683-3815619482, Internal MariaDB error code: 1021
160421 18:02:06 [Warning] Master 's1': Slave: Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:02:06 [Warning] Master 's1': Slave: Disk full (abuse_filter_log); waiting for someone to free some space... (errno: 189 "Disk full") Error_code: 1021
160421 18:02:06 [ERROR] Master 's1': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 's1-bin.002098' position 27645084
Thu Apr 21 18:20:40 2016 TokuFT file system space is low

Dispenser closed subtask T133323: u2815__old_p (dispenser) database using 20G on labsdb1001 (enwiki) as Resolved.Apr 21 2016, 11:36 PM

Luke081515 subscribed.Apr 22 2016, 1:29 AM

jcrespo moved this task from Triage to In progress on the DBA board.Apr 22 2016, 4:26 PM

Southparkfan subscribed.Apr 23 2016, 12:57 PM

jcrespo moved this task from In progress to Pending comment on the DBA board.Apr 25 2016, 10:49 AM

jcrespo moved this task from Pending comment to In progress on the DBA board.May 4 2016, 9:50 AM

labsdb1003 is now also at 10% free space.

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.Dec 5 2016, 4:14 AM

Matthewrbowker closed subtask T133321: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 as Resolved.Feb 1 2017, 7:30 PM

marcmiquel closed subtask T133322: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 as Resolved.Feb 6 2017, 9:36 AM

jcrespo reopened subtask T133321: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 as Open.Mar 17 2017, 4:53 PM

jcrespo reopened subtask T133322: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 as Open.Mar 17 2017, 5:02 PM

jcrespo closed subtask T133321: `s51187__xtools_tmp` database using 272G on labsdb1001 and 118G on labsdb1003 as Resolved.Mar 17 2017, 5:18 PM

14% now maybe enough until decommission?

jcrespo closed subtask T133322: u3532__ (=marcmiquel) table using 64G on labsdb1001 and 108 GB on labsdb1003 as Resolved.Mar 28 2017, 10:22 AM

For now.

We just had a spike on temporary tables being created, causing service disruption to all users.

jcrespo created subtask T162519: s51362 has been rate limited to 2 concurrent connections for creating hundreds of 1400-second queries to labsdb1001 and labsdb1003 every 10 seconds.Apr 8 2017, 11:30 AM

Do you think it is still worth compressing whatever is on InnoDB (big wikis) not compressed?

Paladox subscribed.Apr 8 2017, 1:25 PM

37G of logfile in 4 days is quite a bit (we are logging warnings):

[root@labsdb1001 09:01 /srv/sqldata]
# ls -lh labsdb1001.err
-rw-r----- 1 mysql mysql 37G Apr 17 09:01 labsdb1001.err

[root@labsdb1001 09:01 /srv/sqldata]
# head -n2 labsdb1001.err

170413 23:53:31 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_sleepers_txn] Data truncated for column 'STATE' at row 1

[root@labsdb1001 09:01 /srv/sqldata]
# tail -f -n1 labsdb1001.err
170417  9:01:05 [Warning] Event Scheduler: [root@208.80.154.151][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 1

Maybe we can try to move it to / as there are 330G available there. And we can get 37-40G back permanently on this host if we want to keep the log logging all that.

We added 1 extra terabyte by deleting /srvuserdata on both hosts- this will likely impact performance negatively, but at leasy they can now receive schema changes.

Marostegui awarded a token.Apr 27 2017, 9:51 AM

Marostegui closed subtask T133325: s51127__dewiki_lists (merlbot) database using 13G on labsdb1001 (enwiki) as Resolved.Jun 21 2017, 9:59 AM

Marostegui closed subtask T133326: p50380g50816__pop_stats (popularpages) using 53G on labsdb1001 (enwiki) as Resolved.Jul 21 2017, 7:26 AM

Marostegui closed subtask T162519: s51362 has been rate limited to 2 concurrent connections for creating hundreds of 1400-second queries to labsdb1001 and labsdb1003 every 10 seconds as Resolved.Jul 31 2017, 9:46 AM

labsdb1001 and labsdb1003 short on available spaceClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

labsdb1001 and labsdb1003 short on available space
Closed, ResolvedPublic
Actions

Related Objects
Search...