Page MenuHomePhabricator

Delete old data and/or stop logging to stewardbots' SULWatcher SQL DB
Closed, ResolvedPublic

Description

SULWatcher users a SQL db to store the usernames that triggers one of the regexes configured. However the logging table is becoming very large, with more than 170K entries:

MariaDB [s51541_sulwatcher]> SELECT COUNT(*) FROM logging;
+----------+
| count(*) |
+----------+
|   174770 |
+----------+
1 row in set (0.55 sec)

Since the SULWatcher tool does not have an interface anymore for authorized users to see which usernames have triggered one of the regexes, we maybe should stop logging those usernames and delete all stuff contained in the logging table, just using the SQL db to store the regexes.

Event Timeline

MarcoAurelio renamed this task from Delete old data and or stop logging to stewardbots' SULWatcher SQL DB to Delete old data and/or stop logging to stewardbots' SULWatcher SQL DB.Nov 19 2016, 3:41 PM
MarcoAurelio triaged this task as Medium priority.
MarcoAurelio removed a subscriber: Aklapper.

If no one objects, I'll clean the logging table in the next days.

MarcoAurelio moved this task from General to Restricted Project Column on the stewardbots board.Jan 28 2017, 1:04 PM
MarcoAurelio edited projects, added Unknown Object (Project); removed stewardbots.
MarcoAurelio moved this task from Restricted Project Column to General on the stewardbots board.Jan 28 2017, 1:09 PM
MarcoAurelio edited projects, added stewardbots; removed Unknown Object (Project).

@jcrespo Any safe way to just clean data older than one month?

I'm going to re-ping @jcrespo and @Marostegui because I'm unsure about how to do this properly and without danger. Issuing DELETE or DROP commands are dangerous. If you could give me ideas on how to do this I'd appreciate it. Thanks.

For what I see, this is a user database (meaning it is not a replicated mediawiki table, right?). If that is the case, there is very little risk. If you want to be sure you do not overload the server, using pt-archiver is a good way to do smaller transactions, even if it takes more time.

Some context:

MariaDB [s51541_sulwatcher]> describe logging;
+-------------+------------------+------+-----+----------------+----------------+
| Field       | Type             | Null | Key | Default        | Extra          |
+-------------+------------------+------+-----+----------------+----------------+
| l_id        | int(10) unsigned | NO   | PRI | NULL           | auto_increment |
| l_regex     | varchar(255)     | NO   | MUL | NULL           |                |
| l_user      | varchar(255)     | NO   |     | NULL           |                |
| l_project   | varchar(255)     | NO   | MUL | NULL           |                |
| l_timestamp | binary(14)       | NO   | MUL | 19700101000000 |                |
+-------------+------------------+------+-----+----------------+----------------+
5 rows in set (0.00 sec)
MariaDB [s51541_sulwatcher]> SELECT COUNT(*) FROM logging;
+----------+
| COUNT(*) |
+----------+
|   216107 |
+----------+
1 row in set (0.44 sec)

In addition to Jaime's comments, when issuing drops, it is generally a good idea to do them with "if exists":
drop table xx if exists
drop database xx if exists

This would avoid replication issues (if there was replication) with tables or databases under a master-slave model that might not be in sync for whatever reason.

MarcoAurelio claimed this task.

I did this today:

MariaDB [s51541_sulwatcher]> select count(*) from logging where l_timestamp<20171201000000;
+----------+
| count(*) |
+----------+
|   233730 |
+----------+
1 row in set (0.15 sec)

MariaDB [s51541_sulwatcher]> delete from logging where l_timestamp<20171201000000;
Query OK, 233730 rows affected (34.70 sec)

MariaDB [s51541_sulwatcher]> select count(*) from logging;
+----------+
| count(*) |
+----------+
|      487 |
+----------+
1 row in set (0.00 sec)

So I only kept records starting from 20171201000000 in advance, deleting old data no longer useful.

I think that this part of the task, namely, deleting old data can be closed.

For future tasks we should investigate if autocleaning of old data is possible and only watch loginwiki feed to avoid duplicating data.

Mentioned in SAL (#wikimedia-cloud) [2017-12-04T10:24:04Z] <TabbyCat> Fixed T151113 by cleaning data older than 20171201000000.