Page MenuHomePhabricator

Trove for cluebotng-review?
Closed, ResolvedPublic

Description

The database on cluebotng-review is around 9GB (as reported by schema info) and will only grow. It also streams a large amount of data back from the database when training runs are being executed, which is load on tools-db.

In the spirit of T291782, this is likely a good candidate to be migrated onto Trove.

If you agree with this, then this is a request for the applicable quota/instance, with a migration to be done at some point in the future.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@RichSmith thanks for suggesting this! 9GB is not a lot compared to the really "big ones", but I think it's a good idea to move it, especially if you routinely stream large amounts of data from it.

Do you think that 20GB could be enough as an initial quota, to fit the current data and some future increase? We can modify the disk size later, but it will be up to you to check the available space and request an increase before the space runs out. Unfortunately we don't have an automated monitoring yet, so if the disk gets completely full the db will just crash.

fnegri triaged this task as Medium priority.

@RichSmith thanks for suggesting this! 9GB is not a lot compared to the really "big ones", but I think it's a good idea to move it, especially if you routinely stream large amounts of data from it.

I can't take credit, I just subscribed as it's a ClueBot issue... the thanks goes to @DamianZaremba

But, I'll give my 2 UK pence... 20GB should be fine

The disk usage is actually reported as less since I split 1 table into 2 (re-creating it), likely the files got optimised since we haven't updated the binary columns.

We are reading (based on SUM(LENGTH( and some assumptions about other columns) around 5GB of data per day though, so it might still be nice to give this it's own instance.

Only regarding the large tables, per edit we have (these are stored in binary fields so I'm making a guess that length is more accurate rather than char_length)

MariaDB [s54862__reviewer]> select avg(LENGTH(text)) from cbng_reviewer_currentrevision;
+-------------------+
| avg(LENGTH(text)) |
+-------------------+
|        21176.7700 |
+-------------------+
1 row in set (43.205 sec)

MariaDB [s54862__reviewer]> select avg(LENGTH(text)) from cbng_reviewer_previousrevision;
+-------------------+
| avg(LENGTH(text)) |
+-------------------+
|        22282.2879 |
+-------------------+
1 row in set (11.564 sec)

Each day there is 1 sampled + N reported edits added.

Reports are a bit more random, but for the sake of lies damn lies and statistics....

MariaDB [s52585__cb]> select avg(s.cnt) from (select count(*) as cnt, date(timestamp) as date from reports where status = 2 group by date) as s;
+------------+
| avg(s.cnt) |
+------------+
|    39.0000 |
+------------+
1 row in set (0.002 sec)

So (21176.7700 + 22282.2879) * 39 = let's call it 1.7MB + some MySQL overhead per day.

Currently we are using (according to info schema):

MariaDB [s54862__reviewer]> select table_schema, SUM(data_length + index_length) as size from information_schema.tables where table_schema = 's54862__reviewer' group by table_schema;
+------------------+------------+
| table_schema     | size       |
+------------------+------------+
| s54862__reviewer | 2019622912 |
+------------------+------------+
1 row in set (0.005 sec)

We should be good for a few years with 20GB (given we've been using tools for 12 years, this statement may well come back to bite me).

Note: Reading above refers primarily to cbng-trainer which is running jobs under another account, so most of that data stays within the WMCS ecosystem.

I can't take credit, I just subscribed as it's a ClueBot issue... the thanks goes to @DamianZaremba

@RichSmith My fault, I don't know how I failed to read the Authored By field :)

@DamianZaremba thank you for the detailed analysis, I double checked at the filesystem level:

root@tools-db-4:/srv/labsdb/data# du -hs s54862*
261M    s54862__review
2.1G    s54862__reviewer

I will create a project with 20GB Trove quota and give you access.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-08-12T17:05:37Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for trove-only project cluebotng-review in eqiad1 (T401347)

@DamianZaremba you should have access to the new Cloud VPS project cluebotng-review, where you should be able to create new database instances up to a total of 20GB of disk space.

More info at https://wikitech.wikimedia.org/wiki/Help:Trove_database_user_guide

Please reopen this or contact us if anything is not working.

Note for admins: I added a new section Creating Trove-only projects to the Trove admin page, as this process was not documented. I also linked to that page from the Clinic Duties page, and modified links in Help:Trove database user guide to point to Cloud-VPS (Project-requests) rather than Toolforge (Quota-requests).

FYI I migrated the schema over to the trove instance yesterday.

Only issue I found was the "root access" described doesn't work because I don't have a vm in the same project; I don't need that right now, but it might be useful down the line if the instance breaks for some reason.

@DamianZaremba good point about root access, if you find yourself needing it, please open a Phab task about it and we'll find a way to make it work.