Page MenuHomePhabricator

OTRS database is "too large"
Closed, DuplicatePublic

Description

When looking at the backup sizes (logical backup, compressed) I get:

root@dbstore1001:/srv/backups$ ls -lhaSr
...
-rw-r----- 1 root  root    34G Jun 14 03:16 wikidatawiki-201606131835.sql.gz
-rw-r----- 1 root  root    34G Jun 15 10:54 wikidatawiki-201606150205.sql.gz
-rw-r----- 1 root  root    38G Jun 15 03:59 dewiki-201606150205.sql.gz
-rw-r----- 1 root  root    39G Jun 22 03:55 dewiki-201606220205.sql.gz
-rw-r----- 1 root  root    39G Jun 29 03:45 dewiki-201606290205.sql.gz
-rw-r----- 1 root  root    62G Jun 15 05:43 commonswiki-201606150205.sql.gz
-rw-r----- 1 root  root    62G Jun 22 05:43 commonswiki-201606220205.sql.gz
-rw-r----- 1 root  root    62G Jun 29 05:27 commonswiki-201606290205.sql.gz
-rw-r----- 1 root  root    83G Jun 15 06:50 enwiki-201606150205.sql.gz
-rw-r----- 1 root  root    83G Jun 22 06:47 enwiki-201606220205.sql.gz
-rw-r----- 1 root  root    83G Jun 29 06:31 enwiki-201606290205.sql.gz
-rw-r--r-- 1 root  root   257G Jun 15 04:24 m2-otrs-20160615010002.sql.gz
-rw-r--r-- 1 root  root   259G Jun 22 04:46 m2-otrs-20160622010001.sql.gz
-rw-r--r-- 1 root  root   259G Jun 29 04:41 m2-otrs-20160629010002.sql.gz

OTRS is 259GB after compression! In comparison, a full enwiki copy is only 83GB! And it can grow 2 GB in a week.

After investigating, it seems that the main responsible tables for that are:

root@db1020:/srv/sqldata/otrs$ ls -lhaSr
...
-rw-rw----  1 mysql mysql 192M Jun 29 03:11 dynamic_field_value.ibd
-rw-rw----  1 mysql mysql 248M Jun 29 07:50 article_flag.ibd
-rw-rw----  1 mysql mysql 980M Jun 29 07:50 ticket.ibd
-rw-rw----  1 mysql mysql 4.3G Jun 29 07:50 article_search.ibd
-rw-rw----  1 mysql mysql 5.4G Jun 29 07:50 ticket_history.ibd
-rw-rw----  1 mysql mysql 9.2G Jun 29 07:50 article.ibd
-rw-rw----  1 mysql mysql 143G Jun 29 07:50 article_attachment.ibd
-rw-rw----  1 mysql mysql 151G Jun 29 07:50 article_plain.ibd

dbstore1001 is now at <5% available space, so this is creating actual issues. Also, this host is only for creating the tarballs- in the backups this may be wasting a lot of space.

Maybe a more efficient strategy can be created for OTRS structure and backups? In particular, storing attachments on the database seems like a bad idea, but we should be careful if we try to change that, so that not millions of small files are created making thing worse.

The main problem, rather than space, is that recovering a single 150GB-table from a dump may take weeks, which would make the backups rather useless.

Event Timeline

Just noting here that OTRS started to include a script to move attachments from the database to the filesystem a couple of versions back, see https://otrs.github.io/doc/manual/admin/stable/en/html/performance-tuning.html#performance-tuning-otrs-storage (in case you want to go down that road).

@pajz Thank you for the tip- The otrs owner already mention that as a possibility.

I am personally unsure if we will go that way- probably it will depend on how many of those files are there and how easy is to handle them (we had issues with millions of files on other systems). There could be other solutions (maybe older files are not needed? Maybe partitioning?), so I just wanted to state the problem and investigate the best solution that is both efficient and easy to administrate.

I 've been starting to look into the storage of articles/attachment in the filesystem. It is the recommended way of storing attachments in large setups (and/or large attachment sizes). We don't control attachment size anyway but we do have a large setup as it seems. In fact one that has been growing for quite a bit of years. While filesystem storage is the recommended way to go in cases like ours, I am still a bit skeptical and would like to investigate further before we take a decision. Things to look out for:

  • IOPS performance in general on mendelevium. mendelevium is currently on SATA disks, so that is probably a blocker but is actively being worked on in T138414
  • FS hierarchy performance issues as @jcrespo already pointed out. It seems like the files are being well archived into years/months etc subfolders on the FS. This should remove any kind of subdirectory performance issues with millions of files into a single directory. This needs some more investigation as to how the files are indexed and accessed.
  • Cost of backing up the FS hierarchy. In terms of disk space, we should be better off as we will rely on the differential/incrementals of our backing up system (bacula) which will perform better than the current solution. However, the listing of all those millions of files will probably not be very performant
  • Time cost of the migration. That is a once off cost but if it ends up lasting for weeks with the application being unavailable it is probably a no go. It is however possible to have the application looking into both file storage backends (DB and FS) so in theory it should be possible to run the migration in the background. Still investigating whether that is true.

So we got at least 2 bullets with a few unknowns to determine before moving forward with the filestorage FS migration

Low because the main issue has been work-arounded (rOPUP3ae9c670916f10621771cc2fc8ca454bfa15d4a0), but the issues with the size (mainly, a backup recovery) are still existing, so let's keep this in mind for the future.

I'm guessing this got worse with the recent spam problem

Yearly state of otrs report (compressed):

12G     ./dump.m1.2018-03-07--13-22-38
329G    ./dump.m2.2018-03-07--13-47-42
29G     ./dump.m3.2018-03-07--21-28-09
11G     ./dump.m5.2018-03-07--17-52-13
99G     ./dump.s1.2018-03-07--11-38-30
88G     ./dump.s2.2018-03-07--07-45-47
97G     ./dump.s3.2018-03-06--20-10-05
89G     ./dump.s4.2018-03-06--23-22-49
47G     ./dump.s5.2018-03-07--09-40-46
62G     ./dump.s6.2018-03-07--10-22-47
76G     ./dump.s7.2018-03-07--03-49-55
73G     ./dump.s8.2018-03-07--18-06-22
17G     ./dump.x1.2018-03-06--19-00-01

otrs current compressed backup size:

417G	dump.m2.2020-04-14--04-34-33/

I'd like to list the benefits, of which most are already stated here, and add my two cents to the topic.

Switching storage from DB to FS will:
— increase update speed (no altering in large tables)
— increase database dump speed
— enormously increase the restoring speed
— make it easier to implement incremental backup plans

And just like in T187984, I am offering my expertise. I can list scenarios on how that can be achieved (there are a couple of ways) and what the pitfalls are.
I did it multiple times on large environments and it's not a problem at all; OTRS offers all needed scripts and settings.

I honestly don't see a reason why this shouldn't be tackled.
It's also a pretty good time to try it out, as you currently have a testing environment that can provide good benchmarks.

We are using FS storage since the beginning, and it usually goes well without large problems. So far I remember only one problem which was present in OTRSv4: attachments with bad encoding sometimes were saved uing a name which was not translated back to the original well.

I use XFS (which handles happily whatever amount of directories or files) but other _modern_ FS may work well.
Obviously you can contact me too, as eyazi mentioned, for know-how.

The future of this task is heavily dependent on the outcome of https://phabricator.wikimedia.org/T275294 and its follow up tasks.

Now that the OTRS migration to Znuny is complete, this task can potentially be revived. @Keegan: would you be able to help me find the right people on the Znuny side to talk to about where this work fits priority-wise?

I would be happy to also provide some help here, all work regarding decreasing the database backup size will revert on maintainability gains for me. While I had some contact with blob/object storage recently for unrelated projects, probably the right person to own this at infrastructure side would be whoever ends up maintaining MOSS: T279621

Now that the OTRS migration to Znuny is complete, this task can potentially be revived. @Keegan: would you be able to help me find the right people on the Znuny side to talk to about where this work fits priority-wise?

Sure, there's a few options with a caveat. Znuny, unlike OTRS AG, is only a couple of people. They formerly provided consulting services but are now shifting into managing the tech, so any formal/deep technical help we might need from them may be limited aside from the emergency situations our support contract provides for.

That being said, there's three ways to work with Znuny outside of our emergency support contract:

  1. GitHub–for filing bugs, feature requests; you know, GitHub-y stuff.
  2. Discord–synchronous communication! I'm not in their Discord though, so I cannot vouch for its reliability as a communication resource
  3. Community forum–moderately active and may be a good place for asynchronous communication

An issue in GitHub may be your best bet, but I'm not a developer so I'm unsure about my own recommendation :)

I thought of starting with the day to day admins, or is that also part of the support contract?

I thought of starting with the day to day admins, or is that also part of the support contract?

Our admins are not traditional CRM/database admins, they fulfill a largely technocratic community function and have little to no experience with the underpinnings of the actual software.

If you'd like to communicate with experienced software admins, the Znuny community forum is an active place to reach other users.

FWIW:

root@db1195:/srv/sqldata/otrs# ls -Ssh 
total 633G
294G article_data_mime_plain.ibd
292G article_data_mime_attachment.ibd
...

The reason that the tables are basically the same size is that almost all modern email clients send the html as an attachment (multi-part), so it's basically 300GB (already massive) but duplicated :/

o/

We've started doing some tests on migrating these large objects to the file system just to get a feel of what this would look like but nothing solid as yet. I did this on a test installation on WMCS with very little data but hoping to get a clone of the production database (T341489) soon to get more meaningful results of what this will actually look like. The process seems fairly straight forward but Znuny support warned that we ensure we have enough inodes on the filesystem we migrate the data to.

otrs@vrts-1002:/opt/otrs$ ./bin/otrs.Console.pl Admin::Article::StorageSwitch --target ArticleStorageFS
...
1/1 (TicketID:1)
Done.

@Arnoldokoth Testing with the DB snapshot will likely require firewall adjustments, let's discuss the requirements when you're back.

Change 952865 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1118 to m2

https://gerrit.wikimedia.org/r/952865

Change 952865 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1118 to m2

https://gerrit.wikimedia.org/r/952865

LSobanski raised the priority of this task from Low to Medium.Aug 30 2023, 2:41 PM