Page MenuHomePhabricator

Update OTRS to the latest stable version (6.0.x)
Open, MediumPublic

Description

The last update of OTRS has been made in through T74109. The upgrade was successful despite some bugs behind found.

Yet again, it would be great to update to the latest version (after a test period, like it was done last time).

Some fixes and features can improve the work of agents and the security of OTRS (2FA with T122220, bug with links T126759, security issue T187893, etc.)

See latest release notes.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
akosiaris moved this task from Backlog to Pending patch / update on the OTRS board.Jul 6 2020, 2:40 PM

As this seems more snapshotting related than databases, I may take care myself of the db preparation needed.

If db1077 is freed because of T256120#6281102, we could maybe setup the staging otrs database there, to avoid the extra cross-dc latency.

@akosiaris How long do you estimate you will need the temporary db for? I know you say to not know (that's ok), but right now between 1 month and 6 months we may go with a different hardware selection for the temporary db.

Stretch question. Have you considered working towards attachments issue at RE:T138915 at the same time? Many of the db issues/maintenance will take way less if those were separate.

db1077 can be used yep

As this seems more snapshotting related than databases, I may take care myself of the db preparation needed.

Thanks. I am fine with doing it as well on my own if it's going to take up valuable time. It's part of my goals for this quarter and I 've got allocated time for it.

If db1077 is freed because of T256120#6281102, we could maybe setup the staging otrs database there, to avoid the extra cross-dc latency.

@akosiaris How long do you estimate you will need the temporary db for? I know you say to not know (that's ok), but right now between 1 month and 6 months we may go with a different hardware selection for the temporary db.

Between 1month and 2 months I 'd say. Hopefully less.

Stretch question. Have you considered working towards attachments issue at RE:T138915 at the same time? Many of the db issues/maintenance will take way less if those were separate.

Yes, but it's orthogonal after all. Moving the attachments out of the database requires writing code to move the attachments to swift, code that hasn't been written by anyone yet and requires some OTRS expertise to happen. As such, it's a far greater investment in time for the team than it's currently possible.

ElHef added a subscriber: ElHef.Jul 6 2020, 7:13 PM
Jeff_G added a subscriber: Jeff_G.Jul 8 2020, 9:04 AM

Change 614746 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] otrs: Set otrs1001 as OTRS role

https://gerrit.wikimedia.org/r/614746

Change 614759 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] traffic: Add ticket-test.wikimedia.org

https://gerrit.wikimedia.org/r/614759

Change 614746 merged by Alexandros Kosiaris:
[operations/puppet@production] otrs: Set otrs1001 as OTRS role

https://gerrit.wikimedia.org/r/614746

Change 615170 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] otrs: Allow disabling daemon in profile

https://gerrit.wikimedia.org/r/615170

Change 615170 merged by Alexandros Kosiaris:
[operations/puppet@production] otrs: Allow disabling daemon in profile

https://gerrit.wikimedia.org/r/615170

Change 615176 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] otrs: vary mysql-client on debian distro version

https://gerrit.wikimedia.org/r/615176

Change 615210 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] otrs: Remove demime condition from exim

https://gerrit.wikimedia.org/r/615210

Change 615176 merged by Alexandros Kosiaris:
[operations/puppet@production] otrs: vary mysql-client on debian distro version

https://gerrit.wikimedia.org/r/615176

Change 615210 merged by Alexandros Kosiaris:
[operations/puppet@production] otrs: Remove demime condition from exim

https://gerrit.wikimedia.org/r/615210

Change 616531 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] otrs: Add OTRS 6.0.29 prereq packages

https://gerrit.wikimedia.org/r/616531

Change 616532 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Temporarily add ticket-test.wikimedia.org

https://gerrit.wikimedia.org/r/616532

Change 616531 merged by Alexandros Kosiaris:
[operations/puppet@production] otrs: Add OTRS 6.0.29 prereq packages

https://gerrit.wikimedia.org/r/616531

Change 617381 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/software/otrs@master] Release 1.0.16. First version that only support 6.0.x

https://gerrit.wikimedia.org/r/617381

Change 614759 merged by Alexandros Kosiaris:
[operations/puppet@production] traffic: Add ticket-test.wikimedia.org

https://gerrit.wikimedia.org/r/614759

Change 616532 merged by Alexandros Kosiaris:
[operations/dns@master] Temporarily add ticket-test.wikimedia.org

https://gerrit.wikimedia.org/r/616532

akosiaris added a comment.EditedJul 30 2020, 11:50 AM

An update:

The upgrade on the new node using a test database has progressed ok. A couple of issues met:

The script ./DBUpdate-to-6.pl complained about the database itself not being utf8 (despite all the tables being utf8 - I double checked). Fixed with:

alter database otrs character set utf8;

Then

Found 39 orphaned entries in time_accounting table
Do you want to automatically delete the entries from the database now? [Y]es/[N]o:

since we don't use the time accounting (accounting how much time an agent spends on a task) functionality, I 've pressed yes and proceeded. Note that when I did not I was told to inspect them and remove them manually before the script would proceed. It turned out that the entirety of that table has anyway were very old entries (from 2004 to 2013).

To be followed by:

Found 3138 orphaned entries in ticket_history table ...
Do you want to automatically delete the entries from the database now? [Y]es/[N]o:

Again answering no results in the script terminating and asking that we take care of those manually providing an SQL query. The script won't proceed without entries being deleted. Again the suggested SELECT query pointed out very old entries (from 2007 to 2013). I 've was forced to delete them.

The above actions all allowed the process to proceed. The next issue was that

Step 31 of 44: Post changes on article related tables ...

took close to 48h as it runs some very time consuming ALTER tables. Seems like we will have to schedule a rather long maintenance window for this migration.

After this was done we get to the following

Warning: Ticket::SearchIndexModule is not an entity value type, skipping...

This is actually NOT a warning and failing to fix it will cause OTRS to error out. The fix isn't clear and only a hint is provided by the upgrade script

Following settings were not fixed:
  - Ticket::SearchIndexModule

Please use console command (bin/otrs.Console.pl Admin::Config::Update --help) or GUI to fix them.

Turns out the correct commands to run are

bin/otrs.Console.pl Admin::Config::Update --setting-name Ticket::SearchIndexModule --value "Kernel::System::Ticket::ArticleSearchIndex::DB"
bin/otrs.Console.pl Maint::Ticket::FulltextIndex --rebuild
Marking all articles for reindexing...
Done.
bin/otrs.Console.pl Maint::Ticket::QueueIndexRebuild
Rebuilding ticket index...
Done.
bin/otrs.Console.pl Maint::Ticket::EscalationIndexRebuild

That last command takes another couple of hours.

Finally, the new version (1.0.16) of the WikimediaTemplates package needs to be installed. I 've patched it, built it and installed. Unfortunately it's not backwards compatible to 5.0.x, but that doesn't matter much.

There is 1 more warning, which is asking for a value and it's easy to fix:

Ticket::Type::Default is invalid, select one of the choices below:
    [1] default

Your choice:

DNS and edge cache changes have been merged, this is ready to be tested by agents. I 'll ping on the OTRS wiki noticeboard asking for volunteers that want to test drive this. URL is ticket-test.wikimedia.org. The instance receives no inbound email and the database is a point in time snapshot. Outgoing email won't work either

It takes very little to load another snapshot if you think you need it.

Change 617700 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] otrs: Disallow outgoing emails from test instance

https://gerrit.wikimedia.org/r/617700

Change 617700 merged by Alexandros Kosiaris:
[operations/puppet@production] otrs: Disallow outgoing emails from test instance

https://gerrit.wikimedia.org/r/617700

eyazi added a comment.Jul 31 2020, 2:42 PM

Not sure if you did, but you should also reset the Ticket::SearchIndexModule setting. Can be done on the interface if you have access or via console:

bin/otrs.Console.pl Admin::Config::Update --setting-name Ticket::SearchIndexModule --reset

Not sure if you did, but you should also reset the Ticket::SearchIndexModule setting. Can be done on the interface if yousudo -u otrs /opt/otrs/ have access or via console:

bin/otrs.Console.pl Admin::Config::Update --setting-name Ticket::SearchIndexModule --reset

@eyazi Thanks! I actually went with bin/otrs.Console.pl Admin::Config::Update --setting-name Ticket::SearchIndexModule --value "Kernel::System::Ticket::ArticleSearchIndex::DB" but the end result is the same. I did add it in my notes above, seems like I had forgotten it.

There don't see to have been any actionable comments regarding the test installation, either in phabricator or OTRS Cafe. In that light I 'd like to schedule a maintenance window for this migration to occur. It will have to be a pretty large one (~48h, aka 2 days) as the migration script is taking a long time to perform a number of ALTER table statements. That's unfortunate but doable.

During the maintenance window the system will be completely offline. That means:

  • No access over the web to the interface
  • No scheduled jobs of any kind will be run
  • Email will not be delivered but rather backlogged. It will not be lost as our MX systems will accept them and put them in the queue. Once the system is back to being fully functional, the emails will flow into the system.

We will have a rollback plan ready of course in case the migration goes south. The migration has been tested, but something might arise anyway.

Looking at https://otrsreports.toolforge.org/daily.html and https://grafana.wikimedia.org/d/000000371/otrs?viewPanel=1&orgId=1&from=now-90d&to=now-1m, there doesn't seem to be any timeperiod (e.g. a weekend) that is particularly more favorable than others.

So, I am thinking of starting this on Monday September 14th EU morning (08:00 UTC). Barring any grave issues and a need to rollback we should be operational again around Wednesday September 16th (~08:00 UTC again). If there are objects, now is the time.

I 'll be informing OTRS Cafe and setting appropriate notices in OTRS for agents to be aware.

Look good. thank you Akosiaris for all you work.

It might be worth it to notify the various OTRS mailing list of the downtime - not everybody is reading the Café.

Look good. thank you Akosiaris for all you work.

It might be worth it to notify the various OTRS mailing list of the downtime - not everybody is reading the Café.

Good point, I 'll send out an email to otrs-en-l and otrs-fr as I am able to communicate in those languages. We 'll have to rely on the OTRS software notices, the Cafe, Tech news and the good will of OTRS agents to translate for the others I guess.

Included in https://meta.wikimedia.org/wiki/Tech/News/2020/37 going out on Monday.

@Johan, I had not thought of that angle, many many thanks for that.

@akosiaris This is my proposal, db-wise:

  • Disable https://ticket-test.wikimedia.org so it no longer can query db1077 db
  • At some point before the maintenance, clone only otrs database into db1077 again and make it replicate from m2 primary (db1107)
  • Just before maintenance starts, stop/disable replication on db1077, keep it on the other dbs for maintaining redundancy of other databases.

This will require keeping a snapshot of m2 on our backup systems, but if there is any unexpected issue, we can just failover otrs db to db1077 quickly, rather than having to do a recovery, while the rest of the services are unaffected. If the issue is on hw/other dbs, we can switchover those to db1117.

@akosiaris This is my proposal, db-wise:

  • Disable https://ticket-test.wikimedia.org so it no longer can query db1077 db
  • At some point before the maintenance, clone only otrs database into db1077 again and make it replicate from m2 primary (db1107)
  • Just before maintenance starts, stop/disable replication on db1077, keep it on the other dbs for maintaining redundancy of other databases.

This will require keeping a snapshot of m2 on our backup systems, but if there is any unexpected issue, we can just failover otrs db to db1077 quickly, rather than having to do a recovery, while the rest of the services are unaffected. If the issue is on hw/other dbs, we can switchover those to db1117.

+1. Sounds pretty ok to me.

I 'll keep ticket-test.wikimedia.org working for a couple of more days and then disable it on Tuesday.

Change 626604 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove ticket-test.wikimedia.org

https://gerrit.wikimedia.org/r/626604

Change 626626 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Revert "Temporarily add ticket-test.wikimedia.org"

https://gerrit.wikimedia.org/r/626626

Change 626604 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove ticket-test.wikimedia.org

https://gerrit.wikimedia.org/r/626604

Change 626626 merged by Alexandros Kosiaris:
[operations/dns@master] Revert "Temporarily add ticket-test.wikimedia.org"

https://gerrit.wikimedia.org/r/626626

Change 626629 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Switch ticket.discovery.wmnet to otrs1001

https://gerrit.wikimedia.org/r/626629

Change 626630 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] exim: Switch OTRS exim to otrs1001

https://gerrit.wikimedia.org/r/626630

Change 626631 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Promote otrs1001 as the main otrs host

https://gerrit.wikimedia.org/r/626631

akosiaris added a comment.EditedFri, Sep 11, 10:14 AM

Layout of the process for quick reference on Monday:

  • Disable puppet on mendelevium. Command for that sudo disable-puppet "Puppet disabled. Migration ongoing, services on purpose are disabled. Do not enable, you will cause mayhem. T187984"
  • Stop cron, otrs daemon, apache, exim on mendelevium. Mark them as scheduled downtime for 14 days in icinga.
  • Merge https://gerrit.wikimedia.org/r/626631
  • Run puppet on otrs1001
  • Disable puppet on otrs1001. Command for that sudo disable-puppet "Puppet disabled. Migration ongoing, services on purpose are disabled. Do not enable, you will cause mayhem. T187984"
  • Stop cron, otrs daemon, apache, exim on otrs1001. Mark them as scheduled downtime for 2.5 (60h) days in icinga.
  • Copy /opt/otrs/Kernel/Config.pm, /opt/otrs/Kernel/Config/Files/ZZZAuto.pm and /opt/otrs/var/log/TicketCounter.log from mendelevium to otrs1001
  • Create a tmux session and start the migration script
  • Monitor, wait, answer questions, monitor, wait and so on for ~48h or so
  • Once the above is done, run the following commands
bin/otrs.Console.pl Admin::Config::Update --setting-name Ticket::SearchIndexModule --value "Kernel::System::Ticket::ArticleSearchIndex::DB"
bin/otrs.Console.pl Maint::Config::Rebuild
bin/otrs.Console.pl Maint::Cache::Delete
bin/otrs.Console.pl Admin::Package::ReinstallAll
bin/otrs.Console.pl Maint::Ticket::FulltextIndex --rebuild
bin/otrs.Console.pl Maint::Ticket::QueueIndexRebuild
bin/otrs.Console.pl Maint::Ticket::EscalationIndexRebuild

Up to this point, it's rather easy to rollback. Upload a change similar to https://gerrit.wikimedia.org/r/626631 to switch mendelevium to use db1077.

Rollback at this point now requires revert of the patch above and running puppet on the above mentioned hosts

Rollback now requires also to revert the above step in addition to the previous ones.

  • Validate that ticket.wikimedia.org works
  • Announce it.
  • Poweroff mendelevium. Mark it as purposefully powered off in icinga.

@akosiaris This is my proposal, db-wise:

@jcrespo, done!

  • At some point before the maintenance, clone only otrs database into db1077 again and make it replicate from m2 primary (db1107)
  • Just before maintenance starts, stop/disable replication on db1077, keep it on the other dbs for maintaining redundancy of other databases.

This will require keeping a snapshot of m2 on our backup systems, but if there is any unexpected issue, we can just failover otrs db to db1077 quickly, rather than having to do a recovery, while the rest of the services are unaffected. If the issue is on hw/other dbs, we can switchover those to db1117.

I 've layed out the migration process above, with hooks to allow for using the restoration process.

Change 626690 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Temporarilly disable logical backups of m2 on eqiad

https://gerrit.wikimedia.org/r/626690

Change 626690 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Temporarilly disable logical backups of m2 on eqiad

https://gerrit.wikimedia.org/r/626690

Change 626669 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Revert "mariadb-backups: Temporarilly disable logical backups of m2 on eqiad"

https://gerrit.wikimedia.org/r/626669

  • At some point before the maintenance, clone only otrs database into db1077 again and make it replicate from m2 primary (db1107)

This is done. A snapshot is also available on the provisioning hosts.

IMPORTANT: db1077 is online and replicating from m2 master. Replication should be stopped and disabled after otrs is offline but before maintenance starts for a fast switchover!

This is just a couple of commands that can be done before maintenance.

Mentioned in SAL (#wikimedia-operations) [2020-09-14T08:49:38Z] <akosiaris> start the OTRS upgrade to 6.0.29 T187984

Mentioned in SAL (#wikimedia-operations) [2020-09-14T09:09:07Z] <akosiaris> db1077. stop slave ; show slave status > /home/akosiaris/show_slave_status; reset slave all T187984

Change 626631 merged by Alexandros Kosiaris:
[operations/puppet@production] Promote otrs1001 as the main otrs host

https://gerrit.wikimedia.org/r/626631

Mentioned in SAL (#wikimedia-operations) [2020-09-14T09:26:17Z] <akosiaris> T187984 migration script on otrs1001 now in step 8/41

Mentioned in SAL (#wikimedia-operations) [2020-09-14T09:27:36Z] <akosiaris> T187984 migration script on otrs1001 now in step 8/44 (correction)

Mentioned in SAL (#wikimedia-operations) [2020-09-14T12:06:54Z] <akosiaris> T187984 migration script on otrs1001 now in step 31/44

Mentioned in SAL (#wikimedia-operations) [2020-09-15T08:01:04Z] <akosiaris> T187984 migration script on otrs1001 proceeding as expected. Still in step 31/44, but that's what we saw in the test migration

Change 617381 merged by Alexandros Kosiaris:
[operations/software/otrs@master] Release 1.0.16. First version that only support 6.0.x

https://gerrit.wikimedia.org/r/617381

Mentioned in SAL (#wikimedia-operations) [2020-09-16T07:02:36Z] <akosiaris> T187984 migration script done. Config updates, rebuilds, package upgrades/reinstall and index rebuilds done

Mentioned in SAL (#wikimedia-operations) [2020-09-16T07:03:02Z] <akosiaris> T187984 validated that the OTRS installation is functional over SSH

Mentioned in SAL (#wikimedia-operations) [2020-09-16T07:12:37Z] <akosiaris> T187984 Disable gravatar in system configuration to avoid leaking agent PII through a 3rd party service

Change 626630 merged by Alexandros Kosiaris:
[operations/puppet@production] exim: Switch OTRS exim to otrs1001

https://gerrit.wikimedia.org/r/626630

Mentioned in SAL (#wikimedia-operations) [2020-09-16T07:26:22Z] <akosiaris> T187984 Tested outbound email, switching inbound email configuration and performing tests

I 'll split this off in its own task, but worthy to point out in order not to forget it. Znuny's QuickClose package seems to be causing multiple Quick Close menus to appear in Ticket View.

Mentioned in SAL (#wikimedia-operations) [2020-09-16T07:37:19Z] <akosiaris> T187984 Tested inbound email successfully

Change 626669 merged by Jcrespo:
[operations/puppet@production] Revert "mariadb-backups: Temporarilly disable logical backups of m2 on eqiad"

https://gerrit.wikimedia.org/r/626669

Mentioned in SAL (#wikimedia-operations) [2020-09-16T07:49:56Z] <akosiaris> T187984 Switch over ticket.discovery.wmnet to otrs1001

Change 626629 merged by Alexandros Kosiaris:
[operations/dns@master] Switch ticket.discovery.wmnet to otrs1001

https://gerrit.wikimedia.org/r/626629

Change 627745 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Incrase max_allowed_packet to 64MB on all generic misc dbs

https://gerrit.wikimedia.org/r/627745

Mentioned in SAL (#wikimedia-operations) [2020-09-16T08:04:01Z] <akosiaris> T187984 Validated that ticket.wikimedia.org works, proceeding with a wider announcement

Change 627745 merged by Jcrespo:
[operations/puppet@production] mariadb: Increase max_allowed_packet to 64MB on all generic misc dbs

https://gerrit.wikimedia.org/r/627745

All m2 dbs are back to sync with primary server:
https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=6&orgId=1&var-server=db2078&var-port=13322&from=now-2d&to=now

No backups or failovers were needed, so far.

Mentioned in SAL (#wikimedia-operations) [2020-09-16T10:01:29Z] <akosiaris> T187984 Shutdown mendelevium.

Please restore the color highlighting as was in the previous OTRS version. It's removal from OTRS 6.0 makes it more difficult to distinguish between sent and received emails, and internal memos, in lengthy and complex emails. Other OTRS agents besides myself have complained about this on the OTRS discussion thread (see "colours"). At least, why not have it available as a user-preference option?

JGHowes

@JGHowes please sse T263243.

Note that the new version was in test for all before the upgrade and that this issue could have been pointed out before.

Also, a user-preference option would be a new feature that will probably not be developed, but a custom CSS rule is probably more doable

Please restore the color highlighting as was in the previous OTRS version. It's removal from OTRS 6.0 makes it more difficult to distinguish between sent and received emails, and internal memos, in lengthy and complex emails. Other OTRS agents besides myself have complained about this on the OTRS discussion thread (see "colours"). At least, why not have it available as a user-preference option?

I wish I could do that easily. That decision hasn't been taken by me or anyone else in WMF, the community or the movement as a whole. it's from the upstream OTRS developers. You can view their rational and decision making process in https://bugs.otrs.org/show_bug.cgi?id=13155 (comment 6 is particularly interesting).

Which discussion thread are you referring to btw? Got a link ?

@JGHowes please sse T263243.

Note that the new version was in test for all before the upgrade and that this issue could have been pointed out before.

That's true and it has been, albeit by a single person. See https://otrs-wiki.wikimedia.org/w/index.php?title=Caf%C3%A9&diff=next&oldid=100583 and following edits for the discussion.

that will probably not be developed

As a small correction, instead of "that will probably not be developed" something more like "that would have to be requested to the developer of the software" which are the ones that took those decisions (otrs).

@jcrespo : By experience, those kind of features will not see light in our era for OTRS since more urgent "features" (like T23579) are still stuck since 2013... Also, this "color" scheme was previously a feature (see https://otrsteam.ideascale.com/a/dtd/color-actions-in-ticket-history-view/90409-10369) that has now been discontinued and it is unlikely that they are going to revert with a tickbox in the near future.

Anyway, it will be easier and quicker to fix it locally, feel free to support the related ticket and help :-)

Yeah, not disagreeing, in fact supporting that ticket. My stress was on that it was not Alex's decision to remove it. :-) Cheers.