Page MenuHomePhabricator

VRT Logons are delayed
Closed, ResolvedPublicPRODUCTION ERROR

Description

Steps to reproduce:

  1. Go to https://ticket.wikimedia.org
  2. Enter logon credentials
  3. Click "Login"

Expected behavior: System logon, dashboard loads

Observed behavior: System freezes, after a long wait (over 60 seconds) - logon completes and dashboard loads

Obeserved with multiple connections and browsers - recurring for about the last 24 hours

Event Timeline

Xaosflux changed the subtype of this task from "Task" to "Production Error".Mar 17 2025, 2:22 PM

The issue could be the enormous amount of Junk tickets, normally 20k, currently 800k tickets.

It appears to be an infinite loops of this:
https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=14332822

It think it has to be checked in them mail transport logs.

Now getting complete failures, with error:

504 Gateway Time-out
The server didn't respond in time.

The system is too slow. I can not access even the login page.

Mentioned in SAL (#wikimedia-operations) [2025-03-18T15:07:58Z] <arnaudb@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on vrts1003.eqiad.wmnet with reason: debugging T389079

ABran-WMF added a subscriber: phaultfinder.

Webservice is available again. Debugging is still in progress for the mail loop.

Change #1128888 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] vrts: denylist a sender

https://gerrit.wikimedia.org/r/1128888

Change #1128888 merged by Arnaudb:

[operations/puppet@production] vrts: denylist a sender

https://gerrit.wikimedia.org/r/1128888

it seems that the system has recovered after the patch has been deployed. LMK if the situation degrades again.

Mentioned in SAL (#wikimedia-operations) [2025-03-18T16:10:32Z] <arnaudb@cumin1002> DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on phab1004.eqiad.wmnet with reason: debugging T389079

While the gateway timeout is no longer occurring, the original user story (Logon is taking over a minute to complete) is still presenting.

This for sure is because of the 1.4 million open Junk tickets. It will take approximate 30 hours to delete them. Please stand by.

The Junk queue is now shrinking instead of growing now but is still very large (over 1.4 million tickets). Unless there is a way to purge the queue, let's wait until it is emptied by the standard process. As a side note, we may want to consider monitoring the Junk queue size going forward.

This comment was removed by Krd.

Change #1129369 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] vrts: add parameters for exim_deny_senders from private repo

https://gerrit.wikimedia.org/r/1129369

Change #1129374 had a related patch set uploaded (by Dzahn; author: Dzahn):

[labs/private@master] vrts: add profile::vrts::exim_deny_senders with fake value

https://gerrit.wikimedia.org/r/1129374

Change #1129374 merged by Dzahn:

[labs/private@master] vrts: add profile::vrts::exim_deny_senders with fake value

https://gerrit.wikimedia.org/r/1129374

Change #1129369 merged by Dzahn:

[operations/puppet@production] vrts: add parameters for exim_deny_senders from private repo

https://gerrit.wikimedia.org/r/1129369

@Krd Znuny suggested temporarily raising the "GenericAgentRunLimit" from 4000 to 40000 and letting it run every 10 minutes to speed up the clean up.

I'd support that if we are sure that the database can handle it. I will try later today if the limit can be increased in the system config.

On second thought and considering it's Friday evening in EMEA maybe let's not take the risk.

We may have some database trouble already, the queue Probably-Spam in the overview shows to have 47 tickets but is actually empty.
{F58894214}

The problem described in the previous entry has disappeared. Good.

After the weekend I suggest we just leave it as it is, as the deletion will be done in a few days.

Junk is now down to 18k tickets, which is more or less the usual amount.

As seen in the email ticket:

otrs@vrts1003:/opt/otrs$ ./bin/otrs.Console.pl Maint::Ticket::QueueIndexCleanup
Error: Kernel::System::Ticket::IndexAccelerator::StaticDB is the active queue index, aborting.
Use Maint::Ticket::QueueIndexRebuild to regenerate the active index.
otrs@vrts1003:/opt/otrs$ ./bin/otrs.Console.pl Maint::Ticket::QueueIndexRebuild
Rebuilding ticket index...
Done.

We still have one more command to run if any index issue persists.

Logon time is back to normal (<5 seconds) now; is this resolved? Is root cause just being that the system capacity was insufficient for the load?

@Xaosflux Basically, yes, that is the answer. (And overprovisioning for a really unexpected event like this would have also not been the optimal choice.)

closing this as the issue seems to be fixed. Please feel free to reopen if needed!

Change #1140207 had a related patch set uploaded (by Dzahn; author: AOkoth):

[operations/puppet@production] vrts: add junk queue count and remove mobile queue

https://gerrit.wikimedia.org/r/1140207

Change #1140207 merged by AOkoth:

[operations/puppet@production] vrts: add junk queue count and remove mobile queue

https://gerrit.wikimedia.org/r/1140207