Page MenuHomePhabricator

Show "timed out" error to the user when an event update has been running for over an hour
Open, LowPublicBUG REPORT

Description

Currently if an update (aka job) has been running for over an hour, the system just assumes the worst and deletes it, without giving the user any indication as to what happened. We now have different states for jobs (internally there's "queued", "started", "failed timeout" and "failed unknown"). Instead of deleting the job, we can set the state to "failed timeout". Then we can show the same timeout error that you see when individual queries timeout.

Event Timeline

Are we sure that after an hour it's not going to work? I.e., is that the right interval to declare defeat?

Waiting for T220463 to be resolved to test this. Currently, we time out long before the 24 hour limit that this change introduces.

@MaxSem, to test I started https://eventmetrics-dev.wmflabs.org/programs/150/events/364 This morning at 8:30. It was still crunching at 11:30 today. When I refreshed at that point, it reverted back to original state, with no message and no metrics. So is the standard that nothing should run over an hour? I'm not sure what we're aiming at...

jmatazzoni changed the subtype of this task from "Task" to "Bug Report".Apr 15 2019, 6:16 PM

@MaxSem has tried what he was going to try and it seems like this problem is intermittent. @aezell I'm going to pull this task off the board, and we'll monitor to see how things progress.

Thanks. Max did notice that the cron jobs on staging were not configured correctly. That could have contributed to some of the confusion around what was going on with this. I don't think we can consider it complete though.

I assume we didn't mean to start working on EM again? :P

Niharika removed a project: Community-Tech.
Niharika subscribed.

Low priority.

MaxSem subscribed.