Currently if an update (aka job) has been running for over an hour, the system just assumes the worst and deletes it, without giving the user any indication as to what happened. We now have different states for jobs (internally there's "queued", "started", "failed timeout" and "failed unknown"). Instead of deleting the job, we can set the state to "failed timeout". Then we can show the same timeout error that you see when individual queries timeout.
@MaxSem, to test I started https://eventmetrics-dev.wmflabs.org/programs/150/events/364 This morning at 8:30. It was still crunching at 11:30 today. When I refreshed at that point, it reverted back to original state, with no message and no metrics. So is the standard that nothing should run over an hour? I'm not sure what we're aiming at...
Thanks. Max did notice that the cron jobs on staging were not configured correctly. That could have contributed to some of the confusion around what was going on with this. I don't think we can consider it complete though.