Page MenuHomePhabricator

[Spike: 3hrs] Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg.
Closed, ResolvedPublicBUG REPORT

Description

I tried to get numbers on an "event" (that amounts of everyday editing for 94 people). The URL is https://eventmetrics.wmflabs.org/programs/150/events/364 ("First quarter WPMED edits" in the program "WikiProject Medicine test".)

It told me that it was currently crunching the numbers. A couple of hours later, it stopped saying that, but nothing else happened. There are no numbers, no new buttons, etc.

This is a screenshot of what the browser window looked like in the end:

Screen Shot 2019-04-01 at 1.47.17 PM.png (706×534 px, 86 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I tested this event at the office. It did just what @Whatamidoing-WMF said: announced it was "crunching the numbers" for 3-4 hours, with the animation, then did stopped and went back to square one. No error message and no data update.

I would be surprised if the process is actually still running. I think the backend systems would likely have killed this process but that isn't being reflected in the UI for some reason.

So, possibly there are two issues:

  1. Why is it taking so long?
  2. If it's timing out or failing, why is the UI not being updated?
jmatazzoni renamed this task from Event Metrics stopped crunching the numbers, but nothing happened to Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg..Apr 2 2019, 6:36 PM

This is the event that keeps failing on fetching pageviews (we get emails). That could be slow, but it should never time out. Investigation pending...

I tried Updating this event again today. The results were the same in that I got no results, but it churned for only about 20 minutes this time, for whatever reason.

I tried testing this with just the 3 on-latin script usernames, since I'd never tested with those before. This time the event Updated just fine.

@Samwilson points out that this might be related to T219897 and may benefit from the work we do there. Or, it may possibly be similar work to catch the error to save the process from dying.

@jmatazzoni asks if the problem is specific to this event or is this event an exemplar of a quiet bug that we haven't seen before but which could affect other events.

MBinder_WMF renamed this task from Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg. to [Spike: 3hrs] Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg..Apr 2 2019, 11:15 PM
MBinder_WMF added a project: Spike.
MBinder_WMF moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.
jmatazzoni changed the subtype of this task from "Task" to "Bug Report".Apr 3 2019, 12:06 AM

In T219897#5091678, @jmatazzoni wrote:

...So, is that it? Are a lot of pages bad but a lot of users if fine?

I did a few tests and it's possible that that is the answer. I broke the "First quarter WPMED edits" event into smaller pieces: all the pieces have the same number of participants, but they took place over shorter durations, so involve fewer contributions—most specifically, fewer pages improved:

@dom_walden, what do you think? Is the limit someplace around 6500 pages? It still doesn't explain why this event doesn't show an error message when it times out, if that's what it's doing...

LATER COMMENT: See T219449#5091755. It doesn't appear to be as simple as number of pages. An event wit 5.7k pages improved succeeded in processing, while two with significatnly fewer pages did not.

@dom_walden, what do you think? Is the limit someplace around 6500 pages? It still doesn't explain why this event doesn't show an error message when it times out, if that's what it's doing...

LATER COMMENT: See T219449#5091755. It doesn't appear to be as simple as number of pages. An event wit 5.7k pages improved succeeded in processing, while two with significatnly fewer pages did not.

See my comment in T219449#5092010. There seems to be some unreliability as to whether an event will process or not, perhaps related to how much load the server is under at the time.

I have been able to create events with 50k pages improved and have them process. E.g. this event of 50k wikidata items improved, 2 participants, just now only took 1-2 mins to get data. On the other hand, it reports 0 pageviews.

I will investigate events with larger pageviews to see if they have a problem. (Although AF 4 and AF 1 have large pagviews, and they loaded in under 10 mins for me just now.)

The main problem I have is that the pages improved report tends to time out for large numbers of pages improved.

Dom's point about server load is also true for load across the entire environment, especially the database. When it's higher, we likely have less resources and so things take longer. When they take longer, there's more likelihood for them to fail.

A way to get at this data is to send some timing information for all jobs to Graphite and start graphing the length of these jobs against whatever other metrics about them we have ie, number of participants, number of pages, length of event, etc.

@dom_walden, what do you think? Is the limit someplace around 6500 pages? It still doesn't explain why this event doesn't show an error message when it times out, if that's what it's doing...

Experimenting with this today, the limit for me of what will complete processing and what will timeout (with a message) is around 8,000 pages improved or files uploaded (wikidata items I have seen go up to 50k, but not beyond).

For example, an event with 8,001 files uploaded will complete (relatively quickly). If I increase the time of the event by just 1 minute (adding 6 more files uploaded) it will time out.

Although I was not taking into account factors of database caching, so I will re-run some of these events at a later date when the caches have cleared.

I have been able to create events with 50k pages improved and have them process...

I think I got this wrong. I have not been able to get to 50k other than with wikidata. I thought they would be the same as regular pages improved...

I currently have two events processing data. With DevTools open, I see they are making GET requests to https://eventmetrics-dev.wmflabs.org/events/job-status/$jobid.

For both events, the response is {"id":$jobid,"status":"queued"}. They have remained like this for some time, even when I am not running any other processes. When I do run another process it completes, but the two events remain in the "queued" state.

EDIT: Examining the HTTP logs for one of the events which appears indefinitely queued, the first POST which started the processing returned a 500 response. "500: Internal Server Error; The server said: <code>Database quota exceeded!</code>". I did not see this indicated anywhere on the page.

See also T219449#5093208. Keep in mind the replicas will cache recently scanned rows (or something like that), so some of those WPMED events may have finished or were fast only because you recently ran an update to a similar event that touched the same data (in whole or part). This is why sometimes if an update times out, it will finish on subsequent tries.

Other things to keep in mind:

  • If we are at our database quota, hitting "Update data" will queue a job. The cron will try to start it every 5 minutes until we have enough quota. When this happens, cron output is sent to developers, which I have not seen. So I don't think we've hit our quota at any point yet.
  • Jobs that have been running for an hour are automatically killed. When this happens, no error is shown, under the assumption this should never happen (and it shouldn't). This should at least be logged in some way, which it isn't currently :/ I think this is what happened for the event described in the task description.
  • The longest part of the update process, by a longshot, is fetching pageviews. I would recommend we tackle T217911 next. I have a strong suspicion that will solve a lot of our problems (such as the job being killed, at least some of the "mysql has gone away" errors, etc.). For some large events the update process will in theory be faster by orders of magnitude.

Jobs that have been running for an hour are automatically killed. When this happens, no error is shown

Hmm actually, I'm now remembering this behaviour was basically a hack I added in before we had the different job "states". Now we have a "timed out" state, so we can just set it to that, in which case the user will at least be shown an error. We can also increase the 1-hour threshold, but I don't think jobs should ever be running that long.

Jobs that have been running for an hour are automatically killed. When this happens, no error is shown

Hmm actually, I'm now remembering this behaviour was basically a hack I added in before we had the different job "states". Now we have a "timed out" state, so we can just set it to that, in which case the user will at least be shown an error. We can also increase the 1-hour threshold, but I don't think jobs should ever be running that long.

OK, so it sounds like we have a diagnosis of why some jobs just end. What is the best way to fix this problem, so that a) it happens less often (if we think a job might actually finish after an hour) and/or 2) people get an illuminating error message?

OK, so it sounds like we have a diagnosis of why some jobs just end. What is the best way to fix this problem, so that
a) it happens less often (if we think a job might actually finish after an hour)

T217911: Performance improvement: Fetch pageviews asynchronously should make a big difference. To me this feels like the best next step. We wanted to do this eventually anyway. Fetching pageviews of each page, one by one, is crazy slow considering we have the capability to get pageviews for up to 100 pages at once.

b) people get an illuminating error message?

What I was saying above; The auto-killing of jobs shouldn't happen, and instead we change the state to "timed out". I'll create a ticket here soon. Implementation will be a little bit involved.

b) people get an illuminating error message?

What I was saying above; The auto-killing of jobs shouldn't happen, and instead we change the state to "timed out".

Tracked at T220423: Show "timed out" error to the user when an event update has been running for over an hour