Maniphest T219835

[Spike: 3hrs] Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg.
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

None

Authored By

	• Whatamidoing-WMF
	Apr 1 2019, 8:52 PM

Description

I tried to get numbers on an "event" (that amounts of everyday editing for 94 people). The URL is https://eventmetrics.wmflabs.org/programs/150/events/364 ("First quarter WPMED edits" in the program "WikiProject Medicine test".)

It told me that it was currently crunching the numbers. A couple of hours later, it stopped saying that, but nothing else happened. There are no numbers, no new buttons, etc.

This is a screenshot of what the browser window looked like in the end:

Screen Shot 2019-04-01 at 1.47.17 PM.png (706×534 px, 86 KB)

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	BUG REPORT	None	T219835 [Spike: 3hrs] Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg.
Resolved		MusikAnimal	T217911 Performance improvement: Fetch pageviews asynchronously
Open	BUG REPORT	None	T220423 Show "timed out" error to the user when an event update has been running for over an hour

Event Timeline

• Whatamidoing-WMF created this task.Apr 1 2019, 8:52 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptApr 1 2019, 8:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• jmatazzoni added a comment.Apr 1 2019, 8:55 PM

This comment was removed by • jmatazzoni.

I tested this event at the office. It did just what @Whatamidoing-WMF said: announced it was "crunching the numbers" for 3-4 hours, with the animation, then did stopped and went back to square one. No error message and no data update.

I would be surprised if the process is actually still running. I think the backend systems would likely have killed this process but that isn't being reflected in the UI for some reason.

So, possibly there are two issues:

Why is it taking so long?
If it's timing out or failing, why is the UI not being updated?

• jmatazzoni moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Apr 1 2019, 11:46 PM

• jmatazzoni moved this task from Backlog to To be estimated on the Event Metrics board.

• jmatazzoni updated the task description. (Show Details)Apr 2 2019, 4:30 PM

• jmatazzoni renamed this task from Event Metrics stopped crunching the numbers, but nothing happened to Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg..Apr 2 2019, 6:36 PM

This is the event that keeps failing on fetching pageviews (we get emails). That could be slow, but it should never time out. Investigation pending...

I tried Updating this event again today. The results were the same in that I got no results, but it churned for only about 20 minutes this time, for whatever reason.

I tried testing this with just the 3 on-latin script usernames, since I'd never tested with those before. This time the event Updated just fine.

@Samwilson points out that this might be related to T219897 and may benefit from the work we do there. Or, it may possibly be similar work to catch the error to save the process from dying.

@jmatazzoni asks if the problem is specific to this event or is this event an exemplar of a quiet bug that we haven't seen before but which could affect other events.

MBinder_WMF renamed this task from Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg. to [Spike: 3hrs] Event Metrics 'crunched numbers' for hours then stopped, but showed neither results nor an error msg..Apr 2 2019, 11:15 PM

MBinder_WMF added a project: Spike.

MBinder_WMF moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

• jmatazzoni added a project: Community-Tech-Sprint.Apr 3 2019, 12:01 AM

• jmatazzoni moved this task from Up Next (June 3-21) to In Sprint 🏃‍♀️🏃‍♂️ on the Community-Tech board.

• jmatazzoni moved this task from To be estimated to In Sprint on the Event Metrics board.

• jmatazzoni changed the subtype of this task from "Task" to "Bug Report".Apr 3 2019, 12:06 AM

MusikAnimal mentioned this in T219897: [4 hours] Event Metrics updates time out after 30 seconds.Apr 3 2019, 3:00 AM

In T219897#5091678, @jmatazzoni wrote:

...So, is that it? Are a lot of pages bad but a lot of users if fine?

I did a few tests and it's possible that that is the answer. I broke the "First quarter WPMED edits" event into smaller pieces: all the pieces have the same number of participants, but they took place over shorter durations, so involve fewer contributions—most specifically, fewer pages improved:

First quarter WPMED edits, original, 3 months, 14,037 pages improved : DOESN'T PROCESS
First quarter WPMED edits. 1 day only, 102 pages improved. PROCESSES INSTANTLY
First quarter WPMED edits. 1 week only, 756 pages improved, PROCESSES QUICKLY
First quarter WPMED edits, 1 month only, 4,810 pages improved, PROCESSES IN 5-10 MINS
First quarter WPMED edits, 6 weeks only, 6,500 pages improved, PROCESSES IN ?? MINS
First quarter WPMED edits, 2 months only, ?? pages improved, CRUNCHED FOR A LONG TIME THEN DID WHAT THE 3 MONTH VERSION DOES.

@dom_walden, what do you think? Is the limit someplace around 6500 pages? It still doesn't explain why this event doesn't show an error message when it times out, if that's what it's doing...

LATER COMMENT: See T219449#5091755. It doesn't appear to be as simple as number of pages. An event wit 5.7k pages improved succeeded in processing, while two with significatnly fewer pages did not.

In T219835#5091682, @jmatazzoni wrote:

@dom_walden, what do you think? Is the limit someplace around 6500 pages? It still doesn't explain why this event doesn't show an error message when it times out, if that's what it's doing...

LATER COMMENT: See T219449#5091755. It doesn't appear to be as simple as number of pages. An event wit 5.7k pages improved succeeded in processing, while two with significatnly fewer pages did not.

See my comment in T219449#5092010. There seems to be some unreliability as to whether an event will process or not, perhaps related to how much load the server is under at the time.

I have been able to create events with 50k pages improved and have them process. E.g. this event of 50k wikidata items improved, 2 participants, just now only took 1-2 mins to get data. On the other hand, it reports 0 pageviews.

I will investigate events with larger pageviews to see if they have a problem. (Although AF 4 and AF 1 have large pagviews, and they loaded in under 10 mins for me just now.)

The main problem I have is that the pages improved report tends to time out for large numbers of pages improved.

Dom's point about server load is also true for load across the entire environment, especially the database. When it's higher, we likely have less resources and so things take longer. When they take longer, there's more likelihood for them to fail.

A way to get at this data is to send some timing information for all jobs to Graphite and start graphing the length of these jobs against whatever other metrics about them we have ie, number of participants, number of pages, length of event, etc.

In T219835#5091682, @jmatazzoni wrote:

@dom_walden, what do you think? Is the limit someplace around 6500 pages? It still doesn't explain why this event doesn't show an error message when it times out, if that's what it's doing...

Experimenting with this today, the limit for me of what will complete processing and what will timeout (with a message) is around 8,000 pages improved or files uploaded (wikidata items I have seen go up to 50k, but not beyond).

For example, an event with 8,001 files uploaded will complete (relatively quickly). If I increase the time of the event by just 1 minute (adding 6 more files uploaded) it will time out.

Although I was not taking into account factors of database caching, so I will re-run some of these events at a later date when the caches have cleared.

In T219835#5092108, @dom_walden wrote:

I have been able to create events with 50k pages improved and have them process...

I think I got this wrong. I have not been able to get to 50k other than with wikidata. I thought they would be the same as regular pages improved...

I currently have two events processing data. With DevTools open, I see they are making GET requests to https://eventmetrics-dev.wmflabs.org/events/job-status/$jobid.

For both events, the response is {"id":$jobid,"status":"queued"}. They have remained like this for some time, even when I am not running any other processes. When I do run another process it completes, but the two events remain in the "queued" state.

EDIT: Examining the HTTP logs for one of the events which appears indefinitely queued, the first POST which started the processing returned a 500 response. "500: Internal Server Error; The server said: <code>Database quota exceeded!</code>". I did not see this indicated anywhere on the page.

See also T219449#5093208. Keep in mind the replicas will cache recently scanned rows (or something like that), so some of those WPMED events may have finished or were fast only because you recently ran an update to a similar event that touched the same data (in whole or part). This is why sometimes if an update times out, it will finish on subsequent tries.

Other things to keep in mind:

If we are at our database quota, hitting "Update data" will queue a job. The cron will try to start it every 5 minutes until we have enough quota. When this happens, cron output is sent to developers, which I have not seen. So I don't think we've hit our quota at any point yet.
Jobs that have been running for an hour are automatically killed. When this happens, no error is shown, under the assumption this should never happen (and it shouldn't). This should at least be logged in some way, which it isn't currently :/ I think this is what happened for the event described in the task description.
The longest part of the update process, by a longshot, is fetching pageviews. I would recommend we tackle T217911 next. I have a strong suspicion that will solve a lot of our problems (such as the job being killed, at least some of the "mysql has gone away" errors, etc.). For some large events the update process will in theory be faster by orders of magnitude.

Jobs that have been running for an hour are automatically killed. When this happens, no error is shown

Hmm actually, I'm now remembering this behaviour was basically a hack I added in before we had the different job "states". Now we have a "timed out" state, so we can just set it to that, in which case the user will at least be shown an error. We can also increase the 1-hour threshold, but I don't think jobs should ever be running that long.

In T219835#5094217, @MusikAnimal wrote:

Jobs that have been running for an hour are automatically killed. When this happens, no error is shown

Hmm actually, I'm now remembering this behaviour was basically a hack I added in before we had the different job "states". Now we have a "timed out" state, so we can just set it to that, in which case the user will at least be shown an error. We can also increase the 1-hour threshold, but I don't think jobs should ever be running that long.

OK, so it sounds like we have a diagnosis of why some jobs just end. What is the best way to fix this problem, so that a) it happens less often (if we think a job might actually finish after an hour) and/or 2) people get an illuminating error message?

In T219835#5094321, @jmatazzoni wrote:

OK, so it sounds like we have a diagnosis of why some jobs just end. What is the best way to fix this problem, so that
a) it happens less often (if we think a job might actually finish after an hour)

T217911: Performance improvement: Fetch pageviews asynchronously should make a big difference. To me this feels like the best next step. We wanted to do this eventually anyway. Fetching pageviews of each page, one by one, is crazy slow considering we have the capability to get pageviews for up to 100 pages at once.

b) people get an illuminating error message?

What I was saying above; The auto-killing of jobs shouldn't happen, and instead we change the state to "timed out". I'll create a ticket here soon. Implementation will be a little bit involved.

In T219835#5094346, @MusikAnimal wrote:

b) people get an illuminating error message?

What I was saying above; The auto-killing of jobs shouldn't happen, and instead we change the state to "timed out".

Tracked at T220423: Show "timed out" error to the user when an event update has been running for over an hour

• jmatazzoni added subtasks: T217911: Performance improvement: Fetch pageviews asynchronously, T220423: Show "timed out" error to the user when an event update has been running for over an hour.Apr 8 2019, 10:20 PM

• jmatazzoni moved this task from Ready to Product sign-off on the Community-Tech-Sprint board.Apr 8 2019, 11:01 PM

• jmatazzoni closed this task as Resolved.Apr 15 2019, 5:25 PM

• jmatazzoni removed projects: Community-Tech-Sprint, Community-Tech.

MusikAnimal closed subtask T217911: Performance improvement: Fetch pageviews asynchronously as Resolved.Apr 18 2019, 1:20 AM