Page MenuHomePhabricator

[Timebox: 4h] Investigation: Updating data on a large (real) events causes endless "crunching"
Closed, ResolvedPublicBUG REPORT

Description

I tried to Update two large but actual events . In one case, the event timed out the first time but worked on a second try. In the other, the event showed the "crunching" message and animation for hours. It did this twice. The events don't seem that different Is there a bug causing the first one to neither time out nor succeed?

To reproduce:

AF 3.1, crunching for 2 hours

  • On test, go to the Program "AFtest" and open the event "AF 3.1"
  • (Size of event: Pages created, 1.9k, pages improved, 7.2k)
  • Click Update Data
  • Expected result: the metrics update
  • Actual result: After 2 hours 4 hours, the page is still "crunching the numbers," with the working animation showing

AF 4, timeout then success

  • On test, go to the Program "AFtest" and open the event "AF 4."
  • (Size of event: Pages created, 500, pages improved, 3.4k)
  • Click Update Data
  • Expected result: the metrics update
  • Actual result: on the first try, the event timeout after 5 minutes or so. On the second try, success.

Both of these are, as far as I can tell, real events from Art & Feminism (the organizer is not one of us). They are on the large side, with 769 and 561 editors, respectively. But the one that crunched endlessly appears to be not that different in principle from the other.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Mooeypoo renamed this task from Updating data on a large (real) events causes endless "crunching" to [Timebox: 4h] Investigation: Updating data on a large (real) events causes endless "crunching".Mar 27 2019, 11:00 PM
jmatazzoni changed the subtype of this task from "Task" to "Bug Report".Apr 4 2019, 10:23 PM

I went ahead and tested some other, similar events from this same program:

AF 2
(Size of event: Pages created, 871, pages improved, 5.7k)
Got the numbers!

AF 1
(Size of event: Pages created, 487, pages improved, 4k)
CRUNCHED FOR HOURS THEN FAILED

Retrying AF 4
(Size of event: Pages created, 498, pages improved, 3.4k)
CRUNCHED FOR HOURS THEN TIMED OUT

AF 1
(Size of event: Pages created, 487, pages improved, 4k)
CRUNCHED FOR HOURS THEN FAILED

@jmatazzoni Ran this morning between 07:30-07:45 UTC. Took a bit less than 10 mins.

Retrying AF 4
(Size of event: Pages created, 498, pages improved, 3.4k)
CRUNCHED FOR HOURS THEN TIMED OUT

Ran just after above event. Took about 4 mins.

Perhaps there is an issue of how much load the EM servers and/or replica DBs are under at the time. I don't know when peak usage is.

I noted in T219020#5073851 that:

On staging, I find that while waiting for [a large "pages improved" CSV/Wikitext] report I cannot do anything else on Event Metrics ([other] pages stop loading entirely). I don't know whether this is a sign of EM being overloaded or some sort of feature of wmflabs to prevent a particular user hogging all the resources.

I do not know if our VMs are reliable environments for performance testing.

This event, which is en.wiki category "Living people" for 10 minutes, is still crunching after well over an hour.

However, when I made a copy of that event and calculated the numbers it finished in less than 30 seconds. I have successfully calculated numbers for that event for up to 6 hours (so far), which only took about 5 minutes.

Again, performance figures don't seem to be all that reliable.

This event, which is en.wiki category "Living people" for 10 minutes, is still crunching after well over an hour.

However, when I made a copy of that event and calculated the numbers it finished in less than 30 seconds. I have successfully calculated numbers for that event for up to 6 hours (so far), which only took about 5 minutes.

Again, performance figures don't seem to be all that reliable.

Subsequent tries, whether copying the event or using the existing one that timed out, are more likely to be successful due to query caching on the replicas (not just our own query caching). I'm no DBA, but I think even if you extend the date range as you have done, so long as there is some overlap from previous queries, the relevant rows may still be in the MySQL buffer pool and hence fetches are much faster. From my experience, for testing purposes you'll need to give it some 10-15 minutes to let the MySQL-side caching expire naturally.

Retried some of these today. I was trying to see if i got an error message after an hour, but they all processed!

AF 3.1, started 4:47, completed, supposedly in 4 mins according to time stamp.

AF 1, started 4:47: completed, supposedly in 3 mins according to time stamp.

AF 4, started 8:18, processed by 8:20.

AF 2, started 8:18, processed by 8:20.

AF 3, started 8:48, processed in 3 mins.

@MaxSem, if the idea is that after one hour no event should still be "crunching," then we're not done here. I set this event to Update at 10:10 tonight:

https://eventmetrics-dev.wmflabs.org/programs/150/events/364

At 11:15 it's still crunching the numbers. Then, when I reloaded the page, it just went back to it's pre-update state. No metrics, no error message. Sooo, what next?

AF 3.1 now finishes in about 6 minutes, and AF 4 finished in about 5 minutes :)

https://eventmetrics-dev.wmflabs.org/programs/150/events/364 however timed out after about 10 mintues or so, due to issues we've been discussing at T219897.