Page MenuHomePhabricator

[4 hours] Event Metrics updates time out after 30 seconds
Closed, ResolvedPublicBUG REPORT

Description

I originally thought these timeouts were related to getting the background image from Commons.

However, it turns out the issue lies with our communication to the Pageviews API.

Error log elevated from a @MusikAnimal comment below:

[2019-04-02 16:33:32] event_processor.INFO: > Fetching pageviews for en.wikipedia... [] {"host":"eventmetrics.wmflabs.org","uri":"http://eventmetrics.wmflabs.org/events/process/364"}
[2019-04-02 16:35:38] php.CRITICAL: Fatal Error: Maximum execution time of 30 seconds exceeded {"exception":"[object] (Symfony\\Component\\Debug\\Exception\\FatalErrorException(code: 0): Error: Maximum execution time of 30 seconds exceeded at /var/www/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:538)"} {"host":"eventmetrics.wmflabs.org","uri":"http://eventmetrics.wmflabs.org/events/process/364"}
[2019-04-02 16:35:38] request.CRITICAL: Uncaught PHP Exception Symfony\Component\Debug\Exception\FatalErrorException: "Error: Maximum execution time of 30 seconds exceeded" at /var/www/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php line 538 {"exception":"[object] (Symfony\\Component\\Debug\\Exception\\FatalErrorException(code: 0): Error: Maximum execution time of 30 seconds exceeded at /var/www/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:538)"} {"host":"eventmetrics.wmflabs.org","uri":"http://eventmetrics.wmflabs.org/events/process/364"}

Event Timeline

We use Guzzle in PageviewsRepository too. Both pageviews and the Commons API (where the background images come from) should never be timing out, at least after 30 seconds! It could also be a misrepresentation of where the error actually happened. I'm going to safely assume the latter. We have many tools querying these APIs and I've never seen them timeout. If it's not on our end, it's in the pageviews API and/or MediaWiki. I'm close to 100% certain the pageviews API will never timeout.

Apart from today, on April 1 we got six different timeout errors, four of which were from SQL queries. That is expected on occasion, say for big events. At any rate, this worthy of investigation.

Case-matching search fooled me again.

I agree that we should look into these timeouts.

[2019-04-02 16:33:32] event_processor.INFO: > Fetching pageviews for en.wikipedia... [] {"host":"eventmetrics.wmflabs.org","uri":"http://eventmetrics.wmflabs.org/events/process/364"}
[2019-04-02 16:35:38] php.CRITICAL: Fatal Error: Maximum execution time of 30 seconds exceeded {"exception":"[object] (Symfony\\Component\\Debug\\Exception\\FatalErrorException(code: 0): Error: Maximum execution time of 30 seconds exceeded at /var/www/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:538)"} {"host":"eventmetrics.wmflabs.org","uri":"http://eventmetrics.wmflabs.org/events/process/364"}
[2019-04-02 16:35:38] request.CRITICAL: Uncaught PHP Exception Symfony\Component\Debug\Exception\FatalErrorException: "Error: Maximum execution time of 30 seconds exceeded" at /var/www/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php line 538 {"exception":"[object] (Symfony\\Component\\Debug\\Exception\\FatalErrorException(code: 0): Error: Maximum execution time of 30 seconds exceeded at /var/www/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:538)"} {"host":"eventmetrics.wmflabs.org","uri":"http://eventmetrics.wmflabs.org/events/process/364"}

So apparently it is the pageviews! I trust the stack trace here. I don't understand it, though.

aezell renamed this task from [BUG] EventMetrics background image times out causing fatal error to [BUG] EventMetrics times out when talking to the Pageviews API.Apr 2 2019, 10:39 PM
aezell updated the task description. (Show Details)
aezell renamed this task from [BUG] EventMetrics times out when talking to the Pageviews API to [BUG] [4 hours] EventMetrics times out when talking to the Pageviews API.Apr 2 2019, 11:06 PM
aezell triaged this task as Medium priority.

We will spend ~ 4 hours or less investigating the issue. The possible outcomes of this week are:

  • A new ticket describing the work to be done to avoid the timeouts
  • A new ticket describing the work to handle the Fatal Error so the entire job doesn't fail
  • A patch that fixes the problem (if the fix is easy)
jmatazzoni renamed this task from [BUG] [4 hours] EventMetrics times out when talking to the Pageviews API to [4 hours] EventMetrics times out when talking to the Pageviews API.Apr 2 2019, 11:58 PM
jmatazzoni moved this task from To be estimated to In Sprint on the Event Metrics board.
jmatazzoni added subscribers: MaxSem, Samwilson.

Hopefully the 4 hours doesn't include the time it will take to copy the event mentioned in T219835 to your local! 😛I have created T219958: Add import/export functionality for events.

For whoever is working on this (might be myself), I think the 30-second timeout is the default for Guzzle. Why it's getting hung up, I've no idea, but from my experience it would be highly unusual for a single request to the Pageviews API to take anywhere near 30 seconds. The full round trip for the per-article endpoint, regardless of date range or article, is almost always under 200ms, though I have seen it get as high as 500ms. If you see any request taking over 1 second to respond, we should probably tell Analytics about it.

The answer here is really "what else did you expect from an event this big?" The event in question tracks the activity of 93 users over 3 months and has to crunch through tremendous amounts of data. Here's how it looks if you try to update it from command line:

www-data@eventmetrics-prod01:/var/www$ bin/console app:process-event 364

Event processor
===============

Processing event 364

==== Processing en.wikipedia ====
> Fetching page IDs and edit counts... Done (41965 ms)
>> Pages created: 186
>> Pages improved: 14037
>> Edits: 34411
> Fetching bytes changed... Done (64100 ms)
>> Bytes changed: 5284453
> Fetching files uploaded and global file usage... Done (1453 ms)
>> Files uploaded: 32
>> File usage: 29
>> Pages using files: 42
> Fetching user counts for en.wikipedia... Done (40 ms)
>> Participants: 93

Total edits: 34411
Total pages created: 186
Total pages improved: 14037
Total bytes added: 5284453
Total files uploaded: 32
Total files used: 29
Total pages using uploaded files: 42

> Fetching pageviews for en.wikipedia... Done (340595 ms)
>> Pageviews of pages created: 200711
>> Average daily pageviews of pages improved: 8608102
> Fetching pageviews of pages containing files uploaded...
In DBALException.php line 172:

  An exception occurred while executing 'SET STATEMENT max_statement_time = 900 FOR
  SELECT dbname, url FROM wiki WHERE (dbname = ?) OR (url LIKE ?) OR (url LIKE ?)' with params ["enwiki", "https:\/\/enwiki_p", "https
  :\/\/enwiki_p.org"]:

  Warning: PDOStatement::execute(): MySQL server has gone away

Conclusions, for discussion at engineering meeting:

  • Have some heuristics to detect potentially slow events and always process them via cronjobs.
  • Have the DBAL automatically reconnect if connection had timed out because retrieving pageviews took too long.
  • Currently, Guzzle doesn't have any timeout set for HTTP requests. Add timeouts to prevent jobs from hanging for a long time (pull request).

In T219897#5089655, @MaxSem wrote:

The answer here is really "what else did you expect from an event this big?" The event in question tracks the activity of 93 users over 3 months and has to crunch through tremendous amounts of data. Here's how it looks if you try to update it from command line:

I'm not ready to say pure size is the issue. Or, if it is, the question is what do we mean by size precisely? What is the factor that maxes the system out? Compare the following two events.

The event Max is looking into that fails has 93 users making 34,000 edits to 14,000 pages over 3 months. This other medical event has 5,500 participants making 53,800 edits to 1,200 pages over an entire year. Yet that event processes in under 5 mins (this event is defined by a single category, RTT.)

See the stats below. So, is that it? Are a lot of pages bad but a lot of users if fine?

'First quarter WPMED edits' doesn't process

  • Pages created: 186
  • Pages improved: 14037
  • Edits: 34411
  • Bytes changed: 5284453
  • Files uploaded: 32
  • File usage: 29
  • Pages using files: 42
  • Fetching user counts for en.wikipedia... Done (40 ms)
  • Participants: 93

Event processes in under 5 mins

  • "Participants",5487
  • "Pages created",2
  • "Pages improved",1169
  • "Edits",53819
  • "Bytes changed",1945263
  • "Uploaded files",0
  • "Views to pages created",557164
  • "Avg. daily views to pages improved",1407012
  • "Unique pages with uploaded files",0
  • "Uploaded files in use",0
  • "Avg. daily views to uploaded files",0
  • "New editors",1735
  • "Retention after 7 days",278

In T219897#5091678, @jmatazzoni wrote:

...So, is that it? Are a lot of pages bad but a lot of users if fine?

I did some further research on the event referenced above, First quarter WPMED edits. But since T219835 is devoted to that event particularly, I took the discussion there. See T219835#5091682

I won't make the upcoming engineering meeting, so to reply to Max's thoughts:

Have some heuristics to detect potentially slow events and always process them via cronjobs.

If you mean use the cron to process the event (ala app:process-event), as opposed to processing directly as with EventDataController.php#L201, I don't think it will make a difference. Going the latter route, it will background the process, which is effectively the same as starting it with a cron. The client does not wait for a reply from the initial /events/process endpoint, it just continually pings for the job status.

Have the DBAL automatically reconnect if connection had timed out because retrieving pageviews took too long.

I brought this up in an engineering meeting a few months ago. This same problem exists for nearly all my tools, anything that has sequential, long-running queries. Here are our notes from that meeting:

Do we need this? https://github.com/facile-it/doctrine-mysql-come-back

I've toyed around with the first library (doctrine-mysql-come-back), and it didn't seem to work as advertised. I've also tried using $connection->ping() where we grab the existing connection, and using $connection->reconnect() as necessary, and this didn't do the trick either. The issue could also be due to MySQL configuration on the replicas.

At any rate, I don't think it's related to pageviews (see below). In your app:process-event output, it errored when querying the replicas for the project URL and database name.

Currently, Guzzle doesn't have any timeout set for HTTP requests. Add timeouts to prevent jobs from hanging for a long time (pull request).

This seems like a useful (and well implemented!) solution for handling retries to the pageviews API, but I'm still not convinced that's our problem. Guzzle is only used for the Pageviews API, and from my experience this API rarely if ever "times out" (at least over 30 seconds, as our error reports are saying!). We don't set a timeout for PopularPagesBot, another PHP application using Guzzle for pageviews, and as far as I know it doesn't error out in the same way. Your PR will at least give us more information to go off of though, which is great.

See the stats below. So, is that it? Are a lot of pages bad but a lot of users if fine?

For retrieving the information from the database, the number of potential revisions to go through is decisive, so both time range and number of users matter. For retrieving pageviews information, the speed depends on the number of pages affected (both created and edited).

If you mean use the cron to process the event (ala app:process-event), as opposed to processing directly as with EventDataController.php#L201, I don't think it will make a difference. Going the latter route, it will background the process, which is effectively the same as starting it with a cron. The client does not wait for a reply from the initial /events/process endpoint, it just continually pings for the job status.

This apparently doesn't happen, since the request times out after 30 seconds. In any case, having long-running processes in Apache is an antipattern, that's why I want to change this.

At any rate, I don't think it's related to pageviews (see below). In your app:process-event output, it errored when querying the replicas for the project URL and database name.

It timed out because the connection was idle while waiting for pageviews to be processed. We can try making them the last thing we retrieve but then we'll still have to use the DB at least to store the results. Looks like we'll have to dig deeper into keeping the connections alive. In CommonsNotifier, I ended up just keeping the connection open in short bursts, closing and reopening multiple times. It's slower, but causes 0 timeouts now.

from my experience this API rarely if ever "times out" (at least over 30 seconds, as our error reports are saying!)

Our report is PHP timeout, not something related to Guzzle.

Oh, goodness gracious, yes it is a PHP timeout. I didn't even know that was a thing. The exceptions pointed to Guzzle, but this was only because that's where it was in the update process when it timed out. It all makes sense now :) So indeed, if we go by cron (where the time limit is apparently unlimited), we avoid this.

@MaxSem Does it matter if the session is closed by PHP, as with EventDataController.php#L198? I don't think we want to use [[ https://www.php.net/manual/en/function.set-time-limit.php | set_time_limit ]] from the beginning of the request, but would be acceptable to set it to an hour, or what have you, once we know we're backgrounding the process? It wouldn't be tieing up the client, after all.

Eh, as a quick fix it's probably OK.

Should we make a different task to make this run by cron jobs? Seems like we should as this was a spike. I think that this cron job work plus the async pageview work are the two priority things we should work on right now.

Right. Since we made this a 4-hour spike, it seems like we should close this and open a new ticket about the Cron job. If you guys agree, can one of you who understands that better please write that?

It sounds like for this task, we could get away with a one-line change to use set_time_limit just after the update process is backgrounded. I don't really want to send everything to the cron (it only runs every 5 mins), and determining which events might be slow is mostly going to be a wild guess. There is some concern about piling on Apache processes, but this shouldn't be a big issue because the job queue system will only allow 5 concurrent updates anyway, anything more will be queued and picked up by the cron. In short, I think we'll be okay.

We can create new a ticket for whatever we decide to do, but we don't really know for sure if our proposed solutions will work yet. I think this is one of those experiment-till-fixed kind of things.

At any rate, Pageviews aren't the problem, as we've uncovered. I'll retitle this task.

MusikAnimal renamed this task from [4 hours] EventMetrics times out when talking to the Pageviews API to [4 hours] Event Metrics updates time out after 30 seconds.Apr 8 2019, 10:40 PM

I tried https://eventmetrics-dev.wmflabs.org/programs/150/events/364 on staging, after https://github.com/wikimedia/eventmetrics/pull/284 was merged and deployed, and it timed out:

[2019-04-10 09:20:34] event_processor.INFO: Processing event 364  [] {"host":"eventmetrics-dev.wmflabs.org","uri":"http://eventmetrics-dev.wmflabs.org/events/process/364"}

...

[2019-04-10 09:28:58] event_processor.INFO: > Fetching pageviews of pages containing files uploaded... [] {"host":"eventmetrics-dev.wmflabs.org","uri":"http://eventmetrics-dev.wmflabs.org/events/process/364"}
[2019-04-10 09:28:58] php.WARNING: Warning: PDOStatement::execute(): MySQL server has gone away {"exception":"[object] (ErrorException(code: 0): Warning: PDOStatement::execute(): MySQL server has gone away at /var/www/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:117)"} {"host":"eventmetrics-dev.wmflabs.org","uri":"http://eventmetrics-dev.wmflabs.org/events/process/364"}
[2019-04-10 09:28:58] php.WARNING: Warning: PDOStatement::execute(): Error reading result set's header {"exception":"[object] (ErrorException(code: 0): Warning: PDOStatement::execute(): Error reading result set's header at /var/www/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:117)"} {"host":"eventmetrics-dev.wmflabs.org","uri":"http://eventmetrics-dev.wmflabs.org/events/process/364"}
[2019-04-10 09:28:58] request.CRITICAL: Uncaught PHP Exception Exception: "Job for event 364 timed out" at /var/www/src/AppBundle/Service/JobHandler.php line 187 {"exception":"[object] (Exception(code: 0): Job for event 364 timed out at /var/www/src/AppBundle/Service/JobHandler.php:187, Symfony\\Component\\HttpKernel\\Exception\\HttpException(code: 900): error-query-timeout at /var/www/src/AppBundle/Repository/Repository.php:425)"} {"host":"eventmetrics-dev.wmflabs.org","uri":"http://eventmetrics-dev.wmflabs.org/events/process/364"}

Same issue :(

jmatazzoni added a subscriber: dom_walden.

@dom_walden, I don't really know what we did in the end. And I'm not sure there's anything I can check—anything I can see. I'm bouncing this into your column. You can decide whether there is something to QA or whether to just Resolve. Thanks.

jmatazzoni moved this task from QA to Done on the Community-Tech-Sprint board.