Page MenuHomePhabricator

WebVideoTranscodeJob is keeping database connections open for several minutes on s4 master
Open, LowPublic

Description

Several times today transactions from videoscalers tmh100[12] have piled up on db1040. They apparently run this sort of query:

INSERT /* WebVideoTranscodeJob::run */ IGNORE INTO transcode (transcode_image_name, transcode_key, transcode_time_startwork, transcode_error, transcode_final_bitrate) VALUES ('Night_of_the_Living_Dead_(1968_film).webm', '160p.ogv', '20150430131138', '', '0')

Then hang around blocking stuff without committing until icinga alerts go off and they risk getting killed. Do we expect to hold locks for the duration of the transcoding? Is this a recent change? If not, why it this suddenly happening now?

EDIT: Transactions do not seem to keep being open right now, but connections does, until they timeout on the database.

Details

Related Gerrit Patches:
mediawiki/extensions/TimedMediaHandler : masterShut down database connections during transcode shell-outs
mediawiki/core : masterAdd LBFactory::closeAll() method

Event Timeline

Springle created this task.Apr 30 2015, 1:35 PM
Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle added a project: Wikimedia-Rdbms.
Springle added a subscriber: Springle.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 30 2015, 1:35 PM
Springle triaged this task as High priority.May 5 2015, 1:28 PM

This is recurring.

several long-running jobs on tmh100*, perhaps related

www-data 23579  159  1.2 986544 202352 ?       RNl  13:21  36:06  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_bc66c506e711-1.webm -threads
www-data 18401  153  1.2 1000048 210064 ?      RNl  13:05  59:49  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_1b094e5bdce1-1.webm -threads
www-data 18366 78.7  0.5 219024 92480 ?        RN   13:05  30:58  |                       \_ /usr/bin/ffmpeg2theora /tmp/localcopy_0248b2397fe5-1.webm -V 512 -
www-data 18252  150  1.4 1041916 241184 ?      RNl  13:04  60:42  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_90771f9f5942-1.webm -threads
www-data 29231 94.1  0.5 210448 86572 ?        RN   13:40   4:02                          \_ /usr/bin/ffmpeg2theora /tmp/localcopy_b9fb9c780cd6-1.webm -V 160 -

This is recurring.

Seven weeks later: Still occurring? Still high priority? Any news?

Ten weeks later: Still occurring? Still high priority? Any news?

Three months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

Five months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

Reedy added a subscriber: Reedy.EditedNov 14 2015, 5:18 PM

Five months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

I presume you mean db1040? :)

There's no sign of any long running queries like that on db1040, but that's not to say the bug is fixed etc

Reedy set Security to None.

I would say this would merit investigation by people working with media: https://logstash.wikimedia.org/#dashboard/temp/AVEPl-AqptxhN1XagGNY

hashar added a subscriber: hashar.Feb 19 2016, 10:46 AM

Related is T127428. Seems the job runner RPC service wrap the whole transcoding session in a database transaction. So while the transcoding happens there is an ongoing idle transaction on the database.

We workarounded the 3rd party connection issues by tweaking the pool of connections handling. But the long running transaction until they timout (default: 3600 seconds) are still ongoing.

Note: no ongoing queries this time, only Idle (open transactions?) connections.

brion added a comment.May 1 2016, 5:11 PM

Two questions:

First, is the current code closing out the transactions correctly?

Second, what are we supposed to do, as a general rule, during long running processes that do not touch the database? Can we close out the connections are reopen them later?

brion added a comment.May 1 2016, 5:24 PM

Couple notes:

https://commons.wikimedia.org/wiki/Special:TimedMediaHandler Shows what TimedMediaHandler thinks is currently running. There are a lot of new files including very long speeches / talk / conference videos which take forever to transcode. This is backing up the queue right now by over an hour.

However the queues being full ought not to result in unreasonable load. If they are we may have to tune how many processes run and how many servers are available.

We also recently added two new low resolution webm transcode targets, but there are no automatic jobs re-running them on old files so that should not cause a major backup itself.

Can we close out the connections are reopen them later?

No "can", must. For several reasons:

  • A long idle open connection takes resources (mostly, memor) from other connections that may be doing something useful. It is in particular harmful with the tread-pool we have deployed on our servers
  • Under certain cases, it can lead to max connection exhaustion
  • The server will automatically close long running connections anyway
  • The application must be prepared for connections to fail at any time (e.g. on failover)
  • The application must reread its database configuration regularly (e.g. if a server is depooled)

Not closing idle connections would mean to commit to a server being up for hours. Servers being available is not a guarantee, only the service is guaranteed.

In the MySQL model, opening new connections is not costly. Do not do it on every single query, but unless very fast queries are going to be sent one after another, connections should be closed as soon as mysql is no longer needed (even before the request is served). That increases the amount of work mysql can do in parallel.

If at some point connections would get expensive (e.g. TLS deployment), that should be transparent to the application by setting a higher-level proxy of persistent connections. The application should not try to optimize that at all.

As far as I can see, no ongoing transactions (open transactions), only open connections:

| 19982549 | wikiuser        | 10.64.16.146:46469 | commonswiki        | Sleep       |    1166 |
...
History list length 256
...
---TRANSACTION 6084148123, not started
MySQL thread id 19982549, OS thread handle 0x7faddbc83700, query id 1263078851 10.64.16.146 wikiuser cleaning up

So the bug was fixed, partially (transactions seem to finish, but they do not disconnect).

brion added a comment.May 1 2016, 10:18 PM

Ok so follow up: "in MediaWiki how do we close out all db connections in a reliable way that allows them to open again later?" I've never had this particular use case so will probably have to dive into the lb classes . :)

brion added a comment.May 1 2016, 10:22 PM

Found LoadBalancer::closeAll() which looks handy. When I'm back on the ground I'll test to confirm whether the connections reopen on demand.

Change 286381 had a related patch set uploaded (by Brion VIBBER):
Shut down database connections during transcode shell-outs

https://gerrit.wikimedia.org/r/286381

Joe added a subscriber: Joe.May 2 2016, 9:20 AM

So looking on the scalers, what I found out is what follows:

  • there are several jobs running in hhvm, more than the number that should be enqueued by the jobrunner process (which would be 5)
  • serveral jobs fail to be transcoded within the 6 hours limit we impose

Some more data:

so it seems like POSTs get responded before the jobs are done and that the number of running jobs can thus greatly overflow the programmed number.

hashar removed a subscriber: hashar.May 2 2016, 9:39 AM
brion added a comment.May 2 2016, 11:27 AM

maxtime=60 seems like a problem. What enforces that?

The application must reread its database configuration regularly (e.g. if a server is depooled)

So for that:

Shut down database connections during transcode shell-outs
https://gerrit.wikimedia.org/r/286381

to afterwards be able to then reconnect it would need to make a new "request"?

jcrespo renamed this task from transcode transaction blocking for minutes on s4 master to transcode connections open for minutes on s4 master.May 3 2016, 6:59 AM
jcrespo updated the task description. (Show Details)

Change 290478 had a related patch set uploaded (by Brion VIBBER):
Add LBFactory::closeAll() method

https://gerrit.wikimedia.org/r/290478

Change 290478 merged by Aaron Schulz:
Add LBFactory::closeAll() method

https://gerrit.wikimedia.org/r/290478

TheDJ added a subscriber: TheDJ.Oct 31 2016, 9:06 PM

So do we know if this is any better now ? Can this be close ?

As far as I can tell, @TheDJ, only support for CloseAll() has been added, now it has to be used by the client code I assume?

This should improve the number of long running connections, mainly to s4-master: purge lag and processlist:
https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=db1040&from=now-24h&to=now

Sorry, I can only say thank you for resurecting this! @brion

brion added a comment.Nov 2 2016, 5:23 PM

No need for sorries! The patch resurrection is upon us, let us rejoice and merge. :D

brion added a comment.Nov 2 2016, 5:24 PM

[Note I had to make a slight update for backend changes that made it different how one gets to the LBFactory singleton. Confirmed that it works in the Vagrant setup, at least in isolation.]

Change 286381 merged by jenkins-bot:
Shut down database connections during transcode shell-outs

https://gerrit.wikimedia.org/r/286381

I do not know if it is fixed, right now we just kill sleeping connections after 60 seconds.

jcrespo closed this task as Resolved.May 25 2017, 2:28 PM
jcrespo claimed this task.
jcrespo reassigned this task from jcrespo to brion.
jcrespo reopened this task as Open.May 25 2017, 3:10 PM
jcrespo lowered the priority of this task from High to Low.

I checked and this still happens (connections being idle for a long time) - not 100% sure, but I would suspect for the same initial reason. It is not impacting production because we will kill, connections as I said above.

jcrespo removed brion as the assignee of this task.May 25 2017, 3:10 PM
Krinkle renamed this task from transcode connections open for minutes on s4 master to WebVideoTranscodeJob is keeping database connections open for several minutes on s4 master.Jul 29 2018, 1:42 AM
Krinkle moved this task from Untriaged to Usage problem on the Wikimedia-Rdbms board.
Krinkle removed a project: MW-1.28-release-notes.