Page MenuHomePhabricator

Some API action=upload & filekey requests take >30 seconds to complete
Closed, ResolvedPublic

Description

Screenshot of the error message

Twice I received this error message. Please see attached .jpg screenshot. The message reports failed upload, but the upload is successful.


Version: unspecified
Severity: major

Attached:

Incorrect_upload_wizard_message.jpg (414×864 px, 25 KB)

Details

Reference
bz34785

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:10 AM
bzimport set Reference to bz34785.

Including actual error message in bug title ( UploadStashFileNotFoundException ) for easier searching and for people too lazy to look at image ;)

resetting milestone till we figure out that this is more dependably replicable

Roan has been working on this one.

Just completed the upload of 19 greyscale images, total size 7.1MB and two reported the same bug, although the images were in fact uploaded and not rejected. Unfortunately I can't provide any technical data.

Just encountered it, upping severity:

Request

POST /w/api.php HTTP/1.1
User-Agent: Opera/9.80 (Windows NT 6.1; U; ru) Presto/2.10.229 Version/11.61
Host: commons.wikimedia.org
Accept-Language: ru,en;q=0.9
Accept-Encoding: gzip, deflate
Referer: https://commons.wikimedia.org/wiki/Special:UploadWizard
Cookie: *
Connection: Keep-Alive
Content-Length: 955
Content-Type: application/x-www-form-urlencoded
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
Content-Transfer-Encoding: binary

POST variables

action upload
format json
filekey 10firwjkg56g.caqj15.20471.jpg
filename Vladislav_Kovalyov_execution_notice.jpg
text =={{int:filedesc}}== {{Information |description={{en|1=Notice sent to the mother of the excuted 2011 Minsk metro bombing perpetrator Vladislav Kovalyov}} |date=2012-03-16 |source=http://charter97.org/photos/20120317_prigovor.jpg |author=Belarussian Supreme Court |permission= |other_versions= |other_fields= }} =={{int:license-header}}== {{subst:Custom license marker added by UW}} {{PD-BY-exempt}} [[Category:Uploaded with UploadWizard]] [[Category:Death penalty]] [[Category:2012 in Belarus]] [[Category:2011 Minsk Metro bombing]]
summary User created page with UploadWizard (alpha)
token 9bf46413397017ca89f840746cf1dcc2+\

Response

HTTP/1.1 200 OK
Server: nginx/0.7.65
Date: Sat, 17 Mar 2012 10:48:26 GMT
Content-Type: application/json; charset=utf-8
Connection: keep-alive
X-Content-Type-Options: nosniff
Cache-Control: private
MediaWiki-API-Error: internal_api_error_UploadStashFileNotFoundException
Content-Encoding: gzip
Vary: Accept-Encoding
X-Vary-Options: Accept-Encoding;list-contains=gzip
Content-Length: 452
X-Cache: MISS from sq62.wikimedia.org
X-Cache-Lookup: MISS from sq62.wikimedia.org:3128
X-Cache: MISS from amssq38.esams.wikimedia.org
X-Cache-Lookup: MISS from amssq38.esams.wikimedia.org:3128
X-Cache: MISS from amssq33.esams.wikimedia.org
X-Cache-Lookup: MISS from amssq33.esams.wikimedia.org:80
Via: 1.1 sq62.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq38.esams.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq33.esams.wikimedia.org:80 (squid/2.7.STABLE9)

Body

{"servedby":"srv195","error":{"code":"internal_api_error_UploadStashFileNotFoundException","info":"Exception Caught: key '10firwjkg56g.caqj15.20471.jpg' not found in stash","*":"\n\n#0 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/upload\/UploadStash.php(135): UploadStash->getFile('10firwjkg56g.ca...')\n#1 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/upload\/UploadFromStash.php(79): UploadStash->getMetadata('10firwjkg56g.ca...')\n#2 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiUpload.php(308): UploadFromStash->initialize('10firwjkg56g.ca...', 'Vladislav_Koval...')\n#3 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiUpload.php(64): ApiUpload->selectUploadModule()\n#4 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiMain.php(706): ApiUpload->execute()\n#5 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiMain.php(360): ApiMain->executeAction()\n#6 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiMain.php(344): ApiMain->executeActionWithErrorHandling()\n#7 \/usr\/local\/apache\/common-local\/php-1.19\/api.php(117): ApiMain->execute()\n#8 \/usr\/local\/apache\/common-local\/live-1.5\/api.php(3): require('\/usr\/local\/apac...')\n#9 {main}\n\n"}}

Here's an example of this bug from user-testing:

https://accounts.usertesting.com/ViewVideo.aspx?file=t0uwDTLDpas%3d

Firefox 12 user. File first fails with "unknown error", then on retry with the stash error. What's noticeable is that the final step takes a long time before it errors out. These are just image files, albeit large ones.

This is definitely a high priority bug that needs some deeper investigation, probably by Aaron as well since it might be FileBackend/SWIFT related. I'll poke at it as well to see if I can produce it at all in local testing.

I should note that the above user actually successfully uploaded the balloon picture which reported the error. (I deleted the file, so you can no longer see it.) The timing was the same as the other uploads. So UW apparently correctly uploaded the file but reported an API error.

Here's what appears to be happening. The MediaWiki API JavaScript convenience functions in core (resources/mediawiki.api/mediawiki.api.js) have a default timeout of 30 seconds. This causes uploads to report an error if the final step runs longer than 30 seconds. The uploads should be successfully going through, regardless, which is why the "Retry" shows a stash error (at that point the file has been successfully removed from the stash).

We can "fix" this by overriding the default timeout, but we should determine why the action=upload for a file (even one of just a few hundred KB) in the stash takes upwards of 30 seconds. This is a severe performance regression that seems to have started around the 1.19 deployment.

We deployed the timeout increase; please report if you still see this problem in production. We're further investigating the reasons for the slow API response. site_stats contention issues are currently a prime suspect.

Sorry, the previous patch was broken and actually had no effect on the timeout. This one should do the trick once deployed: https://gerrit.wikimedia.org/r/#/c/3808/

Deployed. Confirmed that the timeout issue is fixed and users should no longer see the error message in the last step.

However, we still need to figure out why the last step is taking so long.

Updating bug description. This is not inherently an UploadWizard issue, although UploadWizard may exacerbate contention issues because it does batch uploading.

Latest status on this: we're currently testing a change that should fix this problem. Aaron has enabled this on mediawiki.org, and we'll likely make this change on commons tomorrow.

If our theory is correct, the problem is caused when page editing or image uploading requires an update to the site statistics. The code that fixes this is deployed, but needs to be configured.

Tech minutia follows:

The fix we're relying on is here: https://gerrit.wikimedia.org/r/#/c/7724

The configuration change needs to be made, changing $wgSiteStatsAsyncFactor.

You can see which wikis this has been deployed to by looking for wgSiteStatsAsyncFactor here:
http://wikitech.wikimedia.org/view/Server_admin_log

$wgSiteStatsAsyncFactor is now set to 1 for Wikimedia Commons.

I'm no longer able to get 15-30 second requests, although I did catch a 12 second call for a 400K filekey upload that was part of a large batch (most of which was processed very quickly). So it's possible this issue is not (fully) resolved yet.

Let's keep the bug open for now -- if you see the issue in the wild (manifests as last step of Upload Wizard uploads taking a very long time), please report it here.

Asher and I our still tracking down additional problems. I've made https://gerrit.wikimedia.org/r/8006 to add more profiling. The problem is tracked down to LocalFile::upload.

(In reply to comment #18)

Asher and I our still tracking down additional problems. I've made
https://gerrit.wikimedia.org/r/8006 to add more profiling. The problem is
tracked down to LocalFile::upload.

For upload profiling: The edit portions tends to be around <=1 sec at all times. p90 for purging from swift goes up to 11 sec for much of the day. As a workaround, perhaps the thumbnails purging could be skipped for brand-new uploads, since there should be nothing to purge.

The problem (same as bug 34717), is that the Swift container DBs have extremely poor performance and while likely need to be mapped to SSD devices instead of HDD.
SELECT queries go fast when the sqlite3 db file is paged in memory, but slow otherwise (much of the time). The slowness just comes from the listings, the actual concurrent DELETEs are very fast (<1 second even for p90).

(In reply to comment #19)

The problem (same as bug 34717), is that the Swift container DBs have extremely
poor performance and while likely need to be mapped to SSD devices instead of
HDD.

*will likely*

They're using SQLite with no other options? That's stupid cause even if you put the DB on the fastest drive on the Earth, it will still suck because SQLite had never been intended for any serious level of concurrency, and it's not a bug but a design decision.

Per the LocalFile graphs at https://graphite.wikimedia.org, this was fixed for brand new uploads. Re-uploads will still have potential slowness.

(In reply to comment #21)

They're using SQLite with no other options? That's stupid cause even if you put
the DB on the fastest drive on the Earth, it will still suck because SQLite had
never been intended for any serious level of concurrency, and it's not a bug
but a design decision.

The concurrency limitations are probably dealt with by the fact that we shard the containers for the large wikis. This also mitigates the B-tree index update slowness (proportional to the no. of items in the container object list table).

Since the container service and object service are running on the same boxes (and the later gets used way more and on way more data), no DB files can stay in the page cache. The random seeks can't be well handled by SATA drives.

Asher also floating trying some hacks with http://www.kernel.org/doc/man-pages/online/pages/man2/posix_fadvise.2.html for the sake of investigation.

Confirmed that new file uploads are now zippy, yay.

Closing this bug as fixed; additional improvements should be discussed in separate bugs.

Gilles raised the priority of this task from High to Unbreak Now!.Dec 4 2014, 10:21 AM
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to High.Dec 4 2014, 11:22 AM

No offense meant, but this is a 2012 report. I am happy that it's finally closed

It was actually closed in 2012, these recent updates to the ticket I've made are just housekeeping related to the migration to phabricator.