Page MenuHomePhabricator

Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad""
Closed, ResolvedPublic

Description

Hello,

I'm unable to server-side upload some images, see for instance

urbanecm@mwmaint1002:~$ mwscript importImages.php --wiki=commonswiki --user=Meisam upload/
Import Images

Importing Hurtigruten.06.01.1920x1080.NRK2.webm...done.
Importing Hurtigruten.05.15.1920x1080.NRK2.webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".)
Importing Hurtigruten.05.13.1920x1080.NRK2.webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".)
Importing Hurtigruten.05.12.1920x1080.NRK2.webm...done.
Importing Hurtigruten.06.02.1920x1080.NRK2.webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".)
Importing Hurtigruten.05.11.1920x1080.NRK2.webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".)
Importing Hurtigruten.05.14.1920x1080.NRK2.webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".)
Importing Hurtigruten.05.16.1920x1080.NRK2.webm...done.

Found: 8
Added: 3
Failed: 5
urbanecm@mwmaint1002:~$

(this is from Jun 20, however, I try it again from time to time and the error is still reproducable)

If you want to try yourself, the files are on mwmaint1002:/home/urbanecm/T223052-upload and mwmaint1002:/home/urbanecm/T223052-upload2 now.

One of the files I'm trying to upload is 3.9G, which should be possible to upload (IIRC, the limit is 2^32 bytes).

[urbanecm@mwmaint1002 T223052-upload2]$ ls -lh
total 3.9G
-rw-r--r-- 1 urbanecm wikidev 3.9G Jun 25 20:33 Hurtigruten.05.11.1920x1080.NRK2.webm
[urbanecm@mwmaint1002 T223052-upload2]$

Tagged T223052 with SRE and SRE-swift-storage, but since I got no response (even nothing like "acked") from an opsen on duty, I'm creating a new task, so it is clearly visible in the dashboards etc :).

Thanks for your help.

Martin Urbanec

Event Timeline

Tagging @Reedy, who had problems with this too in T226845.

[urbanecm@mwmaint1002 T223052-upload2]$ ls
Hurtigruten.05.11.1920x1080.NRK2.webm
[urbanecm@mwmaint1002 T223052-upload2]$ mwscript importImages.php --wiki=commonswiki --user=Meisam .
Import Images

Importing Hurtigruten.05.11.1920x1080.NRK2.webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".)

Found: 1
Failed: 1
[urbanecm@mwmaint1002 T223052-upload2]$ ls -lh
total 3.9G
-rw-r--r-- 1 urbanecm wikidev 3.9G Jun 25 20:33 Hurtigruten.05.11.1920x1080.NRK2.webm
[urbanecm@mwmaint1002 T223052-upload2]$

Just tried again, out of curiosity.

Aklapper renamed this task from Not possible to server-side upload certain images to Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"".Jun 30 2019, 9:08 PM
jbond triaged this task as Medium priority.Jul 1 2019, 10:35 AM

I took a look at this, and it seems we're hitting a curl timeout on the mw side ? I'm not sure how long is the (default?) timeout, swift thinks the client disconnected before the operation was completed (status code 499)

Jun 30 19:57:39 ms-be1046 object-server: 10.64.32.221 - - [30/Jun/2019:19:57:39 +0000] "PUT /sdi1/49471/AUTH_mw/wikipedia-commons-local-public.30/3/30/Hurtigruten.05.11.1920x1080.NRK2.webm" 499 - "PUT http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.30/3/30/Hurtigruten.05.11.1920x1080.NRK2.webm" "tx18c227e0922040c49af4c-005d191302" "proxy-server 8839" 305.1190 "-" 2166 0
Jun 30 19:57:39 ms-be1027 object-server: 10.64.32.221 - - [30/Jun/2019:19:57:39 +0000] "PUT /sdh1/49471/AUTH_mw/wikipedia-commons-local-public.30/3/30/Hurtigruten.05.11.1920x1080.NRK2.webm" 499 - "PUT http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.30/3/30/Hurtigruten.05.11.1920x1080.NRK2.webm" "tx18c227e0922040c49af4c-005d191302" "proxy-server 8839" 305.1753 "-" 1808 0
Jun 30 19:57:39 ms-be1022 object-server: 10.64.32.221 - - [30/Jun/2019:19:57:39 +0000] "PUT /sdl1/49471/AUTH_mw/wikipedia-commons-local-public.30/3/30/Hurtigruten.05.11.1920x1080.NRK2.webm" 499 - "PUT http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.30/3/30/Hurtigruten.05.11.1920x1080.NRK2.webm" "tx18c227e0922040c49af4c-005d191302" "proxy-server 8839" 305.7051 "-" 1796 0
Jun 30 19:57:39 ms-fe1008 proxy-server: 10.64.16.77 10.64.32.221 30/Jun/2019/19/57/39 PUT /v1/AUTH_mw/wikipedia-commons-local-public.30/3/30/Hurtigruten.05.11.1920x1080.NRK2.webm HTTP/1.0 499 - wikimedia/multi-http-client%20v1.0 AUTH_tkc4732bce3... 2860122112 89 fe6dcea4261da5ac78c912ebef159e98 tx18c227e0922040c49af4c-005d191302 - 305.0647 - - 1561924354.261385918 1561924659.326113939 0

from logstash (ignoring the fact that the message couldn't actually be indexed for unrelated reasons) (curl error: 28) Timeout was reached]

[2019-06-30T19:57:34,863][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2019.06.30", :_type=>"mediawiki", :_routing=>nil}, 2019-06-30T19:57:34.000Z mwmaint1002 HTTP 100 (Continue) in 'SwiftFileBackend::doStoreInternal' (given '{"async":false,"src":"./Hurtigruten.05.11.1920x1080.NRK2.webm","dst":"mwstore://local-swift-eqiad/local-public/3/30/Hurtigruten.05.11.1920x1080.NRK2.webm","overwrite":true,"headers":{"x-content-dimensions":""},"dstExists":false}'): (curl error: 28) Timeout was reached], :response=>{"index"=>{"_index"=>"logstash-2019.06.30", "_type"=>"mediawiki", "_id"=>"AWup9tYHABdYYWJYyYdA", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"[err] is defined as a field in mapping [mediawiki] but this name is already used for an object in other types"}}}}
reedy@deploy1001:~$ mwscript eval.php commonswiki
> var_dump( $wgHTTPTimeout, $wgHTTPImportTimeout, $wgAsyncHTTPTimeout, $wgHTTPConnectTimeout, $wgCopyUploadTimeout );
int(25)
int(50)
int(25)
float(5)
int(180)

I don't think it's any of those...

I do see in protected $timeout = 'default'; in MWHttpRequest, which is later used as $this->curlOptions[CURLOPT_TIMEOUT] = $this->timeout; in CurlHttpRequest....

No idea offhand what value "default" defaults to... but I guess we should find out, and see if we can increase it for importImages.php to something longer

Filing a subtask for it

Thanks @Reedy ! I'd be interested too in finding out what the timeout is, it is also possible the default timeout is reasonable and swift is taking a long time too. Maybe trying an import again with debug logging might shed some light between start/end of the upload requests ?

I honestly don't know how long it takes to upload a 4GB file or so, I've not paid enough attention. Easy enough to test though

Worth doing time to get a ballpark, as I don't think we have anything explicit we can do any other real debugging (no verbose mode on the script), so it's a mixture of that and possibly mw error logs as you surfaced before

Well, this is fun.. :) Closed T226845 as done now

reedy@mwmaint1002:/tmp/uploads$ date && time mwscript importImages.php --wiki=commonswiki --user=C.Suthorn --overwrite /tmp/uploads && date
Mon Jul  1 12:41:31 UTC 2019
Import Images

Politparade.webm exists, overwriting...done.

Found: 1
Overwritten: 1

real	6m35.303s
user	1m31.584s
sys	0m15.984s
Mon Jul  1 12:48:06 UTC 2019

Is there any way to check if swift was just "busier" around the times the uploads were done? To see whether that puts some weight to your suggestion that it might not be MW/curl timeout, but Swift itself causing issues. I dunno if we can work out easily which host an upload was routed to check that specifically

Tried to measure how long it takes to upload a 3 GB file. Well, importImages.php took ~5.5 mins and ended with (An unknown error occurred in storage backend "local-swift-eqiad"..

[urbanecm@mwmaint1002 T226953]$ ls -lh
total 3.0G
-rw-r--r-- 1 urbanecm wikidev 3.0G Jun 30 09:50 rally_Velenje_2019_-_Slovenija.webm
-rw-r--r-- 1 urbanecm wikidev  520 Jun 30 10:21 rally_Velenje_2019_-_Slovenija.webm.txt
[urbanecm@mwmaint1002 T226953]$ date && time mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Sporti . && date
Mon Jul  1 13:43:18 UTC 2019
Import Images

Importing rally_Velenje_2019_-_Slovenija.webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".)

Found: 1
Failed: 1

real    5m35.377s
user    0m33.536s
sys     0m5.320s
Mon Jul  1 13:48:53 UTC 2019
[urbanecm@mwmaint1002 T226953]$

Is there any way to check if swift was just "busier" around the times the uploads were done? To see whether that puts some weight to your suggestion that it might not be MW/curl timeout, but Swift itself causing issues. I dunno if we can work out easily which host an upload was routed to check that specifically

Maybe https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1 can help?

Is there any way to check if swift was just "busier" around the times the uploads were done? To see whether that puts some weight to your suggestion that it might not be MW/curl timeout, but Swift itself causing issues. I dunno if we can work out easily which host an upload was routed to check that specifically

Maybe https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1 can help?

Yes that'll show swift traffic, for a more "service oriented" swift dashboard please take a look at https://grafana.wikimedia.org/d/000000584/swift-4gs which shows relevant latencies. There will be spikes up that correspond to rebalances when putting back ms-be1033 in service in T223518: ms-be1033 not powering up

@fgiunchedi Hi, is Importing Radomlje_-_Nafta_1903_0-0_(2._polčas).webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".) timeout as well, or something else? I've temporarily raised timeout from 300 to 3000 seconds on mwmaint1002 to be able to do T227968, but I still got this unknown error.

FYI, whole importImages.php took 3 mins and 17 secs, which is below the timeout (even the before-change timeout, which is 300 seconds).

[urbanecm@mwmaint1002 ~/T227968]$ time mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Sporti .
Importing Files

Importing Radomlje_-_Nafta_1903_0-0_(2._polčas).webm...failed. (An unknown error occurred in storage backend "local-swift-eqiad".)

Found: 1
Failed: 1

real    3m17.017s
user    0m14.808s
sys     0m1.424s
[urbanecm@mwmaint1002 ~/T227968]$

@Urbanecm I'm sorry I won't have time to dig into this anytime soon, although T228086: Swift TCP retransmits increase might have a clue

The issue that was present all the time was resolved, the last file I wasn't able to upload suceeded this time. Let's close this for now, thank you everyone.