Page MenuHomePhabricator

Increase in failures caused by Saucelabs
Closed, DeclinedPublic

Description

Since Dec 4th, I'm seeing an increase in errors that seem to be caused by Saucelabs. Tests were fine before that date, or at least failures were plausible. But recently more and more errors of the kind of "Sauce could not start your job." start popping up. Strangely there are days where this error does not happen, while on another day it is happening several times per build.
E.g. December 10th seemed to be a bad day for Saucelabs.

Looks like the tests are failing randomly :(

BuildTestErrorSauce LabsSauce Labs Error
258Edit sitelinks.Remove multiple sitelinksNet::ReadTimeouthttp://saucelabs.com/jobs/a31d28114680489485963f5d27bea5aeInternal Server Error
257Using url properties in statements.Check UI for invalid values (outline example : missing.http.org )Selenium::WebDriver::Error::UnknownErrorhttp://saucelabs.com/jobs/a0692d83dcd848fc890217f3785f61ce https://saucelabs.com/jobs/429a4fa845de4a009a35b5f8bd856a6enull, The Sauce VMs failed to start the browser or device. For more info, please check https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages
256Setting snaktypes of statements.Change the snaktype and save (outline example : somevalue )Net::ReadTimeouthttp://saucelabs.com/jobs/2e7df09a1b4f40e9b61e3e815e03b7d1, http://saucelabs.com/jobs/2455da0a2be84a3bb91069fea3881dednull, The connection with your VM was lost and your job can't complete. You won't be charged for these minutes. For help, please check https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages
256Using monolingual properties in statements.Adding a statement of type monolingual (outline example : English )Selenium::WebDriver::Error::UnknownErrorhttps://saucelabs.com/jobs/65530e0b4f034a4291cc4b814f484b86The Sauce VMs failed to start the browser or device. For more info, please check https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages
255Edit aliases.Type new aliasSelenium::WebDriver::Error::UnknownErrorhttps://saucelabs.com/jobs/5781c8559fbf401f8f454396bc021a21The Sauce VMs failed to start the browser or device. For more info, please check https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages
255Edit label.Modify the labelSelenium::WebDriver::Error::UnknownErrorhttps://saucelabs.com/jobs/aa642a7c67f841e28421f60342c63f29The Sauce VMs failed to start the browser or device. For more info, please check https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages
255Property smoke test.Click statement add buttonSelenium::WebDriver::Error::UnknownErrorhttps://saucelabs.com/jobs/914b045996e84b42b6d4b0212960abaaThe Sauce VMs failed to start the browser or device. For more info, please check https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages
254-
253-

Documentation for the most common error message: https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages#CommonErrorMessages-TheSauceLabsVirtualMachineFailedtoStarttheBrowserorDevice

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 12 2016, 3:22 PM
Tobi_WMDE_SW updated the task description. (Show Details)Dec 12 2016, 3:23 PM
Tobi_WMDE_SW moved this task from Incoming to Monitoring on the User-Tobi_WMDE_SW board.
Tobi_WMDE_SW triaged this task as High priority.Jan 11 2017, 1:48 PM
Tobi_WMDE_SW added subscribers: zeljkofilipin, hashar.

This is getting worse it seems. Any idea why this is happening @zeljkofilipin @hashar?

zeljkofilipin moved this task from Inbox to Next on the Browser-Tests-Infrastructure board.
zeljkofilipin moved this task from Backlog 🔙 to Next 🔜 on the User-zeljkofilipin board.

Apologies for the late reply. I am looking into this.

Sauce Labs status says everything is fine on their end, there were no incidents in the last week. I took a quick look, the rest of selenium jobs are very rarely failing with Selenium::WebDriver::Error::Un​knownError.

selenium-Wikibase job is the only one that runs for hours. Most of the jobs run in a few minutes, two jobs run for about 30 minutes (selenium-MobileFrontend, selenium-MultimediaViewer). Both of the longer jobs also occasionally fail with Selenium::WebDriver::Error::Un​knownError.

Rarely, something goes wrong when connecting to Sauce Labs, and when a job creates many connections, one (or more) of them fail.

I am not sure what to do, but to contact Sauce Labs support and ask if they can investigate on their side.

Can it be that sometime we exhaust the number of concurrent sessions? Though we run the tests serially, so unless we have more than X jobs running together we should not reach that limit.

My account on sauce labs does not let me access the interface (need email verification and I never receive it). But maybe the build logs have more details?

zeljkofilipin added a comment.EditedFeb 3 2017, 12:11 PM

@hashar Good idea about checking the limit, but we never reach it. Our limit is 10 concurrent sessions.

Do you have a Wikimedia Sauce Labs account? I do not see it in the list.

This comment was removed by zeljkofilipin.
This comment was removed by zeljkofilipin.
zeljkofilipin updated the task description. (Show Details)Feb 3 2017, 4:02 PM
zeljkofilipin updated the task description. (Show Details)
zeljkofilipin updated the task description. (Show Details)Feb 3 2017, 4:04 PM
zeljkofilipin updated the task description. (Show Details)Feb 3 2017, 4:27 PM
zeljkofilipin updated the task description. (Show Details)Feb 3 2017, 4:37 PM
zeljkofilipin updated the task description. (Show Details)Feb 3 2017, 4:41 PM
zeljkofilipin added a comment.EditedFeb 3 2017, 4:47 PM

I have reported the problem to Sauce Labs support.

Feb 3, 8:47 AM PST

wikimedia-jenkins <jenkins@wikimedia.org> reported an issue on job https://saucelabs.com/tests/65530e0b4f034a4291cc4b814f484b86 :
Issue: Job url ?
https://saucelabs.com/beta/tests/aceb09defbe84f9db7b6d1da1fbe43e0

Problem details
A random test(s) sometimes (almost daily) fail with "The Sauce VMs failed to start the browser or device". Nothing changed in the configuration. The test used to work. The next time it runs, it works fine.

More information is available on our public tracker: https://phabricator.wikimedia.org/T152963

Are there other tests in your test suite with the same problem ?
Yes

Does this same problem happen in other browser/os/device combinations ?
No

Was the same test working previously ?
Yes

Albert Sison (Sauce Labs Help Center)

Feb 3, 1:20 PM PST
Hello,

Thanks for writing in to Sauce Labs support. Regarding this specific ”Internal Server Error", this means the VM to run the test started and then become unresponsive. The VM appears to have hit a resource limit and crashed. It’s hard to tell exactly what limit was hit. Later our services tried to restart the VM, but by that point the test was no longer registered as an active job (i.e. it had “gone stale”), and the job failed. This is a type of failure we’ve seen before: where a VM hits a resource limit and crashes. It’s a known problem where we don’t know the exact cause, not enough to prevent it from happening again. In the engineer’s words, it is a “rare but expected scenario”. More information about this error can be found here: https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages#CommonErrorMessages-InternalServerError

As for the "The connection with your VM was lost and your job can't complete." error, if you only get this message rarely and randomly, it is probably a fluke on our end caused by an infrastructure blip as mentioned on our error page: https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages#CommonErrorMessages-TheConnectionwithYourVirtualMachinewasLostandYourJobCan'tComplete

For "The Sauce VMs failed to start the browser or device." error, what percentage of your tests are affected by this specific error? I noticed all the Sauce job URLs provided were testing against Linux/Chrome 48. For diagnostic purposes, can you try testing against a different OS such as Windows 10 or OSX and let us know if you notice any improvements? Thanks for your patience and understanding.

Regards,
Albert

Date: Wed, 8 Feb 2017 15:55:55 +0100

Hi Albert,

This Jenkins job has 211 tests and runs once a day. Each test starts a
Sauce Labs job. I have investigated jobs 253-258, so 6 days, or 1266 tests.
"The Sauce VMs failed to start the browser or device" error has appeared 5
times. 5/1266 = 0.0039 or 0.4%.

I will run the same job on Windows and Mac and see if there are any
improvements.

Regards,

Željko

Change 336632 had a related patch set uploaded (by Zfilipin):
WIP Run selenium-Wikibase Jenkins job using Windows 10

https://gerrit.wikimedia.org/r/336632

This comment was removed by zeljkofilipin.
zeljkofilipin added a comment.EditedFeb 14 2017, 5:07 PM
Linux selenium-WikibaseWindows selenium-Wikibase-336632-1Mac selenium-Wikibase-336632-2
green in last 5 test runs553
green in last 10 test runs563
average run time, last 5 (hours:minutes)2:172:313:08
average run time, last 10 (hours:minutes)2:092:132:46
zeljkofilipin added a comment.EditedFeb 14 2017, 5:27 PM

Number of failures in the last 10 runs

LinuxWindowsMac
001 Watir::Wait::TimeoutError
000
000
001 Watir::Wait::TimeoutError
000
34 MediawikiApi::ApiError1 Watir::Wait::TimeoutError1 Watir::Exception::UnknownObjectException
77 MediawikiApi::ApiError1 Net::ReadTimeout0
76 MediawikiApi::ApiError34 MediawikiApi::ApiError34 MediawikiApi::ApiError
76 MediawikiApi::ApiError34 MediawikiApi::ApiError34 MediawikiApi::ApiError
76 MediawikiApi::ApiError03 MediawikiApi::ApiError

MediawikiApi::ApiError is unrelated problem (T157665) and should be ignored, so only last 5 runs are relevant. In the last 5 runs, both Linux and Windows did not have any problems. Linux runs are faster than Windows, Mac is the slowest. Mac runs also failed more often.

I am not sure what to conclude. It seems to me that moving to Windows would not help. Maybe T158074: Update Ruby tests to Selenium 3 will help.

@zeljkofilipin is it possible to run the job for beta at a different time than the job for test?
At the moment they both run at 04:40 UTC but I would like one of those to start 2 hours later or earlier. Seems like the beta-job ran pretty stable during the last days while the test-job was disabled, so probably it's an issue to run them both in parallel.

zeljkofilipin added a comment.EditedFeb 15 2017, 3:13 PM

I did a bit of investigation, thinking that builds targeting test wiki fail more for some reason, but that does not seem to be the case. Builds failing with MediawikiApi::ApiError are a couple of problems unrelated to this one, and should be ignored.

buildbetatest
failureserror(s)failureserror(s)
27001Selenium::WebDriver::Error::UnknownError
2690
2680
2670
2660
2650
26434MediawikiApi::ApiError
2631Net::ReadTimeout76MediawikiApi::ApiError
262076MediawikiApi::ApiError
261076MediawikiApi::ApiError
260076MediawikiApi::ApiError
259076MediawikiApi::ApiError
2581Net::ReadTimeout76MediawikiApi::ApiError
2571Selenium::WebDriver::Error::UnknownError76MediawikiApi::ApiError
2562Net::ReadTimeout Selenium::WebDriver::Error::UnknownErrorAttributeError
2553Selenium::WebDriver::Error::UnknownError76MediawikiApi::ApiError
254076MediawikiApi::ApiError
253076MediawikiApi::ApiError
25274MediawikiApi::ApiError76MediawikiApi::ApiError
251076MediawikiApi::ApiError
2502Selenium::WebDriver::Error::UnknownError0
24900
24800
2476Selenium::WebDriver::Error::UnknownError4Selenium::WebDriver::Error::UnknownError
2464Selenium::WebDriver::Error::UnknownError1Selenium::WebDriver::Error::UnknownError
24500
2441Net::ReadTimeout2Selenium::WebDriver::Error::UnknownError Watir::Wait::TimeoutError
24300
24200
2413Selenium::WebDriver::Error::UnknownError76MediawikiApi::ApiError
240076MediawikiApi::ApiError

The trouble with recent data is that we had a lot of trouble with unrelated MediawikiApi::ApiError problems. Now that that is resolved, I have created a new job selenium-Wikibase-336632-4 that will run every 4 hours (because it needs over 3 hours for a run). I will let it run for a day or two.

zeljkofilipin added a comment.EditedFeb 15 2017, 3:53 PM

@zeljkofilipin is it possible to run the job for beta at a different time than the job for test?

We could run the jobs sequentially, not in parallel:

https://docs.openstack.org/infra/jenkins-job-builder/project_matrix.html?highlight=sequential

sequential (bool): run builds sequentially (default false)

Example:

https://gerrit.wikimedia.org/r/#/c/333280/6/jjb/mediawiki.yaml

(Thanks to @hashar.)

Unfortunately, looks like running sequentially is supported only for executing strategy classic, but we use yaml :(

From http://stackoverflow.com/questions/12787032/handling-exceptions-on-cucumber-scenarios?rq=1

Around('@handle_alert_boxes') do |scenario, block| do
  begin
    block.call
  rescue Selenium::WebDriver::Error::UnhandledAlertError
    puts "It's OK!"
  end
end

So maybe just rescue the few timeout exceptions and block.call again? :}

Sauce labs support ticket (not public): https://support.saucelabs.com/hc/en-us/requests/35513

wikimedia-jenkins
a few seconds ago

Running tests on Windows and Mac did not prove to be more stable. More information is available at our public bug tracker: https://phabricator.wikimedia.org/T152963

We will rerun failed tests and see if that helps.

Change 336632 abandoned by Zfilipin:
WIP Run selenium-Wikibase Jenkins job on Linux, Mac and Windows

https://gerrit.wikimedia.org/r/336632

Change 338368 had a related patch set uploaded (by Zfilipin):
WIP Increase in failures caused by Saucelabs

https://gerrit.wikimedia.org/r/338368

Change 338785 had a related patch set uploaded (by Zfilipin):
WIP Increase in failures caused by Saucelabs

https://gerrit.wikimedia.org/r/338785

Change 338967 had a related patch set uploaded (by Zfilipin):
WIP Increase in failures caused by Saucelabs

https://gerrit.wikimedia.org/r/338967

zeljkofilipin added a comment.EditedFeb 21 2017, 4:01 PM

Albert Sison (Sauce Labs Help Center)

Feb 17, 3:22 PM PST
Hello,

Thanks for the update and please let us know if re-running the failed tests work for you. In case you're wondering, I have attached screenshots of "Internal Server Error" and "The Sauce VMs failed to start the browser or device." error rate for jobs rans by all users for the past 30 days.

Regards,
Albert

The Sauce VMs failed to start the browser or device

Internal Server Error

Change 338785 abandoned by Zfilipin:
WIP Increase in failures caused by Saucelabs

https://gerrit.wikimedia.org/r/338785

Change 341523 had a related patch set uploaded (by zfilipin):
[mediawiki/selenium] WIP Problem: Can not use --retry option to retry failed tests as part of the same run

https://gerrit.wikimedia.org/r/341523

After several days of stability, the last 2 builds had several strange failures again:

One time:

unknown error: $.cookie is not a function

Several times:

Net::ReadTimeout (Net::ReadTimeout)

Several:

Network is unreachable - connect(2) for "test.wikidata.org" port 443 (Faraday::ConnectionFailed)

Some:

Sauce could not start your job. For more information on what happened, please visit https://saucelabs.com/jobs/67b6ddb2476a4fcf852c418a8de063a4 (Selenium::WebDriver::Error::UnknownError)

https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=test,PLATFORM=Linux,label=BrowserTests/
https://integration.wikimedia.org/ci/job/selenium-Wikibase/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/

@Tobi_WMDE_SW I am working on this. Looks like the only way to ensure stability is to rerun failed tests. I am investigating 2-3 ways on how to do that. No luck so far, but getting close.

@zeljkofilipin Ok! Thanks for your effort and for the update! If there's anything we can do from our side, please let me know.

Change 342636 had a related patch set uploaded (by Zfilipin):
[mediawiki/extensions/Wikibase] WIP: Increase in failures caused by Saucelabs

https://gerrit.wikimedia.org/r/342636

hashar removed a subscriber: hashar.Mar 14 2017, 3:36 PM
zeljkofilipin removed zeljkofilipin as the assignee of this task.May 8 2017, 11:28 AM

Change 338368 abandoned by Zfilipin:
WIP Increase in failures caused by Saucelabs

https://gerrit.wikimedia.org/r/338368

Change 342636 abandoned by Zfilipin:
WIP: Increase in failures caused by Saucelabs

https://gerrit.wikimedia.org/r/342636

@zeljkofilipin any updates? Looks like some MobileFrontend extension tests still fail from time to time: https://integration.wikimedia.org/ci/view/Reading-Web/job/selenium-MobileFrontend/451/

@pmiazga apologies for the delay, I have to finish some selenium+node tasks before I can get back to this.

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 27 2017, 1:36 PM
zeljkofilipin closed this task as Declined.Nov 7 2017, 3:04 PM

The only thing we can do is avoid Sauce Labs: T167432: Run Wikibase daily browser tests on Jenkins