Page MenuHomePhabricator

mwcore-phpunit-coverage-master times out after 5 hours
Closed, ResolvedPublic

Description

From Icinga:

<wmf-insecte> Project mwcore-phpunit-coverage-master build #168: STILL FAILING in 5 hr 0 min:
 https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/168/

From the build time history https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/buildTimeTrend it seems to take way more time whenever the build is scheduled on integration-slave-docker-1043.integration.eqiad.wmflabs

Build#DurationAgent
Failed#1685 hr 0 minintegration-slave-docker-1043
Failed#1675 hr 0 minintegration-slave-docker-1043
Success#1663 hr 40 minintegration-slave-docker-1052
Failed#1655 hr 0 minintegration-slave-docker-1043
Success#1643 hr 35 minintegration-slave-docker-1041
Success#1633 hr 36 minintegration-slave-docker-1051
Success#1624 hr 5 minintegration-slave-docker-1040
Success#1613 hr 49 minintegration-slave-docker-1054
Success#1603 hr 32 minintegration-slave-docker-1059
Success#1593 hr 32 minintegration-slave-docker-1041
Success#1583 hr 59 minintegration-slave-docker-1040
Success#1573 hr 40 minintegration-slave-docker-1051
Success#1563 hr 34 minintegration-slave-docker-1041
Success#1553 hr 57 minintegration-slave-docker-1048
Success#1543 hr 49 minintegration-slave-docker-1054
Success#1533 hr 41 minintegration-slave-docker-1050
Success#1523 hr 36 minintegration-slave-docker-1048
Success#1513 hr 44 minintegration-slave-docker-1050
Success#1503 hr 32 minintegration-slave-docker-1050
Success#1493 hr 38 minintegration-slave-docker-1041
Success#1483 hr 35 minintegration-slave-docker-1051
Success#1473 hr 35 minintegration-slave-docker-1041
Success#1463 hr 48 minintegration-slave-docker-1058
Success#1453 hr 32 minintegration-slave-docker-1041
Success#1443 hr 34 minintegration-slave-docker-1041
Success#1433 hr 50 minintegration-slave-docker-1050
Success#1423 hr 45 minintegration-slave-docker-1054
Failed#1415 hr 0 minintegration-slave-docker-1043
Success#1403 hr 27 minintegration-slave-docker-1040
Success#1382 hr 48 minintegration-slave-docker-1050
Success#1373 hr 15 minintegration-slave-docker-1041
Success#1362 hr 44 minintegration-slave-docker-1050
Success#1353 hr 52 minintegration-slave-docker-1050
Success#1343 hr 43 minintegration-slave-docker-1041
Success#1334 hr 18 minintegration-slave-docker-1040
Success#1323 hr 43 minintegration-slave-docker-1048
Success#1313 hr 44 minintegration-slave-docker-1052
Success#1303 hr 37 minintegration-slave-docker-1052
Success#1293 hr 51 minintegration-slave-docker-1058
Success#1283 hr 32 minintegration-slave-docker-1051
Success#1274 hr 19 minintegration-slave-docker-1043
Success#1262 hr 42 minintegration-slave-docker-1058
Success#1252 hr 58 minintegration-slave-docker-1041
Success#1242 hr 40 minintegration-slave-docker-1050
Success#1233 hr 3 minintegration-slave-docker-1041
Success#1222 hr 46 minintegration-slave-docker-1050
Success#1212 hr 45 minintegration-slave-docker-1052
Success#1202 hr 52 minintegration-slave-docker-1054
Success#1193 hr 4 minintegration-slave-docker-1040
Success#1182 hr 38 minintegration-slave-docker-1041
Success#1172 hr 38 minintegration-slave-docker-1050
Success#1162 hr 47 minintegration-slave-docker-1051
Success#1152 hr 45 minintegration-slave-docker-1051
Success#1144 hr 5 minintegration-slave-docker-1043
Success#1132 hr 47 minintegration-slave-docker-1054
Success#1122 hr 37 minintegration-slave-docker-1048
Success#1112 hr 28 minintegration-slave-docker-1050
Success#1102 hr 26 minintegration-slave-docker-1041
Success#1092 hr 35 minintegration-slave-docker-1054

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2019-09-12T08:14:08Z] <hashar> Marking integration-slave-docker-1043 offline , triggering mwcore-phpunit-coverage-master again # T232706

Mentioned in SAL (#wikimedia-releng) [2019-09-13T12:39:51Z] <hashar> integration-slave-docker-1043 : killed stall container # T232706

hashar claimed this task.

There was a stall container on integration-slave-docker-1043 apparently using too much CPU. I have killed it that should help.

https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/ still times out from time to time depending on the underlying host executing the job and its workload...

StatusBuildDurationInstance
Failed#1995 hr 0 minintegration-agent-docker-1009
Success#1984 hr 28 minintegration-agent-docker-1005
Failed#1975 hr 0 minintegration-agent-docker-1006
Success#1964 hr 39 minintegration-agent-docker-1006
Failed#1955 hr 0 minintegration-agent-docker-1002
Success#1944 hr 37 minintegration-agent-docker-1011
Success#1934 hr 42 minintegration-agent-docker-1006
Success#1924 hr 53 minintegration-agent-docker-1003
Success#1913 hr 29 minintegration-slave-docker-1059
Success#1903 hr 39 minintegration-slave-docker-1048
Success#1893 hr 44 minintegration-slave-docker-1054
Success#1883 hr 31 minintegration-slave-docker-1058
Success#1873 hr 28 minintegration-slave-docker-1058
Success#1863 hr 48 minintegration-slave-docker-1052
Failed#1855 hr 0 minintegration-slave-docker-1043

And the suite also list a bunch of very slow tests:

You should really fix these slow tests (>50 ms)

152277 msApiQuerySiteinfoTest:testContinuation (fixed T234016)
223437 msApiMoveTest:testMoveSubpages
323395 msRefreshLinksPartitionTest:testRefreshLinks with data set #0
422151 msApiQueryContinueTest:testGen2Prop2List1Meta
521176 msApiQueryContinueTest:testSameGenList
620822 msApiQueryContinueTest:testGen2Prop
720026 msMWDebugTest:testAppendDebugInfoToApiResultXmlFormat
815395 msDatabaseSQLTest:testInsertSelectBatching
915351 msMovePageTest:testMoveSubpagesIfAllowed
1015017 msMovePageTest:testMoveSubpages
1114036 msApiQueryContinueTest:testSameGenAndProp
1213151 msApiQueryPrefixSearchTest:testOffsetContinue with data set "no offset"
1312404 msMediaWiki\Auth\AuthManagerTest:testAutoAccountCreation
1412275 msSpecialPageFatalTest:testSpecialPageDoesNotFatal with data set "Version"
1510898 msApiMainTest:testApiNoParam
1610705 msSpecialPageFatalTest:testSpecialPageDoesNotFatal with data set "Preferences"
1710572 msDefaultPreferencesFactoryTest:testIntvalFilter
1810195 msApiQueryContinueTest:testGen1Prop1List
1910171 msApiQueryPrefixSearchTest:testOffsetContinue with data set "with offset"
2010007 msSpecialPreferencesTest:testT43337
219867 msSpecialPageFatalTest:testSpecialPageDoesNotFatal with data set "Recentchanges"
229848 msApiQueryPrefixSearchTest:testOffsetContinue with data set "past end, no offset"
239809 msMovePageTest:testTitleMoveCompleteIntegrationTest
249582 msApiQueryPrefixSearchTest:testOffsetContinue with data set "past end, with offset"
259111 msApiQueryPrefixSearchTest:testValidCovers
268953 msSpecialPageFatalTest:testSpecialPageDoesNotFatal with data set "PasswordPolicies"
278934 msLinksUpdateTest:testOnAddingAndRemovingCategoryToTemplates_embeddingPagesAreIgnored
288675 msSpecialPageFatalTest:testSpecialPageDoesNotFatal with data set "Listgrants"
298586 msApiMoveTest:testMoveTalk
308365 msArticleTablesTest:testTemplatelinksUsesContentLanguage
318351 msApiRevisionDeleteTest:testUnhidingOutput
328078 msApiRevisionDeleteTest:testPartiallyBlockedPage
338020 msWikiPageMcrWriteBothDbTest:testDoRollback
347955 msWikiPageMcrDbTest:testDoRollback
357841 msWikiPagePreMcrDbTest:testDoRollback
367829 msApiRevisionDeleteTest:testHidingRevisions
377551 msGlobalWithDBTest:testWfIsBadImage with data set "No context page"
387547 msImportLinkCacheIntegrationTest:testImportForImportSource
397480 msWikiPageMcrReadNewDbTest:testDoRollback
407475 msSpecialPageFatalTest:testSpecialPageDoesNotFatal with data set "Listgrouprights"
417250 msMessageCacheTest:testLoadFromDB_fetchLatestRevision
427193 msSearchEngineTest:testFiltersMissing
437188 msMovePageTest:testMove with data set "Move between invalid names"
447186 msApiQueryContinueTest:testGen1Prop
457161 msWikiPageNoContentModelDbTest:testDoRollback
467110 msMovePageTest:testMoveAbortedByTitleMoveHook
477056 msMovePageTest:testMove with data set "Aborted by hook"
486869 msApiMoveTest:testMoveTalkFailed
496854 msSpecialPageFatalTest:testSpecialPageDoesNotFatal with data set "Newpages"
506851 msMovePageTest:testMove with data set "Doubly aborted by hook"

There is another issue, the builds that run on Stretch based VMs are slower than the one running on Jessie VMs. That also affects other jobs so there is some infrastructure issue that has to be figured out.

Snapshot from https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/buildTimeTrend

StatusBuildDurationInstance
Success#2633 hr 55 minintegration-agent-docker-1016
Success#2624 hr 0 minintegration-agent-docker-1006
Failed#2615 hr 0 minintegration-agent-docker-1002
Success#2604 hr 47 minintegration-agent-docker-1010
Success#2594 hr 16 minintegration-agent-docker-1003
Success#2584 hr 49 minintegration-agent-docker-1007
Success#2574 hr 3 minintegration-agent-docker-1011
Success#2564 hr 8 minintegration-agent-docker-1012
Success#2553 hr 57 minintegration-agent-docker-1008
Success#2544 hr 24 minintegration-agent-docker-1003
Success#2533 hr 53 minintegration-agent-docker-1003
Success#2524 hr 4 minintegration-agent-docker-1005
Success#2513 hr 47 minintegration-agent-docker-1016
Success#2504 hr 11 minintegration-agent-docker-1005
Success#2493 hr 44 minintegration-agent-docker-1016
Success#2484 hr 28 minintegration-agent-docker-1007
Failed#2475 hr 0 minintegration-agent-docker-1013
Success#2463 hr 55 minintegration-agent-docker-1005
Success#2452 hr 58 minintegration-agent-jessie-docker-1001
Success#2443 hr 59 minintegration-agent-docker-1011
Failed#2435 hr 0 min integration-agent-docker-1002
Success#2422 hr 43 minintegration-agent-jessie-docker-1001
Success#2413 hr 56 minintegration-agent-docker-1011
Success#2404 hr 3 minintegration-agent-docker-1003
Success#2393 hr 45 minintegration-agent-docker-1005
Success#2384 hr 6 minintegration-agent-docker-1001
Success#2374 hr 34 minintegration-agent-docker-1007
Success#2363 hr 48 minintegration-agent-docker-1012
Success#2353 hr 39 minintegration-agent-docker-1012
Success#2342 hr 40 minintegration-agent-jessie-docker-1001
Success#2333 hr 39 minintegration-agent-docker-1008
Success#2324 hr 43 minintegration-agent-docker-1009
Success#2313 hr 36 minintegration-agent-docker-1011
Failed#2305 hr 0 minintegration-agent-docker-1007
Success#2293 hr 55 minintegration-agent-docker-1003
Failed#2285 hr 0 minintegration-agent-docker-1002
Success#2273 hr 38 minintegration-agent-docker-1012
Failed#2265 hr 0 minintegration-agent-docker-1002
Success#2254 hr 1 minintegration-agent-docker-1008
Failed#2245 hr 0 minintegration-agent-docker-1002
Success#2232 hr 50 minintegration-agent-jessie-docker-1001
Failed#2221 hr 25 minintegration-agent-docker-1003
Failed#2215 hr 0 minintegration-agent-docker-1013
Success#2204 hr 53 minintegration-agent-docker-1007
Success#2194 hr 4 minintegration-agent-docker-1011
Success#2184 hr 9 minintegration-agent-docker-1006
Failed#2175 hr 0 minintegration-agent-docker-1004
Success#2164 hr 1 minintegration-agent-docker-1011
Success#2154 hr 45 minintegration-agent-docker-1007
Success#2144 hr 14 minintegration-agent-docker-1005
Success#2134 hr 27 minintegration-agent-docker-1003
Success#2124 hr 21 minintegration-agent-docker-1006
Failed#2115 hr 0 minintegration-agent-docker-1010
Success#2102 hr 48 minintegration-agent-jessie-docker-1001
Success#2094 hr 3 minintegration-agent-jessie-docker-1001
Success#2084 hr 25 minintegration-agent-docker-1016
Success#2074 hr 20 minintegration-agent-docker-1011
Success#2064 hr 2 minintegration-agent-docker-1011
Failed#2055 hr 0 minintegration-agent-docker-1013
Success#2044 hr 44 minintegration-agent-docker-1010
Failed#2035 hr 0 minintegration-agent-docker-1013
Success#2024 hr 45 minintegration-agent-docker-1007

It is constantly faster on the Jessie instance (which is on cloudvirt1028):

Success#2452 hr 58 minintegration-agent-jessie-docker-1001
Success#2422 hr 43 minintegration-agent-jessie-docker-1001
Success#2342 hr 40 minintegration-agent-jessie-docker-1001
Success#2232 hr 50 minintegration-agent-jessie-docker-1001
Success#2102 hr 48 minintegration-agent-jessie-docker-1001
Success#2094 hr 3 minintegration-agent-jessie-docker-1001

Other agents being on cloudvirt1028:

Success#2553 hr 57 minintegration-agent-docker-1008
Success#2524 hr 4 minintegration-agent-docker-1005
Success#2504 hr 11 minintegration-agent-docker-1005
Success#2463 hr 55 minintegration-agent-docker-1005
Success#2393 hr 45 minintegration-agent-docker-1005
Success#2333 hr 39 minintegration-agent-docker-1008
Success#2254 hr 1 minintegration-agent-docker-1008
Success#2144 hr 14 minintegration-agent-docker-1005

Mentioned in SAL (#wikimedia-releng) [2019-10-30T21:48:49Z] <hashar> Triggering build of mwcore-phpunit-coverage-master , should use Docker 18.06.2 and be faster # T232706

hashar triaged this task as High priority.Oct 30 2019, 9:49 PM

I have rolled back the whole fleet to Docker 18.06.2 so it should be faster now. We will see tomorrow after the manual build and the 3am ones.

(17) integration-agent-docker-[1001-1014,1016].integration.eqiad.wmflabs,integration-agent-jessie-docker-1001.integration.eqiad.wmflabs,integration-agent-puppet-docker-1001.integration.eqiad.wmflabs

  • OUTPUT of 'docker --version' -----

Docker version 18.06.2-ce, build 6d37f41

Change 547550 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Update mwcore-phpunit-coverage-master for xdebug

https://gerrit.wikimedia.org/r/547550

Change 547550 merged by jenkins-bot:
[integration/config@master] Update mwcore-phpunit-coverage-master for xdebug

https://gerrit.wikimedia.org/r/547550

Change 547551 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Update coverage jobs for php-xdebug update

https://gerrit.wikimedia.org/r/547551

Change 547551 merged by jenkins-bot:
[integration/config@master] Update coverage jobs for php-xdebug update

https://gerrit.wikimedia.org/r/547551

Status

  • the CI instances are back to Docker 18.06.2
  • the job uses a container with an updated xdebug ( docker-registry.wikimedia.org/releng/quibble-coverage:0.0.39-s3 )

The 15:00UTC build benefit from those. We will see whether it is faster once it has completed: https://integration.wikimedia.org/ci/job/mwcore-phpunit-coverage-master/272/console

All the builds since last capture:

StatusBuildDurationInstanceNote
Success#2833 hr 33 minintegration-agent-docker-1004cloudvirt1017
Success#2822 hr 29 minintegration-agent-docker-1011cloudvirt1030
Success#2812 hr 18 minintegration-agent-docker-1012cloudvirt1030
Success#2802 hr 29 minintegration-agent-docker-1011cloudvirt1030
Success#2793 hr 29 minintegration-agent-docker-1002cloudvirt1017
Success#2783 hr 10 minintegration-agent-docker-1009cloudvirt1021
Success#2773 hr 28 minintegration-agent-docker-1004cloudvirt1017
Success#2763 hr 30 minintegration-agent-docker-1002cloudvirt1017
Success#2753 hr 19 minintegration-agent-docker-1013cloudvirt1017
Success#2742 hr 28 minintegration-agent-docker-1016cloudvirt1026
Success#2732 hr 24 minintegration-agent-docker-1008cloudvirt1028
Success#2722 hr 22 minintegration-agent-docker-1006cloudvirt1028, php-xdebug fix
Success#2712 hr 56 minintegration-agent-docker-1008cloudvirt1028,
Success#2702 hr 58 minintegration-agent-docker-1008cloudvirt1028, Rolled back to Docker 18.06.2
Success#2694 hr 3 minintegration-agent-docker-1013
Failed#2685 hr 0 minintegration-agent-docker-1007
Failed#2675 hr 0 minintegration-agent-docker-1013
Failed#2665 hr 0 minintegration-agent-docker-1013
Success#2654 hr 39 minintegration-agent-docker-1006
Success#2643 hr 59 minintegration-agent-docker-1005
Success#2633 hr 55 minintegration-agent-docker-1016

Build #270 took 3 hours, and is the first that benefited from the rollback to Docker 18.06.2

Build #272 is the fastest we had with 2h22min it benefits from the php-xdebug fix.

One can see some builds takes 2h20 - 2h30 which I think is the baseline, while others mysteriously take 3h30mins. That is to be investigated, but most probably related to the cloudvirt machine they are running on :-\

hashar removed hashar as the assignee of this task.Mar 5 2020, 2:09 PM

That is stalled by T236675: Investigate Docker slowness between 18.06.2 and 18.09.7. Gotta bisect / find the root cause of Docker slowness.