Page MenuHomePhabricator

UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests
Closed, ResolvedPublicPRODUCTION ERROR

Description

While doing a backport for https://gerrit.wikimedia.org/r/c/1127458/ , httpb test suite failed three times in a row, each time on the baremetal mwdebug servers.

The baremetal debug servers timing out is a regular issue (T360867, T371427), but I don't know whether that could be a more general issue and thus mark it as a train blocker.

Error
normalized_message
[{reqId}] {exception_url}   UnexpectedValueException: Invalid server index #
FrameLocationCall
from/srv/mediawiki/php-1.44.0-wmf.21/includes/objectcache/SqlBagOStuff.php(371)
#0/srv/mediawiki/php-1.44.0-wmf.21/includes/objectcache/SqlBagOStuff.php(535)SqlBagOStuff->getConnection(null)
#1/srv/mediawiki/php-1.44.0-wmf.21/includes/objectcache/SqlBagOStuff.php(192)SqlBagOStuff->fetchBlobs(array, bool)
#2/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/objectcache/MediumSpecificBagOStuff.php(122)SqlBagOStuff->doGet(string, int)
#3/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/objectcache/BagOStuff.php(783)Wikimedia\ObjectCache\MediumSpecificBagOStuff->get(string, int)
#4/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/objectcache/MultiWriteBagOStuff.php(396)Wikimedia\ObjectCache\BagOStuff->proxyCall(string, int, int, array, Wikimedia\ObjectCache\MultiWriteBagOStuff)
#5/srv/mediawiki/php-1.44.0-wmf.21/includes/libs/objectcache/MultiWriteBagOStuff.php(187)Wikimedia\ObjectCache\MultiWriteBagOStuff->callKeyMethodOnTierCache(int, string, int, int, array)
#6/srv/mediawiki/php-1.44.0-wmf.21/includes/parser/ParserCache.php(278)Wikimedia\ObjectCache\MultiWriteBagOStuff->get(string, int)
#7/srv/mediawiki/php-1.44.0-wmf.21/includes/parser/ParserCache.php(377)MediaWiki\Parser\ParserCache->getMetadata(WikiPage, int)
#8/srv/mediawiki/php-1.44.0-wmf.21/includes/page/ParserOutputAccess.php(239)MediaWiki\Parser\ParserCache->get(WikiPage, MediaWiki\Parser\ParserOptions)
#9/srv/mediawiki/php-1.44.0-wmf.21/includes/page/Article.php(729)MediaWiki\Page\ParserOutputAccess->getCachedParserOutput(WikiPage, MediaWiki\Parser\ParserOptions, null, int)
#10/srv/mediawiki/php-1.44.0-wmf.21/includes/page/Article.php(545)Article->generateContentOutput(MediaWiki\User\User, MediaWiki\Parser\ParserOptions, int, MediaWiki\Output\OutputPage, array)
#11/srv/mediawiki/php-1.44.0-wmf.21/extensions/Wikibase/repo/includes/Actions/ViewEntityAction.php(80)Article->view()
#12/srv/mediawiki/php-1.44.0-wmf.21/extensions/Wikibase/repo/includes/Actions/ViewEntityAction.php(55)Wikibase\Repo\Actions\ViewEntityAction->showEntityPage()
#13/srv/mediawiki/php-1.44.0-wmf.21/includes/actions/ActionEntryPoint.php(730)Wikibase\Repo\Actions\ViewEntityAction->show()
#14/srv/mediawiki/php-1.44.0-wmf.21/includes/actions/ActionEntryPoint.php(507)MediaWiki\Actions\ActionEntryPoint->performAction(Article, MediaWiki\Title\Title)
#15/srv/mediawiki/php-1.44.0-wmf.21/includes/actions/ActionEntryPoint.php(145)MediaWiki\Actions\ActionEntryPoint->performRequest()
#16/srv/mediawiki/php-1.44.0-wmf.21/includes/MediaWikiEntryPoint.php(202)MediaWiki\Actions\ActionEntryPoint->execute()
#17/srv/mediawiki/php-1.44.0-wmf.21/index.php(58)MediaWiki\MediaWikiEntryPoint->run()
#18/srv/mediawiki/w/index.php(3)require(string)
#19{main}

Console outputs from this morning:

07:22:54 Executing check 'check_testservers_baremetal-1_of_1'
07:22:54 Executing check 'check_testservers_k8s-1_of_2'
07:22:54 Executing check 'check_testservers_k8s-2_of_2'
07:23:42 Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...
https://test.wikidata.org/wiki/Property:P13344 (/srv/deployment/httpbb-tests/appserver/test_main.yaml:133)
  mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'rbkKZufs', got '<!DOCTYPE html>\n<html class="client-nojs vector-fe'... (34883 characters total).
https://test.wikidata.org/wiki/Wikidata:Main_Page (/srv/deployment/httpbb-tests/appserver/test_main.yaml:136)
  mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'test instance', got '<!DOCTYPE html>\n<html class="client-nojs vector-fe'... (37308 characters total).
===
FAIL: 147 requests sent to each of 4 hosts. 2 requests with failed assertions.
07:29:01 Executing check 'check_testservers_baremetal-1_of_1'
07:29:01 Executing check 'check_testservers_k8s-1_of_2'
07:29:01 Executing check 'check_testservers_k8s-2_of_2'
07:29:37 Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...
https://test.wikidata.org/wiki/Property:P13344 (/srv/deployment/httpbb-tests/appserver/test_main.yaml:133)
  mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'rbkKZufs', got '<!DOCTYPE html>\n<html class="client-nojs vector-fe'... (34883 characters total).
https://test.wikidata.org/wiki/Wikidata:Main_Page (/srv/deployment/httpbb-tests/appserver/test_main.yaml:136)
  mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'test instance', got '<!DOCTYPE html>\n<html class="client-nojs vector-fe'... (37308 characters total).
===
FAIL: 147 requests sent to each of 4 hosts. 2 requests with failed assertions.
07:30:22 Executing check 'check_testservers_baremetal-1_of_1'
07:30:22 Executing check 'check_testservers_k8s-1_of_2'
07:30:22 Executing check 'check_testservers_k8s-2_of_2'
07:30:57 Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...
https://test.wikidata.org/wiki/Property:P13344 (/srv/deployment/httpbb-tests/appserver/test_main.yaml:133)
  mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'rbkKZufs', got '<!DOCTYPE html>\n<html class="client-nojs vector-fe'... (34883 characters total).
https://test.wikidata.org/wiki/Wikidata:Main_Page (/srv/deployment/httpbb-tests/appserver/test_main.yaml:136)
  mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'test instance', got '<!DOCTYPE html>\n<html class="client-nojs vector-fe'... (37308 characters total).
https://wikimedia.org/research (/srv/deployment/httpbb-tests/appserver/test_redirects.yaml:3)
  mwdebug1002.eqiad.wmnet
    Status code: expected 301, got 503.
    Location header: expected 'https://research.wikimedia.org/', was missing.
===
FAIL: 147 requests sent to each of 4 hosts. 3 requests with failed assertions.

Event Timeline

hashar triaged this task as Unbreak Now! priority.Mar 18 2025, 7:47 AM

I am marking this an Unbreak Now! since the test failed repeatedly and I don't think it is a good idea to ignore failing smoke tests.

Noting this also happened last night during the train presync:

03:32:44 Executing check 'check_testservers_baremetal-1_of_1'
03:32:44 Executing check 'check_testservers_k8s-1_of_2'
03:32:44 Executing check 'check_testservers_k8s-2_of_2'
03:33:39 Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...
https://test.wikidata.org/wiki/Property:P13344 (/srv/deployment/httpbb-tests/appserver/test_main.yaml:133)
  mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'rbkKZufs', got '<!DOCTYPE html>\n<html class="client-nojs vector-fe'... (34883 characters total).
https://test.wikidata.org/wiki/Wikidata:Main_Page (/srv/deployment/httpbb-tests/appserver/test_main.yaml:136)
  mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'test instance', got '<!DOCTYPE html>\n<html class="client-nojs vector-fe'... (37308 characters total).
===
FAIL: 147 requests sent to each of 4 hosts. 2 requests with failed assertions.

I have looked at logstash https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002 with message:500 . The one that failed overnight started the checks at 03:32:44 and emitted the error at 03:33:39.

In logstash I find:

Mar 18, 2025 @ 03:33:02.0002025-03-18T03:33:02 194441 2620:0:861:103:10:64:32:123 proxy:unix:/run/php/fpm-www-7.4.sockfcgi://www-7.4/500 7406 GET http://test.wikidata.org/wiki/Property:P13344 - text/html - - httpbb (https://wikitech.wikimedia.org/wiki/Httpbb) mwpresync@deploy2002 - - - - 2620:0:861:103:10:64:32:123 1ac78293-5f0b-992b-a2a4-a90c026ae66d -
Mar 18, 2025 @ 03:33:02.0002025-03-18T03:33:02 204784 2620:0:861:103:10:64:32:123 proxy:unix:/run/php/fpm-www-7.4.sockfcgi://www-7.4/500 7800 GET http://test.wikidata.org/wiki/Wikidata:Main_Page - text/html - - httpbb (https://wikitech.wikimedia.org/wiki/Httpbb) mwpresync@deploy2002 - - - - 2620:0:861:103:10:64:32:123 438e7826-7a64-99ae-a363-79929c6b5466 -

Which is for mwdebug1001. They fail / 500 when hitting http://test.wikidata.org/wiki/Property:P13344 and http://test.wikidata.org/wiki/Wikidata:Main_Page .

The first one is followed by errors logged in MediaWiki:

/wiki/Property:P13344   PHP Notice: Undefined offset: 0
/wiki/Property:P13344   PHP Notice: Trying to get property 'exptime' of non-object
/wiki/Property:P13344   PHP Notice: Trying to get property 'shardIndex' of non-object
/wiki/Property:P13344   UnexpectedValueException: Invalid server index #
hashar renamed this task from Deployment fails due to mwdebug servers timing out while running httpbb tests to UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests.Mar 18 2025, 8:29 AM
hashar updated the task description. (Show Details)
Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptMar 18 2025, 8:29 AM

I am pretty sure that is a regression in wmf.21 and the patch upgrading the testwikis (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128586 ) was only published on the mwdebug servers. Since httpbb rightfully failed when hitting the debug servers, the patch was never fully deployed.

When I went to run the backport for an unrelated config change, that is still deploying wmf.21 and hit the same bug. We gotta figure it out.

Aklapper renamed this task from UnexpectedValueException: Invalid server index # causes eployment to fail due to mwdebug servers timing out while running httpbb tests to UnexpectedValueException: Invalid server index # causes deployment to fail due to mwdebug servers timing out while running httpbb tests.Mar 18 2025, 9:07 AM

Change #1128802 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] objectcache: Re-number array keys in SqlBagOStuff

https://gerrit.wikimedia.org/r/1128802

Ladsgroup moved this task from Triage to In progress on the DBA board.

Change #1128803 had a related patch set uploaded (by Jaime Nuche; author: Amir Sarabadani):

[mediawiki/core@wmf/1.44.0-wmf.21] objectcache: Re-number array keys in SqlBagOStuff

https://gerrit.wikimedia.org/r/1128803

Change #1128803 merged by jenkins-bot:

[mediawiki/core@wmf/1.44.0-wmf.21] objectcache: Re-number array keys in SqlBagOStuff

https://gerrit.wikimedia.org/r/1128803

Mentioned in SAL (#wikimedia-operations) [2025-03-18T09:39:34Z] <jnuche@deploy2002> Started scap sync-world: Backport for [[gerrit:1128803|objectcache: Re-number array keys in SqlBagOStuff (T389169)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-18T09:44:51Z] <jnuche@deploy2002> jnuche: Backport for [[gerrit:1128803|objectcache: Re-number array keys in SqlBagOStuff (T389169)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Thank you @Ladsgroup ! That has unblocked the train :)

Sorry for breaking it! I will try to add some regression testing a bit later

Change #1128802 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Re-number array keys in SqlBagOStuff

https://gerrit.wikimedia.org/r/1128802