Page MenuHomePhabricator

Missing Cirrussearch dump (enwiki and wikidata)
Closed, ResolvedPublic3 Estimated Story Points

Description

Hello @EBernhardson, the script failed yesterday; here is the email we got and the stack trace;

email
Systemd timer ran the following command:
    `/usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s1.dblist`
its return value was 1 and emitted the following output:
<13>Feb 27 19:03:59 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20230227/enwiki-20230227-cirrussearch-content.json.gz
stack.trace
Dumping 6624383 documents (6624383 in the index)
        6% done...
        8% done...
        ......
        58% done...
Elastica\Exception\ClientException from line 26 of /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Connection/Strategy/Simple.php: No enabled connection
#0 /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Connection/ConnectionPool.php(86): Elastica\Connection\Strategy\Simple->getConnection(Array)
#1 /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Client.php(396): Elastica\Connection\ConnectionPool->getConnection()
#2 /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Client.php(512): Elastica\Client->getConnection()
#3 /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Search.php(348): Elastica\Client->request('enwiki_content/...', 'POST', Array, Array)
#4 /srv/mediawiki/php-1.40.0-wmf.24/extensions/CirrusSearch/includes/Elastica/SearchAfter.php(90): Elastica\Search->search()
#5 /srv/mediawiki/php-1.40.0-wmf.24/extensions/CirrusSearch/includes/Elastica/SearchAfter.php(70): CirrusSearch\Elastica\SearchAfter->runSearch()
#6 /srv/mediawiki/php-1.40.0-wmf.24/extensions/CirrusSearch/maintenance/DumpIndex.php(163): CirrusSearch\Elastica\SearchAfter->next()
#7 /srv/mediawiki/php-1.40.0-wmf.24/maintenance/includes/MaintenanceRunner.php(609): CirrusSearch\Maintenance\DumpIndex->execute()
#8 /srv/mediawiki/php-1.40.0-wmf.24/maintenance/doMaintenance.php(99): MediaWiki\Maintenance\MaintenanceRunner->run()
#9 /srv/mediawiki/php-1.40.0-wmf.24/extensions/CirrusSearch/maintenance/DumpIndex.php(288): require_once('/srv/mediawiki/...')
#10 /srv/mediawiki/multiversion/MWScript.php(118): require_once('/srv/mediawiki/...')
#11 {main}

We checked to see if enwiki-20230227-cirrussearch-content.json.gz was later completed; sadly, the file is still missing. Additionally, the wikidata files for the last run are missing; we didn’t get any errors or anything related to why the wikidata files weren’t completed.

The cirrussearch wikidata log on that day(20230220) looks like this; It shows the dump didn’t complete

Dumping 101983781 documents (101983781 in the index)
        2% done...
        ......
        30% done...
        32% done...
        34% done...
        36% done...

Event Timeline

Not sure yet, enwiki failed with curl error 23 (Failed writing body). The client library disabled the connection after this failure, meaning no retries were attempted and we went to the failure condition.

wikidata is more curious, i don't see any logs generated by mediawiki around the time that it stopped running. Will have to look closer to see whats going on there.

Took a closer look for the wikidata failure, but i've turned up nothing. With the output ending without any failure messages it suggests to me that the process died, perhaps a force kill or a segfault. I couldn't find anything in the system logs that correlate. Sadly syslog doesn't go back that far (oldest syslog entry is feb 27, this died on feb 22) but theres no certainty it would have had useful information.

Getting back to the enwiki failure, with error 23 (CURLE_WRITE_ERROR), this is also a bit curious. Typically, per stack overflow, this error occurs when attempting to write to disk and it's rejected for permissions or available space. Thats not the case here, the output goes to an in-memory buffer which shouldn't be able to have such issues. I poked the libcurl source a bit to see what other conditions can generate this error, there are actually quite a few. Too many to review for relevance here without spending significant time.

I'm indecisive on the correct fix for the CURLE_WRITE_ERROR, will talk with the team in our wednesday meeting and see what we think.

We talked a bit about this, the plan right now is to prevent the library we use from disabling the last available connection. That should allow the retries to work as we expect regardless of the error type. It seems this connection disabling is more suited for cases that have many instances in their pool, rather than a single DNS backed by LVS.

Change 904615 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/Elastica@master] Don't disable the last connection after http error

https://gerrit.wikimedia.org/r/904615

Change 904615 merged by jenkins-bot:

[mediawiki/extensions/Elastica@master] Don't disable the last connection after http error

https://gerrit.wikimedia.org/r/904615

has been deployed for a month without issues, lets hope this is resolved.