Page MenuHomePhabricator

getAllAbandonedJobs() does not work in production due to missing jobs within the <wiki>:jobqueue:<queue>:h-data key
Closed, ResolvedPublic

Description

I'm trying to debug an issue where several thousand cirrus search jobs have been marked as abandoned, unfortunately retrieving them via php to inspect is showing to not be working. I did some light debugging and this is talking to redis, it is retrieving some things, but the end result is an empty iterator. I have not figured out what exactly is causing this though. This doesn't block my work, i can extract some keys and manually query redis, but it would be nice if this worked.

Test script:

$queue = JobQueueGroup::singleton()->get('cirrusSearchIncomingLinkCount');
$count = 0;
foreach ( $queue->getAllAbandonedJobs() as $job ) { $count++; }
var_dump( $queue->getAbandonedCount(), $count );

Expected output:

int(3103)
int(3103)

Actual output:

int(3103)
int(0)

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added subscribers: EBernhardson, aaron.

My best guess for this is that the data about abandoned jobs has been lost.

redis 10.64.0.24:6379> zrange enwiki:jobqueue:cirrusSearchIncomingLinkCount:z-abandoned 0 10
 1) "ff20deb54d5c4a658bdf9197fa7d9dec"
 2) "4e87819a23684a2992d75164df67c728"
 3) "f923a043dbbd442485c4a4583231480c"
 4) "0a4eb9b77d1942d3a0d12d971d79a82b"
 5) "246fd21a6f6e448a98922dc09028061b"
 6) "29f80f374d2d49b2aa5206c96fdec8b1"
 7) "4b6872024c7349e3a5f8d48c8dac37fa"
 8) "5b2b95a3991a416e9b2dfdcff48c8160"
 9) "6509cbeb382f423ba97555e5e2978981"
10) "68a52a08271c49f3a790d5e6721c7a8f"
11) "7a48e06a77794a2da4732fb9d97de8f2"
redis 10.64.0.24:6379> hget enwiki:jobqueue:cirrusSearchIncomingLinkCount:h-data ff20deb54d5c4a658bdf9197fa7d9dec
(nil)
redis 10.64.0.24:6379> jhget enwiki:jobqueue:cirrusSearchIncomingLinkCount:h-data 4e87819a23684a2992d75164df67c728
(nil)
redis 10.64.0.24:6379> hget enwiki:jobqueue:cirrusSearchIncomingLinkCount:h-data f923a043dbbd442485c4a4583231480c
(nil)

Running the script in the ticket description through strace confirms this is why getAllAbandonedJobs() returns no results.

sudo -u $MEDIAWIKI_WEB_USER strace -s 4096 -e trace=network php /srv/mediawiki-staging/multiversion/MWScript.php eval.php --wiki=enwiki < test.php > test.php.result 2>&1
EBernhardson renamed this task from getAllAbandonedJobs() does not work in production to getAllAbandonedJobs() does not work in production due to missing jobs within the <wiki>:jobqueue:<queue>:h-data key.Dec 15 2015, 8:11 PM
EBernhardson set Security to None.
Krinkle triaged this task as Medium priority.Dec 21 2015, 7:49 PM
Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.
Krinkle removed a project: Performance-Team.

My best guess for this is that the data about abandoned jobs has been lost.

redis 10.64.0.24:6379> zrange enwiki:jobqueue:cirrusSearchIncomingLinkCount:z-abandoned 0 10
 1) "ff20deb54d5c4a658bdf9197fa7d9dec"
 2) "4e87819a23684a2992d75164df67c728"
 3) "f923a043dbbd442485c4a4583231480c"
 4) "0a4eb9b77d1942d3a0d12d971d79a82b"
 5) "246fd21a6f6e448a98922dc09028061b"
 6) "29f80f374d2d49b2aa5206c96fdec8b1"
 7) "4b6872024c7349e3a5f8d48c8dac37fa"
 8) "5b2b95a3991a416e9b2dfdcff48c8160"
 9) "6509cbeb382f423ba97555e5e2978981"
10) "68a52a08271c49f3a790d5e6721c7a8f"
11) "7a48e06a77794a2da4732fb9d97de8f2"
redis 10.64.0.24:6379> hget enwiki:jobqueue:cirrusSearchIncomingLinkCount:h-data ff20deb54d5c4a658bdf9197fa7d9dec
(nil)
redis 10.64.0.24:6379> jhget enwiki:jobqueue:cirrusSearchIncomingLinkCount:h-data 4e87819a23684a2992d75164df67c728
(nil)
redis 10.64.0.24:6379> hget enwiki:jobqueue:cirrusSearchIncomingLinkCount:h-data f923a043dbbd442485c4a4583231480c
(nil)

Running the script in the ticket description through strace confirms this is why getAllAbandonedJobs() returns no results.

sudo -u $MEDIAWIKI_WEB_USER strace -s 4096 -e trace=network php /srv/mediawiki-staging/multiversion/MWScript.php eval.php --wiki=enwiki < test.php > test.php.result 2>&1

Maybe you can debug this in vagrant by setting a low claimTTL in the jobchron config and by pushing and popping jobs but not ack'ing them.

mwscript on prod lists abandoned jobs for some queues, so it doesn't seem to effect all abandoned jobs.

aaron claimed this task.

https://gerrit.wikimedia.org/r/284488 should prevent this from happening.