Page MenuHomePhabricator

PoolCounter using Redis backend doesn't release locks
Closed, ResolvedPublicBUG REPORT

Description

Having a PoolCounter configuration like this:

$wgPoolCounterConf = [
        'ArticleView' => [
                'class' => 'PoolCounterRedis',
                'timeout' => 15, // wait timeout in seconds
                'workers' => 2, // maximum number of active threads in each pool
                'maxqueue' => 5, // maximum number of total threads in each pool
                'servers' => [ '127.0.0.1:6380' ],
                'redisConfig' => $redisConfig
        ],
];
  1. Visit a page that is explicitly not saved in the parser cache (this can be caused by an extension, etc).
  2. Reload the page 3 times in quick succession.
  3. You get an error message about too may users trying to access the page, and "Timeout waiting for the lock"

This can be reproduced consistently in MediaWiki 1.38 and 1.39.

The error clears itself after several seconds (15, apparently)

After adding code to debug the issue on the relevant file https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/poolcounter/PoolCounterRedis.php, my conclusion is that locks are acquired but not released:

  • Adding debug code in the waitForSlotOrNotif method to check which slot is acquired, I get: 1, 2 and QUEUE_WAIT.
  • Debugging values passed to initAndPopPoolSlotList and release I get values like:
    • ACQUIRE "poolcounter:l-slots-c11eb30fb49572ea76e23b357c5a7644c15c1104-2" "poolcounter:z-renewtime-c11eb30fb49572ea76e23b357c5a7644c15c1104-2" "poolcounter:z-wait-c11eb30fb49572ea76e23b357c5a7644c15c1104-2", "2" "5" "15" "60" "" "1673473413.7754"
    • RELEASE: "poolcounter:l-slots-c11eb30fb49572ea76e23b357c5a7644c15c1104-2" "poolcounter:z-renewtime-c11eb30fb49572ea76e23b357c5a7644c15c1104-2" "poolcounter:l-wakeup-c11eb30fb49572ea76e23b357c5a7644c15c1104-2" "poolcounter:z-wait-c11eb30fb49572ea76e23b357c5a7644c15c1104-2" , "2" "60" "1" "1673473413" "1" "1673473414.1703"
  • Debugging the lua code itself, at the release method, I see the execution always enters the branck of code with the comment -- Slot lock expired and was released already
    • This is caused by rSlotTime not matching the original rTime value in the lock acquisition. Compare this code:
      • ACQUIRE: redis.call('zAdd',kSlotsNextRelease,rTime + rExpiry,slot) -- rTime = "1673473413.7754"
      • RELEASE: redis.call('zScore',kSlotsNextRelease,rSlot) ~= (rSlotTime + rExpiry) -- rSlotTime = "1673473413"

The problem is, one value is a float/double, and the other value is an integer, and the comparison doesn't match.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 879656 had a related patch set uploaded (by Martineznovo; author: Martineznovo):

[mediawiki/core@master] PoolCounterRedis: Fix wrong cast, locks weren't being released

https://gerrit.wikimedia.org/r/879656

Change 879656 merged by jenkins-bot:

[mediawiki/core@master] PoolCounterRedis: Fix wrong cast, locks weren't being released

https://gerrit.wikimedia.org/r/879656

Change 885779 had a related patch set uploaded (by Martineznovo; author: Martineznovo):

[mediawiki/core@REL1_39] PoolCounterRedis: Fix wrong cast, locks weren't being released

https://gerrit.wikimedia.org/r/885779

Change 885779 merged by jenkins-bot:

[mediawiki/core@REL1_39] PoolCounterRedis: Fix wrong cast, locks weren't being released

https://gerrit.wikimedia.org/r/885779