Page MenuHomePhabricator

Ensure apcu incr/decr are atomic (Upgrade php-apcu)
Open, MediumPublic

Description

There is a known lack of atomicity in the apcu_incr and _decr functions as pointed out by @aaron at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/524078/.

This patch proposes a workaround that basically uses polls using CAS and non-blocking tokens and retries that up to 10 times. While that workaround might be necessary in master either way to support third parties on older versions, it'd be nice if we could avoid this in production at least.

Event Timeline

Krinkle created this task.Oct 29 2019, 3:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 29 2019, 3:03 PM

I asked upstream APCu maintainers what their recommendation is.

Upstream issue at https://github.com/krakjoe/apcu/issues/343

Their answer was: "Let us fix it properly instead."

A patch was written (https://github.com/krakjoe/apcu/pull/384) and has since been merged and released as php-apcu 5.1.18.

Krinkle moved this task from Untriaged to libs/objectcache on the MediaWiki-Cache board.
Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.
jijiki renamed this task from Ensure apcu incr/decr are atomic to Upgrade php-apcu to 5.1.18.Oct 30 2019, 5:40 AM
jijiki updated the task description. (Show Details)
Joe added a subscriber: Joe.Oct 30 2019, 6:02 AM

Instead of upgrading to a later version, we can backport that patch to the version we're using instead.

Joe added a comment.Oct 30 2019, 6:04 AM

We're on 5.1.17 so backporting the patch should be pretty simple.

jijiki renamed this task from Upgrade php-apcu to 5.1.18 to Ensure apcu incr/decr are atomic.Oct 30 2019, 6:14 AM
Krinkle renamed this task from Ensure apcu incr/decr are atomic to Ensure apcu incr/decr are atomic (Upgrade php-apcu).Dec 5 2019, 12:32 PM
Krinkle added a subscriber: jijiki.Sat, Jan 11, 1:16 AM

@Joe @jijiki Backporting works for me :). We're currently dealing with a lot of on-going regressions some of which are slowly getting worse and over a prolonged period of time (several months).

At this point I'm looking to hopefully get a few general fixes out such as this one that will help more generally given its proven near impossible to track exactly which cascading chain of small changes got us to where we are. Most likely it will require a similar multi-month effort of random small minor optimisations to get back which we don't really have the resources for (but are doing regardless).

Is this something we can pencil in for 2019-20 Q3 (Jan-Mar) still? Or Q4 (Apr-Jun)?

Krinkle triaged this task as Medium priority.Sat, Jan 11, 1:17 AM
Joe added a subscriber: RLazarus.Sat, Jan 11, 7:25 AM

@Krinkle are there other fixes we'd like to backport apart from https://github.com/krakjoe/apcu/pull/384 ?

I've reviewed the commits since 5.1.17 and nothing else stands out to me. Perhaps 998157ac57ffe768a4023 is useful for best practices, but I haven't reviewed the code in question and am not sure if it's something we are affected by currently.

jijiki added a comment.EditedFri, Jan 17, 12:47 PM

@Krinkle I backported [[ https://github.com/krakjoe/apcu/pull/384/commits/c8849b0894712d0ede63552fb99ca8dc4d3d884f | c8849b0894712d0ede63552fb99ca8dc4d3d884f]]and packaged it in php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2. You can find the package in your home directory on deployment-mediawiki-07 and install it. If we are happy with it, we can proceed with rolling it out to the canaries and then production.

I can additionally create another package and include 998157ac57ffe768a4023 aftrwards.