Page MenuHomePhabricator

Ensure apcu incr/decr are atomic (Upgrade php-apcu)
Closed, ResolvedPublic

Description

There is a known lack of atomicity in the apcu_incr and _decr functions as pointed out by @aaron at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/524078/.

This patch proposes a workaround that basically uses polls using CAS and non-blocking tokens and retries that up to 10 times. While that workaround might be necessary in master either way to support third parties on older versions, it'd be nice if we could avoid this in production at least.

Event Timeline

I asked upstream APCu maintainers what their recommendation is.

Upstream issue at https://github.com/krakjoe/apcu/issues/343

Their answer was: "Let us fix it properly instead."

A patch was written (https://github.com/krakjoe/apcu/pull/384) and has since been merged and released as php-apcu 5.1.18.

jijiki renamed this task from Ensure apcu incr/decr are atomic to Upgrade php-apcu to 5.1.18.Oct 30 2019, 5:40 AM
jijiki updated the task description. (Show Details)

Instead of upgrading to a later version, we can backport that patch to the version we're using instead.

We're on 5.1.17 so backporting the patch should be pretty simple.

jijiki renamed this task from Upgrade php-apcu to 5.1.18 to Ensure apcu incr/decr are atomic.Oct 30 2019, 6:14 AM
Krinkle renamed this task from Ensure apcu incr/decr are atomic to Ensure apcu incr/decr are atomic (Upgrade php-apcu).Dec 5 2019, 12:32 PM

@Joe @jijiki Backporting works for me :). We're currently dealing with a lot of on-going regressions some of which are slowly getting worse and over a prolonged period of time (several months).

At this point I'm looking to hopefully get a few general fixes out such as this one that will help more generally given its proven near impossible to track exactly which cascading chain of small changes got us to where we are. Most likely it will require a similar multi-month effort of random small minor optimisations to get back which we don't really have the resources for (but are doing regardless).

Is this something we can pencil in for 2019-20 Q3 (Jan-Mar) still? Or Q4 (Apr-Jun)?

Krinkle triaged this task as Medium priority.Jan 11 2020, 1:17 AM

I've reviewed the commits since 5.1.17 and nothing else stands out to me. Perhaps 998157ac57ffe768a4023 is useful for best practices, but I haven't reviewed the code in question and am not sure if it's something we are affected by currently.

@Krinkle I backported [[ https://github.com/krakjoe/apcu/pull/384/commits/c8849b0894712d0ede63552fb99ca8dc4d3d884f | c8849b0894712d0ede63552fb99ca8dc4d3d884f]]and packaged it in php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2. You can find the package in your home directory on deployment-mediawiki-07 and install it. If we are happy with it, we can proceed with rolling it out to the canaries and then production.

I can additionally create another package and include 998157ac57ffe768a4023 aftrwards.

Mentioned in SAL (#wikimedia-releng) [2020-01-22T19:09:18Z] <Krinkle> Installing newer version of php-apcu on deployment-mediawiki-07, ref T236800

Mentioned in SAL (#wikimedia-releng) [2020-01-22T19:37:10Z] <Krinkle> Restore php-apcu version on deployment-mediawiki-07 to …wmf1, ref T236800

@Krinkle I can roll that out to the canaries, but finish production rollout until after all hands, does that sound good ?

@Krinkle we can push the new package to the canaries this week if you are ok with it

@Krinkle we can push the new package to the canaries this week if you are ok with it

Seems OK to me.

Mentioned in SAL (#wikimedia-operations) [2020-02-05T10:24:08Z] <effie> Upload php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 - T236800

Mentioned in SAL (#wikimedia-operations) [2020-02-05T15:24:25Z] <effie> Rollout php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 to api, app and jobrunner canaries - T236800

Mentioned in SAL (#wikimedia-operations) [2020-02-05T15:29:42Z] <effie> restart php-fpm on canaries - T236800

done, please reopen if something is not working as expected