Page MenuHomePhabricator

Ensure apcu incr/decr are atomic (Upgrade php-apcu)
Closed, ResolvedPublic

Description

There is a known lack of atomicity in the apcu_incr and _decr functions as pointed out by @aaron at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/524078/.

This patch proposes a workaround that basically uses polls using CAS and non-blocking tokens and retries that up to 10 times. While that workaround might be necessary in master either way to support third parties on older versions, it'd be nice if we could avoid this in production at least.

Event Timeline

Krinkle created this task.Oct 29 2019, 3:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 29 2019, 3:03 PM

I asked upstream APCu maintainers what their recommendation is.

Upstream issue at https://github.com/krakjoe/apcu/issues/343

Their answer was: "Let us fix it properly instead."

A patch was written (https://github.com/krakjoe/apcu/pull/384) and has since been merged and released as php-apcu 5.1.18.

Krinkle moved this task from Untriaged to libs/objectcache on the MediaWiki-Cache board.
Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.
jijiki renamed this task from Ensure apcu incr/decr are atomic to Upgrade php-apcu to 5.1.18.Oct 30 2019, 5:40 AM
jijiki updated the task description. (Show Details)
Joe added a subscriber: Joe.Oct 30 2019, 6:02 AM

Instead of upgrading to a later version, we can backport that patch to the version we're using instead.

Joe added a comment.Oct 30 2019, 6:04 AM

We're on 5.1.17 so backporting the patch should be pretty simple.

jijiki renamed this task from Upgrade php-apcu to 5.1.18 to Ensure apcu incr/decr are atomic.Oct 30 2019, 6:14 AM
Krinkle renamed this task from Ensure apcu incr/decr are atomic to Ensure apcu incr/decr are atomic (Upgrade php-apcu).Dec 5 2019, 12:32 PM
Krinkle added a subscriber: jijiki.Jan 11 2020, 1:16 AM

@Joe @jijiki Backporting works for me :). We're currently dealing with a lot of on-going regressions some of which are slowly getting worse and over a prolonged period of time (several months).

At this point I'm looking to hopefully get a few general fixes out such as this one that will help more generally given its proven near impossible to track exactly which cascading chain of small changes got us to where we are. Most likely it will require a similar multi-month effort of random small minor optimisations to get back which we don't really have the resources for (but are doing regardless).

Is this something we can pencil in for 2019-20 Q3 (Jan-Mar) still? Or Q4 (Apr-Jun)?

Krinkle triaged this task as Medium priority.Jan 11 2020, 1:17 AM
Joe added a subscriber: RLazarus.Jan 11 2020, 7:25 AM

@Krinkle are there other fixes we'd like to backport apart from https://github.com/krakjoe/apcu/pull/384 ?

I've reviewed the commits since 5.1.17 and nothing else stands out to me. Perhaps 998157ac57ffe768a4023 is useful for best practices, but I haven't reviewed the code in question and am not sure if it's something we are affected by currently.

jijiki added a comment.EditedJan 17 2020, 12:47 PM

@Krinkle I backported [[ https://github.com/krakjoe/apcu/pull/384/commits/c8849b0894712d0ede63552fb99ca8dc4d3d884f | c8849b0894712d0ede63552fb99ca8dc4d3d884f]]and packaged it in php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2. You can find the package in your home directory on deployment-mediawiki-07 and install it. If we are happy with it, we can proceed with rolling it out to the canaries and then production.

I can additionally create another package and include 998157ac57ffe768a4023 aftrwards.

Mentioned in SAL (#wikimedia-releng) [2020-01-22T19:09:18Z] <Krinkle> Installing newer version of php-apcu on deployment-mediawiki-07, ref T236800

Mentioned in SAL (#wikimedia-releng) [2020-01-22T19:37:10Z] <Krinkle> Restore php-apcu version on deployment-mediawiki-07 to …wmf1, ref T236800

@jijiki All done. LGTM :)

Krinkle assigned this task to jijiki.Jan 22 2020, 10:33 PM

@Krinkle I can roll that out to the canaries, but finish production rollout until after all hands, does that sound good ?

jijiki added a comment.EditedFeb 4 2020, 11:49 AM

@Krinkle we can push the new package to the canaries this week if you are ok with it

aaron added a comment.Feb 4 2020, 8:09 PM

@Krinkle we can push the new package to the canaries this week if you are ok with it

Seems OK to me.

Mentioned in SAL (#wikimedia-operations) [2020-02-05T10:24:08Z] <effie> Upload php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 - T236800

Mentioned in SAL (#wikimedia-operations) [2020-02-05T15:24:25Z] <effie> Rollout php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 to api, app and jobrunner canaries - T236800

Mentioned in SAL (#wikimedia-operations) [2020-02-05T15:29:42Z] <effie> restart php-fpm on canaries - T236800

Mentioned in SAL (#wikimedia-operations) [2020-02-10T08:23:46Z] <effie> update php-apcu on codfw - T236800

Mentioned in SAL (#wikimedia-operations) [2020-02-10T08:32:52Z] <effie> update php-apcu on eqiad - T236800

jijiki closed this task as Resolved.Feb 10 2020, 9:34 AM

done, please reopen if something is not working as expected