Page MenuHomePhabricator

mwscript.php/mctest.php does not know about memcache in both datacenters
Closed, ResolvedPublic

Description

Author: lcarr

Description:
(from rt4790) hume does not know about the eqiad memcache cluster, at least when running mwscript. Thus, the two memcache clusters get out of sync when running maintenance scripts on hume.

The real problem is that memcache maintenance is not location aware. As we move into a multi-datacenter model, we need to have the option to look at one or the other memcache instance, or (even better) option for all.

In my head it would be something like

mwscript mctest.php --eqiad
mwscript mctest.php --sdtpa
mwscript mctest.php --all


Version: wmf-deployment
Severity: critical
Whiteboard: deploysprint-13

Details

Reference
bz46428

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:38 AM
bzimport set Reference to bz46428.
bzimport added a subscriber: Unknown Object (MLST).

This isn't just about mctest.php . Regular maintenance scripts that execute memcached insertions/deletions should also hit both locations.

How're the memcache clusters arranged exactly? I'm worried this could lead to split-brain problems.

lcarr wrote:

They're configured here -
mediawiki-config/wmf-config/mc-eqiad.php
mediawiki-config/wmf-config/mc-pmtpa.php
https://wikitech.wikimedia.org/wiki/Memcached isactually up to date :)

lcarr wrote:

Oh and this also has definitely affected things, with memcache being unpurged for some fundraising purposes (how i discovered this). So tampa and eqiad have different behavior for specifically centralauth-user-05ef03b8cf2f9261df5f7f52c7ec7b65

pgehres wrote:

(In reply to comment #4)

Oh and this also has definitely affected things, with memcache being unpurged
for some fundraising purposes (how i discovered this).

AFAIK, this does not affect fundraising currently, but I do know that fundraising has some plans for memcache and I assume they would want the same functionality between the payments clusters.

mwalker wrote:

Not exactly -- we will be using memcache for session storage and continue our use of it for fraud analytic. In both cases; I'm not seeing a serious need for us to sync across clusters at this time -- we only ever have one active cluster and if we lose messages in flight my opinion is 'oh well; we can pick them up in the audit'.

Ok, sounds like the proper fix is:

  1. forbid running mwscript on fenari or elsewhere in pmtpa while eqiad is primary, because IT CAN CAUSE BREAKAGE TO DO SO (split-brain memcache, bad cached items possibly being reinserted to databases)
  1. set up someplace in eqiad where mwscript can be run, so that maintenance scripts can be run WITHOUT BREAKAGE

lcarr wrote:

Brion - I have to disagree with you there.

I think that in a multidatacenter environment (which even if we don't really have it, we should strive to have all our code pretending that we do), the location where you run a script should be unimportant (since we all know someone will mess up somehow) and the script should either have the knowledge of the location or prompt for the input of location (or just clear from all locations).

(In reply to comment #9)

Brion - I have to disagree with you there.

I think that in a multidatacenter environment (which even if we don't really
have it, we should strive to have all our code pretending that we do), the
location where you run a script should be unimportant (since we all know
someone will mess up somehow) and the script should either have the knowledge
of the location or prompt for the input of location (or just clear from all
locations).

As brion suggested, it could be disallowed (the wrapper scripts could actually check for this and error out).

Hi folks, we'd be grateful if this issue could be resolved soon, as it is currently blocking the limited re-deployment of Article Feedback 5 on en-wiki, where the tool has now been completely unavailable for over two weeks. Is there anything which Matthias and our E2 team could do to help a prompt resolution? Thanks in advance for any help you can provide towards that goal. :)

tchay wrote:

(In reply to comment #5)

Is this also an issue for Redis?

I don't believe so. Redis should be multi-datacenter aware.

Was told that this should probably wait until the Redis switchover, which will happen either this week or on Wed 10th of this month.

Thanks for the update, André. Do you know when we realistically expect to solve this issue? Who would be responsible for making and deploying the necessary revision? Can we do anything to help make this happen sooner rather than lager? I would like to tell the folks on English Wikipedia, who have been patiently waiting for the AFT5 tool to be re-enabled for nearly 3 weeks now, even though we promised it would be back up 2 weeks ago. :(

(In reply to comment #13)

Was told that this should probably wait until the Redis switchover, which
will
happen either this week or on Wed 10th of this month.

What switchover? The idea of using Redis along for caching was abandoned last year due to it's LRU strategy. If you are referring the job queue, I don't those two being related.

I've been trying to ping people about getting Terbium in a usable state. The directory permissions are still broken. It shouldn't be hard to get that working. At that point, MWScript will be usable, though the more convenient mwscript will still need some work since there is no /home directory.

MWScript.php and mwscript are working now after some permission fixes by Peter and after https://gerrit.wikimedia.org/r/#/c/58124/, so I think terbium starting to get usable. I'll probably start running some scripts on it today myself.

Lowering priority of this bug, and unassigning from Aaron. We're cooking up a plan to generally have a deployment tools-related sprint in another quarter or two, but right now, the urgent priority is getting migrated from hume to terbium, so that we stop trying to use hume for eqiad updates.

See https://www.mediawiki.org/wiki/Site_performance_and_architecture#Roadmap for more context.

I'm not sure why this is being lowered, terbium coming online doesn't fix the fact the script isn't location aware.

This still needs to be fixed, and I would assume by development.

RobLa: Would anyone in dev over there handle making this location aware? We need this for proper monitoring.

I don't understand why you would want a concept of per-DC memcached clusters in MediaWiki, when there is no replication. We're not planning on solving the split-brain problem at the application level. As far as I'm concerned, every actual bug was fixed in I5d64cec2 and Ib327d713 and so this can be closed.

(In reply to comment #9)

I think that in a multidatacenter environment (which even if we don't really
have it, we should strive to have all our code pretending that we do), the
location where you run a script should be unimportant (since we all know
someone will mess up somehow) and the script should either have the knowledge
of the location or prompt for the input of location (or just clear from all
locations).

Since I5d64cec2/Ib327d713, it doesn't matter where you run a script. So this is fixed, isn't it?

I will take your collective silence to mean yes.