Page MenuHomePhabricator

On HHVM, "CACHE_MEMCACHED" shortcut should select native client (default PHP implementation has issues)
Closed, DeclinedPublic

Description

People have been unable to login or make edits in translatewiki.net.

After some investigation it was found that there are memcached timeout errors that effectively make caching broken, and this is most visible for sessions. The fault seems to be with hhvm using the PHP memcached client. Issue does not happen with memcached-pecl client.

Memcached 1.4.25-2ubuntu1.2
HHVM 3.22.1~xenial

I see multiple possible actions:

  • CACHE_MEMCACHED should default to native memcached client under HHVM and not the PHP implementation
  • Investigate the root cause and fix it (possibly an upstream issue)
  • Issue a warning (runtime, release notes...) for this combination
  • Do nothing and drop official support for HHVM

Event Timeline

Saying "you are disconnected" while saving my translated text on translatewiki.net

Échec de l’enregistrement de la traduction : Assertion that the user is logged in failed.

Previous issue (similar behaviour):

There has been issues with sessions that caused edits or logging in to fail for some users. The problem started around 20.00 UTC, possible even earlier, on the previous day. Possible fix has been applied, but please notify project admins if you continue to encounter this issue. (Other news …)

Sessions are rejected:
Votre session de connexion semble avoir des problèmes ; cette action a été annulée en prévention d'un piratage de session. Veuillez cliquer sur « Précédent », rechargez la page d'où vous venez, puis réessayez.

Did you try logging out and in again?

... out and in, and in and out, and new tab ,and new browser , and empty cache , and change browse and .... finally quit all.

More fluent at this time [Paris time 17;14] cxion allowed and save done successfully on translatewiki.net.

... out and in, and in and out, and new tab ,and new browser , and empty cache , and change browse and .... finally quit all.

More fluent at this time [Paris time 17;14] cxion allowed and save done successfully on translatewiki.net.

I even tried clearing my cookie and ipconfig /flushdns, but still don't help, now CSRF errors blocked me from using Translate interface. So is this task UBN now?

I changed some configuration after my last comment here. Please try log out & log in once more.

Aklapper renamed this task from connexion issue while saving my translations to States "you are disconnected" while saving my translated text on translatewiki.net.Oct 15 2017, 5:33 PM

disconection is quasi systematic now after a few minutes of connection and when saving translated text ; its 21:50 Paris time
some translation saves succeed but are not taken into account.
Note that the screen is not refreshed and the indication 'connected' in the upper right corner still remains after the failure poppup

After a connection, free navigation is allowed on any link of the page excepted when you acknowledge a rereading or validate your translated text where it fails.

I changed some configuration after my last comment here. Please try log out & log in once more.

Still CSRF blocked when editing e.g. InternetArchiveBot and XTools, OK now for most others.

Hi all, this morning [8:30 Paris time] :

After connexion and from a remote, separate PC + cleaned cache + latest Firefox updated:

> first immediate translation is/are accepted; ex: https://translatewiki.net/wiki/iNaturalist:Unblock/fr

BUT a few seconds later the same disconnection process applies. This is very unstable and prevent from continuing the translations for now.

Nikerabbit triaged this task as Medium priority.Oct 17 2017, 7:55 AM

There is a constant churn of messages like:

2017-10-17 07:31:49 web1.translatewiki.net translatewiki_net-bw_: SessionBackend "tt1r12ucd8kgk6hg9hfgevl59hbbtp1q" is unsaved, marking dirty in constructor
2017-10-17 07:31:49 web1.translatewiki.net translatewiki_net-bw_: SessionBackend "5nr87v3l5nj1o2lsi10fu8n61ot4kmh4" metadata dirty due to ID reset (formerly "tt1r12ucd8kgk6hg9hfgevl59hbbtp1q")
2017-10-17 07:31:49 web1.translatewiki.net translatewiki_net-bw_: SessionBackend "5nr87v3l5nj1o2lsi10fu8n61ot4kmh4" save: dataDirty=1 metaDirty=1 forcePersist=0
2017-10-17 07:31:49 web1.translatewiki.net translatewiki_net-bw_: SessionBackend "5nr87v3l5nj1o2lsi10fu8n61ot4kmh4" data dirty due to addData(): call_user_func_array/Closure$MediaWiki\Session\SessionBackend::delaySave/MediaWiki\Session\SessionBackend->save/MediaWiki\Session\CookieSessionProvider->persistSession/MediaWiki\Session\SessionBackend->addData
2017-10-17 07:31:49 web1.translatewiki.net translatewiki_net-bw_: SessionBackend "5nr87v3l5nj1o2lsi10fu8n61ot4kmh4" data dirty due to addData(): call_user_func_array/Closure$MediaWiki\Session\SessionBackend::delaySave/MediaWiki\Session\SessionBackend->save/MediaWiki\Session\CookieSessionProvider->persistSession/MediaWiki\Session\SessionBackend->addData
2017-10-17 07:31:49 web1.translatewiki.net translatewiki_net-bw_: SessionBackend "5nr87v3l5nj1o2lsi10fu8n61ot4kmh4" data dirty due to addData(): call_user_func_array/Closure$MediaWiki\Session\SessionBackend::delaySave/MediaWiki\Session\SessionBackend->save/MediaWiki\Session\CookieSessionProvider->persistSession/MediaWiki\Session\SessionBackend->addData
2017-10-17 07:31:49 web1.translatewiki.net translatewiki_net-bw_: SessionBackend "5nr87v3l5nj1o2lsi10fu8n61ot4kmh4" save: dataDirty=1 metaDirty=1 forcePersist=0
2017-10-17 07:31:50 web1.translatewiki.net translatewiki_net-bw_: SessionBackend "5nr87v3l5nj1o2lsi10fu8n61ot4kmh4" Taking over PHP session
2017-10-17 07:31:50 web1.translatewiki.net translatewiki_net-bw_: Saving all sessions on shutdown

I tried switching the backend between CACHE_DB and CACHE_MEMCACHED to no effect.

After enabling debug logging for memcached I see a lot of:

Oct 17 09:49:49 web1.translatewiki.net systemd-memcached-wrapper[2179]: <33 gets translatewiki_net-bw_:MWSession:rjlfd5u4qj70uaqjs4tod83johiag325
Oct 17 09:49:49 web1.translatewiki.net systemd-memcached-wrapper[2179]: > FOUND KEY translatewiki_net-bw_:MWSession:rjlfd5u4qj70uaqjs4tod83johiag325
Oct 17 09:49:49 web1.translatewiki.net systemd-memcached-wrapper[2179]: >33 sending key translatewiki_net-bw_:MWSession:rjlfd5u4qj70uaqjs4tod83johiag325
Oct 17 09:49:49 web1.translatewiki.net systemd-memcached-wrapper[2179]: <33 gets translatewiki_net-bw_:MWSession:kfq64fljdp3gsbmm89hr2ac9b2mnco41
Oct 17 09:49:49 web1.translatewiki.net systemd-memcached-wrapper[2179]: > NOT FOUND translatewiki_net-bw_:MWSession:kfq64fljdp3gsbmm89hr2ac9b2mnco41
Oct 17 09:49:51 web1.translatewiki.net systemd-memcached-wrapper[2179]: <33 gets translatewiki_net-bw_:MWSession:v8c03uci37igu23m4r8c5tdglniqdnvd
Oct 17 09:49:51 web1.translatewiki.net systemd-memcached-wrapper[2179]: > NOT FOUND translatewiki_net-bw_:MWSession:v8c03uci37igu23m4r8c5tdglniqdnvd
Oct 17 09:49:51 web1.translatewiki.net systemd-memcached-wrapper[2179]: <34 gets translatewiki_net-bw_:MWSession:k9tmta5edc233m524vk4jcvc4apc1e6m
Oct 17 09:49:51 web1.translatewiki.net systemd-memcached-wrapper[2179]: > NOT FOUND translatewiki_net-bw_:MWSession:k9tmta5edc233m524vk4jcvc4apc1e6m

In other words, lots of misses, some hit and no set at all. How are these stored in the first place?

Nikerabbit raised the priority of this task from Medium to Unbreak Now!.Oct 17 2017, 8:12 AM

Not being able to make translations is pretty important. Some changes are going through, but given the numerous reports I suspect this is affecting a lot of users.

Meta comment: It would be nice if logging would have the following properties:

  • Output entries to the general error log when something is broken
  • Be clear enough to understand what the error messages mean: are they severe, unexceptional or expected to happen on normal use
  • Be clear enough to understand what would be the appropriate action to take: change setting, how to debug more, point out what is likely the cause

None of that is the case currently, and I am at loss how to debug this further.

[13:10 Paris time] more stable over 15minutes long; no disconnexion => better :-)

I had restarted memcached (for debugging) and increased maximum memory in case it was getting full. But I have doubts whether this is the cause because I still see similar log entries.

ok, fast responses for review ack and translations => stable for now. Thanks.

very stable; normal behaviour observed.

Nikerabbit lowered the priority of this task from Unbreak Now! to High.Oct 19 2017, 6:08 PM

Lowering priority due to lack of new reports, but I am not sure this is solved yet.

new disturbance observed 00.15 Paris time

Lowering priority due to lack of new reports, but I am not sure this is solved yet.

Still Invalid CSRF token in many messages on the new created Dissemin project:

pasted_file (917×1 px, 95 KB)

Maybe off topic, but this also blocked me from using Special:EditWatchlist (shows a white screen!)

OMFG! By re-logging recently:

pasted_file (917×1 px, 85 KB)

Nikerabbit raised the priority of this task from High to Unbreak Now!.Oct 25 2017, 4:25 AM

I think the issue is caused by [memcached] Memcached error: timeout writing to 127.0.0.1:11211 which seems to also disable reading from it.

More specifically, with manually added debugging (since get_multi requests do not show in the debug log!):

[memcached] Writing `gets WANCache:v:global:user:id:translatewiki_net-bw_:1
`;WebRequest-&gt;getSession/MediaWiki\Session\SessionManager-&gt;getSessionForRequest/MediaWiki\Session\SessionManager-&gt;getSessionInfoForRequest/MediaWiki\Session\CookieSessionProvider-&gt;provideSessionInfo/MediaWiki\Session\UserInfo::newFromId/User-&gt;load/User-&gt;loadFromId/User-&gt;loadFromCache/WANObjectCache-&gt;getWithSetCallback/WANObjectCache-&gt;doGetWithSetCallback/WANObjectCache-&gt;get/WANObjectCache-&gt;getMulti/MemcachedPhpBagOStuff-&gt;getMulti/MemcachedClient-&gt;get_multi/MemcachedClient-&gt;_fwrite
[memcached] Memcached error: timeout writing to 127.0.0.1:11211

Increasing the timeout doesn't help.

It's not a problem with memcached: restarting hhvm helps. But also other hhvm instance works, and telnet works. So I guess this is a bug with HHVM and/or with MemcachedClient.

Out of curiosity is there a reason you're using the PHP memcache implementation and not the built-in HHVM extension?

Good question. The config we have is simply $wgMainCacheType = CACHE_MEMCACHED;. I would expect that to pick the best implementation. Is memcached-pecl the setting to use the built-in HHVM extension?

instability observed again; Paris time 1:34 PM

Thanks for the report. I have restarted hhvm again as a workaround. It seems it has been going on for a while again:

2017-10-29 19:04:25 web1.translatewiki.net translatewiki_net-bw_: Memcached error: timeout reading line from 127.0.0.1:11211
2017-10-29 19:04:25 web1.translatewiki.net translatewiki_net-bw_: Memcached error: timeout writing to 127.0.0.1:11211

I'm considering adding daily restarts while this issue is being investigated.

I enabled memcached-pecl now for testing, as it took less than 24 hours to start seeing those errors again.

First observattions:

  1. Wow it has very noisy logging even with $wgMemcachedDebug = false;
  2. Notice: Unable to unserialize: [1]. Unexpected end of buffer during unserialization. in /srv/mediawiki/tags/2017-10-31_06:43:17/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 143
Nikerabbit renamed this task from States "you are disconnected" while saving my translated text on translatewiki.net to Memcached issues on translatewiki.net.Oct 31 2017, 7:20 AM
Nikerabbit claimed this task.
Nikerabbit updated the task description. (Show Details)
Nikerabbit moved this task from Backlog to In Progress on the Language-2017-Oct-Dec board.

Those PHP errors went away when I restarted memcached – I suppose those clients use a slightly different way to serialize data.

I haven't seen new errors since then, but this must run for couple of days more without errors before I declare that I have found a working configuration to work around the original issue.

Nikerabbit lowered the priority of this task from Unbreak Now! to High.Nov 3 2017, 8:54 AM

Memcached seems to be filling much slower with memcached-pecl, unless something else changes. Will consider this fix stable if it runs without any issues to next week.

I'm declaring my workaround stable for now, and re-purpose this task as general MediaWiki issue.

Nikerabbit renamed this task from Memcached issues on translatewiki.net to PHP memcached implementation stops working under HHVM.Nov 6 2017, 9:49 AM
Nikerabbit removed Nikerabbit as the assignee of this task.
Nikerabbit updated the task description. (Show Details)

Very stable during all this month long . This problem has no longer appeared. Satisfied. Thank you Nk.

so it this task still "high priority"?

That would be for the platform team to decide. My guess would be that they deprioritize HHVM related things.

Legoktm lowered the priority of this task from High to Low.Dec 10 2017, 10:12 PM
Legoktm added a project: HHVM.

CACHE_MEMCACHED should default to native memcached client under HHVM and not the PHP implementation

I think a patch for this would be good enough to resolve this.

Krinkle subscribed.

@Liuxinyu970226 I do not see any vulnerability mentioned in this task.

Krinkle renamed this task from PHP memcached implementation stops working under HHVM to On HHVM, "CACHE_MEMCACHED" shortcut should select native client (default PHP implementation has issues).Jul 28 2018, 8:01 PM

@Krinkle But the CSRF token errors really still shows to me, just when I'm trying log out and re-log in twn, frankly trying log in 25 times, 24 sessions are CSRF-blocked, and one "successed", however even this one success, I can only translate not more than 20 messages, and after that I got Invalid CSRF token warning when continue translating (sadly, this is why I'm generally retired on that site). I just wanna ask: where to fix this? (Do not answer me to ipconfig /flushdns, that's unhelpful for anything I met).

@Liuxinyu970226 Translatewiki.net no longer runs on HHVM, so if you are having sessions problems lately, it is something else (nobody has reported such problems, so might want to check your browser first).

@Krinkle But the CSRF token errors really still shows to me, [..]

An error that fails CSRF when it should pass, means we are rejecting more attempts than we should. That's okay from a security perspective.

It works internally by comparing the token your browser has with the token we store internally. When there is a problem with the cache, the token we store internally is "forgotten", which means for a short time, several "valid" attempts are rejected because we cannot compare the tokens when our internal one is "forgotten".

If you see the opposite - you try something you are not allowed to do, and it succeeds, then it is a security problem.