Page MenuHomePhabricator

get a snapshot instance running in beta with stretch, php7
Closed, ResolvedPublic

Description

This will let us test all the pieces needed for dumps, at any rate, before merging a php7 mw manifest into production.

Event Timeline

ArielGlenn triaged this task as Medium priority.Jan 5 2018, 9:38 AM
ArielGlenn created this task.
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.

First attempts to apply https://gerrit.wikimedia.org/r/#/c/394977/ to deployment-snapshot01: php-wikidiff2 and php-luasandbox both are available for stretch but they require php5. Hrm.
Also a bunch of hhvm stuff failed to install; really I don't want any of that but it's irrelevant for this testing.

Opened T184270 for rebuild of the two extensions. Although dumps can run without them, we'll want them soon enough.

With a whole lot of help from Thcipriani (deliberately not pinged on this ticket), got scap going for mediawiki and for the dumps repo. Docs to come soon. Next up: there's only 5gb left so getting the 20g /dev/vda4 formatted and mounted somewhere via puppet would be nice.

I have that volume mounted in the right place now, have picked up php-luasandbox from stretch-backports (even though we won't use it), puppet stll whines about php-wikidiff2 but again, we don't have to have it right now for testing. Next up, check all the php.ini settings and see that they make sense, since we just reused the one from the php5 manifests. Then maybe try a test or two.

Had to do a horrible hack to work around etcd using the deployment-puppetmaster0 cert, while my snapshot instance is politely on a separate puppetmaster so i can cherry-pick things there without applying them to the rest of the cluster. Gross but done.

Now running into Sentry logging problems as per T184359, going to be hard to track that one down. Might try some of the other dump steps in the meantime and see if any of them get past it.

Legoktm is fast! Fixed the Sentry call, and it turns out the error was very easy to track down,because it's this: T184177

Completed a successful run of enwikinews on beta, one dump step at a time, except for abstracts which is still broken.
Did steps in order, using the below syntax varying the job name accordingly:

dumpsgen@deployment-snapshot01:/srv/deployment/dumps/dumps/xmldumps-backup$ python ./worker.py --configfile /etc/dumps/confs/wikidump.conf.labs --job metahistory7zdump enwikinews

Next up: try one of the "big wikis", step by step so we can skip abstracts. Big wikis on beta are: simplewiki, enwiki, wikidatawiki.

I chose randomly enwiki for the large run. Did: tables, stubs, page content dumps, all ran fine. Waiting for flow dumps to finish up, loath to shoot it as it's been an hour and is still producing output. Without prefetch it's painfuly slow, and this is the first run (and prefetch isn't possible yet, there's tasks pending).

Because I posted the above comment, flow finished! I've run everything else except abstracts, and now on flowhistory. I expect that to take another hour, and really I'm just running it so we'll have it for prefetch testing in the future.

Flow history run completed in just about exactly an hour. Run is complete except for abstracts. Starting to write up install/setup and testing instructions here: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Dumps and https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Dumps/Setup_notes

Flow dumps for enwiki produced this message for both regular and flow history dumps, one at the beginning of the run and one at the end of the run. Hard to tell what redis service it's trying to reach from the error message, guess I'll go poke around.

Jan  7 17:52:48 deployment-snapshot01 php7.0: PHP Warning:  Redis::connect(): connect() failed: No such file or directory in /srv/mediawiki/php-master/includes/libs/redis/RedisConnectionPool.php on line 238

Found the redis issue.
Run:

/srv/mediawiki/multiversion/MWScript.php extensions/Flow/maintenance/dumpBackup.php --wiki=hewiki --current --report=1000 --output=bzip2:/mnt/dumpsdata/xmldatadumps/temp/cawiki-flow-crap.bz2

Get result in logstash:

Could not connect to server "/var/run/nutcracker/redis_eqiad.sock

So I guess that needs some checking.

It seems that the /run/nutcracker directory won't get created until a reboot after the first complete puppet run. (see https://phabricator.wikimedia.org/T178457 and https://gerrit.wikimedia.org/r/#/c/384980/ for discussion.) I was too bored to do that so I did

root@deployment-snapshot01:/var/log/nutcracker# systemd-tmpfiles --create  nutcracker.conf

and the next puppet run forced a restart.

Next up: from that same flow command

dumpsgen@deployment-snapshot01:~$ /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Flow/maintenance/dumpBackup.php --wiki=hewiki --current --report=1000 --output=bzip2:/mnt/dumpsdata/xmldatadumps/temp/cawiki-flow-crap.bz2

I see a ton of

Warning: Memcached::getMulti(): could not decompress value: unrecognised encryption type in /srv/mediawiki/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php     on line 235
Warning: Memcached::get(): could not decompress value: unrecognised encryption type in /srv/mediawiki/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 143

Some central auth related lookup issues apparently. Here's the stack trace:

#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php(235): Memcached->getMulti(array)
#2 /srv/mediawiki/php-master/includes/libs/objectcache/WANObjectCache.php(333): MemcachedPeclBagOStuff->getMulti(array)
#3 /srv/mediawiki/php-master/includes/libs/objectcache/WANObjectCache.php(278): WANObjectCache->getMulti(array, array, array, array)
#4 /srv/mediawiki/php-master/includes/libs/objectcache/WANObjectCache.php(1121): WANObjectCache->get(string, NULL, array, NULL)
#5 /srv/mediawiki/php-master/includes/libs/objectcache/WANObjectCache.php(1062): WANObjectCache->doGetWithSetCallback(string, integer, Closure, array, NULL)
#6 /srv/mediawiki/php-master/extensions/CentralAuth/includes/CentralAuthUser.php(515): WANObjectCache->getWithSetCallback(string, integer, Closure, array)
#7 /srv/mediawiki/php-master/extensions/CentralAuth/includes/CentralAuthUser.php(371): CentralAuthUser->loadFromCache()
#8 /srv/mediawiki/php-master/extensions/CentralAuth/includes/CentralAuthUser.php(546): CentralAuthUser->loadState()
#9 /srv/mediawiki/php-master/extensions/CentralAuth/includes/CentralAuthIdLookup.php(83): CentralAuthUser->getId()
#10 /srv/mediawiki/php-master/extensions/CentralAuth/includes/CentralAuthIdLookup.php(89): CentralAuthIdLookup->isAttached(User)
#11 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(407): CentralAuthIdLookup->centralIdFromLocalUser(User, integer)
#12 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(368): Flow\Dump\Exporter->formatRevision(Flow\Model\Header)
#13 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(247): Flow\Dump\Exporter->formatRevisions(Flow\Model\Header)
#14 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(200): Flow\Dump\Exporter->formatHeader(Flow\Model\Header)
#15 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(182): Flow\Dump\Exporter->formatWorkflow(Flow\Model\Workflow, Flow\Search\Iterators\HeaderIterator, Flow\Search\Iterators\TopicIterator)
#16 /srv/mediawiki/php-master/extensions/Flow/maintenance/dumpBackup.php(86): Flow\Dump\Exporter->dump(BatchRowIterator)
#17 /srv/mediawiki/php-master/extensions/Flow/maintenance/dumpBackup.php(59): FlowDumpBackup->dump(integer)
#18 /srv/mediawiki/php-master/maintenance/doMaintenance.php(94): FlowDumpBackup->execute()
#19 /srv/mediawiki/php-master/extensions/Flow/maintenance/dumpBackup.php(130): require_once(string)
#20 /srv/mediawiki/multiversion/MWScript.php(100): require_once(string)
#21 {main}

Here's info-level log entries from CentralAuth right afterwards:

Loading CentralAuthUser for user Mooeypoo from cache object
Loading CentralAuthUser for user Mattflaschen from cache object

And finally:

Memcached error for key "WANCache:v:global:centralauth-user:<XXXX redacted by me>" on server "127.0.0.1:11212": SOME ERRORS WERE REPORTED

These are all for the same request id.

Some central auth related lookup issues apparently.

Looks more like something is blowing up when trying to access memcached, not necessarily related to CentralAuth. You might start by enabling more logging in memcached to see what's behind that unhelpful "SOME ERRORS WERE REPORTED".

Looks more like something is blowing up when trying to access memcached, not necessarily related to CentralAuth. You might start by enabling more logging in memcached to see what's behind that unhelpful "SOME ERRORS WERE REPORTED".

You are correct:

Memcached error for key "WANCache:v:global:lag-times:1:deployment-db03:0-1" on server "127.0.0.1:11212": SOME ERRORS WERE REPORTED

also from snapshot01.

I thought I' at least verify that memcached is accessible and working from snapshot01:

root@deployment-snapshot01:~# telnet 10.68.23.25 11211
Trying 10.68.23.25...
Connected to 10.68.23.25.
Escape character is '^]'.
stats items
STAT items:3:number 746
STAT items:3:age 5951356
STAT items:3:evicted 0
STAT items:3:evicted_nonzero 0
STAT items:3:evicted_time 0
...
STAT items:173:number 1
STAT items:173:age 13744444
STAT items:173:evicted 0
STAT items:173:evicted_nonzero 0
STAT items:173:evicted_time 0
STAT items:173:outofmemory 0
STAT items:173:tailrepairs 0
STAT items:173:reclaimed 0
STAT items:173:expired_unfetched 0
STAT items:173:evicted_unfetched 0
STAT items:173:crawler_reclaimed 0
STAT items:173:lrutail_reflocked 0
STAT items:174:number 1
STAT items:174:age 3655892
STAT items:174:evicted 0
...

OK, so let's see some key that is in there:

...
stats cachedump 173 2
ITEM enwiki:prepared-edit:6463829d079143101c3f30d8eb056cce:b4c18ed81fcd98cf6875a8e9ca016c29b796330c:80bbc50584c0c6886fab1ad282cfee51 [493921 b; 1501695739 s]
END

Great, so let's get the value:

...
get enwiki:prepared-edit:6463829d079143101c3f30d8eb056cce:b4c18ed81fcd98cf6875a8e9ca016c29b796330c:80bbc50584c0c6886fab1ad282cfee51
END

No value given? Why? I tried with a smaller object to make sure it wasn't the number of bytes:

...
stats cachedump 6 2
ITEM enwiki:limiter:move:user:12414 [1 b; 1515412445 s]
END
get enwiki:limiter:move:user:12414
END

Same result, stumped, Probably to a memcached person it's something obvious.

In unrelated news, I ran into problems with one text entry for the enwiki page content dumps on beta:

Spawning database subprocess: '/usr/bin/php7.0' '/srv/mediawiki/php-master/../multiversion/MWScript.php' 'fetchText.php' '--wiki' 'enwiki'
getting/checking text 20971 failed (Generic error while obtaining text for id 20971)

I checked on deployment-db04 and the revision record is there but the text entry is missing:

root@BETA[enwiki]> select * from text where old_id = 20971;
Empty set (0.00 sec)

root@BETA[enwiki]> select * from revision where rev_text_id = 20971;
+--------+----------+-------------+---------------------------------------------------------------------------------------------+----------+-----------------------+----------------+--------
| rev_id | rev_page | rev_text_id | rev_comment                                                                                 | rev_user | rev_user_text         | rev_timestamp  | rev_min
+--------+----------+-------------+---------------------------------------------------------------------------------------------+----------+-----------------------+----------------+--------
|  20971 |     1442 |       20971 | Closing small tag and removing nested small tag.  Please revert if nesting was intentional. |        0 | imported>Plastikspork | 20100824213917 |
+--------+----------+-------------+---------------------------------------------------------------------------------------------+----------+-----------------------+----------------+--------
1 row in set (0.49 sec)

Sidetracked a bit by issues with labs/private repo on deployment-puppetmaster02; these have been resolved by rebuilding the repo, rebasing all local changes on top, getting rid of extra merge commits, committing some changes that had been added and never committed, and last but not least committing some files not managed by the repo but sitting in the directory. Sync from the remote repo every ten minutes now works again. There's a chance that there are a couple extra files in our local copy but that won't hurt anything at least.

Back to the memcached issue. With some tcpdump and some strace and some telnet I have the following for the first error encountered, when running my dump command with files that don't even exist, so we won't even have stubs or prefetch. So this is still in the setup phase.

  • The key being checked is WANCache:v:global:lag-times:1:deployment-db03:0-1
  • The key exists and data is sent back and received by the mediawiki client through nutcracker on snapshot01, via memc04
  • The data seems to be binary data of some sort; verified from a local telnet request for the key on deployment-memc04
  • The actual error is "could not decompress value: unrecognised encryption type" at /srv/mediawiki/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php

This is out of the strace:

wfDebug 2018-01-11 20:51:28 [32c153505e074cc74879de76] deployment-snapshot01 enwikinews 1.31.0-alpha error WARNING: [32c153505e074cc74879de76] [no req]
ErrorException from line 235 of /srv/mediawiki/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php:
PHP Warning: Memcached::getMulti(): could not decompress value: unrecognised encryption type
{"exception_id":"32c153505e074cc74879de76","exception_url":"[no req]","caught_by":"mwe_handler"} .wfDebug
wfDebug [Exception ErrorException] (/srv/mediawiki/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php:235)
PHP Warning: Memcached::getMulti(): could not decompress value: unrecognised encryption type.wfDebug

 #0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array).wfDebug
 #1 /srv/mediawiki/php-master/includes/libs/objectcache/MemcachedPeclBagOStuff.php(235): Memcached->getMulti(array).wfDebug
 #2 /srv/mediawiki/php-master/includes/libs/objectcache/WANObjectCache.php(333): MemcachedPeclBagOStuff->getMulti(array).wfDebug
 #3 /srv/mediawiki/php-master/includes/libs/objectcache/WANObjectCache.php(278): WANObjectCache->getMulti(array, array, array, array).wfDebug

All the mediawiki instances in beta except for this one are using hhvm I suppose. So that leads me to this bug report:

https://github.com/facebook/hhvm/issues/8028

In php-memcached_3.0.1+2.2.0 this change would probably do the trick. No I have not built and tested it.

 [ariel@bigtrouble memcached-3.0.1]$ diff php_memcached.c~ php_memcached.c
3393a3394,3395
> 		// mimic hhvm plugin, which falls back to is_zlib every time. yuck. see hhvm issue #8028
> 		is_zlib = True;

In the version of the code here: https://github.com/php-memcached-dev/php-memcached/blob/REL3_0/php_memcached.c the stanza to be tweaked is at line 3390.

Could we try cherry-picking https://github.com/facebook/hhvm/pull/8031 instead? Alternatively, we could have PHP 7 and HHVM not share a cache (different servers, or different cache prefix, etc...)

I have built and tested with a more complicated patch, because the above is not sufficient. Fallback is actually to a third older style of compression, this is seen in the 2.2.0 php_memcached.c file. It's not working yet, I wanted this to confirm that this is the bug we're encountering. What I do see is that on some runs of my php script I have these errors and on some not, as though the key might be inserted by something using hhvm one time and php another.

Could we try cherry-picking https://github.com/facebook/hhvm/pull/8031 instead? Alternatively, we could have PHP 7 and HHVM not share a cache (different servers, or different cache prefix, etc...)

For sure the right fix is to patch hhvm and rebuild, but I don't want to recommend that until it's certain that this is the bug.

Just fyi it's clear we do get a message back without the compression flag set, I've checked this. But even trying fastlz, zlib, and the old uncompress method from the 2.2.0 code, I can't get decompression to work. Yet. At this point I would vote for that hhvm build, with installation on mediawiki04 through 07, and then wait however long before that cache key gets expired and/or overwritten out of the two memcache instances.

Got a working patch at last, so that's verified to be the issue. Opened a separate ticket for this: T184854

php-wikidiff2 extension was picked up from stretch-backports, so that's done.

During the SRE team meetings, we decided that a good approach for handling the php7 memcached issue would be to have a separate memcached pool for php7 mediawiki hosts. In accordance with that, I'm going to set up a memcached pool in deployment-prep that is only to be used by php7 mediawiki instances.

Questions for the folks who run the existing memcached instances in labs:

  • Do I need more than one memcached instance or is one enough? Load/memory isn't an issue for dumps in labs.
  • It looks like all I'll need to do is to override mediawiki_memcached_servers for hiera on the snapshots. Anything I missed?
  • Spinning up a new memcached in labs seems trivial, m1.medium, no new settings, right?
  • In order to head off other folks from using the new memcached by mistake, I was considering calling it deployment-memc-php7-01, is that too awful? Will it pick up the deployment-memc prefix settings?

Adding @elukey because you worked on the memcached profiles and the labs settings.

After talking with elukey and others on irc, have set up deployment-memc-php01 and it picks up the right puppet settings. One instance is enough for now.

During the SRE team meetings, we decided that a good approach for handling the php7 memcached issue would be to have a separate memcached pool for php7 mediawiki hosts. In accordance with that, I'm going to set up a memcached pool in deployment-prep that is only to be used by php7 mediawiki instances.

This is ok only in labs sadly, as it would create an issue with WanCache purges not propagating, and people on php/hhvm seeing stale content.

Do this to unblock yourself, but I'll need to patch HHVM and make it compatible with php before we can move to production anyways.

After chatting with Joe on IRC: the new memcached instance was only intended to mirror what we expected to do in production. I already have dumps working in beta due to patched php_memcached. I'll wait for the new hhvm build and test against that when it shows up. In the meantime I've tossed that memcached instance. Ah well.

https://gerrit.wikimedia.org/r/#/c/394977/ is running fine on snapshot01 (latest patch version i.e. 30). Besides any other changes reviewers want to propose there, we also need the icu changes and the hhvm buil with the memcached upstream patch.

Note that we provide the php.ini file in the mediawiki module because that's twhat the trusty clauses in the module do. We move from a mod_php path to a php-fpm path only because we expect that to move away from mod_php on stretch. This can be shuffled around when there's a proper php module corresponsing to the hhvm module, that provides this ini file.

The patchset has been merged. Now waiting on hhvm build with the memcached patch. Moritz plans to do this after the ICU work, maybe around Apr 9 or so.

The snapshot01 instance is now running with the vanilla php-memcached, and the various app servers in beta have a patched hhvm. Looking good!

I'm considering this done. Yay!