Page MenuHomePhabricator

Four different PHP/HHVM versions on the cluster
Closed, ResolvedPublic


Most appservers run HHVM 3.12.4 (as do tin and mira), but there are 13 servers running 3.12.7 (8 "real" appservers, 4 snapshot servers, and terbium), 2 servers running 3.18.1 (both debug servers) and 6 servers running 3.18.2 (5 real appservers, and naos).

This means that, once naos replaces mira, the two deployment hosts will not run the same HHVM version. Terbium, where maintenance scripts are run, runs a version that runs on almost no appservers. The debug servers, where deployers go to test code, run a unique version of HHVM that runs nowhere else. This means that bugs caused by differing behavior between different HHVM versions are harder to discover.

catrope@tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -F 20 -M -g mediawiki-installation -r ssh -o -oUser=mwdeploy -- php --version | tee php-versions
catrope@tin:~$ cat php-versions  | grep Compiler | sort -k 3 | less
mira.codfw.wmnet: Compiler: 3.12.14+dfsg-1+wmf1
mw1161.eqiad.wmnet: Compiler: 3.12.14+dfsg-1+wmf1
mw1162.eqiad.wmnet: Compiler: 3.12.14+dfsg-1+wmf1
mw2259.codfw.wmnet: Compiler: 3.12.14+dfsg-1+wmf1
mw2260.codfw.wmnet: Compiler: 3.12.14+dfsg-1+wmf1
tin.eqiad.wmnet: Compiler: 3.12.14+dfsg-1+wmf1
wasat.codfw.wmnet: Compiler: 3.12.14+dfsg-1+wmf1
mw1168.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
mw1169.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
mw1259.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
mw1260.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
mw2118.codfw.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
mw2119.codfw.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
mw2152.codfw.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
mw2246.codfw.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
snapshot1001.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
snapshot1005.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
snapshot1006.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
snapshot1007.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
terbium.eqiad.wmnet: Compiler: 3.12.7+dfsg-1+wmf1~trusty1
mwdebug1001.eqiad.wmnet: Compiler: 3.18.1+dfsg-1+wmf1
mwdebug1002.eqiad.wmnet: Compiler: 3.18.1+dfsg-1+wmf1
mw1262.eqiad.wmnet: Compiler: 3.18.2+dfsg-1+wmf1
mw1263.eqiad.wmnet: Compiler: 3.18.2+dfsg-1+wmf1
mw1264.eqiad.wmnet: Compiler: 3.18.2+dfsg-1+wmf1
mw1265.eqiad.wmnet: Compiler: 3.18.2+dfsg-1+wmf1
naos.codfw.wmnet: Compiler: 3.18.2+dfsg-1+wmf1
mw1261.eqiad.wmnet: Compiler: 3.18.2+dfsg-1+wmf2

Event Timeline

This seems to be partially expected. T158176: Build / migrate to HHVM 3.18 says "3.18.2 is running on the mediawiki canaries, but the wider rollout is held back until after the DC switchover", which seems sensible.

The easiest start here would be to upgrade mwdebug1001/1002 to 3.18.2. That seems to make sense when some real appservers are already on it. They should probably always be updated first before other mw servers anyways, right?

On any other day i would probably just do that since they are debug hosts. Though right now might be a bad moment. The "is running on the canaries" should cover mwdebug* though, shouldn't it.

On any other day i would probably just do that since they are debug hosts. Though right now might be a bad moment.

Possibly, but it should also be pretty low impact. Not my call though.

The "is running on the canaries" should cover mwdebug* though, shouldn't it.

You would think so (since we use the debug servers as canaries too)

Looking at the situation on naos, it looks like an accidental upgrade via hhvm-dbg

Initial puppet run, install hhvm

Start-Date: 2017-04-14  00:01:24
Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install hhvm
Install: libboost-system1.55.0:amd64 (1.55.0+dfsg-3, automatic), libunwind8:amd64 (1.1-3.2, automatic), libgflags2:amd64 (2.0-2.1, automatic), libzip2:amd64 (0.11.2-1.2, automatic), libboost-filesystem1.55.0:amd64 (1.55.0+dfsg-3, automatic), libvpx3:amd64 (1.5.0-2~wm1, automatic), libmcrypt4:amd64 (2.5.8-3.3, automatic), libboost-program-options1.55.0:amd64 (1.55.0+dfsg-3, automatic), hhvm:amd64 (3.12.14+dfsg-1+wmf1), libgoogle-glog0:amd64 (0.3.3-2, automatic), libboost-thread1.55.0:amd64 (1.55.0+dfsg-3, automatic), liblz4-1:amd64 (0.0~r131-2~wmf1, automatic), libodbc1:amd64 (2.3.1-3, automatic), libxslt1.1:amd64 (1.1.28-2+deb8u2, automatic), libdouble-conversion1:amd64 (2.0.1-1, automatic), libtbb2:amd64 (4.2~20140122-5, automatic)
End-Date: 2017-04-14  00:01:40

And a little while later puppet also installs hhvm-dbg which upgrades hhvm and related packages

Start-Date: 2017-04-14  00:21:51
Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install hhvm-dbg
Install: libboost-context1.55.0:amd64 (1.55.0+dfsg-3, automatic), libre2-1:amd64 (20140304+dfsg-2, automatic), hhvm-dbg:amd64 (3.18.2+dfsg-1+wmf1)
Upgrade: hhvm-tidy:amd64 (0.1.3~jessie1, 0.1.3~jessie2), hhvm-luasandbox:amd64 (2.0.12~jessie1, 2.0.12~jessie2), hhvm:amd64 (3.12.14+dfsg-1+wmf1, 3.18.2+dfsg-1+wmf1), hhvm-wikidiff2:amd64 (1.4.1, 1.4.1+wmf1)
End-Date: 2017-04-14  00:22:29

terbium will be upgraded to jessie as soon as we've switched over, for the record.

The only thing that we need to upgrade (and probably @MoritzMuehlenhoff has already scheduled it) are the mwdebug servers, since the rest is Trusty and I don't believe that we'll do any attempt to upgrade HHMV in there.

The remaining appservers with 3.12.7+dfsg-1+wmf1~trusty1 are videoscalers running Trusty, that we eventually will migrate to Jessie.

So as far as I know the next intermediate step will be to upgrade HHVM to 3.18 after the switchover (if no more surprises will come up), and possibly work on migrating the remaining Trusty host to Jessie.

It's unproblematic to also upgrade mwdebug* to 3.18.2, the only difference is a backported patch which only shows up in production load after 4-5 hours.

The deployment servers were not meant to use HHVM 3.18, that the experimental archive section is enabled there is because the deployment servers are running a backported git as requested by RelEng. We can simply downgrade HHVM on naos to 3.12 for now, all mediawiki servers will be upgraded to 3.18 after the switchover anyway.

I've downgraded hhvm-related packages back to their non-experimental version.

It looks like the root cause is experimental and main components of jessie-wikimedia having the same preference (1002), so in case of a reimage or new machine asking apt to install a package picks it up from experimental when enabled. I've updated T158583 with this issue too

MoritzMuehlenhoff claimed this task.

I've upgraded mwdebug to also use 3.18.2. terbium will be reimaged to jessie next week (initially it will use HHVM 3.12 and it'll be upgraded to 3.18 along with the rest of the cluster). This only leaves the video scalers and the snapshot hosts still running trusty. I'm closing this ticket, since the migrations of the video scalers, the snapshot hosts and the general 3.12-3.18 upgrade are already covered by existing tickets. Thanks for pointing out the 3.18/naos discrepancy, that was an unfortunate side effect of our currently limited repository structure.