Page MenuHomePhabricator

Beta thumbnails are broken
Closed, ResolvedPublic

Description

Looking at https://commons.wikimedia.beta.wmflabs.org/wiki/Special:NewFiles - none of the new files are thumbnailing correctly. The production version of the same page looks fine, so this probably isn't an issue with the code on beta (or problems would have occurred on production by now, I think) - it likely has something to do with the beta configuration. T166013 may be related, but we aren't sure, and haven't had much luck in diagnosing.

Event Timeline

I added swift::proxy::shard_container_list to the hieradata config on horizon, it was missing. This allowed to run, however the puppet run nuked the SWIFT_KEY value in the thumbor config. I manually added it back. @fgiunchedi I believe there were issues with this before, what's the fix?

Restarting the swift proxy (on deployment-ms-fe02) made it pick up the new configuration and the traffic started flowing to deployment-imagescaler01 again. However, the thumbor config on that host picked up the poolcounter config, which points to poolcounter1002.eqiad.wmnet and seems to fail (due to firewalling, I presume?):

Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/handler/images/images.py", line 432, in poolcounter_throttle_key
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: lock_acquired = yield self.pc.acq4me(key, cfg['workers'], cfg['maxqueue'], cfg['timeout'])
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: value = future.result()
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: raise_exc_info(self._exc_info)
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1021, in run
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: yielded = self.gen.throw(*exc_info)
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/poolcounter/__init__.py", line 43, in acq4me
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: yield self.stream.write('ACQ4ME %s %d %d %d\n' % (key, workers, maxqueue, timeout))
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: value = future.result()
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: raise_exc_info(self._exc_info)
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: File "<string>", line 3, in raise_exc_info
Jul  5 16:11:58 deployment-imagescaler01 thumbor@8801[5962]: StreamClosedError: Stream is closed

Hitting the same issue as in T169313: Investigate poolcounter failure leading to thumbor failing to generate thumbs where poolcounter being unavailable causes an error instead of being ignored.

Verified that indeed, attempting to connect to poolcounter1002 just hangs:

gilles@deployment-imagescaler01:/srv/log/thumbor$ telnet poolcounter1002.eqiad.wmnet 7531
Trying 10.64.16.152...

I commented out the poolcounter server config line in /etc/thumbor.d/60-thumbor-server.conf on deployment-imagescaler01. Restarted thumbor. Now every original fetch it attempts 404s, and I think it's because of the empty override of the shard_container config I did, because it does expect the commons beta containers to be sharded.

I don't know why it stopped picking up the general config version, though, because the swift proxy seems to get that hiera value just fine.

I bet it's something to do with that hiera value being for the swift::proxy class? I'll attempt re-creating the same list manually in horizon for deployment-imagescaler01

Puppet runs are still nuking the SWIFT_KEY config value... I'm not sure that anything I try to do now will last, as any future puppet run on that host will break things.

Alright, that fixed it. I'll let the permanent fixes to @fgiunchedi as I don't know how to fix the hiera issues.

I'll try putting my overrides in a local-only config file in /etc/thumbor.d, so they don't get nuked by puppet.

The hotfixes are in /etc/thumbor.d/99-T169114.conf

Upgraded all packages to the latest, as I noticed python-thumbor-wikimedia was the previous version.

thanks @Gilles for the debugging! I think it is due to me moving some swift settings from wikitech Hiera: page to horizon, I've put the list of sharded containers and the account keys into "project hiera" on horizon for deployment-prep and now puppet populates SWIFT_KEY again

Change 363626 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: parametrize poolcounter

https://gerrit.wikimedia.org/r/363626

Change 363626 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: parametrize poolcounter

https://gerrit.wikimedia.org/r/363626

Looks like this is fixed, we don't have poolcounter in beta I think? Anyways if we do we can add it later to thumbor

Looks like this is fixed, we don't have poolcounter in beta I think? Anyways if we do we can add it later to thumbor

There is a poolcounter!

deployment-poolcounter0410.68.17.48
wmf-config/LabsServices.php
### Poolcounter
$wmfAllServices['eqiad']['poolcounter'] = [
    '10.68.17.48', # deployment-poolcounter04.deployment-prep.eqiad.wmflabs
];

:)

I've added the Beta poolcounter to deployment-imagescaler01's config in horizon. Seems to work fine!