Page MenuHomePhabricator

Investigate Varnish 3.0.5plus~x-wm7 throwing on beta cluster "Could not mmap SILO"
Closed, ResolvedPublic

Description

On beta cluster, the upload varnish backend refused to start due to a "Could not mmap SILO" error T75922: upload.beta.wmflabs.org is throwing 503s. From that task:

# /etc/init.d/varnish start
 * Starting HTTP accelerator                                                                                                                                                       [fail] 
sizeof(struct smp_ident) = 112 = 0x70
sizeof(struct smp_sign) = 40 = 0x28
sizeof(struct smp_segptr) = 32 = 0x20
sizeof(struct smp_object) = 56 = 0x38
WARNING: (-spersistent) file size reduced to 19770756300 (80% of available disk space)
min_nseg = 10, max_segl = 1976655455
max_nseg = 104850, min_segl = 188522
aim_nseg = 1023, aim_segl = 19322145
free_reserve = 193221450
sizeof(struct smp_ident) = 112 = 0x70
sizeof(struct smp_sign) = 40 = 0x28
sizeof(struct smp_segptr) = 32 = 0x20
sizeof(struct smp_object) = 56 = 0x38
WARNING: (-spersistent) file size reduced to 19770756300 (80% of available disk space)
Could not mmap SILO (/srv/vdb/varnish.main2) at target 0x7efcfa33c000, was mapped at 0x7f61d2649000 instead

rm /srv/vdb/varnish.* and restarted restored the service.

I remember talking to @BBlack about it a few months ago and I am pretty sure he wrote a patch for Varnish. So we either have some regression in that patch or that is a corner case that is not addressed by the patch. Hence this Task for investigation.

The instance is deployment-cache-upload02.eqiad.wmflabs and runs varnish 3.0.5plus~x-wm7

deployment-cache-upload02:~$ apt-cache policy varnish
varnish:
  Installed: 3.0.5plus~x-wm7
  Candidate: 3.0.5plus~x-wm7
  Version table:
 *** 3.0.5plus~x-wm7 0
       1001 http://apt.wikimedia.org/wikimedia/ precise-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
     3.0.2-1ubuntu0.1 0
        500 http://nova.clouds.archive.ubuntu.com/ubuntu/ precise-updates/universe amd64 Packages
     3.0.2-1 0
        500 http://nova.clouds.archive.ubuntu.com/ubuntu/ precise/universe amd64 Packages

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a project: acl*sre-team.
hashar changed Security from none to None.
hashar added a project: Varnish.
hashar added subscribers: hashar, BBlack, Varnish.

3.0.5plus~x-wm7 does have all the related fixes (mmap fixed, fallocate, etc). However, the parts of the fixes that prevent this issue don't come into effect until the cache files have been wiped and recreated at least once after deploying the fixed code. If these cache files predate fixes, then this is just known behavior. Since they've already been deleted I don't really know. The filesystem creation date is back in March, though, and I suspect "old cache files" is what's going on here.

hashar claimed this task.

That seems reasonable. I guess we will delete the files from all beta cluster Varnish instances to play it safe. I have filled T76091 for that.

Thank you Brandon for confirming the version should be working just fine and the explanation about the old cache files.