Page MenuHomePhabricator

Migrate to nginx-light
Open, MediumPublic

Description

We're currently using nginx-full which also includes the Image filter module. It links against a wide range of media libraries:

linux-vdso.so.1 (0x00007ffcc07e0000)
libgd.so.3 => /usr/lib/x86_64-linux-gnu/libgd.so.3 (0x00007f1ace787000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1ace3dc000)
libjpeg.so.62 => /usr/lib/x86_64-linux-gnu/libjpeg.so.62 (0x00007f1ace184000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f1acdf69000)
libpng12.so.0 => /lib/x86_64-linux-gnu/libpng12.so.0 (0x00007f1acdd42000)
libfreetype.so.6 => /usr/lib/x86_64-linux-gnu/libfreetype.so.6 (0x00007f1acda97000)
libfontconfig.so.1 => /usr/lib/x86_64-linux-gnu/libfontconfig.so.1 (0x00007f1acd85a000)
libXpm.so.4 => /usr/lib/x86_64-linux-gnu/libXpm.so.4 (0x00007f1acd648000)
libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007f1acd304000)
libvpx.so.1 => /usr/lib/x86_64-linux-gnu/libvpx.so.1 (0x00007f1accf0c000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1accc0b000)
libtiff.so.5 => /usr/lib/x86_64-linux-gnu/libtiff.so.5 (0x00007f1acc995000)
/lib64/ld-linux-x86-64.so.2 (0x0000562181c27000)
libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f1acc76c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1acc54e000)
libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007f1acc32c000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1acc128000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f1acbf04000)
libjbig.so.0 => /usr/lib/x86_64-linux-gnu/libjbig.so.0 (0x00007f1acbcf5000)
libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007f1acbaf1000)
libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007f1acb8eb000)

We don't use the module in our nginx.conf, but it still triggers a lot of spurious restart warnings for nginx whenever one of those libs are updated.

This isn't important enough to have a round of nginx builds/deployments by itself, but the next time we update nginx for other reasons (say TLS 1.3), I'd like to piggyback that change.

In addition to the systems using tlsproxy, there's also a few systems using nginx-extras or nginx-full, which could be reviewed/migrated:

  • thumbor
  • francium
  • labstore1006/1007
  • install*
  • sodium
  • archiva1001

Details

Related Gerrit Patches:
operations/puppet : productiontlsproxy: use light variant

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 4 2017, 9:00 AM
MoritzMuehlenhoff updated the task description. (Show Details)
BBlack added a subscriber: faidon.

Yeah, @faidon has brought up a similar argument before on a slightly different level: that we shouldn't be using nginx-full on most hosts anyways, since we use virtually none of the plugin modules. Somewhere there's an intersection of these ideas that makes life easier.

ema triaged this task as Medium priority.May 9 2017, 8:37 AM
ema moved this task from Triage to TLS on the Traffic board.

This came up again recently. We really should make the switch to nginx-light (carefully, to avoid mass-restart!)

MoritzMuehlenhoff renamed this task from Build nginx without image filter support to Migrate to nginx-light.Oct 23 2017, 3:48 PM
MoritzMuehlenhoff removed MoritzMuehlenhoff as the assignee of this task.

Seems the bot missed logging this here:

https://gerrit.wikimedia.org/r/#/c/386424/

Auditing production tlsproxy users for the switch to light in https://gerrit.wikimedia.org/r/#/c/386424/ shows no excess modules used on any of them, except for the expected lua+ndk on the cache hosts (which are installed explicitly and separately anyways, not part of full):

bblack@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'grep /usr/lib/nginx/modules/ /proc/$(systemctl show nginx -p MainPID|cut -d= -f2)/maps|cut -d/ -f6-|sort|uniq'
392 hosts will be targeted:
conf[2001-2003].codfw.wmnet,conf[1001-1003].eqiad.wmnet,cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mw[2017,2097,2099-2117,2120-2147,2150-2151,2153-2245,2247-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1235,1238-1258,1261-1290,1293-1306,1308-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                                         
(80) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4023,4025-4032].ulsfo.wmnet,cp1008.wikimedia.org                                                                                                                                                         
----- OUTPUT of 'grep /usr/lib/ng.../ -f6-|sort|uniq' -----                                                                                                                                    
ndk_http_module.so                                                                                                                                                                             
ngx_http_lua_module.so                                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                                         
(1) cp4024.ulsfo.wmnet                                                                                                                                                                         
----- OUTPUT of 'grep /usr/lib/ng.../ -f6-|sort|uniq' -----                                                                                                                                    
ssh: connect to host cp4024.ulsfo.wmnet port 22: Connection timed out                                                                                                                          
================

And the other side of this audit. Before we try to (carefully) switch to nginx-light, we need them all upgraded to the latest version so the dpkg-level replacement works sanely:

bblack@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'apt-cache policy nginx-full|egrep "Installed:|Candidate:"'
392 hosts will be targeted:
conf[2001-2003].codfw.wmnet,conf[1001-1003].eqiad.wmnet,cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mw[2017,2097,2099-2117,2120-2147,2150-2151,2153-2245,2247-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1235,1238-1258,1261-1290,1293-1306,1308-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet
Confirm to continue [y/n]? y                                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                                         
(1) ms-fe2005.codfw.wmnet                                                                                                                                                                      
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.13.5-1+wmf1                                                                                                                                                                     
  Candidate: 1.13.6-2+wmf1                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                         
(7) ms-fe[2006-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet                                                                                                                                  
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.11.10-1+wmf2~stretch1                                                                                                                                                           
  Candidate: 1.13.6-2+wmf1                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                         
(80) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4023,4025-4032].ulsfo.wmnet,cp1008.wikimedia.org                                                                                                                                                         
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
  Candidate: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
===== NODE GROUP =====                                                                                                                                                                         
(53) conf[1001-1003].eqiad.wmnet,mw[2100,2153-2162,2201,2243,2247-2250,2256].codfw.wmnet,mw[1228,1261-1265,1299-1306,1312-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet           
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.11.10-1+wmf3                                                                                                                                                                    
  Candidate: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
===== NODE GROUP =====                                                                                                                                                                         
(4) mw[1308-1311].eqiad.wmnet                                                                                                                                                                  
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.13.5-1+wmf1~jessie1                                                                                                                                                             
  Candidate: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
===== NODE GROUP =====                                                                                                                                                                         
(246) conf[2001-2003].codfw.wmnet,mw[2017,2097,2099,2101-2117,2120-2147,2150-2151,2163-2200,2202-2242,2244-2245,2251-2255,2257-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1227,1229-1235,1238-1258,1266-1290,1293-1298].eqiad.wmnet                                                                                                                                                        
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.11.4-1+wmf14                                                                                                                                                                    
  Candidate: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
================

The basic package version upgrades are generally seamless (nginx does seamless restart without dropping requests when doing a normal apt-get -y install nginx-full), but... should check in with various services to make sure risk timing is ok (ms-fe*, conf*, mw*).

Ping @Joe + @fgiunchedi ?

+1 for ms-fe, I'm assuming the rollout will happen with puppet disabled and the progressively re-enabled ?

BBlack added a comment.Nov 8 2017, 2:05 PM

Well there's two different actions to get through here:

First is upgrade tlsproxy hosts to 1.13.6-2+wmf1 (but still on existing nginx-full packages) - seamless, shouldn't require any depooling. Should be safe on any host, but asking just in case of Whatever. Just gets the versions synced up to the latest, to make the next step simpler and more-fool-proof.

After that's complete everywhere, we can look at the switch to nginx-light. Shouldn't cause issues on any tlsproxy hosts (as I've audited the modules in use, etc), but will require puppet-disable, depool, and a full lossy nginx stop->start cycle.

Mentioned in SAL (#wikimedia-operations) [2018-02-01T15:39:46Z] <moritzm> upgrading nginx on mw1266-mw1299 (for T164456)

First is upgrade tlsproxy hosts to 1.13.6-2+wmf1 (but still on existing nginx-full packages)

I've upgraded all of mw* to 1.13.6-2+wmf1~jessie1 , this leaves only conf* to be upgraded (as far as the nginx usage in tlsproxy is concerned).

Mentioned in SAL (#wikimedia-operations) [2018-04-23T08:17:53Z] <_joe_> upgrading nginx on the config cluster in codfw (T164456)

Mentioned in SAL (#wikimedia-operations) [2018-04-23T08:48:51Z] <_joe_> upgrading nginx on the config cluster in eqiad (T164456)

Current picture:

vgutierrez@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'apt-cache policy nginx-full|egrep "Installed:|Candidate:"'
366 hosts will be targeted:
conf[2001-2003].codfw.wmnet,conf[1001-1006].eqiad.wmnet,cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[5001-5005,5007-5012].eqsin.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mw[2135-2147,2153-2243,2247-2258,2261-2290].codfw.wmnet,mw[1221-1235,1238-1258,1261-1269,1273-1282,1284,1287-1290,1299-1306,1308-1317,1319-1337,1339-1348].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(57) conf[1004-1006].eqiad.wmnet,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mw[2261-2290].codfw.wmnet,mw[1261-1268,1276-1282,1299].eqiad.wmnet
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----
  Installed: 1.13.6-2+wmf1
  Candidate: 1.13.6-2+wmf1
===== NODE GROUP =====
(309) conf[2001-2003].codfw.wmnet,conf[1001-1003].eqiad.wmnet,cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[5001-5005,5007-5012].eqsin.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,mw[2135-2147,2153-2243,2247-2258].codfw.wmnet,mw[1221-1235,1238-1258,1269,1273-1275,1284,1287-1290,1300-1306,1308-1317,1319-1337,1339-1348].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----
  Installed: 1.13.6-2+wmf1~jessie1
  Candidate: 1.13.6-2+wmf1~jessie1
================

now we are in position for the next step, swapping nginx-full with nginx-light

jbond added a subscriber: jbond.Feb 6 2019, 1:48 PM