Page MenuHomePhabricator

Vgutierrez (Valentín Gutiérrez)
Senior Site Reliability Engineer, Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (251 w, 3 d)
Availability
Available
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
Unknown

Recent Activity

Fri, Dec 2

Vgutierrez committed rOSAC47c9054d39e1: debian: Add release 0.36 to changelog (authored by Vgutierrez).
debian: Add release 0.36 to changelog
Fri, Dec 2, 9:57 AM
Vgutierrez committed rOSACbfef4492d4e9: setup.py: update dependencies for bullseye (authored by ssingh).
setup.py: update dependencies for bullseye
Fri, Dec 2, 9:42 AM
Vgutierrez committed rOSACe53926f42c9b: Release 0.36 (authored by Vgutierrez).
Release 0.36
Fri, Dec 2, 9:42 AM
Vgutierrez committed rOSAC6986477af600: Release 0.36 (authored by Vgutierrez).
Release 0.36
Fri, Dec 2, 9:18 AM

Thu, Dec 1

Vgutierrez committed rOSACe06af153436d: setup.py: update dependencies for bullseye (authored by ssingh).
setup.py: update dependencies for bullseye
Thu, Dec 1, 8:38 PM
Vgutierrez edited projects for T324200: Handle edge cache invalidation for the api gateway, added: Traffic; removed Traffic-Icebox.

Re-tagging the task, I'm assuming it got into traffic-icebox by mistake :)

Thu, Dec 1, 3:45 PM · SRE, Traffic, Platform Team Initiatives (API Gateway), serviceops
Vgutierrez added a comment to T188561: SSL cert for links.email.wikimedia.org.

I can confirm that they've added HSTS support and stopped serving traffic in port 80 and redirect it to port 443:

$ curl -I links.e.protectus.org -s |grep -i location
Location: https://links.e.protectus.org/
$ curl -I -s https://links.e.protectus.org |grep strict-transport-security
strict-transport-security: max-age=31536000; includeSubDomains; preload
Thu, Dec 1, 11:25 AM · Fundraising Sprint Wibbly Wobbly Timey Wimey, Fundraising Sprint Undefined, Fundraising Sprint Turtles that are robotic that destroy the whole world with their foot theory, Traffic-Icebox, fr-donorservices, FR-Email, Fundraising-Backlog, SRE, fundraising-tech-ops

Mon, Nov 28

Vgutierrez updated the task description for T238720: Deprecate and disable port 80 for one-off sites under canonical domains.
Mon, Nov 28, 9:48 AM · Traffic, Patch-For-Review, SRE
Vgutierrez updated the task description for T238720: Deprecate and disable port 80 for one-off sites under canonical domains.
Mon, Nov 28, 9:48 AM · Traffic, Patch-For-Review, SRE
Vgutierrez updated the task description for T238720: Deprecate and disable port 80 for one-off sites under canonical domains.
Mon, Nov 28, 9:45 AM · Traffic, Patch-For-Review, SRE

Wed, Nov 23

Vgutierrez updated the task description for T238720: Deprecate and disable port 80 for one-off sites under canonical domains.
Wed, Nov 23, 3:17 PM · Traffic, Patch-For-Review, SRE
Vgutierrez added a comment to T286066: Put lists.wikimedia.org web interface behind LVS.

@Legoktm it looks like the easiest approach would be adding lists1001 as a backend server on ATS and set the caching policy to pass. Under this scenario, lists.wikimedia.org TLS certificate should be a private one handled by our PKI rather than an acme-chief/LE one. After that, we should drop the A/AAAA records and just add a DYNA record like this

lists      600 IN DYNA geoip!text-addrs
Wed, Nov 23, 11:13 AM · SRE, Wikimedia-Mailing-lists
Vgutierrez added a comment to T323485: Transferpy: Enable PBKDF2 usage.

Thanks, I will give it a look and test it on the right context (piping content on a single thread) and explore implementing it for the next release.

Consider testing openssl dgst besides coreutils' /usr/bin/sha256sum

Wed, Nov 23, 10:29 AM · Data-Persistence, Data-Persistence-Backup, database-backups
Vgutierrez added a comment to T323485: Transferpy: Enable PBKDF2 usage.

According to openssl speed SHA-256 isn't slower than MD5:

vgutierrez@cp5020:~$ openssl speed md5
[...]
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5             111276.09k   247604.25k   437966.59k   551410.01k   589116.76k   578333.35k
Wed, Nov 23, 9:37 AM · Data-Persistence, Data-Persistence-Backup, database-backups
Vgutierrez added a comment to T323485: Transferpy: Enable PBKDF2 usage.

Take into account that PBKDF2 is only used for key derivation purposes, the number of iterations could slow down a little bit this process, but actual encryption/decryption isn't affected by that.

Wed, Nov 23, 9:21 AM · Data-Persistence, Data-Persistence-Backup, database-backups

Tue, Nov 22

Vgutierrez updated the task description for T238720: Deprecate and disable port 80 for one-off sites under canonical domains.
Tue, Nov 22, 2:15 PM · Traffic, Patch-For-Review, SRE
Vgutierrez added a comment to T316337: Phabricator was logging out users repeatedly (2022-08-26).

I am going to do it, but I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish (not just the effects and response).

Nothing happened to varnish. ATS was the culprit. https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 prevented phabricator session cookies reaching the phabricator origin server. A more detailed explanation is included in the commit message for https://gerrit.wikimedia.org/r/c/operations/puppet/+/828002:

Tue, Nov 22, 2:06 PM · Wikimedia-Incident, SRE, Phabricator, Traffic
Vgutierrez updated the task description for T238720: Deprecate and disable port 80 for one-off sites under canonical domains.
Tue, Nov 22, 12:04 PM · Traffic, Patch-For-Review, SRE
Vgutierrez closed T320397: _etcd-client SRV record missing for conftool cluster as Resolved.
vgutierrez@lvs6001:~$ ./liberica etcd --config /home/vgutierrez/config.yaml 
Using config file: /home/vgutierrez/config.yaml
2022/11/22 11:52:15 Spawning Watchers...
2022/11/22 11:52:15 Watching /conftool/v1/pools/drmrs/cache_text/ats-tls
2022/11/22 11:52:15 Watching /conftool/v1/pools/drmrs/ncredir/nginx
2022/11/22 11:52:15 etcd endpoints discovered: [https://conf1009.eqiad.wmnet.:4001 https://conf1007.eqiad.wmnet.:4001 https://conf1008.eqiad.wmnet.:4001]

endpoints are now being discovered as expected. Thanks @Joe

Tue, Nov 22, 11:54 AM · Patch-For-Review, Traffic, serviceops, SRE
Vgutierrez updated the task description for T238720: Deprecate and disable port 80 for one-off sites under canonical domains.
Tue, Nov 22, 11:20 AM · Traffic, Patch-For-Review, SRE
Vgutierrez updated the task description for T238720: Deprecate and disable port 80 for one-off sites under canonical domains.
Tue, Nov 22, 11:18 AM · Traffic, Patch-For-Review, SRE
Vgutierrez edited projects for T238720: Deprecate and disable port 80 for one-off sites under canonical domains, added: Traffic; removed Traffic-Icebox.
Tue, Nov 22, 10:47 AM · Traffic, Patch-For-Review, SRE
Vgutierrez updated the task description for T323557: Let HAProxy handle port 80.
Tue, Nov 22, 10:11 AM · SRE, Traffic
Vgutierrez triaged T323557: Let HAProxy handle port 80 as Medium priority.
Tue, Nov 22, 10:11 AM · SRE, Traffic
Vgutierrez created T323557: Let HAProxy handle port 80.
Tue, Nov 22, 10:11 AM · SRE, Traffic
Vgutierrez closed T254235: Let ats-tls handle port 80 as Invalid.

ats-tls has been deprecated in favor of HAProxy

Tue, Nov 22, 10:10 AM · Traffic-Icebox, SRE

Mon, Nov 21

Vgutierrez closed T323365: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload), a subtask of T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic, as Resolved.
Mon, Nov 21, 4:05 PM · Performance-Team (Radar), Patch-For-Review, SRE, Traffic
Vgutierrez closed T323365: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload) as Resolved.
Mon, Nov 21, 4:05 PM · Patch-For-Review, SRE, Traffic
Vgutierrez created T323485: Transferpy: Enable PBKDF2 usage.
Mon, Nov 21, 11:19 AM · Data-Persistence, Data-Persistence-Backup, database-backups
Vgutierrez added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .

After a few tests with openssl enc it looks like it doesn't support AEAD ciphersuites, so chacha20 should be used instead of aes-256-cbc having into account that authenticity isn't guaranteed by openssl during the transfer process

Mon, Nov 21, 11:15 AM · Discovery-Search (Current work)
Vgutierrez added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .

regarding OpenSSL cipher suites: aes-256-cbc shouldn't be used anywhere in production nowadays as CBC a is vulnerable to padding oracle attacks. https://wikitech.wikimedia.org/wiki/Transfer.py currently uses chacha20 (and not chacha20-poly1305). chacha20 alone is faster than aes-256-gcm but not faster than aes-128-gcm but it's only ensuring confidentiality while aes-(128|256)-gcm provide both confidentiality and authenticity. If you consider chacha-poly1305 (confidentiality and authenticity are ensure) then aes-256-gcm should be chosen as it's faster (thanks to the hardware implementation provided by AES-NI)

openssl single threaded benchmarks for aes-128|256-gcm, chacha20 and chacha20-poly1305
vgutierrez@cp5020:~$ openssl speed -evp aes-128-gcm
Doing aes-128-gcm for 3s on 16 size blocks: 98984079 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 63266501 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 34909080 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 1024 size blocks: 12490399 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 1871993 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 16384 size blocks: 940859 aes-128-gcm's in 3.00s
OpenSSL 1.1.1n  15 Mar 2022
built on: Fri Jun 24 20:07:00 2022 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-k6U0OK/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-gcm     527915.09k  1349685.35k  2978908.16k  4263389.53k  5111788.89k  5138344.62k
vgutierrez@cp5020:~$ openssl speed -evp aes-256-gcm
Doing aes-256-gcm for 3s on 16 size blocks: 88506987 aes-256-gcm's in 2.99s
Doing aes-256-gcm for 3s on 64 size blocks: 53360573 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 256 size blocks: 30021007 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 1024 size blocks: 10522239 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 1574806 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 16384 size blocks: 805803 aes-256-gcm's in 3.00s
OpenSSL 1.1.1n  15 Mar 2022
built on: Fri Jun 24 20:07:00 2022 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-k6U0OK/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-gcm     473615.98k  1138358.89k  2561792.60k  3591590.91k  4300270.25k  4400758.78k
vgutierrez@cp5020:~$ openssl speed -evp chacha20
Doing chacha20 for 3s on 16 size blocks: 66353577 chacha20's in 3.00s
Doing chacha20 for 3s on 64 size blocks: 28290196 chacha20's in 3.00s
Doing chacha20 for 3s on 256 size blocks: 26906428 chacha20's in 3.00s
Doing chacha20 for 3s on 1024 size blocks: 12356852 chacha20's in 3.00s
Doing chacha20 for 3s on 8192 size blocks: 1663576 chacha20's in 3.00s
Doing chacha20 for 3s on 16384 size blocks: 840369 chacha20's in 3.00s
OpenSSL 1.1.1n  15 Mar 2022
built on: Fri Jun 24 20:07:00 2022 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-k6U0OK/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
chacha20        353885.74k   603524.18k  2296015.19k  4217805.48k  4542671.53k  4589535.23k
vgutierrez@cp5020:~$ openssl speed -evp chacha20-poly1305
Doing chacha20-poly1305 for 3s on 16 size blocks: 43895625 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 64 size blocks: 23138776 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 256 size blocks: 18327745 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 1024 size blocks: 8294664 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 8192 size blocks: 1135730 chacha20-poly1305's in 3.00s
Doing chacha20-poly1305 for 3s on 16384 size blocks: 572043 chacha20-poly1305's in 3.00s
OpenSSL 1.1.1n  15 Mar 2022
built on: Fri Jun 24 20:07:00 2022 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-k6U0OK/openssl-1.1.1n=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
chacha20-poly1305   234110.00k   493627.22k  1563967.57k  2831245.31k  3101300.05k  3124117.50k
Mon, Nov 21, 10:05 AM · Discovery-Search (Current work)

Fri, Nov 18

Vgutierrez triaged T323365: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload) as Medium priority.
Fri, Nov 18, 11:19 AM · Patch-For-Review, SRE, Traffic
Vgutierrez created T323365: Rename role::cache::(text|upload)_haproxy to role::cache::(text|upload).
Fri, Nov 18, 11:18 AM · Patch-For-Review, SRE, Traffic
Vgutierrez closed T306236: Improve handling/logging of HAproxy emergency log messages as Resolved.
Fri, Nov 18, 11:14 AM · SRE, Traffic
Vgutierrez added a comment to T320397: _etcd-client SRV record missing for conftool cluster.

ping?

Fri, Nov 18, 10:33 AM · Patch-For-Review, Traffic, serviceops, SRE
Vgutierrez closed T323263: Wikipedia on flow with no http request, still responds with a Bad Request 400 as Resolved.

fix has been merged and it's being deployed, it should be available fleet wide in ~30 minutes

Fri, Nov 18, 10:03 AM · Upstream, SRE, Traffic
Vgutierrez changed the status of T323263: Wikipedia on flow with no http request, still responds with a Bad Request 400 from Stalled to In Progress.

We were missing one config option in our HAProxy setup: option http-ignore-probes, after enabling it, HAProxy behaves as expected:

with option http-ignore-probes
# tshark -r /root/tls.pcap -o tls.keylog_file:/home/vgutierrez/sslkeylog.log -z "follow,tls,hex,0"
Running as user "root" and group "root". This could be dangerous.
    1 0.000000000    127.0.0.1 → 127.0.0.1    TCP 74 38798 → 443 [SYN] Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=2288492386 TSecr=0 WS=512
    2 0.000038703    127.0.0.1 → 127.0.0.1    TCP 74 443 → 38798 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=2288492386 TSecr=2288492386 WS=512
    3 0.000062069    127.0.0.1 → 127.0.0.1    TCP 66 38798 → 443 [ACK] Seq=1 Ack=1 Win=44032 Len=0 TSval=2288492386 TSecr=2288492386
    4 0.001980314    127.0.0.1 → 127.0.0.1    TLSv1 349 Client Hello
    5 0.003427271    127.0.0.1 → 127.0.0.1    TLSv1.3 4162 Server Hello, Change Cipher Spec, Encrypted Extensions
    6 0.003479768    127.0.0.1 → 127.0.0.1    TCP 66 38798 → 443 [ACK] Seq=284 Ack=4097 Win=41472 Len=0 TSval=2288492389 TSecr=2288492389
    7 0.007754045    127.0.0.1 → 127.0.0.1    TLSv1.3 1279 Certificate, Certificate Verify, Finished
    8 0.007769038    127.0.0.1 → 127.0.0.1    TCP 66 38798 → 443 [ACK] Seq=284 Ack=5310 Win=42496 Len=0 TSval=2288492393 TSecr=2288492393
    9 0.012263131    127.0.0.1 → 127.0.0.1    TLSv1.3 146 Change Cipher Spec, Finished
   10 0.012663505    127.0.0.1 → 127.0.0.1    TLSv1.3 321 New Session Ticket
   11 0.012683209    127.0.0.1 → 127.0.0.1    TCP 66 38798 → 443 [ACK] Seq=364 Ack=5565 Win=44032 Len=0 TSval=2288492398 TSecr=2288492398
   12 0.012804423    127.0.0.1 → 127.0.0.1    TLSv1.3 321 New Session Ticket
   13 0.012816096    127.0.0.1 → 127.0.0.1    TCP 66 38798 → 443 [ACK] Seq=364 Ack=5820 Win=44032 Len=0 TSval=2288492398 TSecr=2288492398
   14 1.989842483    127.0.0.1 → 127.0.0.1    TCP 66 38798 → 443 [FIN, ACK] Seq=364 Ack=5820 Win=44032 Len=0 TSval=2288494375 TSecr=2288492398
   15 1.990087001    127.0.0.1 → 127.0.0.1    TLSv1.3 90 Alert (Level: Warning, Description: Close Notify)
   16 1.990125094    127.0.0.1 → 127.0.0.1    TCP 54 38798 → 443 [RST] Seq=365 Win=0 Len=0
Fri, Nov 18, 9:42 AM · Upstream, SRE, Traffic

Thu, Nov 17

Vgutierrez changed the status of T323263: Wikipedia on flow with no http request, still responds with a Bad Request 400 from Open to Stalled.

reported to upstream in https://github.com/haproxy/haproxy/issues/1934

Thu, Nov 17, 4:11 PM · Upstream, SRE, Traffic
Vgutierrez added a comment to T323263: Wikipedia on flow with no http request, still responds with a Bad Request 400.

this seems to be triggered by HAProxy, I just logged the H1 trace on a cloud test instance using:

echo "trace h1 event +any; trace h1 level developer; trace h1 verbosity complete; trace h1 sink stdout; trace h1 start now"  | sudo socat stdio /run/haproxy/haproxy.sock
Thu, Nov 17, 3:35 PM · Upstream, SRE, Traffic
Vgutierrez triaged T323263: Wikipedia on flow with no http request, still responds with a Bad Request 400 as Lowest priority.

I'm curious about those firewalls considering that we are talking about TLSv1.3 traffic that shouldn't be able to be inspected by them. We definitely won't support any kind of setup where a MiTM is happening

Thu, Nov 17, 12:07 PM · Upstream, SRE, Traffic
Vgutierrez added a comment to T284304: Create dashboard showing aggregate data transfer rates per DC/cluster.

nice work @BCornwall. Current version looks good, if you allow me a small nitpick, we got a small inconsistency between labels on Varnish panels VS HAProxy/ATS ones. both HAProxy and ATS refer to "cache_text" and "cache_upload", but varnish ones refer to "varnish-text" and "varnish-upload".

Thu, Nov 17, 5:28 AM · Traffic-Icebox, SRE
Vgutierrez closed T322903: oom killed varnish on cp4047 as Resolved.

THP has been disabled globally as a result of this task with https://gerrit.wikimedia.org/r/857686. A rolling restart has been performed to applied this change:

vgutierrez@cumin1001:~$ sudo cumin A:cp 'grep -c thp:never /proc/$(systemctl show --property MainPID --value varnish-frontend.service)/environ'
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5002-5016,5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet
Ok to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit 96
===== NODE GROUP =====                                                                                                                                                                                                                                                                                                                                                   
(96) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5002-5016,5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet                                                                                                                                                                                                    
----- OUTPUT of 'grep -c thp:neve...service)/environ' -----                                                                                                                                                                                                                                                                                                              
1
Thu, Nov 17, 4:43 AM · SRE, Traffic

Wed, Nov 16

Vgutierrez committed rLPRI0271e85bd57e: secret: Add empty varnish/dp.master.key (authored by Vgutierrez).
secret: Add empty varnish/dp.master.key
Wed, Nov 16, 7:45 PM
Vgutierrez added a comment to T323208: lists apache config change should trigger an apache reload.

hmmm that would trigger a few seconds of downtime every time that Apache is restarted automatically by puppet

Wed, Nov 16, 11:11 AM · Wikimedia-Incident, SRE, Wikimedia-Mailing-lists

Mon, Nov 14

Vgutierrez added a comment to T322903: oom killed varnish on cp4047.

I've updated the description of the task to add the used memory by varnish after running the experiment for the whole weekend. Data gathered with systemctl show varnish-frontend -p MemoryCurrent --value | awk '{print $1/1024/1024/1024 " GB "}'

Mon, Nov 14, 2:22 PM · SRE, Traffic
Vgutierrez updated the task description for T322903: oom killed varnish on cp4047.
Mon, Nov 14, 2:21 PM · SRE, Traffic

Fri, Nov 11

Vgutierrez lowered the priority of T322903: oom killed varnish on cp4047 from High to Medium.

Lowing the priority after deploying several experiments in upload@ulsfo that could mitigate the issue, see the task description for more details

Fri, Nov 11, 4:42 PM · SRE, Traffic
Vgutierrez updated the task description for T322903: oom killed varnish on cp4047.
Fri, Nov 11, 3:50 PM · SRE, Traffic
Vgutierrez added a comment to T322903: oom killed varnish on cp4047.

In fact it seems like varnish is the one eating the extra memory... in cp4045 (upload) with the following malloc specific config: -s malloc,283G -s Transient=malloc,10G varnish is consuming 458G, in cp4041 (text) with -s malloc,283G -s Transient=malloc,5G varnish is consuming 318G

Fri, Nov 11, 10:56 AM · SRE, Traffic
Vgutierrez added a comment to T322903: oom killed varnish on cp4047.

After further inspection I don't think that ATS memory increase is enough to explain what we are seeing here, text nodes in ulsfo are using around 326G of RAM but upload ones are using ~470G... that 144G gap can't be explained with the extra 13G used by ATS in upload nodes.

Fri, Nov 11, 10:40 AM · SRE, Traffic
Vgutierrez updated the task description for T322903: oom killed varnish on cp4047.
Fri, Nov 11, 7:12 AM · SRE, Traffic
Vgutierrez added a comment to T322903: oom killed varnish on cp4047.

Free memory on NUMA Node 0 got below the min threshold (1028416 < 1041448):

Node 0 Normal free:1028416kB min:1041448kB low:1303560kB high:1565672kB reserved_highatomic:2048KB active_anon:1800292kB inactive_anon:257312200kB active_file:0kB inactive_file:408kB unevictable:24kB writepending:0kB present:266338304kB managed:262112564kB mlocked:24kB pagetables:556408kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Fri, Nov 11, 7:11 AM · SRE, Traffic
Vgutierrez moved T322903: oom killed varnish on cp4047 from Triage to Active Issues on the Traffic board.
Fri, Nov 11, 6:44 AM · SRE, Traffic
Vgutierrez triaged T322903: oom killed varnish on cp4047 as High priority.
Fri, Nov 11, 6:21 AM · SRE, Traffic
Vgutierrez created T322903: oom killed varnish on cp4047.
Fri, Nov 11, 6:19 AM · SRE, Traffic

Thu, Nov 10

Vgutierrez closed T319324: Consider adding X-Analytics subfield for 'has a session cookie' as Resolved.

CR merged and https://wikitech.wikimedia.org/wiki/X-Analytics#Keys updated, thanks for choosing the traffic team CDN services ;P

Thu, Nov 10, 11:57 AM · Analytics-Radar, SRE, Traffic
Vgutierrez added a comment to T264021: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface.

Are we sure that this is a service side issue? this sounds a lot like a FetchError triggered by the client going away/connection being interrupted before varnish gets the whole POST body (and that triggers a 503 issued by Varnish). ATS seems to believe that eventgate-logging-external.discovery.wmnet is rather healthy: https://grafana.wikimedia.org/goto/QKFq4aD4k?orgId=1

Thu, Nov 10, 9:42 AM · Analytics, Event-Platform Value Stream, Data-Engineering, SRE

Wed, Nov 9

Vgutierrez added a comment to T188561: SSL cert for links.email.wikimedia.org.

Thanks for the feedback and requirements documentation, @Vgutierrez. Acoustic, the vendor in this case, doesn't have specific documentation regarding this, but I can tell you that they recently moved to certs issued by Amazon AWS. I am going to open a support ticket with Acoustic to see if they can address the specific issues you've noted. First, I wanted to ensure I understand them correctly so I have a couple of clarifying questions:

  • They lack support of HSTS

For this item, we need them to add an HSTS header to the pages that respond at the proposed domain (links.email.wikimedia.org), correct? And this header specification is sufficient: Strict-Transport-Security: max-age=31536000; includeSubDomains; preload?

Yes, but the HSTS header must be set only on HTTPS requests, not for plain text ones

  • They don't use CAA, at least for the example provided (liks.e.protectus.org)

I am not well-versed on DNS CAA implementation, but it looks like the key component is adding a CAA record to the domain's DNS that specifies the CAs that are allowed to issue certs for the domain. DNS in this case is controlled by Wikimedia. Is there anything Acoustic needs to implement to satisfy this requirement?

Just let us know which CAs need to be authorized in our CAA record for links.email.wikimedia.org

  • They serve requests in plain text (again based on links.e.protectus.org)

Clarifying that this is referring to the example link tracking domain (links.e.protectus.org) serving content over HTTP upon request rather than redirecting to HTTPS? If so, I can use the requirements provided to detail the implementation needs for Acoustic.

Yes, that's it.

Wed, Nov 9, 5:14 PM · Fundraising Sprint Wibbly Wobbly Timey Wimey, Fundraising Sprint Undefined, Fundraising Sprint Turtles that are robotic that destroy the whole world with their foot theory, Traffic-Icebox, fr-donorservices, FR-Email, Fundraising-Backlog, SRE, fundraising-tech-ops
Vgutierrez closed T321804: Enterprise redirect for wikimediaenterprise.com to enterprise.wikimedia.com as Resolved.
vgutierrez@ncredir6001:~$ curl -L -I http://wikimediaenterprise.com 
HTTP/1.1 301 Moved Permanently
Server: nginx/1.14.2
Date: Wed, 09 Nov 2022 17:08:50 GMT
Content-Type: text/html
Content-Length: 185
Connection: keep-alive
Location: https://wikimediaenterprise.com/
Wed, Nov 9, 5:10 PM · SRE, Traffic
Vgutierrez created P38789 (An Untitled Masterwork).
Wed, Nov 9, 10:18 AM · Traffic

Nov 8 2022

Vgutierrez committed rLPRI61df52ba8470: labs: add profile::swift::replication_keys data (authored by Vgutierrez).
labs: add profile::swift::replication_keys data
Nov 8 2022, 3:59 PM
Vgutierrez committed rLPRI1c0ebd7a8e41: labs: Add profile::swift::account_keys data (authored by Vgutierrez).
labs: Add profile::swift::account_keys data
Nov 8 2022, 3:58 PM
Vgutierrez updated the task description for T306068: Cloud VPS "deployment-prep" project Stretch deprecation.
Nov 8 2022, 3:28 PM · Beta-Cluster-Infrastructure, Cloud-VPS (Debian Stretch Deprecation)
Vgutierrez closed T322231: Create new deployment-ms-be instances running Debian Bullseye as Resolved.
Nov 8 2022, 3:27 PM · Patch-For-Review, Beta-Cluster-Infrastructure
Vgutierrez updated the task description for T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.
Nov 8 2022, 3:27 PM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez closed T322231: Create new deployment-ms-be instances running Debian Bullseye, a subtask of T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye, as Resolved.
Nov 8 2022, 3:27 PM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez closed T322231: Create new deployment-ms-be instances running Debian Bullseye, a subtask of T321654: Thumbnails on beta cluster return 503 Service Unavailable, as Resolved.
Nov 8 2022, 3:27 PM · serviceops, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure
Vgutierrez updated the task description for T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.
Nov 8 2022, 6:37 AM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez updated the task description for T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.
Nov 8 2022, 6:16 AM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez closed T322554: Create new deployment-ms-fe instance running Debian Bullseye as Resolved.
Nov 8 2022, 6:12 AM · Patch-For-Review, Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez closed T322554: Create new deployment-ms-fe instance running Debian Bullseye, a subtask of T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye, as Resolved.
Nov 8 2022, 6:12 AM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez updated the task description for T306068: Cloud VPS "deployment-prep" project Stretch deprecation.
Nov 8 2022, 6:12 AM · Beta-Cluster-Infrastructure, Cloud-VPS (Debian Stretch Deprecation)

Nov 7 2022

Vgutierrez added a project to T322575: ATS isn't caching documents in deployment-cache-upload07: Data-Persistence.

on Swift the Cache-Control header is appended on the ms-fe instances by a filter called ensure_max_age with the following config:

[filter:ensure_max_age]
paste.filter_factory = wmf.ensure_max_age:filter_factory
methods_list = HEAD GET 
status_list = 200 
max_age = 86400
host_list = upload.wikimedia.org
Nov 7 2022, 7:22 PM · SRE, Data-Persistence, Traffic, Beta-Cluster-Infrastructure
Vgutierrez added a comment to T322575: ATS isn't caching documents in deployment-cache-upload07.

For some reason swift isn't setting a Cache-Control header in deployment-prep:

vgutierrez@deployment-ms-fe04:~$ curl --connect-to upload.wikimedia.beta.wmflabs.org:80:127.0.0.1 http://upload.wikimedia.beta.wmflabs.org/wikipedia/commons/d/de/123_4.jpg -v -o /dev/null |grep -i cache-control
* Connecting to hostname: 127.0.0.1
*   Trying 127.0.0.1:80...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 127.0.0.1 (127.0.0.1) port 80 (#0)
> GET /wikipedia/commons/d/de/123_4.jpg HTTP/1.1
> Host: upload.wikimedia.beta.wmflabs.org
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: image/jpeg
< X-Object-Meta-Sha1Base36: l3g1j3ermq3z1om07dpn18k7wq6dtis
< Etag: 0d1cc1db7e0c7e206cd365a39bd14e12
< Last-Modified: Thu, 20 Jan 2022 19:42:40 GMT
< X-Timestamp: 1642707759.06735
< Accept-Ranges: bytes
< Content-Length: 200456
< Access-Control-Allow-Origin: *
< X-Trans-Id: tx8a5843561ed543fd8847d-0063695945
< X-Openstack-Request-Id: tx8a5843561ed543fd8847d-0063695945
< Date: Mon, 07 Nov 2022 19:15:18 GMT
< 
{ [22016 bytes data]
100  195k  100  195k    0     0  5019k      0 --:--:-- --:--:-- --:--:-- 5290k
* Connection #0 to host 127.0.0.1 left intact
Nov 7 2022, 7:16 PM · SRE, Data-Persistence, Traffic, Beta-Cluster-Infrastructure
Vgutierrez triaged T322575: ATS isn't caching documents in deployment-cache-upload07 as Medium priority.
Nov 7 2022, 7:12 PM · SRE, Data-Persistence, Traffic, Beta-Cluster-Infrastructure
Vgutierrez created T322575: ATS isn't caching documents in deployment-cache-upload07.
Nov 7 2022, 7:11 PM · SRE, Data-Persistence, Traffic, Beta-Cluster-Infrastructure
Vgutierrez committed rOSPUa2d3165c9949: Release 0.19 (authored by ssingh).
Release 0.19
Nov 7 2022, 5:43 PM
Vgutierrez updated the task description for T306068: Cloud VPS "deployment-prep" project Stretch deprecation.
Nov 7 2022, 4:40 PM · Beta-Cluster-Infrastructure, Cloud-VPS (Debian Stretch Deprecation)
Vgutierrez updated the task description for T306068: Cloud VPS "deployment-prep" project Stretch deprecation.
Nov 7 2022, 4:36 PM · Beta-Cluster-Infrastructure, Cloud-VPS (Debian Stretch Deprecation)
Vgutierrez updated the task description for T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.
Nov 7 2022, 4:06 PM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez created T322554: Create new deployment-ms-fe instance running Debian Bullseye.
Nov 7 2022, 4:05 PM · Patch-For-Review, Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez updated the task description for T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.
Nov 7 2022, 4:02 PM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez added a parent task for T322231: Create new deployment-ms-be instances running Debian Bullseye: T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye.
Nov 7 2022, 4:02 PM · Patch-For-Review, Beta-Cluster-Infrastructure
Vgutierrez added a subtask for T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye: T322231: Create new deployment-ms-be instances running Debian Bullseye.
Nov 7 2022, 4:02 PM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
Vgutierrez committed rOSAC4144adb51552: debian: Add release 0.35 to changelog (authored by Vgutierrez).
debian: Add release 0.35 to changelog
Nov 7 2022, 12:14 PM
Vgutierrez closed T322420: ATS flags origin servers as down during 60 seconds after a connect timeout as Resolved.
Nov 7 2022, 9:04 AM · Traffic, SRE

Nov 4 2022

Vgutierrez added a comment to T322420: ATS flags origin servers as down during 60 seconds after a connect timeout.

I think proxy.config.http.connect.dead.policy is also interesting for us:

Controls what origin server connection failures contribute to marking a server dead. When set to 2, any connection failure during the TCP and TLS handshakes will contribute to marking the server dead. When set to 1, only TCP handshake failures will contribute to marking a server dead. When set to 0, no connection failures will be used towards marking a server dead.
Nov 4 2022, 4:00 PM · Traffic, SRE
Vgutierrez claimed T322420: ATS flags origin servers as down during 60 seconds after a connect timeout.
Nov 4 2022, 3:27 PM · Traffic, SRE
Vgutierrez created T322420: ATS flags origin servers as down during 60 seconds after a connect timeout.
Nov 4 2022, 3:26 PM · Traffic, SRE
Vgutierrez added a comment to T188561: SSL cert for links.email.wikimedia.org.

Their TLS termination has been improved over time but they still don't meet the requirements listed on https://wikitech.wikimedia.org/wiki/HTTPS for a canonical domain (wikimedia.org):

  • They lack support of HSTS
  • They don't use CAA, at least for the example provided (liks.e.protectus.org)
  • They serve requests in plain text (again based on links.e.protectus.org)
Nov 4 2022, 2:10 PM · Fundraising Sprint Wibbly Wobbly Timey Wimey, Fundraising Sprint Undefined, Fundraising Sprint Turtles that are robotic that destroy the whole world with their foot theory, Traffic-Icebox, fr-donorservices, FR-Email, Fundraising-Backlog, SRE, fundraising-tech-ops
Vgutierrez added a comment to T321684: haproxy::site doesn't work as expected on the first puppet run.

however if someone is introducing haproxy::site in their puppetization and haproxy isn't installed already then yes they'll run into this bug I think

Nov 4 2022, 2:06 PM · Cloud-Services, Thumbor, Puppet, Infrastructure-Foundations, Data-Persistence
Vgutierrez added a comment to T321684: haproxy::site doesn't work as expected on the first puppet run.

It's my understanding that right now the haproxy class is flexible enough for the current use cases. What I'm not sure about is that what it's being described as a bug in this task is actually a bug or just the expected behavior for some of those uses cases

Nov 4 2022, 11:54 AM · Cloud-Services, Thumbor, Puppet, Infrastructure-Foundations, Data-Persistence
Vgutierrez added a comment to T321684: haproxy::site doesn't work as expected on the first puppet run.

cause we don't own the HAProxy puppetization,

@Vgutierrez do you know who does? the CP servers are the biggest user of this class making up 75% of users. with cloud-control, dbproxy and thumbor being the other users

Nov 4 2022, 11:45 AM · Cloud-Services, Thumbor, Puppet, Infrastructure-Foundations, Data-Persistence
Vgutierrez removed a project from T321684: haproxy::site doesn't work as expected on the first puppet run: Traffic.

I'm removing traffic from this task cause we don't own the HAProxy puppetization, happy to help here as one of the main users within SRE. Our custom bits for HAProxy are shipped on profile::cache::haproxy and haproxy::tls_terminator

Nov 4 2022, 10:11 AM · Cloud-Services, Thumbor, Puppet, Infrastructure-Foundations, Data-Persistence
Vgutierrez updated subscribers of T321684: haproxy::site doesn't work as expected on the first puppet run.

I believe this is not affecting cp instances. In your log, systemd is complaining about several notifications:

systemd[1]: haproxy.service: Got notification message from PID 21899, but reception is disabled.
Nov 4 2022, 10:08 AM · Cloud-Services, Thumbor, Puppet, Infrastructure-Foundations, Data-Persistence

Nov 3 2022

Vgutierrez committed rOSAC004b2af4b223: Release 0.35 (authored by Vgutierrez).
Release 0.35
Nov 3 2022, 4:38 PM
Vgutierrez committed rOSAC86caeb0279c1: readme: Add general notes for testing deps (authored by BCornwall).
readme: Add general notes for testing deps
Nov 3 2022, 4:36 PM
Vgutierrez committed rOSAC8e358eefb6f8: api: Offer JSON for metadata if requested (authored by taavi).
api: Offer JSON for metadata if requested
Nov 3 2022, 4:36 PM
Vgutierrez committed rOSAC1f35a9ee2929: api: support sha256 checksums (authored by taavi).
api: support sha256 checksums
Nov 3 2022, 4:36 PM
Vgutierrez committed rOSAC571d39a078e9: acme-chief: Unlink certificate renewal and OCSP handling (authored by BCornwall).
acme-chief: Unlink certificate renewal and OCSP handling
Nov 3 2022, 4:36 PM