Page MenuHomePhabricator

Vgutierrez (Valentín Gutiérrez)
Staff Site Reliability Engineer, Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (331 w, 1 d)
Availability
Available
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
VGutiérrez (WMF) [ Global Accounts ]

Recent Activity

Today

Vgutierrez moved T367290: Consider using preconnect for https://phab.wmfusercontent.org CDN from Backlog to Radar/Not for service by Traffic on the Traffic board.

we already leverage preconnect on some cases but not as an HTTP Header but using the HTML <link> tag:

$ curl -v -s https://en.wikipedia.org/wiki/Main_Page 2>&1 |grep -i preconnect
<link rel="preconnect" href="//upload.wikimedia.org">
Tue, Jun 18, 10:46 AM · Traffic, Upstream, Phabricator (Upstream), Release-Engineering-Team (Priority Backlog 📥)
Vgutierrez created T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation.
Tue, Jun 18, 9:32 AM · Infrastructure-Foundations, Traffic

Yesterday

Vgutierrez added a comment to T367731: drmrs/esams/magru LVS : remove cross-rack links.

Don't be to aggressive with this one, we could need to rollback at some point, let's wait a few weeks at the very least

Mon, Jun 17, 12:36 PM · netops, Traffic, Infrastructure-Foundations
Vgutierrez updated the task description for T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation.
Mon, Jun 17, 9:47 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Elasticsearch, Infrastructure-Foundations, Traffic
Vgutierrez added a subtask for T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation: T367511: Migrate Cloudelastic load balancing to IPIP encapsulation (LVS).
Mon, Jun 17, 9:46 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Elasticsearch, Infrastructure-Foundations, Traffic
Vgutierrez added a parent task for T367511: Migrate Cloudelastic load balancing to IPIP encapsulation (LVS): T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation.
Mon, Jun 17, 9:46 AM · Patch-For-Review, Data-Platform-SRE, Traffic

Wed, Jun 12

Vgutierrez triaged T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation as High priority.
Wed, Jun 12, 2:22 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Elasticsearch, Infrastructure-Foundations, Traffic
Vgutierrez created T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation.
Wed, Jun 12, 2:22 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Elasticsearch, Infrastructure-Foundations, Traffic
Vgutierrez closed T366466: Use IPIP encapsulation on lvs<-->text cluster as Resolved.
Wed, Jun 12, 2:04 PM · Patch-For-Review, Traffic
Vgutierrez closed T366466: Use IPIP encapsulation on lvs<-->text cluster, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Wed, Jun 12, 2:03 PM · Traffic
Vgutierrez reassigned T367204: LVSRealserverMSS alert is broken for ferm based hosts from Vgutierrez to CDobbins.
Wed, Jun 12, 1:31 PM · Traffic
Vgutierrez moved T365616: Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS) from Backlog to Radar/Not for service by Traffic on the Traffic board.
Wed, Jun 12, 8:46 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Patch-For-Review, Traffic

Tue, Jun 11

Vgutierrez created T367204: LVSRealserverMSS alert is broken for ferm based hosts.
Tue, Jun 11, 4:12 PM · Traffic
Vgutierrez added a comment to T365616: Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS).

Now that T365689 has been completed we can discuss tackling this one @bking.

Tue, Jun 11, 3:59 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Patch-For-Review, Traffic
Vgutierrez closed T365689: Provide a ferm based alternative to tcp-mss-clamper as Resolved.

ferm based MSS clamping is live on ncredir cluster

Tue, Jun 11, 3:36 PM · Traffic
Vgutierrez closed T365689: Provide a ferm based alternative to tcp-mss-clamper, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Tue, Jun 11, 3:35 PM · Traffic
Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.
Tue, Jun 11, 1:04 PM · Patch-For-Review, Traffic

Mon, Jun 10

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.
Mon, Jun 10, 1:05 PM · Patch-For-Review, Traffic

Thu, Jun 6

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.
Thu, Jun 6, 1:04 PM · Patch-For-Review, Traffic

Wed, Jun 5

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.
Wed, Jun 5, 3:06 PM · Patch-For-Review, Traffic

Mon, Jun 3

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.
Mon, Jun 3, 2:34 PM · Patch-For-Review, Traffic
Vgutierrez triaged T366466: Use IPIP encapsulation on lvs<-->text cluster as Medium priority.
Mon, Jun 3, 12:23 PM · Patch-For-Review, Traffic
Vgutierrez created T366466: Use IPIP encapsulation on lvs<-->text cluster.
Mon, Jun 3, 12:22 PM · Patch-For-Review, Traffic

Thu, May 23

Vgutierrez closed T365354: rp_filter should be disabled on puppet apply, a subtask of T357257: Use IPIP encapsulation on lvs<-->upload cluster, as Resolved.
Thu, May 23, 1:34 PM · Patch-For-Review, Traffic
Vgutierrez closed T365354: rp_filter should be disabled on puppet apply as Resolved.
Thu, May 23, 1:34 PM · Traffic
ssingh awarded T357257: Use IPIP encapsulation on lvs<-->upload cluster a Burninate token.
Thu, May 23, 1:01 PM · Patch-For-Review, Traffic
Vgutierrez closed T357257: Use IPIP encapsulation on lvs<-->upload cluster as Resolved.
Thu, May 23, 1:01 PM · Patch-For-Review, Traffic
Vgutierrez closed T357257: Use IPIP encapsulation on lvs<-->upload cluster, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Thu, May 23, 1:00 PM · Traffic
Vgutierrez created T365689: Provide a ferm based alternative to tcp-mss-clamper.
Thu, May 23, 9:42 AM · Traffic
Vgutierrez closed T350462: Provide a TCP MSS clamping mechanism for real servers as Resolved.

tcp-mss-clamper is being already used to perform MSS clamping on ncredir and CDN upload clusters

Thu, May 23, 9:38 AM · Traffic
Vgutierrez closed T350462: Provide a TCP MSS clamping mechanism for real servers, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Thu, May 23, 9:38 AM · Traffic

Wed, May 22

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Wed, May 22, 3:18 PM · Patch-For-Review, Traffic
Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Wed, May 22, 1:54 PM · Patch-For-Review, Traffic
Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Wed, May 22, 12:50 PM · Patch-For-Review, Traffic

Tue, May 21

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Tue, May 21, 1:22 PM · Patch-For-Review, Traffic
Vgutierrez closed T364589: acme-chief: add support for serving individual files over the puppet file system api as Resolved.

acme-chief 0.37 deployed shipping https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/8

Tue, May 21, 1:14 PM · Acme-chief
Vgutierrez added a comment to T365456: Move HTTP/1.0 requests rejections at HAProxy level.

Scope of the task should be rejecting invalid HTTP requests on HAProxy rather than varnish as soon as we have analytics moved to HAProxy (and not only HTTP/1.0 ones)

Tue, May 21, 12:43 PM · Patch-For-Review, Traffic
Vgutierrez updated subscribers of T365327: Consider preferring TLS_AES_128_GCM_SHA256 over TLS_AES_256_GCM_SHA384.

Personally I'm not sold on the idea of decreasing the key size, @BBlack what are your thoughts?

Tue, May 21, 9:42 AM · Traffic

Mon, May 20

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Mon, May 20, 3:39 PM · Patch-For-Review, Traffic
Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Mon, May 20, 2:58 PM · Patch-For-Review, Traffic
Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Mon, May 20, 2:14 PM · Patch-For-Review, Traffic
Vgutierrez triaged T365354: rp_filter should be disabled on puppet apply as Medium priority.
Mon, May 20, 12:05 PM · Traffic
Vgutierrez created T365354: rp_filter should be disabled on puppet apply.
Mon, May 20, 12:04 PM · Traffic

Sun, May 19

Vgutierrez added a comment to T365327: Consider preferring TLS_AES_128_GCM_SHA256 over TLS_AES_256_GCM_SHA384.

Stupid note: In my phone, it still prefers AES256. Maybe my client for whatever reason doesn't support ChaCha20 or it thinks it's strong enough to just go with AES instead. But I will try to confirm SSL_OP_PRIORITIZE_CHACHA is working.

Sun, May 19, 1:42 PM · Traffic
Vgutierrez added a comment to T365327: Consider preferring TLS_AES_128_GCM_SHA256 over TLS_AES_256_GCM_SHA384.

Chacha20 is faster than AES when both are running without hardware acceleration. If AES-NI is present, AES is faster. This is also considered by clients to choose their ciphersuite suite to be sent to the server as part of the ClientHello

Sun, May 19, 1:31 PM · Traffic
Vgutierrez added a comment to T365327: Consider preferring TLS_AES_128_GCM_SHA256 over TLS_AES_256_GCM_SHA384.

We can also consider preferring ChaCha20 everywhere as well like Meta websites.

Sun, May 19, 1:21 PM · Traffic

May 16 2024

Vgutierrez closed T365101: MSS clamper check triggers false positives as Resolved.

removed raw sockets usage. We are now fetching MSS data via getsockopts()

May 16 2024, 1:15 PM · Patch-For-Review, Traffic
Vgutierrez closed T365101: MSS clamper check triggers false positives, a subtask of T357257: Use IPIP encapsulation on lvs<-->upload cluster, as Resolved.
May 16 2024, 1:14 PM · Patch-For-Review, Traffic
Vgutierrez renamed T365101: MSS clamper check triggers false positives from MSS clamper clamping check false positives to MSS clamper check triggers false positives.
May 16 2024, 8:44 AM · Patch-For-Review, Traffic
Vgutierrez moved T365101: MSS clamper check triggers false positives from Backlog to Traffic team actively servicing on the Traffic board.
May 16 2024, 8:35 AM · Patch-For-Review, Traffic
Vgutierrez created T365101: MSS clamper check triggers false positives.
May 16 2024, 8:35 AM · Patch-For-Review, Traffic
Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
May 16 2024, 8:20 AM · Patch-For-Review, Traffic

May 13 2024

Vgutierrez added a comment to T364691: Elevated 503 backend fetch failed reported by users.

we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:

image.png (699×1 px, 151 KB)

I saw that but the timing doesn't match. What I'm getting from users is a constant 10% or something of all pageviews being like this for days now.

May 13 2024, 1:37 PM · Traffic
Vgutierrez added a comment to T364691: Elevated 503 backend fetch failed reported by users.

we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:

image.png (699×1 px, 151 KB)

May 13 2024, 1:20 PM · Traffic
Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.
May 13 2024, 12:43 PM · Patch-For-Review, Traffic
Vgutierrez added a comment to T364589: acme-chief: add support for serving individual files over the puppet file system api.

file_metadata is already there and supports individual files. The only limitation is that it's currently expecting the parameters links=manage&source_permissions=ignore that correlate with the file resource attributes with the same names (https://www.puppet.com/docs/puppet/7/types/file.html#file-attribute-source_permissions && https://www.puppet.com/docs/puppet/7/types/file.html#file-attribute-links).

May 13 2024, 9:46 AM · Acme-chief

May 10 2024

Vgutierrez changed the status of T357257: Use IPIP encapsulation on lvs<-->upload cluster from Open to In Progress.
May 10 2024, 10:04 AM · Patch-For-Review, Traffic
Vgutierrez changed the status of T357257: Use IPIP encapsulation on lvs<-->upload cluster, a subtask of T332027: Replace current L4LB with with Katran-based alternative, from Open to In Progress.
May 10 2024, 10:04 AM · Traffic
Vgutierrez added a comment to T364589: acme-chief: add support for serving individual files over the puppet file system api.

your request is missing some required parameters for file_metadata endpoint:

$ curl -H "Accept: application/json" --cert /var/lib/puppet/ssl/certs/mx-out1001.wikimedia.org.pem --key /var/lib/puppet/ssl/private_keys/mx-out1001.wikimedia.org.pem --cacert /var/lib/puppet/ssl/certs/ca.pem "https://acmechief2002.codfw.wmnet:8140/puppet/v3/file_metadata/acmedata/mx-out/live/ec-prime256v1.crt?links=manage&source_permissions=ignore" |jq
{
  "checksum": {
    "type": "md5",
    "value": "{md5}b86fb140f227639d70ad971db461c82c"
  },
  "destination": null,
  "group": 498,
  "links": "manage",
  "mode": 420,
  "owner": 498,
  "path": "/etc/acmecerts/live/ec-prime256v1.crt",
  "relative_path": null,
  "type": "file"
}
May 10 2024, 6:15 AM · Acme-chief

May 8 2024

Vgutierrez closed T364385: Remove mtail leftovers on ncredir puppetization and instances, a subtask of T362776: replace mtail with benthos on ncredir instances, as Resolved.
May 8 2024, 1:07 PM · Traffic
Vgutierrez closed T364385: Remove mtail leftovers on ncredir puppetization and instances as Resolved.
May 8 2024, 1:07 PM · Traffic
Vgutierrez moved T364400: map the /api/ prefix to /w/rest.php from Backlog to Radar/Not for service by Traffic on the Traffic board.
May 8 2024, 9:40 AM · serviceops, Traffic, MW-Interfaces-Team

May 7 2024

Dzahn awarded T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy a Like token.
May 7 2024, 5:28 PM · Traffic
Vgutierrez created T364385: Remove mtail leftovers on ncredir puppetization and instances.
May 7 2024, 12:57 PM · Traffic
Vgutierrez closed T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy as Resolved.

https://gerrit.wikimedia.org/r/1028818 removed the prometheus jobs, alert should go away as soon as puppet runs on the prometheus hosts.

May 7 2024, 12:55 PM · Traffic
Vgutierrez claimed T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy.

that's right, this is a leftover from the migration from mtail to benthos on ncredir. We will take care of it ASAP.

May 7 2024, 8:22 AM · Traffic

May 6 2024

Vgutierrez closed T362776: replace mtail with benthos on ncredir instances as Resolved.
May 6 2024, 2:25 PM · Traffic

May 3 2024

Vgutierrez closed T357258: Release tcp-mss-clamper for bullseye as Resolved.
May 3 2024, 3:55 PM · Traffic
Vgutierrez closed T357258: Release tcp-mss-clamper for bullseye, a subtask of T357257: Use IPIP encapsulation on lvs<-->upload cluster, as Resolved.
May 3 2024, 3:55 PM · Patch-For-Review, Traffic

Apr 30 2024

Vgutierrez triaged T362776: replace mtail with benthos on ncredir instances as Medium priority.
Apr 30 2024, 1:16 PM · Traffic

Apr 19 2024

Vgutierrez added a comment to T362776: replace mtail with benthos on ncredir instances.

running another test this time with 3x10k requests it looks like the culprit is the socket_server UDP input that drops packets:

processor_latency_ns_count{label="syslog_format",path="root.input.processors.0"} 29963
vgutierrez@ncredir2001:/var/log/nginx$ cat /proc/net/udp
   sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops            
 1330: 0100007F:04C5 00000000:0000 07 00000000:00000000 00:00000000 00000000 18837        0 64597138 2 0000000000000000 37
Apr 19 2024, 2:31 PM · Traffic
Vgutierrez updated subscribers of T362776: replace mtail with benthos on ncredir instances.

Testing benthos on ncredir2001 shows some concerning results (TL;DR it looks like benthos drops some messages and metrics aren't as accurate as expected).

Apr 19 2024, 2:23 PM · Traffic

Apr 17 2024

Vgutierrez created T362776: replace mtail with benthos on ncredir instances.
Apr 17 2024, 1:26 PM · Traffic

Apr 15 2024

Vgutierrez closed T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 as Resolved.
Apr 15 2024, 1:41 PM · Upstream, Traffic
Vgutierrez closed T362063: Upgrade to HAProxy 2.6.17 as Resolved.
Apr 15 2024, 1:41 PM · Traffic
Vgutierrez closed T362063: Upgrade to HAProxy 2.6.17, a subtask of T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066, as Resolved.
Apr 15 2024, 1:40 PM · Upstream, Traffic
Vgutierrez updated the task description for T362063: Upgrade to HAProxy 2.6.17.
Apr 15 2024, 1:38 PM · Traffic
Vgutierrez updated the task description for T362063: Upgrade to HAProxy 2.6.17.
Apr 15 2024, 1:13 PM · Traffic

Apr 11 2024

Vgutierrez updated the task description for T362063: Upgrade to HAProxy 2.6.17.
Apr 11 2024, 3:58 PM · Traffic
Vgutierrez updated the task description for T362063: Upgrade to HAProxy 2.6.17.
Apr 11 2024, 2:55 PM · Traffic
Vgutierrez updated the task description for T362063: Upgrade to HAProxy 2.6.17.
Apr 11 2024, 1:49 PM · Traffic

Apr 9 2024

Vgutierrez updated the task description for T362063: Upgrade to HAProxy 2.6.17.
Apr 9 2024, 3:34 PM · Traffic

Apr 8 2024

Vgutierrez updated the task description for T362063: Upgrade to HAProxy 2.6.17.
Apr 8 2024, 2:22 PM · Traffic
Vgutierrez triaged T362063: Upgrade to HAProxy 2.6.17 as Medium priority.
Apr 8 2024, 1:20 PM · Traffic
Vgutierrez created T362063: Upgrade to HAProxy 2.6.17.
Apr 8 2024, 12:24 PM · Traffic

Feb 12 2024

Vgutierrez triaged T357257: Use IPIP encapsulation on lvs<-->upload cluster as Medium priority.
Feb 12 2024, 8:15 AM · Patch-For-Review, Traffic
Vgutierrez moved T357258: Release tcp-mss-clamper for bullseye from Backlog to Traffic team actively servicing on the Traffic board.
Feb 12 2024, 7:26 AM · Traffic
Vgutierrez triaged T357258: Release tcp-mss-clamper for bullseye as Medium priority.
Feb 12 2024, 7:25 AM · Traffic
Vgutierrez created T357258: Release tcp-mss-clamper for bullseye.
Feb 12 2024, 7:25 AM · Traffic
Vgutierrez created T357257: Use IPIP encapsulation on lvs<-->upload cluster.
Feb 12 2024, 7:24 AM · Patch-For-Review, Traffic

Feb 7 2024

Vgutierrez added a comment to T356025: A poor internet connection should not result in a HTTP 503 error.

sadly varnish is not able to tell between a client that goes away earlier than expected (by poor Internet access) triggering a backend fetch error from an actual backend fetch error where the client connection is healthy but varnish is unable to reach the backend server.

Feb 7 2024, 7:05 AM · SRE, Traffic

Feb 6 2024

Vgutierrez added a comment to T356792: Problem downloading large files from analytics.wikimedia.org.

https://github.com/wikimedia/operations-puppet/blob/1a6c9d13ee7a499ee7a28e47449774a6a6dcdccc/modules/envoyproxy/manifests/tls_terminator.pp#L147 could be the culprit.

Feb 6 2024, 6:23 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Vgutierrez moved T356792: Problem downloading large files from analytics.wikimedia.org from Backlog to Radar/Not for service by Traffic on the Traffic board.

@BTullis it's origin related:

Feb 6 2024, 6:02 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)
Vgutierrez added a comment to T356792: Problem downloading large files from analytics.wikimedia.org.

I can reproduce via text@drmrs, I'll take a look ASAP :)

Feb 6 2024, 5:57 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11)

Feb 4 2024

Vgutierrez triaged T356598: Track Linux key retention service as Medium priority.
Feb 4 2024, 2:43 PM · Traffic
Vgutierrez created T356598: Track Linux key retention service.
Feb 4 2024, 2:42 PM · Traffic

Jan 31 2024

Vgutierrez added a comment to T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.

Fix already released on HAProxy 2.9: https://www.mail-archive.com/haproxy@formilux.org/msg44547.html

Jan 31 2024, 5:12 PM · Upstream, Traffic

Jan 30 2024

Vgutierrez added a comment to T355905: Restarting fifo-log-demux should not restart nginx.

IIRC that was done to smooth the reimage process and first puppet run on various roles using fifo-log-demux.

Jan 30 2024, 8:52 AM · Traffic

Jan 23 2024

Vgutierrez renamed T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066 from HAProxy 2.6.16 CPU spikes on cp3066 to HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.
Jan 23 2024, 3:40 PM · Upstream, Traffic

Jan 22 2024

Vgutierrez added a comment to T354424: HAProxy 2.6.16/2.8.5 CPU spikes on cp3066.

as suggested by Willy Tarreau on https://github.com/haproxy/haproxy/issues/2403#issuecomment-1900111538 this issue could be easier to debug on HAProxy 2.8

Jan 22 2024, 10:48 AM · Upstream, Traffic