Feed Advanced Search

Advanced Search
Use Results
Edit Query
Hide Query

	Include stories about projects I am a member of.

Thu, Jun 27

Vgutierrez closed T353279: CI on gitlab for eBPF / networking heavy projects, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.

Thu, Jun 27, 10:10 AM · Traffic

Vgutierrez closed T342618: Perform katran load tests on lvs1013 as Resolved.

Thu, Jun 27, 10:09 AM · SRE, Traffic

Vgutierrez closed T342618: Perform katran load tests on lvs1013, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.

Thu, Jun 27, 10:09 AM · Traffic

Vgutierrez added a comment to T368545: weighted maglev viability for low-traffic services.

In T368545#9929335, @ayounsi wrote:

I think I miss some context, what's the current low-traffic setup ?

Usually services use wrr (weighted round robin) to balance traffic across nodes

Thu, Jun 27, 9:16 AM · Infrastructure-Foundations, netops, serviceops, Traffic

Wed, Jun 26

Vgutierrez added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

as discussed on the meeting, you can rely on X-Client-IP header being present to tell between CDN requests and internal requests.

Wed, Jun 26, 3:43 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

Vgutierrez triaged T368545: weighted maglev viability for low-traffic services as Medium priority.

Wed, Jun 26, 2:51 PM · Infrastructure-Foundations, netops, serviceops, Traffic

Vgutierrez created T368545: weighted maglev viability for low-traffic services.

Wed, Jun 26, 2:50 PM · Infrastructure-Foundations, netops, serviceops, Traffic

Vgutierrez triaged T368544: IPIP encapsulation considerations for low-traffic services as Medium priority.

Wed, Jun 26, 2:46 PM · Infrastructure-Foundations, serviceops, netops, Traffic

Vgutierrez created T368544: IPIP encapsulation considerations for low-traffic services.

Wed, Jun 26, 2:45 PM · Infrastructure-Foundations, serviceops, netops, Traffic

Vgutierrez triaged T367204: LVSRealserverMSS alert is broken for ferm based hosts as Medium priority.

Wed, Jun 26, 2:34 PM · Traffic

Tue, Jun 25

Vgutierrez triaged T368083: migrate all high-traffic1 and high-traffic2 services to maglev as Medium priority.

Tue, Jun 25, 3:40 PM · Traffic

Vgutierrez closed T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.

Tue, Jun 25, 1:35 PM · Traffic

Vgutierrez closed T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation as Resolved.

Tue, Jun 25, 1:35 PM · Infrastructure-Foundations, Traffic

Vgutierrez closed T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation, a subtask of T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation, as Resolved.

Tue, Jun 25, 1:35 PM · Infrastructure-Foundations, Traffic

Vgutierrez closed T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation as Resolved.

Tue, Jun 25, 1:35 PM · Infrastructure-Foundations, Traffic

Mon, Jun 24

Vgutierrez closed T367963: Investigate increase in CD termination state after upgrading eqsin/ulsfo to HAProxy 2.8.10 as Resolved.

cache_haproxy.mtail failed to accept -1 as an HTTP status code, under reporting CD and CR termination states.

Mon, Jun 24, 3:38 PM · Patch-For-Review, Data-Engineering, Traffic

Vgutierrez closed T367963: Investigate increase in CD termination state after upgrading eqsin/ulsfo to HAProxy 2.8.10, a subtask of T367756: Upgrade hosts to haproxy 2.8.10, as Resolved.

Mon, Jun 24, 3:36 PM · Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Vgutierrez changed the status of T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation from Open to In Progress.

Mon, Jun 24, 2:54 PM · Infrastructure-Foundations, Traffic

Vgutierrez changed the status of T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation, a subtask of T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation, from Open to In Progress.

Mon, Jun 24, 2:53 PM · Infrastructure-Foundations, Traffic

Vgutierrez added a comment to T367963: Investigate increase in CD termination state after upgrading eqsin/ulsfo to HAProxy 2.8.10.

this is caused a bug in the mtail regex used to parse haproxy logs, on haproxy 2.6.17 http status gets reported as -1:

2024-06-24T13:46:52.422444+00:00 cp7001 haproxy[2780368]: 180684 -1 0 0 -1 {es.wikipedia.org} {} CD

Mon, Jun 24, 1:54 PM · Patch-For-Review, Data-Engineering, Traffic

Thu, Jun 20

Vgutierrez created T368083: migrate all high-traffic1 and high-traffic2 services to maglev.

Thu, Jun 20, 4:18 PM · Traffic

Vgutierrez changed the status of T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation, a subtask of T332027: Replace current L4LB with with Katran-based alternative, from Open to In Progress.

Thu, Jun 20, 4:16 PM · Traffic

Vgutierrez changed the status of T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation from Open to In Progress.

Thu, Jun 20, 4:16 PM · Infrastructure-Foundations, Traffic

Vgutierrez updated the task description for T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation.

Thu, Jun 20, 2:49 PM · Infrastructure-Foundations, Traffic

Vgutierrez added a comment to T367756: Upgrade hosts to haproxy 2.8.10.

In T367756#9907435, @Fabfur wrote:

After upgrading HAProxy to 2.8.10 on whole ulsfo we still see some errors in the kafka DLQ like:

this needs to be reported to upstream

Thu, Jun 20, 9:12 AM · Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Wed, Jun 19

Vgutierrez added a comment to T367290: Consider using preconnect for https://phab.wmfusercontent.org CDN.

@Aklapper we've briefly discussed this yesterday during the Traffic weekly meeting and you can proceed and enable preconnect

Wed, Jun 19, 8:48 AM · Traffic, Upstream, Phabricator (Upstream), Release-Engineering-Team (Priority Backlog 📥)

Tue, Jun 18

Vgutierrez updated the task description for T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation.

Tue, Jun 18, 3:13 PM · Infrastructure-Foundations, Traffic

Vgutierrez moved T367290: Consider using preconnect for https://phab.wmfusercontent.org CDN from Backlog to Radar/Not for service by Traffic on the Traffic board.

we already leverage preconnect on some cases but not as an HTTP Header but using the HTML <link> tag:

$ curl -v -s https://en.wikipedia.org/wiki/Main_Page 2>&1 |grep -i preconnect
<link rel="preconnect" href="//upload.wikimedia.org">

Tue, Jun 18, 10:46 AM · Traffic, Upstream, Phabricator (Upstream), Release-Engineering-Team (Priority Backlog 📥)

Vgutierrez created T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation.

Tue, Jun 18, 9:32 AM · Infrastructure-Foundations, Traffic

Mon, Jun 17

Vgutierrez added a comment to T367731: drmrs/esams/magru LVS : remove cross-rack links.

Don't be to aggressive with this one, we could need to rollback at some point, let's wait a few weeks at the very least

Mon, Jun 17, 12:36 PM · netops, Infrastructure-Foundations, Traffic

Vgutierrez updated the task description for T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation.

Mon, Jun 17, 9:47 AM · Infrastructure-Foundations, Traffic

Vgutierrez added a subtask for T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation: T367511: Migrate Cloudelastic load balancing to IPIP encapsulation (LVS).

Mon, Jun 17, 9:46 AM · Infrastructure-Foundations, Traffic

Vgutierrez added a parent task for T367511: Migrate Cloudelastic load balancing to IPIP encapsulation (LVS): T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation.

Mon, Jun 17, 9:46 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Patch-For-Review, Traffic

Wed, Jun 12

Vgutierrez triaged T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation as High priority.

Wed, Jun 12, 2:22 PM · Infrastructure-Foundations, Traffic

Vgutierrez created T367312: Migrate services behind high-traffic2 LVS to IPIP encapsulation.

Wed, Jun 12, 2:22 PM · Infrastructure-Foundations, Traffic

Vgutierrez closed T366466: Use IPIP encapsulation on lvs<-->text cluster as Resolved.

Wed, Jun 12, 2:04 PM · Patch-For-Review, Traffic

Vgutierrez closed T366466: Use IPIP encapsulation on lvs<-->text cluster, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.

Wed, Jun 12, 2:03 PM · Traffic

Vgutierrez reassigned T367204: LVSRealserverMSS alert is broken for ferm based hosts from Vgutierrez to CDobbins.

Wed, Jun 12, 1:31 PM · Traffic

Vgutierrez moved T365616: Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS) from Backlog to Radar/Not for service by Traffic on the Traffic board.

Wed, Jun 12, 8:46 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Patch-For-Review, Traffic

Tue, Jun 11

Vgutierrez created T367204: LVSRealserverMSS alert is broken for ferm based hosts.

Tue, Jun 11, 4:12 PM · Traffic

Vgutierrez added a comment to T365616: Consider migrating Search Platform-owned clusters to IPIP encapsulation (LVS).

Now that T365689 has been completed we can discuss tackling this one @bking.

Tue, Jun 11, 3:59 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Patch-For-Review, Traffic

Vgutierrez closed T365689: Provide a ferm based alternative to tcp-mss-clamper as Resolved.

ferm based MSS clamping is live on ncredir cluster

Tue, Jun 11, 3:36 PM · Traffic

Vgutierrez closed T365689: Provide a ferm based alternative to tcp-mss-clamper, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.

Tue, Jun 11, 3:35 PM · Traffic

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.

Tue, Jun 11, 1:04 PM · Patch-For-Review, Traffic

Mon, Jun 10

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.

Mon, Jun 10, 1:05 PM · Patch-For-Review, Traffic

Thu, Jun 6

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.

Thu, Jun 6, 1:04 PM · Patch-For-Review, Traffic

Jun 5 2024

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.

Jun 5 2024, 3:06 PM · Patch-For-Review, Traffic

Jun 3 2024

Vgutierrez updated the task description for T366466: Use IPIP encapsulation on lvs<-->text cluster.

Jun 3 2024, 2:34 PM · Patch-For-Review, Traffic

Vgutierrez triaged T366466: Use IPIP encapsulation on lvs<-->text cluster as Medium priority.

Jun 3 2024, 12:23 PM · Patch-For-Review, Traffic

Vgutierrez created T366466: Use IPIP encapsulation on lvs<-->text cluster.

Jun 3 2024, 12:22 PM · Patch-For-Review, Traffic

May 23 2024

Vgutierrez closed T365354: rp_filter should be disabled on puppet apply, a subtask of T357257: Use IPIP encapsulation on lvs<-->upload cluster, as Resolved.

May 23 2024, 1:34 PM · Patch-For-Review, Traffic

Vgutierrez closed T365354: rp_filter should be disabled on puppet apply as Resolved.

May 23 2024, 1:34 PM · Traffic

ssingh awarded T357257: Use IPIP encapsulation on lvs<-->upload cluster a Burninate token.

May 23 2024, 1:01 PM · Patch-For-Review, Traffic

Vgutierrez closed T357257: Use IPIP encapsulation on lvs<-->upload cluster as Resolved.

May 23 2024, 1:01 PM · Patch-For-Review, Traffic

Vgutierrez closed T357257: Use IPIP encapsulation on lvs<-->upload cluster, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.

May 23 2024, 1:00 PM · Traffic

Vgutierrez created T365689: Provide a ferm based alternative to tcp-mss-clamper.

May 23 2024, 9:42 AM · Traffic

Vgutierrez closed T350462: Provide a TCP MSS clamping mechanism for real servers as Resolved.

tcp-mss-clamper is being already used to perform MSS clamping on ncredir and CDN upload clusters

May 23 2024, 9:38 AM · Traffic

Vgutierrez closed T350462: Provide a TCP MSS clamping mechanism for real servers, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.

May 23 2024, 9:38 AM · Traffic

May 22 2024

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 22 2024, 3:18 PM · Patch-For-Review, Traffic

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 22 2024, 1:54 PM · Patch-For-Review, Traffic

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 22 2024, 12:50 PM · Patch-For-Review, Traffic

May 21 2024

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 21 2024, 1:22 PM · Patch-For-Review, Traffic

Vgutierrez closed T364589: acme-chief: add support for serving individual files over the puppet file system api as Resolved.

acme-chief 0.37 deployed shipping https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/8

May 21 2024, 1:14 PM · Acme-chief

Vgutierrez added a comment to T365456: Move HTTP/1.0 requests rejections at HAProxy level.

Scope of the task should be rejecting invalid HTTP requests on HAProxy rather than varnish as soon as we have analytics moved to HAProxy (and not only HTTP/1.0 ones)

May 21 2024, 12:43 PM · Patch-For-Review, Traffic

May 20 2024

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 20 2024, 3:39 PM · Patch-For-Review, Traffic

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 20 2024, 2:58 PM · Patch-For-Review, Traffic

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 20 2024, 2:14 PM · Patch-For-Review, Traffic

Vgutierrez triaged T365354: rp_filter should be disabled on puppet apply as Medium priority.

May 20 2024, 12:05 PM · Traffic

Vgutierrez created T365354: rp_filter should be disabled on puppet apply.

May 20 2024, 12:04 PM · Traffic

May 16 2024

Vgutierrez closed T365101: MSS clamper check triggers false positives as Resolved.

removed raw sockets usage. We are now fetching MSS data via getsockopts()

May 16 2024, 1:15 PM · Patch-For-Review, Traffic

Vgutierrez closed T365101: MSS clamper check triggers false positives, a subtask of T357257: Use IPIP encapsulation on lvs<-->upload cluster, as Resolved.

May 16 2024, 1:14 PM · Patch-For-Review, Traffic

Vgutierrez renamed T365101: MSS clamper check triggers false positives from MSS clamper clamping check false positives to MSS clamper check triggers false positives.

May 16 2024, 8:44 AM · Patch-For-Review, Traffic

Vgutierrez moved T365101: MSS clamper check triggers false positives from Backlog to Traffic team actively servicing on the Traffic board.

May 16 2024, 8:35 AM · Patch-For-Review, Traffic

Vgutierrez created T365101: MSS clamper check triggers false positives.

May 16 2024, 8:35 AM · Patch-For-Review, Traffic

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 16 2024, 8:20 AM · Patch-For-Review, Traffic

May 13 2024

Vgutierrez added a comment to T364691: Elevated 503 backend fetch failed reported by users.

In T364691#9790288, @Ladsgroup wrote:

In T364691#9790207, @Vgutierrez wrote:

we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:

I saw that but the timing doesn't match. What I'm getting from users is a constant 10% or something of all pageviews being like this for days now.

May 13 2024, 1:37 PM · Traffic

Vgutierrez added a comment to T364691: Elevated 503 backend fetch failed reported by users.

we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:

May 13 2024, 1:20 PM · Traffic

Vgutierrez updated the task description for T357257: Use IPIP encapsulation on lvs<-->upload cluster.

May 13 2024, 12:43 PM · Patch-For-Review, Traffic

Vgutierrez added a comment to T364589: acme-chief: add support for serving individual files over the puppet file system api.

file_metadata is already there and supports individual files. The only limitation is that it's currently expecting the parameters links=manage&source_permissions=ignore that correlate with the file resource attributes with the same names (https://www.puppet.com/docs/puppet/7/types/file.html#file-attribute-source_permissions && https://www.puppet.com/docs/puppet/7/types/file.html#file-attribute-links).

May 13 2024, 9:46 AM · Acme-chief

May 10 2024

Vgutierrez changed the status of T357257: Use IPIP encapsulation on lvs<-->upload cluster from Open to In Progress.

May 10 2024, 10:04 AM · Patch-For-Review, Traffic

Vgutierrez changed the status of T357257: Use IPIP encapsulation on lvs<-->upload cluster, a subtask of T332027: Replace current L4LB with with Katran-based alternative, from Open to In Progress.

May 10 2024, 10:04 AM · Traffic

Vgutierrez added a comment to T364589: acme-chief: add support for serving individual files over the puppet file system api.

your request is missing some required parameters for file_metadata endpoint:

$ curl -H "Accept: application/json" --cert /var/lib/puppet/ssl/certs/mx-out1001.wikimedia.org.pem --key /var/lib/puppet/ssl/private_keys/mx-out1001.wikimedia.org.pem --cacert /var/lib/puppet/ssl/certs/ca.pem "https://acmechief2002.codfw.wmnet:8140/puppet/v3/file_metadata/acmedata/mx-out/live/ec-prime256v1.crt?links=manage&source_permissions=ignore" |jq
{
  "checksum": {
    "type": "md5",
    "value": "{md5}b86fb140f227639d70ad971db461c82c"
  },
  "destination": null,
  "group": 498,
  "links": "manage",
  "mode": 420,
  "owner": 498,
  "path": "/etc/acmecerts/live/ec-prime256v1.crt",
  "relative_path": null,
  "type": "file"
}

May 10 2024, 6:15 AM · Acme-chief

May 8 2024

Vgutierrez closed T364385: Remove mtail leftovers on ncredir puppetization and instances, a subtask of T362776: replace mtail with benthos on ncredir instances, as Resolved.

May 8 2024, 1:07 PM · Traffic

Vgutierrez closed T364385: Remove mtail leftovers on ncredir puppetization and instances as Resolved.

May 8 2024, 1:07 PM · Traffic

Vgutierrez moved T364400: map the /api/ prefix to /w/rest.php from Backlog to Radar/Not for service by Traffic on the Traffic board.

May 8 2024, 9:40 AM · serviceops, Traffic, MW-Interfaces-Team

May 7 2024

Dzahn awarded T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy a Like token.

May 7 2024, 5:28 PM · Traffic

Vgutierrez created T364385: Remove mtail leftovers on ncredir puppetization and instances.

May 7 2024, 12:57 PM · Traffic

Vgutierrez closed T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy as Resolved.

https://gerrit.wikimedia.org/r/1028818 removed the prometheus jobs, alert should go away as soon as puppet runs on the prometheus hosts.

May 7 2024, 12:55 PM · Traffic

Vgutierrez claimed T364354: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy.

that's right, this is a leftover from the migration from mtail to benthos on ncredir. We will take care of it ASAP.

May 7 2024, 8:22 AM · Traffic

May 6 2024

Vgutierrez closed T362776: replace mtail with benthos on ncredir instances as Resolved.

May 6 2024, 2:25 PM · Traffic

May 3 2024

Vgutierrez closed T357258: Release tcp-mss-clamper for bullseye as Resolved.

May 3 2024, 3:55 PM · Traffic

Vgutierrez closed T357258: Release tcp-mss-clamper for bullseye, a subtask of T357257: Use IPIP encapsulation on lvs<-->upload cluster, as Resolved.

May 3 2024, 3:55 PM · Patch-For-Review, Traffic

Apr 30 2024

Vgutierrez triaged T362776: replace mtail with benthos on ncredir instances as Medium priority.

Apr 30 2024, 1:16 PM · Traffic

Apr 19 2024

Vgutierrez added a comment to T362776: replace mtail with benthos on ncredir instances.

running another test this time with 3x10k requests it looks like the culprit is the socket_server UDP input that drops packets:

processor_latency_ns_count{label="syslog_format",path="root.input.processors.0"} 29963
vgutierrez@ncredir2001:/var/log/nginx$ cat /proc/net/udp
   sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops            
 1330: 0100007F:04C5 00000000:0000 07 00000000:00000000 00:00000000 00000000 18837        0 64597138 2 0000000000000000 37

Apr 19 2024, 2:31 PM · Traffic

Vgutierrez updated subscribers of T362776: replace mtail with benthos on ncredir instances.

Testing benthos on ncredir2001 shows some concerning results (TL;DR it looks like benthos drops some messages and metrics aren't as accurate as expected).

Apr 19 2024, 2:23 PM · Traffic