Page MenuHomePhabricator

Vgutierrez (Valentín Gutiérrez)
Senior Site Reliability Engineer, Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (293 w, 4 d)
Availability
Available
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
VGutiérrez (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Vgutierrez created P52721 (An Untitled Masterwork).
Thu, Sep 28, 2:07 PM

Wed, Sep 20

Vgutierrez renamed T346874: Allow purged to specify buffer length from Allow purged to specify buffer lenght to Allow purged to specify buffer length.
Wed, Sep 20, 9:09 AM · SRE, Traffic
Vgutierrez removed a project from T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper: Patch-For-Review.
Wed, Sep 20, 8:31 AM · SRE-Access-Requests, SRE
Vgutierrez closed T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper as Resolved.

Patch has been merged, it should be effective in ~30 minutes when puppet runs. @acooper should have received an email to change the password of his kerberos principal.

Wed, Sep 20, 8:30 AM · SRE-Access-Requests, SRE

Tue, Sep 19

Vgutierrez moved T346640: Traffic cache daemon restart scripts need some rework from Backlog to Ready for work on the Traffic board.
Tue, Sep 19, 8:12 AM · SRE, Traffic

Mon, Sep 18

Vgutierrez added a comment to T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper.

Thanks!, still blocked on @thcipriani for deployment group membership

Mon, Sep 18, 12:00 PM · SRE-Access-Requests, SRE
Vgutierrez closed T346602: VarnishKafka logrotate fails on bookworm , a subtask of T342154: Upgrade Traffic hosts to bookworm, as Resolved.
Mon, Sep 18, 10:47 AM · SRE, Patch-For-Review, Traffic
Vgutierrez closed T346602: VarnishKafka logrotate fails on bookworm as Resolved.
Mon, Sep 18, 10:47 AM · SRE, Traffic
Vgutierrez added a subtask for T342154: Upgrade Traffic hosts to bookworm: T346602: VarnishKafka logrotate fails on bookworm .
Mon, Sep 18, 8:47 AM · SRE, Patch-For-Review, Traffic
Vgutierrez added a parent task for T346602: VarnishKafka logrotate fails on bookworm : T342154: Upgrade Traffic hosts to bookworm.
Mon, Sep 18, 8:47 AM · SRE, Traffic

Thu, Sep 14

Vgutierrez reopened T345853: Fail event on /dev/md/0:kubernetes2028 as "Open".

not sure why I've been pinged in this task but anyways, the new disk needs to be added to the RAID, as it's still degraded:

/dev/md/0:
           Version : 1.2
     Creation Time : Fri Sep  1 18:30:45 2023
        Raid Level : raid1
        Array Size : 937267200 (893.85 GiB 959.76 GB)
     Used Dev Size : 937267200 (893.85 GiB 959.76 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent
Thu, Sep 14, 2:38 PM · serviceops, ops-codfw, SRE

Wed, Sep 13

Vgutierrez added a comment to T345370: Cookbook should ask for confirmation at beginning of execution.

@Vgutierrez Is this something that should be addressed in the cookbook?

Your idea of automatically including it in cookbooks with depooling is an interesting one.

Wed, Sep 13, 8:40 AM · Infrastructure-Foundations, SRE-tools, Spicerack

Tue, Sep 12

Vgutierrez added a comment to T341780: Direct 5% of all traffic to mw-on-k8s.

Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everything fails with 502 or 503. Prominent example: https://admin.toolforge.org.

Tue, Sep 12, 9:20 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Fri, Sep 8

Vgutierrez added a comment to T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper.

just to be the clear the RSA key is totally valid at this point, I just wanted to save @acooper more "pain" further down the line. The task currently waiting for @thcipriani and @odimitrijevic / @Milimetric approvals :)

Fri, Sep 8, 1:53 PM · SRE-Access-Requests, SRE
Vgutierrez moved T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Sep 8, 1:32 PM · SRE-Access-Requests, SRE
Vgutierrez added a comment to T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper.

I followed these instructions already which requested rsa type (maybe worth updating the instructions if ed25519 is preferred now?)
https://wikitech.wikimedia.org/wiki/Yubikey-SSH

My concern now is, I'm not 100% sure how to deal with the existing key (delete it?) and what the alternative steps would be. Do you know if its documented anywhere for ed25519?

Fri, Sep 8, 1:32 PM · SRE-Access-Requests, SRE
Vgutierrez added a comment to T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper.

thanks! @acooper RSA keys are being deprecated in some parts of our infrastructure already (T336769), so I'm wondering if you could provide an ed25519 one rather than a rsa-4096. This should be totally feasible with a yubikey 5 (I'm guessing you're using one due to the cardno comment from your SSH key)

Fri, Sep 8, 1:15 PM · SRE-Access-Requests, SRE
Vgutierrez updated the task description for T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper.
Fri, Sep 8, 1:11 PM · SRE-Access-Requests, SRE
Vgutierrez added a comment to T339134: Package and deploy ATS 9.2.1.

A quick check with cumin shows several servers impacted:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'journalctl -u fifo-log-demux@notpurge.service --since=-1h | grep Error'
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit: 96 
===== NODE GROUP =====                                                                                                                                                            
(1) cp5028.eqsin.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:18:14 cp5028 fifo-log-demux[1607]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
===== NODE GROUP =====                                                                                                                                                            
(1) cp2033.codfw.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:27:36 cp2033 fifo-log-demux[1320]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
===== NODE GROUP =====                                                                                                                                                            
(1) cp4042.ulsfo.wmnet                                                                                                                                                            
----- OUTPUT of 'journalctl -u fi...-1h | grep Error' -----                                                                                                                       
Sep 08 09:03:56 cp4042 fifo-log-demux[1831]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer                  
================                                                                                                                                                                  
PASS |████                                                                                                                               |   3% (3/96) [00:04<02:14,  1.45s/hosts]
FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉    |  97% (93/96) [00:04<00:00, 21.44hosts/s]
96.9% (93/96) of nodes failed to execute command 'journalctl -u fi...-1h | grep Error': cp[2027-2032,2034-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5027,5029-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4041,4043-4052].ulsfo.wmnet
3.1% (3/96) success ratio (< 100.0% threshold) for command: 'journalctl -u fi...-1h | grep Error'. Aborting.: cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet
3.1% (3/96) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.: cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet
Fri, Sep 8, 9:52 AM · Traffic
Vgutierrez added a comment to T339134: Package and deploy ATS 9.2.1.

that's not an issue with ATS 9.2.1, problem got fixed by restarting fifo-log-demux@notpurge.service:

vgutierrez@cp4052:~$ journalctl -u fifo-log-demux@notpurge.service -f
-- Journal begins at Sat 2023-08-05 22:39:02 UTC. --
Sep 03 20:57:26 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 06 15:47:35 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 07 07:52:46 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 07 08:17:20 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer
Sep 08 09:42:55 cp4052 systemd[1]: Stopping FIFO log demultiplexer (instance notpurge)...
Sep 08 09:42:55 cp4052 systemd[1]: fifo-log-demux@notpurge.service: Succeeded.
Sep 08 09:42:55 cp4052 systemd[1]: Stopped FIFO log demultiplexer (instance notpurge).
Sep 08 09:42:55 cp4052 systemd[1]: fifo-log-demux@notpurge.service: Consumed 9min 29.690s CPU time.
Sep 08 09:42:55 cp4052 systemd[1]: Started FIFO log demultiplexer (instance notpurge).
Sep 08 09:42:55 cp4052 fifo-log-demux[902329]: Waiting for connections on /run/trafficserver/notpurge.sock
Fri, Sep 8, 9:46 AM · Traffic
Vgutierrez added a comment to T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper.

I almost forgot, for analytics-privatedata-users I'm assuming @acooper needs a kerberos principal as well, details available on https://wikitech.wikimedia.org/wiki/Analytics/Data_access

Fri, Sep 8, 9:02 AM · SRE-Access-Requests, SRE
Vgutierrez moved T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper from Untriaged to Awaiting User Input on the SRE-Access-Requests board.

we are also pending on @acooper submitting their public SSH key

Fri, Sep 8, 7:34 AM · SRE-Access-Requests, SRE
Vgutierrez updated the task description for T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper.
Fri, Sep 8, 7:32 AM · SRE-Access-Requests, SRE
Vgutierrez changed the status of T345877: Requesting shell access, deployment and analytics-privatedata-users rights for acooper from Open to Stalled.

deployment membership requires the approval of @thcipriani and analytics-privatedata-users of @odimitrijevic / @Milimetric

Fri, Sep 8, 7:31 AM · SRE-Access-Requests, SRE

Thu, Sep 7

Vgutierrez closed T345633: Requesting access to analytics-privatedata-users and ops for brouberol as Resolved.
vgutierrez@mwmaint1002:~$ sudo -i ldapsearch -x cn=ops |grep bro
member: uid=brouberol,ou=people,dc=wikimedia,dc=org
vgutierrez@mwmaint1002:~$ sudo -i ldapsearch -x cn=wmf |grep bro
member: uid=brouberol,ou=people,dc=wikimedia,dc=org
Thu, Sep 7, 2:56 PM · SRE, SRE-Access-Requests
Vgutierrez changed the status of T345633: Requesting access to analytics-privatedata-users and ops for brouberol from Stalled to In Progress.
Thu, Sep 7, 2:38 PM · SRE, SRE-Access-Requests

Wed, Sep 6

Vgutierrez added a comment to T345370: Cookbook should ask for confirmation at beginning of execution.

having this in place would have prevented a ncredir related page already. I'm happy to have this opt-in per cookbook (personally I'd enable it on cookbooks that depool hosts automatically)

Wed, Sep 6, 2:56 PM · Infrastructure-Foundations, SRE-tools, Spicerack
Vgutierrez added a comment to T345633: Requesting access to analytics-privatedata-users and ops for brouberol.

Cheers, I've amended the patch to include the ops membership (already approved by @joanna_borun). CR still blocked till we get @odimitrijevic, @Milimetric and @Gehel approvals

Wed, Sep 6, 10:39 AM · SRE, SRE-Access-Requests
Vgutierrez changed the status of T345633: Requesting access to analytics-privatedata-users and ops for brouberol from In Progress to Stalled.

analytics_privatedata_users membership requires approval from @odimitrijevic or @Milimetric per https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/data.yaml#409

Wed, Sep 6, 9:22 AM · SRE, SRE-Access-Requests
Vgutierrez updated the task description for T345633: Requesting access to analytics-privatedata-users and ops for brouberol.
Wed, Sep 6, 9:08 AM · SRE, SRE-Access-Requests
Vgutierrez changed the status of T345633: Requesting access to analytics-privatedata-users and ops for brouberol from Stalled to In Progress.
vgutierrez@mwmaint1002:~$ sudo -i cross-validate-accounts --username brouberol --uid 45143 --email brouberol@wikimedia.org --real-name "Balthazar Rouberol" --ssh-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKunax7NU1Zx304QaTggTnIjXuY8rxgKTwReUMIffoIR brouberol@wikimedia.org" --kerberos
Wed, Sep 6, 9:08 AM · SRE, SRE-Access-Requests

Tue, Sep 5

Vgutierrez claimed T345633: Requesting access to analytics-privatedata-users and ops for brouberol.
Tue, Sep 5, 4:38 PM · SRE, SRE-Access-Requests
Vgutierrez updated the task description for T345633: Requesting access to analytics-privatedata-users and ops for brouberol.
Tue, Sep 5, 4:38 PM · SRE, SRE-Access-Requests
Vgutierrez changed the status of T345633: Requesting access to analytics-privatedata-users and ops for brouberol from Open to Stalled.

the provided ssh key is already used in WMCS, please provide a new one:

vgutierrez@mwmaint1002:~$ sudo -i cross-validate-accounts --username brouberol --uid 45143 --email brouberol@wikimedia.org --real-name "Balthazar Rouberol" --ssh-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPd+Ekept47K0yIJ91ByVo4q6TAbgVzzxIqfq6k1X0L8 brouberol@wikimedia.org" --kerberos
[...]
brouberol uses the same SSH key(s) in WMCS and production:
  {'AAAAC3NzaC1lZDI1NTE5AAAAIPd+Ekept47K0yIJ91ByVo4q6TAbgVzzxIqfq6k1X0L8'}
Tue, Sep 5, 4:38 PM · SRE, SRE-Access-Requests
Vgutierrez updated subscribers of T345542: DegradedArray event on /dev/md/0:wdqs2024.

I don't think so as it's still using role insetup::search_platform but @bking and @RKemper should have more context about it

Tue, Sep 5, 2:59 PM · SRE, ops-codfw, Data-Platform-SRE
Vgutierrez closed T345455: Requesting access to analytics-admins for cjming as Resolved.

change should be effective in ~30 minutes after puppet runs on the impacted hosts.

Tue, Sep 5, 2:05 PM · SRE, SRE-Access-Requests
Vgutierrez updated the task description for T345455: Requesting access to analytics-admins for cjming.
Tue, Sep 5, 1:28 PM · SRE, SRE-Access-Requests
Vgutierrez changed the status of T345455: Requesting access to analytics-admins for cjming from Stalled to In Progress.

key validated via Slack

Tue, Sep 5, 1:28 PM · SRE, SRE-Access-Requests
Vgutierrez added a comment to T345334: Cache thumbs in our caching infrastructure (e.g. ATS).

Cache revalidation can further extend this period. After the initial 24-hour limit has passed, ATS will issue a conditional request to the backend. If the backend supports it, a 304 response should be returned, eliminating the need to resend the object.

Tue, Sep 5, 1:12 PM · SRE, Thumbor, SRE-swift-storage, Traffic
Vgutierrez closed T268369: how to deal with cumin alias alerts as Declined.

yeah.. clearly I didn't phrase that properly, I was saying it from the PoV of Clinic Duty.

Tue, Sep 5, 12:55 PM · Cumin, Infrastructure-Foundations, observability, SRE
Vgutierrez added a comment to T268369: how to deal with cumin alias alerts.

@Volans any idea on how could we potentially reduce the "false positives" of this alert? we got 7 occurrences in the last 30 days that apparently weren't actionable

Tue, Sep 5, 10:13 AM · Cumin, Infrastructure-Foundations, observability, SRE
Vgutierrez closed T345132: ppenloglou sharing wmcs and production ssh key as Resolved.

your new key should be deployed in the next ~30 minutes. Please do not upload it to gitlab/wikitech to prevent this from happening again.

Tue, Sep 5, 10:06 AM · SRE, SRE-Access-Requests
Vgutierrez updated the task description for T345455: Requesting access to analytics-admins for cjming.
Tue, Sep 5, 7:13 AM · SRE, SRE-Access-Requests
Vgutierrez changed the status of T345132: ppenloglou sharing wmcs and production ssh key from Open to Stalled.
Tue, Sep 5, 7:07 AM · SRE, SRE-Access-Requests

Mon, Sep 4

Vgutierrez added a comment to T345132: ppenloglou sharing wmcs and production ssh key.

the key needs to be uploaded to the puppet repo, you could use this CR as an example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/949839 or I could craft a new one for you.

Mon, Sep 4, 2:20 PM · SRE, SRE-Access-Requests
Vgutierrez changed the status of T345455: Requesting access to analytics-admins for cjming from Open to Stalled.

Waiting for OOB validation

Mon, Sep 4, 10:20 AM · SRE, SRE-Access-Requests
Vgutierrez added a comment to T345132: ppenloglou sharing wmcs and production ssh key.

@ppenloglou that's right. as stated in https://wikitech.wikimedia.org/wiki/People.wikimedia.org people.wm.o is part of the production environment and the SSH key can't be shared with other environments.

Mon, Sep 4, 10:13 AM · SRE, SRE-Access-Requests
Vgutierrez triaged T345132: ppenloglou sharing wmcs and production ssh key as Medium priority.
Mon, Sep 4, 10:06 AM · SRE, SRE-Access-Requests
Vgutierrez added a comment to T345132: ppenloglou sharing wmcs and production ssh key.

@ppenloglou please let us know if you need help submitting a new SSH key for the production environment. Otherwise we will close this task

Mon, Sep 4, 10:05 AM · SRE, SRE-Access-Requests
Vgutierrez created T345542: DegradedArray event on /dev/md/0:wdqs2024.
Mon, Sep 4, 9:30 AM · SRE, ops-codfw, Data-Platform-SRE

Thu, Aug 31

Vgutierrez moved T345334: Cache thumbs in our caching infrastructure (e.g. ATS) from Backlog to Radar/Not for service by Traffic on the Traffic board.

Happy to provide assistance and guidance if needed but caching is technically controlled by the backend services and not by the CDN.
the CDN imposes some limits on what's cacheable and for how long (it will cap the TTL to 24h and flag it as uncacheable if it's bigger than 1Gb for example) but cacheability itself is managed by the Cache-Control header set by Thumbor.

Thu, Aug 31, 11:01 AM · SRE, Thumbor, SRE-swift-storage, Traffic
Vgutierrez closed T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout as Resolved.
vgutierrez@carrot:~$ curl -o /dev/null https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf?vgutierrez=1 --limit-rate 1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1953M  100 1953M    0     0  1024k      0  0:32:33  0:32:33 --:--:-- 1122k
Thu, Aug 31, 9:24 AM · Patch-For-Review, Thumbor, Traffic
Vgutierrez closed T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout, a subtask of T339134: Package and deploy ATS 9.2.1, as Resolved.
Thu, Aug 31, 9:23 AM · Traffic

Aug 28 2023

Vgutierrez added a comment to T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout.

current task name is misleading, 100Mbps is definitely enough to download the file without triggering the ATS timeout, your curl output shows an average speed of 1298 kbytes per sec, that's consistent with a 10Mbps network, not a 100Mbps one. I just used --limit-rate 12M (12 megabytes per second or roughly 100Mbps) to test it:

vgutierrez@carrot:~$ curl -o /dev/null https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf --limit-rate 12M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1953M  100 1953M    0     0  12.0M      0  0:02:42  0:02:42 --:--:-- 12.2M
Aug 28 2023, 10:24 AM · Patch-For-Review, Thumbor, Traffic

Aug 25 2023

Vgutierrez added a comment to T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout.

it seems that the ATS issue could be addressed by https://github.com/apache/trafficserver/pull/8083

Aug 25 2023, 11:10 AM · Patch-For-Review, Thumbor, Traffic
Vgutierrez added a comment to T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout.

A quick check on cp3081 shows the following results:

  • HAproxy closes the connection after 245 seconds and 327 seconds in a second test
  • varnish closes the connection after 353 seconds and 393 seconds in a second test
  • ATS closes the connection after 908 seconds
  • swift allows slow fetching the entire object (completed in 32 minutes using -limit-rate 1M)
Aug 25 2023, 11:02 AM · Patch-For-Review, Thumbor, Traffic
Vgutierrez added a comment to T317616: Revisit CDN<-->Swift communication.

As a side effect of moving to envoy we would be getting https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1 data for swift. As stated in the task description the current TLS termination layer used by swift is the old TLS termination designed for untrusted clients at the CDN. Migrating to envoy would align the service with the bast majority of backend servers that we run nowadays, benefiting from wider internal support.

Aug 25 2023, 10:10 AM · SRE-swift-storage, SRE, Traffic

Aug 23 2023

Vgutierrez added a comment to T344831: confd seems to be leaking memory in cp hosts.

we are currently using the confd 0.16 from https://gerrit.wikimedia.org/g/operations/debs/confd:

vgutierrez@cp6016:~$ apt policy confd
confd:
  Installed: 0.16.0-1+deb11u0
  Candidate: 0.16.0-1+deb11u0
  Version table:
 *** 0.16.0-1+deb11u0 1001
       1001 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
Aug 23 2023, 3:50 PM · SRE, Traffic
Vgutierrez created T344831: confd seems to be leaking memory in cp hosts.
Aug 23 2023, 3:48 PM · SRE, Traffic

Aug 22 2023

Vgutierrez triaged T344674: ATS automatically restarted due to receiving SIGUSR2 on cp5024 as Medium priority.
Aug 22 2023, 8:45 AM · SRE, Traffic
Vgutierrez renamed T344674: ATS automatically restarted due to receiving SIGUSR2 on cp5024 from ATS automatically restarted due to receiving SIGUSR2 to ATS automatically restarted due to receiving SIGUSR2 on cp5024.
Aug 22 2023, 8:44 AM · SRE, Traffic
Vgutierrez created T344674: ATS automatically restarted due to receiving SIGUSR2 on cp5024.
Aug 22 2023, 8:42 AM · SRE, Traffic

Aug 17 2023

Vgutierrez added a project to T344330: acme-chief should support debian bookworm: Acme-chief.
Aug 17 2023, 2:14 PM · Patch-For-Review, Acme-chief, SRE, Traffic
Vgutierrez committed rOSAC065fe96e2d09: tests: fix CertificateState tests on python 3.10+ (authored by Vgutierrez).
tests: fix CertificateState tests on python 3.10+
Aug 17 2023, 1:57 PM

Aug 16 2023

Vgutierrez added a comment to T342019: Add Python 3.10 to Wikimedia CI.

we need to cover 3.11 as well as it's the python version shipped with Debian bookworm: https://packages.debian.org/bookworm/python3

Aug 16 2023, 10:55 AM · Continuous-Integration-Infrastructure
Vgutierrez updated subscribers of T344330: acme-chief should support debian bookworm.

@hashar could you clarify if T342346 would trigger having python 3.11 on CI with some kind of backport for bullseye or do you have another task tracking python 3.11 support in the CI environment?

Aug 16 2023, 10:49 AM · Patch-For-Review, Acme-chief, SRE, Traffic
Vgutierrez triaged T344330: acme-chief should support debian bookworm as Medium priority.
Aug 16 2023, 10:35 AM · Patch-For-Review, Acme-chief, SRE, Traffic
Vgutierrez updated the task description for T344330: acme-chief should support debian bookworm.
Aug 16 2023, 10:29 AM · Patch-For-Review, Acme-chief, SRE, Traffic
Vgutierrez created T344330: acme-chief should support debian bookworm.
Aug 16 2023, 10:21 AM · Patch-For-Review, Acme-chief, SRE, Traffic

Aug 9 2023

Vgutierrez added a comment to T253732: Anycast: consistent ICMP packet too big routing.

@Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can it route it to the proper source host?

Aug 9 2023, 2:17 PM · Traffic-Icebox, Infrastructure-Foundations, User-jbond, netops, SRE

Aug 3 2023

Vgutierrez moved T343440: mw-on-k8s responds 404 for Wikifunctions view pages from Backlog to Radar/Not for service by Traffic on the Traffic board.
Aug 3 2023, 1:11 PM · MW-on-K8s, serviceops, SRE, Traffic, Abstract Wikipedia team, WikiLambda
Vgutierrez updated subscribers of T343440: mw-on-k8s responds 404 for Wikifunctions view pages.
vgutierrez@carrot:~$ for i in {1..100}; do curl -s -v https://www.wikifunctions.org/view/en/Z10000 -o /dev/null 2>&1 |egrep "200|404";  done | sort |uniq -c
     83 < HTTP/2 200 
     17 < HTTP/2 404
$ curl -s -v -o /dev/null https://www.wikifunctions.org/view/en/Z10000
*   Trying 185.15.58.224:443...
* Connected to www.wikifunctions.org (185.15.58.224) port 443 (#0)
* ALPN: offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [19 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [3191 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [78 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org
*  start date: Oct 27 00:00:00 2022 GMT
*  expire date: Nov 17 23:59:59 2023 GMT
*  subjectAltName: host "www.wikifunctions.org" matched cert's "*.wikifunctions.org"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS Hybrid ECC SHA384 2020 CA1
*  SSL certificate verify ok.
} [5 bytes data]
* using HTTP/2
* h2h3 [:method: GET]
* h2h3 [:path: /view/en/Z10000]
* h2h3 [:scheme: https]
* h2h3 [:authority: www.wikifunctions.org]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x563b226fcc70)
} [5 bytes data]
> GET /view/en/Z10000 HTTP/2
> Host: www.wikifunctions.org
> user-agent: curl/7.88.1
> accept: */*
> 
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [265 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
< HTTP/2 404 
< date: Thu, 03 Aug 2023 12:41:55 GMT
< server: mw-web.eqiad.main-57cbd6c888-njqqm
< cache-control: s-maxage=600
< content-type: text/html; charset=utf-8
< vary: Accept-Encoding
< age: 273
< x-cache: cp6014 hit, cp6014 pass
< x-cache-status: hit-local
< server-timing: cache;desc="hit-local", host;desc="cp6014"
< strict-transport-security: max-age=106384710; includeSubDomains; preload
< report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
< nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
< set-cookie: WMF-Last-Access=03-Aug-2023;Path=/;HttpOnly;secure;Expires=Mon, 04 Sep 2023 12:00:00 GMT
< set-cookie: WMF-Last-Access-Global=03-Aug-2023;Path=/;Domain=.wikifunctions.org;HttpOnly;secure;Expires=Mon, 04 Sep 2023 12:00:00 GMT
< x-client-ip: 81.39.92.198
< set-cookie: GeoIP=ES:GA:Boiro:42.65:-8.90:v4; Path=/; secure; Domain=.wikifunctions.org
< set-cookie: NetworkProbeLimit=0.001;Path=/;Secure;Max-Age=3600
< 
{ [1248 bytes data]
* Connection #0 to host www.wikifunctions.org left intact
Aug 3 2023, 12:48 PM · MW-on-K8s, serviceops, SRE, Traffic, Abstract Wikipedia team, WikiLambda

Aug 2 2023

Vgutierrez created P49957 (An Untitled Masterwork).
Aug 2 2023, 11:07 AM
Vgutierrez created P49956 (An Untitled Masterwork).
Aug 2 2023, 10:49 AM
Vgutierrez created P49953 (An Untitled Masterwork).
Aug 2 2023, 8:28 AM
Vgutierrez created P49952 (An Untitled Masterwork).
Aug 2 2023, 8:18 AM
Vgutierrez added a comment to T343000: HAProxy metrics go down on config reload.

It looks like it's a matter of how we graph the data, please see: https://grafana.wikimedia.org/goto/7xCydjqVk?orgId=1

image.png (1×1 px, 352 KB)

Aug 2 2023, 7:34 AM · SRE, observability, Traffic

Aug 1 2023

Vgutierrez changed the status of T343000: HAProxy metrics go down on config reload from Stalled to In Progress.

getting rid of KA didn't help a lot per https://grafana.wikimedia.org/goto/JcVQsuqVk?orgId=1:

image.png (1×1 px, 104 KB)

Aug 1 2023, 4:42 PM · SRE, observability, Traffic
Vgutierrez created P49894 (An Untitled Masterwork).
Aug 1 2023, 1:40 PM
Vgutierrez created P49893 (An Untitled Masterwork).
Aug 1 2023, 1:39 PM
Vgutierrez created P49892 (An Untitled Masterwork).
Aug 1 2023, 1:38 PM

Jul 31 2023

Vgutierrez closed T341992: Relocate lvs1013-lvs1016 to rows E & F, a subtask of T332027: Replace current L4LB with with Katran-based alternative, as Resolved.
Jul 31 2023, 1:08 PM · Traffic
Vgutierrez closed T341992: Relocate lvs1013-lvs1016 to rows E & F as Resolved.
Jul 31 2023, 1:08 PM · ops-eqiad, Traffic
Vgutierrez changed the status of T343000: HAProxy metrics go down on config reload from Open to Stalled.

After disabling KA, haproxy_frontend_connections_total{proxy="stats"} starts to increase as expected:

image.png (607×1 px, 59 KB)

Jul 31 2023, 11:10 AM · SRE, observability, Traffic
Vgutierrez added a comment to T343000: HAProxy metrics go down on config reload.

Regarding HAProxy reload process, basically HAProxy spawns a new process and hands over all the file descriptors to the new process (that's been started with the new configuration)

Jul 31 2023, 7:41 AM · SRE, observability, Traffic

Jul 28 2023

Vgutierrez added a comment to T343000: HAProxy metrics go down on config reload.

I'm wondering if reducing the hard-stop-after window from 5m to something smaller than the scrap time from prometheus (once a minute) could get rid of this. What are your thoughts @fgiunchedi?

Jul 28 2023, 3:03 PM · SRE, observability, Traffic
Vgutierrez triaged T343000: HAProxy metrics go down on config reload as Medium priority.
Jul 28 2023, 2:59 PM · SRE, observability, Traffic
Vgutierrez created T343000: HAProxy metrics go down on config reload.
Jul 28 2023, 2:59 PM · SRE, observability, Traffic

Jul 25 2023

Vgutierrez updated the task description for T342618: Perform katran load tests on lvs1013.
Jul 25 2023, 4:47 PM · SRE, Traffic
Vgutierrez updated the task description for T342618: Perform katran load tests on lvs1013.
Jul 25 2023, 4:27 PM · SRE, Traffic
Vgutierrez triaged T342618: Perform katran load tests on lvs1013 as Medium priority.
Jul 25 2023, 10:49 AM · SRE, Traffic
Vgutierrez created T342618: Perform katran load tests on lvs1013.
Jul 25 2023, 10:49 AM · SRE, Traffic
Vgutierrez created P49697 (An Untitled Masterwork).
Jul 25 2023, 8:23 AM

Jul 24 2023

Vgutierrez added a comment to T342566: varnish-frontend-hospital crash upon ATS restart.
0 Backend_health - vcl-84635598-fffa-4367-86af-05856c435a6e.be_cp3064_esams_wmnet Went sick -------H 2 3 5 0.000000 0.000000                    
0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000                                                                                                     
0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000                                                                                                     
0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000                                                                                                     
0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000                                                                                                     
0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000                                                                                                     
0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000                                                                                                     
0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000                                                                                                     
0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000                                                                                                     
0 Backend_health - vcl-a35c116d-adeb-4b22-9d49-fe43a85ae5c6.be_cp3058_esams_wmnet Back healthy 4---X-RH 3 3 5 0.000395 0.000132 HTTP/1.1 200 OK
Jul 24 2023, 6:47 PM · SRE, Traffic
Vgutierrez triaged T342566: varnish-frontend-hospital crash upon ATS restart as Medium priority.
Jul 24 2023, 6:21 PM · SRE, Traffic
Vgutierrez created T342566: varnish-frontend-hospital crash upon ATS restart.
Jul 24 2023, 6:20 PM · SRE, Traffic
Vgutierrez added a comment to T339134: Package and deploy ATS 9.2.1.

grafana shows a regression on lua performance after the update to 9.2.1:

image.png (1×1 px, 47 KB)

Jul 24 2023, 3:19 PM · Traffic
Vgutierrez updated the task description for T339134: Package and deploy ATS 9.2.1.
Jul 24 2023, 2:46 PM · Traffic
Vgutierrez added a comment to T339134: Package and deploy ATS 9.2.1.

After checking https://github.com/apache/trafficserver/blob/9.2.x/CHANGELOG-9.2.0 I've noticed:

#8784 - Propagate proxy.config.net.sock_option_flag_in to newly accepted connections
Jul 24 2023, 2:37 PM · Traffic
Vgutierrez added a comment to T339134: Package and deploy ATS 9.2.1.

healthcheck gets generated by our default.lua, specifically:

function do_global_read_request()
    if ts.client_request.header['Host'] == 'healthcheck.wikimedia.org' and ts.client_request.get_uri() == '/ats-be' then
        ts.http.intercept(function()
            ts.say('HTTP/1.1 200 OK\r\n' ..
                   'Content-Length: 0\r\n' ..
                   'Cache-Control: no-cache\r\n\r\n')
        end)
Jul 24 2023, 1:47 PM · Traffic