User Details
- User Since
- Feb 12 2018, 9:51 AM (407 w, 4 d)
- Availability
- Available
- IRC Nick
- vgutierrez
- LDAP User
- Vgutierrez
- MediaWiki User
- VGutiérrez (WMF) [ Global Accounts ]
Yesterday
the assessment is OK and the link can be removed safely
Wed, Dec 3
Tue, Dec 2
Wed, Nov 26
We are now rate-limiting non thumbnail steps requests for cache misses when certain X-Is-Browser thresholds are met
Tue, Nov 25
From SREBatchRunnerBase __reboot_action():
Thu, Nov 13
Mon, Nov 10
a quick check using https://en.wikipedia.org/wiki/Main_Page?vgutierrez=tg resulted in telegram bot visiting https://en.m.wikipedia.org/wiki/Main_Page?vgutierrez=tg and retrying after getting a 301 instead of following the redirect
Nov 3 2025
It does but acme-chief also respects the staging time so it's probably under /new instead of /live or still blocked cause it's waiting the staging time till a previous version gets deployed
Oct 21 2025
downtiming for prometheus alertmanager seems broken to me. What we are seeing here looks like this:
- The metric was in an alerting state from 18:45:30 o 18:49:00 per https://grafana.wikimedia.org/goto/BShBVVgDg?orgId=1
- Downtime (silence) was present, so even if the alert condition was true for 3m, notifications were suppressed.
- At 18:49:00 the alert condition cleared and at 18:49:19 the downtime was removed.
- Prometheus’s next evaluation (at 18:49:25) saw that the alert had been pending and satisfied for: 3m, so it fired the alert. This can be verified with this query: https://grafana.wikimedia.org/goto/uJUkS4Rvg?orgId=1 that shows the alert firing till 18:49:30
This is pretty weird, according to RFC 9562 for UUID v4, the third block should always start with a 4.
Oct 20 2025
wdqs-internal-main still has traffic on port 80 in codfw:
TCP 10.2.1.93:80 mh (mh-port) -> 10.192.0.85:80 Tunnel 10 67 73 -> 10.192.32.155:80 Tunnel 10 46 87 -> 10.192.32.156:80 Tunnel 10 52 86
Oct 15 2025
Oct 14 2025
we need to double check that HAProxy supports EdDSA for JWT verification purposes
Oct 13 2025
Oct 8 2025
following up on my last comment, the webm file size is 1660261 bytes, so a request asking for a range starting at 1660261 should probably trigger a 416 Range Not Satisfiable response instead of a 503 but it still doesn't look like a valid request for me, it I use a valid range like 1000-1024, the request gets a valid response every time but inconsistent, the first one it gets a 1660261 bytes response back, and the following ones a 25 bytes response as requested on the Range header
Taking a second look at the curl reproducer, I'm seeing the following behavior:
Oct 7 2025
Do we know what the current behavior is for layers that set X-Request-ID?
Oct 3 2025
yes, it's still hapenning https://grafana.wikimedia.org/goto/SHdP6s3HR?orgId=1:
Sep 11 2025
you got a nice mix of uses cases there @A_smart_kitten.
Sep 9 2025
Sep 8 2025
probably unrelated but I've found what it could be a HAProxy bug related to %rt being increased twice per request: https://github.com/haproxy/haproxy/issues/3107
vgutierrez@cp6016:~$ curl -X TRACE -i https://en.wikipedia.org HTTP/2 405 content-length: 146 cache-control: no-cache content-type: text/html server: HAProxy x-cache: cp6016 int x-cache-status: int-tls allow: DELETE, GET, HEAD, OPTIONS, PATCH, POST, PUT
local tests show that HAProxy issued 46410639 but it never reached the kafka cluster, probably because haproxykafka failed to parse it for some reason, if this happens systematically after a BADREQ I think we could have a bug on HPK
Sep 5 2025
puppet is now happy on deployment-cache-text08.
puppet is happier in deployment-cache-text08 but not 100%:
Sep 5 10:22:01 deployment-cache-text08 puppet-agent[1978923]: (/Stage[main]/Profile::Cache::Haproxy/File[/usr/share/GeoIP/datacenter.mmdb]) Could not evaluate: Could not retrieve information from environment production source(s) puppet:///volatile/datacenter_vendors/datacenter.mmdb
FWIW HAProxy provides UUIDv4 out of the box so it should be as easy as http-request set-header X-Request-Id %[uuid()].
Sep 4 2025
assigning the task to @JMeybohm, he is the SRE on clinic duty this week
Sep 3 2025
That's great, thanks @brouberol
I think we have 3 upcoming DAGs:
- the one covered by this task
- probenet data for the GeoDNS pipeline (T380626)
- Curate lists of well-known JA3N hashes (part of T400270)
Sep 2 2025
@brouberol hey! it looks like we split airflow airflow instances by team and we don't currently have an instance for SRE so I'm guessing we would need to create it as well?
Sep 1 2025
You're totally right, that's referring to library defaults UAs like python-requests
Aug 30 2025
Aug 29 2025
Aug 26 2025
I'm closing this since we've fixed the wrong behavior on HAProxy regarding sequence numbers, please feel to re-open it if you're still detecting issues on your side with sequence numbers.
Aug 25 2025
Aug 22 2025
From Special:Version it appears Lingua Libre is running mediawiki/oauthclient 1.1.0. The ability to set a custom User-Agent (via Config::setUserAgent) was added in 1.2.0 (commit 81edea5f545ef551cb1fb3e8937fd81c549fa94b, task T293609).
@Yug thanks, no need to keep appending user reports till @mickeybarber reports back. This is a well-known and expected behavior of the CDN as announced on https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/APW6FQBGIVLCEN7WZ65D4NVZ6XQIWGCW/ back in 2025-07-30
sampling on webrequest_sampled has been fixed by merging https://gerrit.wikimedia.org/r/1181033
looks good, I took care of merging it, thanks for the patch @Dzahn
change has been merged, please allow 30 minutes to let puppet apply the changes on the required systems. Thanks
Aug 21 2025
flagging as high cause this is already making the downsampling in benthos fail (nice catch by @CDanis):
root = if this.ip != "-" && this.sequence != "-" && this.sequence % env("SAMPLING").number() != 0 { deleted() }Right now we get the sequence number from haproxy %rt log format, that's request_counter (HTTP req or TCP session) according to its documentation. On early stages of the TCP connection it seems like the request counter isn't accesible so it gets logged as 0.
I think I've identified the issue, right now haproxy always log sequence: 0 for <BADREQ> requests
SSH key verified out of band via https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1180839
vgutierrez@ldap-maint1001:~$ ldapsearch -x cn=nda |grep sad member: uid=sadiyamohammed13,ou=people,dc=wikimedia,dc=org vgutierrez@ldap-maint1001:~$ ldapsearch -x cn=wmde |grep sad member: uid=sadiyamohammed13,ou=people,dc=wikimedia,dc=org
Aug 20 2025
@Dima_Koushha_WMDE could you create a gerrit change that contains your public SSH key (it can be immediately abandoned)? we can use that as a way of verifying the SSH key out-of-band, thanks
The change granting access to the requested groups has been merged, please allow up to 30 minutes to let puppet apply the changes on the impacted servers. Thanks
Aug 19 2025
I'm already seeing an account (https://ldap.toolforge.org/user/dang) requested on T288355 with some privileges:
dang:
ensure: present
realname: Tien Dat Nguyen
email: dat.nguyen@wikimedia.de
uid: 32183
gid: 500
ssh_keys:
- ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICnlB6UtmPKPJZOXl/2fkAC88ccb9dn15upi0SsifFg5 dang@C353I'm seeing you have 3 LDAP accounts at the moment:
@dang could you create a CR on gerrit with your public SSH key to confirm it? thanks!
Aug 18 2025
sorry for the delay @Mayakp.wiki.
Thanks, please feel free to re-open it if needed
Aug 14 2025
you got that available as part of the sre.loadbalancer.migrate-service-ipip cookbook on https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/migrate-service-ipip.py#131:
def _ipip_traffic_accepted(self, *, outer_src_ip: str, outer_dst_ip: str, inner_src_ip: str, inner_dst_ip: str, dport: int) -> bool: """Send a single SYN packet using IPIP encapsulation""" s = socket(AF_INET, SOCK_STREAM) s.bind((inner_src_ip, 0)) sport = s.getsockname()[1] syn_packet = ( IP(src=outer_src_ip, dst=outer_dst_ip) / IP(src=inner_src_ip, dst=inner_dst_ip) / TCP(sport=sport, dport=dport, flags="S", seq=1000) ) response = sr1(syn_packet, timeout=3, verbose=self.dry_run) s.close() return response is not None
Do you know which specific hostname the volunteer is asking about?
Aug 13 2025
Aug 7 2025
https://phabricator.wikimedia.org/P80962 for future reference

