User Details
- User Since
- Feb 12 2018, 9:51 AM (293 w, 4 d)
- Availability
- Available
- IRC Nick
- vgutierrez
- LDAP User
- Vgutierrez
- MediaWiki User
- VGutiérrez (WMF) [ Global Accounts ]
Yesterday
Wed, Sep 20
Patch has been merged, it should be effective in ~30 minutes when puppet runs. @acooper should have received an email to change the password of his kerberos principal.
Tue, Sep 19
Mon, Sep 18
Thanks!, still blocked on @thcipriani for deployment group membership
Thu, Sep 14
not sure why I've been pinged in this task but anyways, the new disk needs to be added to the RAID, as it's still degraded:
/dev/md/0: Version : 1.2 Creation Time : Fri Sep 1 18:30:45 2023 Raid Level : raid1 Array Size : 937267200 (893.85 GiB 959.76 GB) Used Dev Size : 937267200 (893.85 GiB 959.76 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent
Wed, Sep 13
Tue, Sep 12
Fri, Sep 8
just to be the clear the RSA key is totally valid at this point, I just wanted to save @acooper more "pain" further down the line. The task currently waiting for @thcipriani and @odimitrijevic / @Milimetric approvals :)
thanks! @acooper RSA keys are being deprecated in some parts of our infrastructure already (T336769), so I'm wondering if you could provide an ed25519 one rather than a rsa-4096. This should be totally feasible with a yubikey 5 (I'm guessing you're using one due to the cardno comment from your SSH key)
A quick check with cumin shows several servers impacted:
vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'journalctl -u fifo-log-demux@notpurge.service --since=-1h | grep Error' 96 hosts will be targeted: cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4052].ulsfo.wmnet OK to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit: 96 ===== NODE GROUP ===== (1) cp5028.eqsin.wmnet ----- OUTPUT of 'journalctl -u fi...-1h | grep Error' ----- Sep 08 09:18:14 cp5028 fifo-log-demux[1607]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer ===== NODE GROUP ===== (1) cp2033.codfw.wmnet ----- OUTPUT of 'journalctl -u fi...-1h | grep Error' ----- Sep 08 09:27:36 cp2033 fifo-log-demux[1320]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer ===== NODE GROUP ===== (1) cp4042.ulsfo.wmnet ----- OUTPUT of 'journalctl -u fi...-1h | grep Error' ----- Sep 08 09:03:56 cp4042 fifo-log-demux[1831]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer ================ PASS |████ | 3% (3/96) [00:04<02:14, 1.45s/hosts] FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 97% (93/96) [00:04<00:00, 21.44hosts/s] 96.9% (93/96) of nodes failed to execute command 'journalctl -u fi...-1h | grep Error': cp[2027-2032,2034-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5027,5029-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4041,4043-4052].ulsfo.wmnet 3.1% (3/96) success ratio (< 100.0% threshold) for command: 'journalctl -u fi...-1h | grep Error'. Aborting.: cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet 3.1% (3/96) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.: cp2033.codfw.wmnet,cp5028.eqsin.wmnet,cp4042.ulsfo.wmnet
that's not an issue with ATS 9.2.1, problem got fixed by restarting fifo-log-demux@notpurge.service:
vgutierrez@cp4052:~$ journalctl -u fifo-log-demux@notpurge.service -f -- Journal begins at Sat 2023-08-05 22:39:02 UTC. -- Sep 03 20:57:26 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer Sep 06 15:47:35 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer Sep 07 07:52:46 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer Sep 07 08:17:20 cp4052 fifo-log-demux[1619]: Error writing to client connection: write unix /run/trafficserver/notpurge.sock->@: write: connection reset by peer Sep 08 09:42:55 cp4052 systemd[1]: Stopping FIFO log demultiplexer (instance notpurge)... Sep 08 09:42:55 cp4052 systemd[1]: fifo-log-demux@notpurge.service: Succeeded. Sep 08 09:42:55 cp4052 systemd[1]: Stopped FIFO log demultiplexer (instance notpurge). Sep 08 09:42:55 cp4052 systemd[1]: fifo-log-demux@notpurge.service: Consumed 9min 29.690s CPU time. Sep 08 09:42:55 cp4052 systemd[1]: Started FIFO log demultiplexer (instance notpurge). Sep 08 09:42:55 cp4052 fifo-log-demux[902329]: Waiting for connections on /run/trafficserver/notpurge.sock
I almost forgot, for analytics-privatedata-users I'm assuming @acooper needs a kerberos principal as well, details available on https://wikitech.wikimedia.org/wiki/Analytics/Data_access
we are also pending on @acooper submitting their public SSH key
deployment membership requires the approval of @thcipriani and analytics-privatedata-users of @odimitrijevic / @Milimetric
Thu, Sep 7
vgutierrez@mwmaint1002:~$ sudo -i ldapsearch -x cn=ops |grep bro member: uid=brouberol,ou=people,dc=wikimedia,dc=org vgutierrez@mwmaint1002:~$ sudo -i ldapsearch -x cn=wmf |grep bro member: uid=brouberol,ou=people,dc=wikimedia,dc=org
Wed, Sep 6
having this in place would have prevented a ncredir related page already. I'm happy to have this opt-in per cookbook (personally I'd enable it on cookbooks that depool hosts automatically)
Cheers, I've amended the patch to include the ops membership (already approved by @joanna_borun). CR still blocked till we get @odimitrijevic, @Milimetric and @Gehel approvals
analytics_privatedata_users membership requires approval from @odimitrijevic or @Milimetric per https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/data.yaml#409
vgutierrez@mwmaint1002:~$ sudo -i cross-validate-accounts --username brouberol --uid 45143 --email brouberol@wikimedia.org --real-name "Balthazar Rouberol" --ssh-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKunax7NU1Zx304QaTggTnIjXuY8rxgKTwReUMIffoIR brouberol@wikimedia.org" --kerberos
Tue, Sep 5
the provided ssh key is already used in WMCS, please provide a new one:
vgutierrez@mwmaint1002:~$ sudo -i cross-validate-accounts --username brouberol --uid 45143 --email brouberol@wikimedia.org --real-name "Balthazar Rouberol" --ssh-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPd+Ekept47K0yIJ91ByVo4q6TAbgVzzxIqfq6k1X0L8 brouberol@wikimedia.org" --kerberos [...] brouberol uses the same SSH key(s) in WMCS and production: {'AAAAC3NzaC1lZDI1NTE5AAAAIPd+Ekept47K0yIJ91ByVo4q6TAbgVzzxIqfq6k1X0L8'}
change should be effective in ~30 minutes after puppet runs on the impacted hosts.
key validated via Slack
Cache revalidation can further extend this period. After the initial 24-hour limit has passed, ATS will issue a conditional request to the backend. If the backend supports it, a 304 response should be returned, eliminating the need to resend the object.
yeah.. clearly I didn't phrase that properly, I was saying it from the PoV of Clinic Duty.
@Volans any idea on how could we potentially reduce the "false positives" of this alert? we got 7 occurrences in the last 30 days that apparently weren't actionable
your new key should be deployed in the next ~30 minutes. Please do not upload it to gitlab/wikitech to prevent this from happening again.
Mon, Sep 4
the key needs to be uploaded to the puppet repo, you could use this CR as an example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/949839 or I could craft a new one for you.
Waiting for OOB validation
@ppenloglou that's right. as stated in https://wikitech.wikimedia.org/wiki/People.wikimedia.org people.wm.o is part of the production environment and the SSH key can't be shared with other environments.
@ppenloglou please let us know if you need help submitting a new SSH key for the production environment. Otherwise we will close this task
Thu, Aug 31
Happy to provide assistance and guidance if needed but caching is technically controlled by the backend services and not by the CDN.
the CDN imposes some limits on what's cacheable and for how long (it will cap the TTL to 24h and flag it as uncacheable if it's bigger than 1Gb for example) but cacheability itself is managed by the Cache-Control header set by Thumbor.
vgutierrez@carrot:~$ curl -o /dev/null https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf?vgutierrez=1 --limit-rate 1M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1953M 100 1953M 0 0 1024k 0 0:32:33 0:32:33 --:--:-- 1122k
Aug 28 2023
current task name is misleading, 100Mbps is definitely enough to download the file without triggering the ATS timeout, your curl output shows an average speed of 1298 kbytes per sec, that's consistent with a 10Mbps network, not a 100Mbps one. I just used --limit-rate 12M (12 megabytes per second or roughly 100Mbps) to test it:
vgutierrez@carrot:~$ curl -o /dev/null https://upload.wikimedia.org/wikipedia/commons/9/9f/ZHSY000097_%E5%AE%8B%E6%9B%B8%E4%B8%80%E7%99%BE%E5%8D%B7_%28%E6%A2%81%29%E6%B2%88%E7%B4%84_%E6%92%B0_%E5%AE%8B%E5%88%BB%E5%AE%8B%E5%85%83%E6%98%8E%E9%81%9E%E4%BF%AE%E6%9C%AC.pdf --limit-rate 12M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1953M 100 1953M 0 0 12.0M 0 0:02:42 0:02:42 --:--:-- 12.2M
Aug 25 2023
it seems that the ATS issue could be addressed by https://github.com/apache/trafficserver/pull/8083
A quick check on cp3081 shows the following results:
- HAproxy closes the connection after 245 seconds and 327 seconds in a second test
- varnish closes the connection after 353 seconds and 393 seconds in a second test
- ATS closes the connection after 908 seconds
- swift allows slow fetching the entire object (completed in 32 minutes using -limit-rate 1M)
As a side effect of moving to envoy we would be getting https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1 data for swift. As stated in the task description the current TLS termination layer used by swift is the old TLS termination designed for untrusted clients at the CDN. Migrating to envoy would align the service with the bast majority of backend servers that we run nowadays, benefiting from wider internal support.
Aug 23 2023
we are currently using the confd 0.16 from https://gerrit.wikimedia.org/g/operations/debs/confd:
vgutierrez@cp6016:~$ apt policy confd confd: Installed: 0.16.0-1+deb11u0 Candidate: 0.16.0-1+deb11u0 Version table: *** 0.16.0-1+deb11u0 1001 1001 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/main amd64 Packages 100 /var/lib/dpkg/status
Aug 22 2023
Aug 17 2023
Aug 16 2023
we need to cover 3.11 as well as it's the python version shipped with Debian bookworm: https://packages.debian.org/bookworm/python3
Aug 9 2023
Aug 3 2023
vgutierrez@carrot:~$ for i in {1..100}; do curl -s -v https://www.wikifunctions.org/view/en/Z10000 -o /dev/null 2>&1 |egrep "200|404"; done | sort |uniq -c 83 < HTTP/2 200 17 < HTTP/2 404
$ curl -s -v -o /dev/null https://www.wikifunctions.org/view/en/Z10000 * Trying 185.15.58.224:443... * Connected to www.wikifunctions.org (185.15.58.224) port 443 (#0) * ALPN: offers h2,http/1.1 } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * CAfile: /etc/ssl/certs/ca-certificates.crt * CApath: /etc/ssl/certs { [5 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [19 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [3191 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [78 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [52 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [52 bytes data] * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN: server accepted h2 * Server certificate: * subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org * start date: Oct 27 00:00:00 2022 GMT * expire date: Nov 17 23:59:59 2023 GMT * subjectAltName: host "www.wikifunctions.org" matched cert's "*.wikifunctions.org" * issuer: C=US; O=DigiCert Inc; CN=DigiCert TLS Hybrid ECC SHA384 2020 CA1 * SSL certificate verify ok. } [5 bytes data] * using HTTP/2 * h2h3 [:method: GET] * h2h3 [:path: /view/en/Z10000] * h2h3 [:scheme: https] * h2h3 [:authority: www.wikifunctions.org] * h2h3 [user-agent: curl/7.88.1] * h2h3 [accept: */*] * Using Stream ID: 1 (easy handle 0x563b226fcc70) } [5 bytes data] > GET /view/en/Z10000 HTTP/2 > Host: www.wikifunctions.org > user-agent: curl/7.88.1 > accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [265 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [265 bytes data] * old SSL session ID is stale, removing { [5 bytes data] < HTTP/2 404 < date: Thu, 03 Aug 2023 12:41:55 GMT < server: mw-web.eqiad.main-57cbd6c888-njqqm < cache-control: s-maxage=600 < content-type: text/html; charset=utf-8 < vary: Accept-Encoding < age: 273 < x-cache: cp6014 hit, cp6014 pass < x-cache-status: hit-local < server-timing: cache;desc="hit-local", host;desc="cp6014" < strict-transport-security: max-age=106384710; includeSubDomains; preload < report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] } < nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0} < set-cookie: WMF-Last-Access=03-Aug-2023;Path=/;HttpOnly;secure;Expires=Mon, 04 Sep 2023 12:00:00 GMT < set-cookie: WMF-Last-Access-Global=03-Aug-2023;Path=/;Domain=.wikifunctions.org;HttpOnly;secure;Expires=Mon, 04 Sep 2023 12:00:00 GMT < x-client-ip: 81.39.92.198 < set-cookie: GeoIP=ES:GA:Boiro:42.65:-8.90:v4; Path=/; secure; Domain=.wikifunctions.org < set-cookie: NetworkProbeLimit=0.001;Path=/;Secure;Max-Age=3600 < { [1248 bytes data] * Connection #0 to host www.wikifunctions.org left intact
Aug 2 2023
It looks like it's a matter of how we graph the data, please see: https://grafana.wikimedia.org/goto/7xCydjqVk?orgId=1
Aug 1 2023
getting rid of KA didn't help a lot per https://grafana.wikimedia.org/goto/JcVQsuqVk?orgId=1:
Jul 31 2023
After disabling KA, haproxy_frontend_connections_total{proxy="stats"} starts to increase as expected:
Regarding HAProxy reload process, basically HAProxy spawns a new process and hands over all the file descriptors to the new process (that's been started with the new configuration)
Jul 28 2023
I'm wondering if reducing the hard-stop-after window from 5m to something smaller than the scrap time from prometheus (once a minute) could get rid of this. What are your thoughts @fgiunchedi?
Jul 25 2023
Jul 24 2023
0 Backend_health - vcl-84635598-fffa-4367-86af-05856c435a6e.be_cp3064_esams_wmnet Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - (null) Went sick -------H 2 3 5 0.000000 0.000000 0 Backend_health - vcl-a35c116d-adeb-4b22-9d49-fe43a85ae5c6.be_cp3058_esams_wmnet Back healthy 4---X-RH 3 3 5 0.000395 0.000132 HTTP/1.1 200 OK
grafana shows a regression on lua performance after the update to 9.2.1:
After checking https://github.com/apache/trafficserver/blob/9.2.x/CHANGELOG-9.2.0 I've noticed:
#8784 - Propagate proxy.config.net.sock_option_flag_in to newly accepted connections
healthcheck gets generated by our default.lua, specifically:
function do_global_read_request() if ts.client_request.header['Host'] == 'healthcheck.wikimedia.org' and ts.client_request.get_uri() == '/ats-be' then ts.http.intercept(function() ts.say('HTTP/1.1 200 OK\r\n' .. 'Content-Length: 0\r\n' .. 'Cache-Control: no-cache\r\n\r\n') end)