Usually services use wrr (weighted round robin) to balance traffic across nodes
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Thu, Jun 27
Wed, Jun 26
as discussed on the meeting, you can rely on X-Client-IP header being present to tell between CDN requests and internal requests.
Tue, Jun 25
Mon, Jun 24
cache_haproxy.mtail failed to accept -1 as an HTTP status code, under reporting CD and CR termination states.
this is caused a bug in the mtail regex used to parse haproxy logs, on haproxy 2.6.17 http status gets reported as -1:
2024-06-24T13:46:52.422444+00:00 cp7001 haproxy[2780368]: 180684 -1 0 0 -1 {es.wikipedia.org} {} CD
Thu, Jun 20
In T367756#9907435, @Fabfur wrote:After upgrading HAProxy to 2.8.10 on whole ulsfo we still see some errors in the kafka DLQ like:
this needs to be reported to upstream
Wed, Jun 19
@Aklapper we've briefly discussed this yesterday during the Traffic weekly meeting and you can proceed and enable preconnect
Tue, Jun 18
we already leverage preconnect on some cases but not as an HTTP Header but using the HTML <link> tag:
$ curl -v -s https://en.wikipedia.org/wiki/Main_Page 2>&1 |grep -i preconnect <link rel="preconnect" href="//upload.wikimedia.org">
Mon, Jun 17
Don't be to aggressive with this one, we could need to rollback at some point, let's wait a few weeks at the very least
Wed, Jun 12
Tue, Jun 11
ferm based MSS clamping is live on ncredir cluster
Mon, Jun 10
Thu, Jun 6
Jun 5 2024
Jun 3 2024
May 23 2024
tcp-mss-clamper is being already used to perform MSS clamping on ncredir and CDN upload clusters
May 22 2024
May 21 2024
acme-chief 0.37 deployed shipping https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/8
Scope of the task should be rejecting invalid HTTP requests on HAProxy rather than varnish as soon as we have analytics moved to HAProxy (and not only HTTP/1.0 ones)
May 20 2024
May 16 2024
removed raw sockets usage. We are now fetching MSS data via getsockopts()
May 13 2024
In T364691#9790288, @Ladsgroup wrote:In T364691#9790207, @Vgutierrez wrote:we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:
I saw that but the timing doesn't match. What I'm getting from users is a constant 10% or something of all pageviews being like this for days now.
we had a big spike of 503s on eqiad/drmrs/esams yesterday during EU morning: https://grafana.wikimedia.org/goto/J4YqQuYIR?orgId=1:
file_metadata is already there and supports individual files. The only limitation is that it's currently expecting the parameters links=manage&source_permissions=ignore that correlate with the file resource attributes with the same names (https://www.puppet.com/docs/puppet/7/types/file.html#file-attribute-source_permissions && https://www.puppet.com/docs/puppet/7/types/file.html#file-attribute-links).
May 10 2024
your request is missing some required parameters for file_metadata endpoint:
$ curl -H "Accept: application/json" --cert /var/lib/puppet/ssl/certs/mx-out1001.wikimedia.org.pem --key /var/lib/puppet/ssl/private_keys/mx-out1001.wikimedia.org.pem --cacert /var/lib/puppet/ssl/certs/ca.pem "https://acmechief2002.codfw.wmnet:8140/puppet/v3/file_metadata/acmedata/mx-out/live/ec-prime256v1.crt?links=manage&source_permissions=ignore" |jq { "checksum": { "type": "md5", "value": "{md5}b86fb140f227639d70ad971db461c82c" }, "destination": null, "group": 498, "links": "manage", "mode": 420, "owner": 498, "path": "/etc/acmecerts/live/ec-prime256v1.crt", "relative_path": null, "type": "file" }
May 8 2024
May 7 2024
https://gerrit.wikimedia.org/r/1028818 removed the prometheus jobs, alert should go away as soon as puppet runs on the prometheus hosts.
that's right, this is a leftover from the migration from mtail to benthos on ncredir. We will take care of it ASAP.
May 6 2024
May 3 2024
Apr 30 2024
Apr 19 2024
running another test this time with 3x10k requests it looks like the culprit is the socket_server UDP input that drops packets:
processor_latency_ns_count{label="syslog_format",path="root.input.processors.0"} 29963 vgutierrez@ncredir2001:/var/log/nginx$ cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 1330: 0100007F:04C5 00000000:0000 07 00000000:00000000 00:00000000 00000000 18837 0 64597138 2 0000000000000000 37
Testing benthos on ncredir2001 shows some concerning results (TL;DR it looks like benthos drops some messages and metrics aren't as accurate as expected).