User Details
- User Since
- Aug 29 2023, 8:30 AM (76 w, 4 d)
- Availability
- Available
- IRC Nick
- arnaudb
- LDAP User
- Arnaudb
- MediaWiki User
- ABran-WMF [ Global Accounts ]
Tue, Feb 11
Thu, Feb 6
Good idea! T385777 has been created
Tue, Feb 4
- node_files_total was used to list all lists.* instances available on thanos. Metric was apparently absent from lists2001. I replaced it by up which properly fetches the complete list of instances.
- host overview dashboard link has also been added
Mon, Feb 3
I've added more metrics to the original dashboard:
503 errors are visible here: https://w.wiki/Cvfc
Thu, Jan 30
I think it mostly depends on the use case. For CI, a simple job that only downloads resources from the Internet would only need the masquerade rule. Gitlab runners are creating their own docker network (we can even ask for a per-job network if we want to get up to that isolation level on CI). It also would require a dedicated per-network masquerade rule. We can toggle the feature if needed. Managing containers that need to expose a port would be a bit tedious as the port would need to be mapped via puppet config on the container's IP:port. OTOH I'm not aware of docker containers outside of kubernetes and CI jobs. Another solution could be to wait for our runners to natively support nftables, to avoid the toil of manually working around compatibility while the feature is being developed.
- currently, the base nftables firewall (abstracted via the firewall puppet profile) installs a base table, which is mostly for filtering traffic. Maybe we can consider having all NAT stuff on a different table, which should be easier to understand, and cleaner to integrate via puppet.
- This is what we do with cloudgw anyway. We install the base nftables config from the firewall puppet profile AND and separate cloudgw table with the NAT specific stuff.
I reused part of the the way you mentioned preparing the configuration for CI runners in T370677, it duplicates a bit of code that we could merge with a more generic approach
- we could even rename or refactor the current cloudgw implementation into a generic nftables-NAT profile, as the basis for further development anyway
I'd be happy to help on that effort!
Wed, Jan 29
Thanks for the review @Jelto ! After discussing with you and @MoritzMuehlenhoff on IRC; it appeared that T385042 was needed to take the most generic approach to the situation.
We will be aiming to avoid using iptables-nft and stick to managing the rules ourselves. It should not be an issue for any CI job that don't need to bind its port to the Internet.
Mon, Jan 27
Panel was returning NaN because mailman_smtp_duration_seconds and mailman_smtp_total were both always at 0 and initial query was dividing by 0, clamp_min in the query limited the minimum to 1.
I've started to update Mailman 3
Fri, Jan 24
Jan 16 2025
after merging the CR, running puppet and trying again to download the file, this was not enough to fix. I've asked Traffic about Varnish and they told me it was probably not our culprit.
it seems that the issue could come from Varnish terminating the transfer after a timeout is reached, maybe the transfer took less time in that previous deployment? question has been asked to Traffic on IRC
I've temporarily configured the vhost to be more "download friendly":
EnableSendfile Off EnableMMAP Off Timeout 600 KeepAliveTimeout 15 LimitRequestBody 0
Also tried using only LimitRequestBody 0 which seemed to be the issue from my previous tests, with no more success though.
but had no success retrieving the file, debug logging told me nothing more than the http/200 that we previously identified. Will continue the dig
I think this is due to the file size;I've tested to generate an empty 2GB file that has the same issue around the same offset (~1.80GB). This file is dated from Mar 6 2023, so I'm guessing that there is something that has changed on the server's configuration since then. I have found nothing yet, I'll keep on digging
Oddly, compressing the file fixes the issue, a duplication of the file reproduces the issue consistently:
$ wget --verbose --output-document /dev/null https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin HTTP response 200 [https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin] /dev/null 79% [===============================================================================================================================================> ] 1.81G 28.22MB/s [Files: 1 Bytes: 1.81G [27.90MB/s] Redirects: 0 Todo: 0 Errors: 0 ]
multipart download retrieves the file properly: P72108
I tried to downgrade to http/1.1 and added a bit more verbosity: P72105
curl gives a bit more information about the download termination, apache still sends http/200:
curl: (92) HTTP/2 stream 1 was not closed cleanly: CANCEL (err 8) HTTP Response Code: 200
the download fails consistently indeed, and is logged as a 200:
2025-01-16T09:22:49 65001309 -/200 2464511302 GET http://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin - application/octet-stream - Wget/2.2.0 - - - - d66cd6e8-46cd-4e23-910f-5d8477edefa0
Jan 13 2025
Jan 8 2025
I stumbled upon a conversion issue on this part of the template.
Dec 16 2024
Dec 4 2024
Dec 3 2024
rebased and updated with the latest spicerack release, will run it on the replicas first
Dec 2 2024
Thanks @Jclark-ctr! puppet patch pas been merged
This will help and test T381086#10366983
Nov 29 2024
Thanks! I already did some scripting during the dc switch: T375186#10160612 and T375186#10167177 wdyt of the gathered data? the way its shown?
- compare weights and groups (API/vslow/dumps/...) and pooling for core hosts
- compare weights and groups (API/vslow/dumps/...) and pooling for es hosts
This stems from my previous question, comparison should be done, I think, across sections and between comparable hosts (i.e. vslow → vslow, api → api, etc.) but weight distribution must also be taken in account. This is where I think we should spend a bit of thinking time, to be able to figure out a good angle to compare hosts properly. We could adopt a pattern that would state that an instance with api has to have maximum weight of xxx, vslow reduces that weight to yyy, etc. This approach would introduce a bias, ignoring hardware differences between the datacenters. I think this should be OK as long as the hardware between the DCs is not too different. We could also calculate the total weight of an instance, using a "sum of weights", this would introduce a bias towards the api etc., imho.
- monitoring notifications enabled on all relevant hosts
- validate event/query killer and pt-heartbeat is properly setup everywhere
Having those in cookbooks will already save quite a bunch of time.
Those would be monitoring probes in my mind, aggregated under an umbrella indicator, I've already prepared T375589 in that direction. This would also make that information available for cookbooks, using the thanos module of spicerack, querying for the alerts aggregated under that umbrella.
Nov 28 2024
please let me know when I can act on this wiki as it is supposed to be private, I would like to handle quickly after it is ready :-)
metadata attachement to the server would also help and tag hosts that are currently running a schema update, for instance to allow for automated exclusion of sections, automated alarming notifications (i.e. host db9999 is running a schema for too long, send a warning alert etc.). Those are 2 very basic examples, I think we can only benefit from "more programmatically accessible information".
Nov 27 2024
Nov 26 2024
Nov 25 2024
s4 is now done:
Result: {"already done in all dbs": ["db1150:3314", "db1190", "db1199", "db1221", "db1238", "db1241", "db1242", "db1243", "db1244", "db1245:3314", "db1247", "db1248", "db1249", "dbstore1007:3314", "db2136", "db2139:3314", "db2140", "db2147", "db2155", "db2172", "db2199:3314", "db2206", "db2210", "db2219", "db2236", "db2237", "db2240"]} Result: {"already done in all dbs": ["db1160", "db2179"]}
Nov 22 2024
ack, good to know thanks!
cookbooks have been merged, TODO: test & debug
Nov 21 2024
After a bit of thinking around how this script should be done:
@bvibber: logfile is in your home directoryon deploy2002: /home/bvibber/mw-script.codfw.6497ohz1.log
job has been deleted:
kubectl --kubeconfig=admin-codfw.config -n mw-script delete job mw-script.codfw.6497ohz1 job.batch "mw-script.codfw.6497ohz1" deleted
Will do, thanks for the info @bvibber !
Nov 20 2024
I'm uncertain about db2231 so I commented it out in P71104, @Marostegui is misc the right section?