Blurb
User Details
- User Since
- Oct 3 2014, 8:40 AM (398 w, 16 h)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- akosiaris
- LDAP User
- Alexandros Kosiaris
- MediaWiki User
- AKosiaris (WMF) [ Global Accounts ]
Wed, May 18
deploy1002 will need to be scheduled well in advance and/or failed over to deploy2002 as it is the canonical deployment host.
Many thanks for this!
Tue, May 17
Gitlab wise, I can create a personal repo easily, but my guess says that's not ideal either. I guess I shouldn't just jump the gun and create a repo on my own under the non-personal hierarchy, right?
Mon, May 16
Ah, cool! Thanks for the update!
Regarding the "fake nodes": I think that could be done with adding the leafs as GlobalNetworkSet to the K8s/Calico API. That should make them easily selectable via peerSelectors without creating the confusion fake nodes would create.
Fri, May 13
The 50% bump in capacity didn't make any noticeable difference this time around. :-(
helm 2 did have a different structure regarding these things, it's definitely fine to revisit and re-evaluate the approach.
we should consider if it makes sense to make an exception and renumber these hosts from 2037 so they are in par with eqiad.
Note that when above https://commons.wikimedia.org/wiki/Special:Upload is mentioned, we are focusing on a specific subset of Special:Upload. We are focusing on Copy by URL which is only triggered via the respective radio box. See image for a visual explanation
For what is worth, I think we 've peaked.
Wed, May 11
Thanks for the update. Load has somewhat increased on our side, albeit minimally.
Just to note a semantic thing here:
Tue, May 10
Mon, May 9
Looking at https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&viewPanel=37&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&editPanel=37 and temporarily adding a legend with max to the right:
@bd808, just for greater visibility, as I said in https://gerrit.wikimedia.org/r/c/773994, you can proceed and self-merge https://gerrit.wikimedia.org/r/c/773994 and https://gerrit.wikimedia.org/r/c/773995 and do the first deploy of it. We are unfortunately short-handed right now, we can't properly review the chart. The rest of the changes have been merged, so we shouldn't be a blocker for getting developer-portal deployed. Don't hesitate however to reach out if you hit any roadblocks or need any help!
Fri, May 6
Thu, May 5
Just for greater visibility and awareness, there is T301505 for the upstream connect error or disconnect/reset before headers. reset reason: overflow". error. As pointed out in that task that error message is a symptom and not the cause.
Been reading some blogs and talks about this 'stretched' Kafka cluster idea, and it sure would make Multi DC apps much easier. From the linked Kafka talk though, it recommends not doing it unless you can guarantee latency between DCs of <= 100ms. I have a feeling that our SREs would not like to guarantee that.
Wed, May 4
Tue, May 3
Fri, Apr 29
As a starting point: @jhathaway noted that we're running ffmpeg at niceness -19, which is quite assertive; raising that value might be an easy way to relieve the pressure. I don't have historical context for why it is that way, but if we can change it safely, it might be a good first step.
https://gerrit.wikimedia.org/r/787708 removes the puppet lvm module which is GPL-2 and incompatible with apache 2.0. So that removes an interesting blocker in adopting a cross repo license.
Thu, Apr 28
Apr 20 2022
Apr 19 2022
So, prometheus follows redirects by default. In https://github.com/prometheus/prometheus/commit/646556a2632700f7fca42cec51d0100294d43c52 support for disabling that functionality was added. I 'd say that for production we want to default to not following redirects, it should help us avoid weird edge cases like this one.
Apr 18 2022
I think that you are right. Those numbers don't look ok. And I do notice the following
Apr 15 2022
I am reopening this. Due to https://wikitech.wikimedia.org/wiki/Incidents/2022-03-27_api we had to lower the concurrency by half in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774462
Some more information regarding this. With the exception of the warnings stanza the response by mathoid for queries `\land xcc' and '\and xcc' is identical
@Wurgl, nailed it!
Apr 14 2022
- The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored (T291946), see above.
Stalling until T306121 is done.
I had a quick lock at openstack browser. Only VM that is stretch is packager01.packaging.eqiad1.wikimedia.cloud. I only use that one for building the etherpad-lite Debian packages cause they are unfortunately unbuildeable in our production infrastructure. I just tested I can build them on packager02.packaging.eqiad1.wikimedia.cloud, so feel free to delete packager01.
Apr 13 2022
@Papaul: mc2023 and kubestage2002 have been downtimed again (for 2days) and I 've just powered them off. The should be ready to be moved.
As pointed out in T305902, those metrics are already scraped by prometheus for a pretty long time now. What's left is to actually alter the dashboard to reference those metrics and not the service-runner ones.
Tentatively resolving this.
I don't see an alert since Apr 5 in Icinga for either codfw or eqiad. I am gonna re-enabling paging.
Apr 12 2022
Thanks for the ping, wouldn't have seen it otherwise. Re-opening and I 'll have a look.
Apr 11 2022
I marked rdb2008, kubestage2002 and mc2023 as YES in the table. rdb2008 is the secondary, not the primary, kubestage2002 is for the staging cluster anyway and mc2023 will be handled by mcrouter's configuration and shard05 should be moved to mc-gp* hosts (gutter pool).
Apr 7 2022
akosiaris@deploy1002:~$ kubectl get events LAST SEEN TYPE REASON OBJECT MESSAGE 2m49s Warning Unhealthy pod/zotero-production-684574794-6n999 Readiness probe failed: Get http://10.64.64.81:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 38s Warning Unhealthy pod/zotero-production-684574794-g5xr6 Readiness probe failed: Get http://10.64.65.14:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 6m17s Warning Unhealthy pod/zotero-production-684574794-gv9nf Readiness probe failed: Get http://10.64.68.96:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 9m40s Warning Unhealthy pod/zotero-production-684574794-jglk4 Readiness probe failed: Get http://10.64.64.67:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 17m Warning Unhealthy pod/zotero-production-684574794-jlk69 Readiness probe failed: Get http://10.64.68.239:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 3m35s Warning Unhealthy pod/zotero-production-684574794-nlqz2 Readiness probe failed: Get http://10.64.64.60:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 17m Warning Unhealthy pod/zotero-production-684574794-nr4ng Readiness probe failed: Get http://10.64.69.222:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 31m Warning Unhealthy pod/zotero-production-684574794-ps595 Readiness probe failed: Get http://10.64.68.138:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 27m Warning Unhealthy pod/zotero-production-684574794-sk5zw Readiness probe failed: Get http://10.64.67.19:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 39m Warning Unhealthy pod/zotero-production-684574794-st8qr Readiness probe failed: Get http://10.64.67.28:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2m38s Warning Unhealthy pod/zotero-production-684574794-tj2qh Readiness probe failed: Get http://10.64.67.249:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 48m Warning Unhealthy pod/zotero-production-684574794-wf5rq Readiness probe failed: Get http://10.64.71.49:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 5m2s Warning Unhealthy pod/zotero-production-684574794-wlbwg Readiness probe failed: Get http://10.64.66.98:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 34m Warning Unhealthy pod/zotero-production-684574794-xtdqm Readiness probe failed: Get http://10.64.66.153:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Apr 6 2022
I am resolving this followup from https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-03-27_api. This will decrease errors from mobileapps in the future and responses will be given to changeprop more promptly, lowering the amount of retries.
And now that enough time has passed, indeed the rate of errors is lower (I 've arbitrarily drawn a couple of lines at around the 90+th percentile to showcase it easily. There is also less variation so this isn't exactly scientific, but I 'd say good enough.
Apr 5 2022
After patch was merged and deployed we have happier graphs!
Apr 4 2022
Apr 2 2022
For posterity's sake, alert histograms from icinga for the 2 instances of zotero.
Apr 1 2022
Mar 29 2022
Mar 28 2022
Mar 21 2022
This has been done, resolving!