User Details
- User Since
- Nov 5 2018, 2:54 PM (293 w, 4 d)
- Availability
- Available
- IRC Nick
- cdanis
- LDAP User
- CDanis
- MediaWiki User
- CDanis (WMF) [ Global Accounts ]
Yesterday
Apologies @dcaro but I had less time for this than I expected this week, was only able to do some prep work but not be ready to touch anything in production before Friday.
@JMeybohm The best way to fix this is by adding a calico definition to the chart directory in a file whose name starts with wmf-, correct?
Thu, Jun 20
From discussion on IRC:
10:14:32 <@cdanis> _joe_ akosiaris claime: my plan is to smoke-test {jobrunner,wikifunctions,parsoid,misc} for a few more minutes, then roll out mw-api-int sampled tracing https://gerrit.wikimedia.org/r/1048011 , and then let that bake over the weekend before doing mw-api-ext and then mw-web 10:14:52 <@akosiaris> 👍 10:15:29 <@claime> ack 10:16:21 <@cdanis> I also gave o11y a heads up about the increased writes to elasticsearch 10:16:24 <@_joe_> seems sensible
Tue, Jun 18
If we need to do a version of this for bare-metal hosts, we will, but for now let's not.
💙cdanis@alert1001.wikimedia.org ~ 🕜☕ sudo statograph -c /etc/statograph/config.yml list_metrics Metric 'Wiki response time' (id lyfcttm2lhw4) with most recent data at Tue, 18 Jun 2024 17:30:00 +0000 (@1718731800.0)
I think the last step to do here is to validate that any rsync failures will get reported on IRC. Then we can consider all the immediate followups of this incident done, and more slowly continue on with the larger work at T367119: Install a default timeout for systemd::timer::jobs.
Mon, Jun 17
Suggestions from discussion at I/F meeting:
- It's probably not necessary or desirable to add this to all of the contextmanager usages of alertmanager silences, as those are expected to be very short-lived
- "Manual" invocations of sre.hosts.downtime should almost certainly do this. Or in general any process where we don't have a pretty deterministic estimated-time-to-completion.
- We don't have an equivalent of "check optimal" for alertmananger, only Icinga. We should probably have this.
- Would be very good to have a dedicated dashboard for silences that are suppressing active alerts but being auto-extended
Alternatives to consider:
- Make this a required field instead of adding a default [harder up-front but potentially safer]
- Make omitting this field wmf puppet style guide violation [slower version of the above]
Fri, Jun 14
Mentioning T364280: Add jaeger-ui and other stuff to mwcli here.
Thu, Jun 13
Very helpful, thanks @dcaro and enjoy the pto!
Hi all. @joanna_borun asked me to do some looking into this. I promise I skimmed the above, but I'm sure I missed things, so please pardon me for the pretty basic questions.
Wed, Jun 12
Example trace as processed in codfw production:
https://trace.wikimedia.org/trace/06aabdeeb578a2663034270cf6d4accf
- Remove known PII (sessionstore URL)
- Filter out some very noisy spans (e.g. healthchecks, Special:Blankpage etc)
Both of these verified working in production :)
Patches as written depend upon otelcol v0.102.0, so upgrading again.
Tue, Jun 11
Mon, Jun 10
Wed, Jun 5
Tue, Jun 4
I discussed this with @Muehlenhoff in his evening/my morning.
I think you are right @Vgutierrez, thanks
Mon, Jun 3
Results after adding BR.ix are in.
Results after adding BR.ix are in.
Thu, May 30
18 :58:42 <+jinxer-wm> RESOLVED: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
As we expected/hoped, the increase in eqiad TX bytes was only about 10-15%.
Wed, May 29
- A few times we triggered a rolling restart of mw-api-int pods in eqiad to deliberately cause the CPU/network TX spikes on the two original masters, so we could observe more about the behavior. We wanted to figure out "what had changed" such that a scap deploy was now creating a ProbeDown page.
- We captured pcaps on a few different k8s masters during such events, and then ran that through wireshark's Statistics > Conversations feature. Here's one such breakdown. I sorted the list of streams by bytes and manually tagged the top dozen or so streams. I stopped a few entries into a very long run of nearly-identical byte counts from local port 6443 (apiserver) to different node IPs; if you sum up the ones I tagged plus all of those, that makes up 97% of the bytes in the sample.
- Of that 97% portion I inspected, everything sending packets to and from the apiserver machines was reasonable-looking: reading from an etcd's port 2380 was about 5% of overall bytes, then after that, in order, the apiserver sending lots of data to all of: the istiod pod, a calico-kube-controllers pod, a k8s-controller-sidecars pod, various different node IPs (so one of kubelet, kube-proxy, or rsyslogd)... all of them expected, known usages of the API.
- This is 'just' absence of evidence but I'm gonna go ahead and call it evidence of absence here.
Tue, May 28
Posting a short comment now, before I start drafting a much longer comment (and possibly don't finish before my toddler ends my day):
Fri, May 24
Thu, May 23
I tested out a simple enabling of the k8s attributes processor in the values in the chart. The diffs all looked quite reasonable, but of course it doesn't work:
Traces are flowing again in eqiad.
helmfile apply went seamlessly, but unfortunately this broke trace collection: I realized only in retrospect that this effectively also changes the DNS name of the collector, and that's vendored into a lot of other charts with the full old name: main-opentelemetry-collector.opentelemetry-collector.svc.cluster.local
Hi Arzhel, for when I do have time to look at this, do you have a recommended way of reproducing without breaking anything or potentially actually affecting a network device?