User Details
- User Since
- Nov 5 2018, 2:54 PM (290 w, 1 d)
- Availability
- Available
- IRC Nick
- cdanis
- LDAP User
- CDanis
- MediaWiki User
- CDanis (WMF) [ Global Accounts ]
Today
- A few times we triggered a rolling restart of mw-api-int pods in eqiad to deliberately cause the CPU/network TX spikes on the two original masters, so we could observe more about the behavior. We wanted to figure out "what had changed" such that a scap deploy was now creating a ProbeDown page.
- We captured pcaps on a few different k8s masters during such events, and then ran that through wireshark's Statistics > Conversations feature. Here's one such breakdown. I sorted the list of streams by bytes and manually tagged the top dozen or so streams. I stopped a few entries into a very long run of nearly-identical byte counts from local port 6443 (apiserver) to different node IPs; if you sum up the ones I tagged plus all of those, that makes up 97% of the bytes in the sample.
- Of that 97% portion I inspected, everything sending packets to and from the apiserver machines was reasonable-looking: reading from an etcd's port 2380 was about 5% of overall bytes, then after that, in order, the apiserver sending lots of data to all of: the istiod pod, a calico-kube-controllers pod, a k8s-controller-sidecars pod, various different node IPs (so one of kubelet, kube-proxy, or rsyslogd)... all of them expected, known usages of the API.
- This is 'just' absence of evidence but I'm gonna go ahead and call it evidence of absence here.
Yesterday
Posting a short comment now, before I start drafting a much longer comment (and possibly don't finish before my toddler ends my day):
Fri, May 24
Thu, May 23
I tested out a simple enabling of the k8s attributes processor in the values in the chart. The diffs all looked quite reasonable, but of course it doesn't work:
Traces are flowing again in eqiad.
helmfile apply went seamlessly, but unfortunately this broke trace collection: I realized only in retrospect that this effectively also changes the DNS name of the collector, and that's vendored into a lot of other charts with the full old name: main-opentelemetry-collector.opentelemetry-collector.svc.cluster.local
Hi Arzhel, for when I do have time to look at this, do you have a recommended way of reproducing without breaking anything or potentially actually affecting a network device?
Wed, May 22
During work on T320563 we learned that we had made a very brittle assumption about naming in this task.
I think the docs at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow#Skein need to be updated, still many mentions of "time bombs" there
Mon, May 20
Fri, May 17
Latest results: magru is a clear win for BR, AR, CL, PY, UY, BO
Latest results: magru is a clear win for BR, AR, CL, PY, UY, BO
The 3rd transit was also of great help to Chile, and probably Peru (although sample size there is a bit small).
The 3rd transit was also of great help to Chile, and probably Peru (although sample size there is a bit small).
Patches welcome :)
Thu, May 16
Adding the 3rd transit link in magru greatly improved the latency for many users in Argentina.
Adding the 3rd transit link in magru greatly improved the latency for many users in Argentina.
@GreenReaper thanks so much for the helpful contribution :) I'll see if I can reproduce your results.
Sure @mpopov! I was running over today's subset (about 228k events).
I can confirm that refinery-hive-0.2.31-shaded.jar does not show the issue on the same dataset.
Wed, May 15
At a glance, this seems to work.
05:08:49 <@claime> I think this is the same or a closely related issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32080#issuecomment-2057446307 05:09:30 <@claime> They're talking about logs, but that last comment seems to point to a larger point about how ottl treats scopes
Tue, May 14
Still seeing the bad data in new traces.
The good news is that this mostly works, and that I also have a patch pending to fix the issues I had with my first attempt at rolling it out.
To add some context:
Mon, May 13
Added you to all those Wikipedias plus their mobile variants
Tobi please approve and reassign to me, thanks!
Hi @OSefu-WMF , is there a subset of wikiprojects that you think would be sufficient for measuring the impact here? Perhaps just a few of the most popular Wikipedias? I ask because adding access to Google Search Console is a manual process per-domain
Thu, May 9
I was thinking about this while I was falling asleep last night, and I think it's actually desirable to drop some messages under persistent high incoming reqrate, as a way to shed some load.
Wed, May 8
Mon, May 6
This is an amazing proof-of-concept, thanks so much @TK-999 !!!
I checked my shell history on deploy1002 and all I've done there recently is scap backport 1026628.
Fri, May 3
Unfortunately subdivision-level mapping didn't help in PE -- there are many regions where magru is both better and worse than eqiad.
magru is a clear win for:
UY, CL, AR, BR, PY
Oh, and I think magru is a win for SV as well.
python import wmfdata spark = wmfdata.spark.create_session(type='yarn-regular')
magru is a clear win for:
UY, CL, AR, BR, PY
Thu, May 2
That sounds good to me @elukey . I don't think a new intermediate is needed.
FYI this happened for me again, despite the above patch
19:48:44 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-02-194555-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out
+1, omit_replicas_in_mwconfig seems like the right way to begin implementing this.