CDanis (Chris Danis)
SRE @ WMF

Projects (11)
View All

conftool
Component
Infrastructure-Foundations
Group
netops
Component
probenet
Component
Security
Tag

Calendar

User Details

User Since: Nov 5 2018, 2:54 PM (290 w, 1 d)
Availability: Available
IRC Nick: cdanis
LDAP User: CDanis
MediaWiki User: CDanis (WMF) [ Global Accounts ]

Recent Activity
View All

Today

CDanis added a comment to T366094: k8s master capacity issues.

A few times we triggered a rolling restart of mw-api-int pods in eqiad to deliberately cause the CPU/network TX spikes on the two original masters, so we could observe more about the behavior. We wanted to figure out "what had changed" such that a scap deploy was now creating a ProbeDown page.
- We captured pcaps on a few different k8s masters during such events, and then ran that through wireshark's Statistics > Conversations feature. Here's one such breakdown. I sorted the list of streams by bytes and manually tagged the top dozen or so streams. I stopped a few entries into a very long run of nearly-identical byte counts from local port 6443 (apiserver) to different node IPs; if you sum up the ones I tagged plus all of those, that makes up 97% of the bytes in the sample.
- Of that 97% portion I inspected, everything sending packets to and from the apiserver machines was reasonable-looking: reading from an etcd's port 2380 was about 5% of overall bytes, then after that, in order, the apiserver sending lots of data to all of: the istiod pod, a calico-kube-controllers pod, a k8s-controller-sidecars pod, various different node IPs (so one of kubelet, kube-proxy, or rsyslogd)... all of them expected, known usages of the API.
- This is 'just' absence of evidence but I'm gonna go ahead and call it evidence of absence here.

Wed, May 29, 2:42 AM · Patch-For-Review, serviceops, SRE

Yesterday

CDanis added a comment to T366094: k8s master capacity issues.

Posting a short comment now, before I start drafting a much longer comment (and possibly don't finish before my toddler ends my day):

Tue, May 28, 8:49 PM · Patch-For-Review, serviceops, SRE

Fri, May 24

CDanis created T365855: Stop hardcoding k8s master (k8s API) endpoint IP addresses.

Fri, May 24, 5:57 PM · Observability-Tracing

CDanis added a subtask for T363407: Proper service names in trace data: T365809: deploy otel collector in k8s staging clusters.

Fri, May 24, 12:58 PM · Patch-For-Review, Observability-Tracing

CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T365809: deploy otel collector in k8s staging clusters.

Fri, May 24, 12:58 PM · Patch-For-Review, Epic, Observability-Tracing

CDanis added parent tasks for T365809: deploy otel collector in k8s staging clusters: T363407: Proper service names in trace data, T320549: distributed tracing v0 [minimum viable].

Fri, May 24, 12:58 PM · Patch-For-Review, Observability-Tracing

CDanis created T365809: deploy otel collector in k8s staging clusters.

Fri, May 24, 12:57 PM · Patch-For-Review, Observability-Tracing

Thu, May 23

CDanis added a comment to T363407: Proper service names in trace data.

I tested out a simple enabling of the k8s attributes processor in the values in the chart. The diffs all looked quite reasonable, but of course it doesn't work:

Thu, May 23, 9:22 PM · Patch-For-Review, Observability-Tracing

CDanis closed T365626: move k8s opentelemetry-collector from services to admin_ng as Resolved.

Traces are flowing again in eqiad.

Thu, May 23, 8:31 PM

CDanis closed T365626: move k8s opentelemetry-collector from services to admin_ng, a subtask of T363407: Proper service names in trace data, as Resolved.

Thu, May 23, 8:30 PM · Patch-For-Review, Observability-Tracing

CDanis closed T365626: move k8s opentelemetry-collector from services to admin_ng, a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.

Thu, May 23, 8:30 PM · Patch-For-Review, Epic, Observability-Tracing

CDanis updated subscribers of T365626: move k8s opentelemetry-collector from services to admin_ng.

helmfile apply went seamlessly, but unfortunately this broke trace collection: I realized only in retrospect that this effectively also changes the DNS name of the collector, and that's vendored into a lot of other charts with the full old name: main-opentelemetry-collector.opentelemetry-collector.svc.cluster.local

Thu, May 23, 7:57 PM

CDanis updated subscribers of T365626: move k8s opentelemetry-collector from services to admin_ng.

Thu, May 23, 7:57 PM

CDanis added a comment to T355750: CFSSL gencert "remote error: tls: certificate require".

Hi Arzhel, for when I do have time to look at this, do you have a recommended way of reproducing without breaking anything or potentially actually affecting a network device?

Thu, May 23, 2:20 PM · CFSSL-PKI, Infrastructure-Foundations

Wed, May 22

CDanis added a subtask for T363407: Proper service names in trace data: T365626: move k8s opentelemetry-collector from services to admin_ng.

Wed, May 22, 3:47 PM · Patch-For-Review, Observability-Tracing

CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T365626: move k8s opentelemetry-collector from services to admin_ng.

Wed, May 22, 3:47 PM · Patch-For-Review, Epic, Observability-Tracing

CDanis added parent tasks for T365626: move k8s opentelemetry-collector from services to admin_ng: T363407: Proper service names in trace data, T320549: distributed tracing v0 [minimum viable].

Wed, May 22, 3:47 PM

CDanis added a comment to T363407: Proper service names in trace data.

During work on T320563 we learned that we had made a very brittle assumption about naming in this task.

Wed, May 22, 3:46 PM · Patch-For-Review, Observability-Tracing

CDanis created T365626: move k8s opentelemetry-collector from services to admin_ng.

Wed, May 22, 3:40 PM

CDanis added a project to T365594: Include geocoded subdivision ISO code in webrequest table: probenet.

Wed, May 22, 1:59 PM · probenet, Data-Engineering

CDanis added a comment to T329398: Puppetize Skein certificate generation.

I think the docs at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow#Skein need to be updated, still many mentions of "time bombs" there

Wed, May 22, 1:58 PM · Data-Platform-SRE

CDanis created T365594: Include geocoded subdivision ISO code in webrequest table.

Wed, May 22, 1:44 PM · probenet, Data-Engineering

Mon, May 20

CDanis updated the title for P62701 wikikube mesh.configuration versions in use from untitled to wikikube mesh.configuration versions in use.

Mon, May 20, 3:00 PM

CDanis created P62701 wikikube mesh.configuration versions in use.

Mon, May 20, 2:59 PM

CDanis added a project to T365362: Alert and automate the renewal of CFSSL intermediate CAs: CFSSL-PKI.

Mon, May 20, 2:33 PM · CFSSL-PKI, Infrastructure-Foundations

CDanis added a project to T365361: Establish a process to periodically upgrade the CFSSL infrastructure: CFSSL-PKI.

Mon, May 20, 2:33 PM · CFSSL-PKI, Infrastructure-Foundations

CDanis triaged T365123: Make dbctl check for depooled future masters as Medium priority.

Mon, May 20, 2:32 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool

Fri, May 17

CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

Latest results: magru is a clear win for BR, AR, CL, PY, UY, BO

Fri, May 17, 9:01 PM · Infrastructure-Foundations, SRE, Traffic

CDanis added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

Latest results: magru is a clear win for BR, AR, CL, PY, UY, BO

Fri, May 17, 9:01 PM · Traffic

CDanis updated the name of F53633438: user-measured latency towards all datacenters from Central/South America, data 2024-05-14 -- 2024-05-17 from "image.png" to "user-measured latency towards all datacenters from Central/South America, data 2024-05-14 -- 2024-05-17".

Fri, May 17, 8:44 PM

CDanis triaged T365289: partial power outage for lsw1-e5-eqiad as High priority.

Fri, May 17, 7:14 PM · DC-Ops, SRE, netops, Infrastructure-Foundations, ops-eqiad

CDanis created T365289: partial power outage for lsw1-e5-eqiad.

Fri, May 17, 7:14 PM · DC-Ops, SRE, netops, Infrastructure-Foundations, ops-eqiad

CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

The 3rd transit was also of great help to Chile, and probably Peru (although sample size there is a bit small).

Fri, May 17, 1:29 PM · Infrastructure-Foundations, SRE, Traffic

CDanis added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

The 3rd transit was also of great help to Chile, and probably Peru (although sample size there is a bit small).

Fri, May 17, 1:28 PM · Traffic

CDanis added a comment to T365123: Make dbctl check for depooled future masters .

Patches welcome :)

Fri, May 17, 12:29 AM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool

Thu, May 16

CDanis added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

Adding the 3rd transit link in magru greatly improved the latency for many users in Argentina.

Thu, May 16, 10:22 PM · Traffic

CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

Adding the 3rd transit link in magru greatly improved the latency for many users in Argentina.

Thu, May 16, 10:22 PM · Infrastructure-Foundations, SRE, Traffic

CDanis added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

@GreenReaper thanks so much for the helpful contribution :) I'll see if I can reproduce your results.

Thu, May 16, 10:16 PM · Traffic

CDanis added a comment to T365197: ISPDatabaseReader null pointer exception.

Sure @mpopov! I was running over today's subset (about 228k events).

Thu, May 16, 9:02 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, Data-Engineering

CDanis added a comment to T365197: ISPDatabaseReader null pointer exception.

I can confirm that refinery-hive-0.2.31-shaded.jar does not show the issue on the same dataset.

Thu, May 16, 8:35 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, Data-Engineering

CDanis created T365197: ISPDatabaseReader null pointer exception.

Thu, May 16, 8:07 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Patch-For-Review, Data-Engineering

Wed, May 15

CDanis added a subtask for T320559: Trace header propagation for MediaWiki: T365053: WikiLambda OrchestratorRequest does not propagate request tracing headers.

Wed, May 15, 6:19 PM · MW-1.41-notes (1.41.0-wmf.22; 2023-08-15), MediaWiki-Platform-Team, Observability-Tracing

CDanis added a parent task for T365053: WikiLambda OrchestratorRequest does not propagate request tracing headers: T320559: Trace header propagation for MediaWiki.

Wed, May 15, 6:19 PM · MW-1.43-notes (1.43.0-wmf.7; 2024-05-28), Abstract Wikipedia team (24Q4 (Apr–Jun)), Observability-Tracing, WikiLambda

CDanis created T365053: WikiLambda OrchestratorRequest does not propagate request tracing headers.

Wed, May 15, 6:17 PM · MW-1.43-notes (1.43.0-wmf.7; 2024-05-28), Abstract Wikipedia team (24Q4 (Apr–Jun)), Observability-Tracing, WikiLambda

CDanis added a comment to T363407: Proper service names in trace data.

At a glance, this seems to work.

Wed, May 15, 3:48 PM · Patch-For-Review, Observability-Tracing

CDanis added a comment to T363407: Proper service names in trace data.

05:08:49	<@claime>	I think this is the same or a closely related issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32080#issuecomment-2057446307
05:09:30	<@claime>	They're talking about logs, but that last comment seems to point to a larger point about how ottl treats scopes

Wed, May 15, 1:17 PM · Patch-For-Review, Observability-Tracing

Tue, May 14

CDanis closed T364907: upgrade to latest stable version of otelcol-contrib as Resolved.

Tue, May 14, 8:26 PM · Observability-Tracing

CDanis added a comment to T363407: Proper service names in trace data.

Still seeing the bad data in new traces.

Tue, May 14, 8:25 PM · Patch-For-Review, Observability-Tracing

CDanis closed T364907: upgrade to latest stable version of otelcol-contrib, a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.

Tue, May 14, 8:24 PM · Patch-For-Review, Epic, Observability-Tracing

CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T364907: upgrade to latest stable version of otelcol-contrib.

Tue, May 14, 7:13 PM · Patch-For-Review, Epic, Observability-Tracing

CDanis added a parent task for T364907: upgrade to latest stable version of otelcol-contrib: T320549: distributed tracing v0 [minimum viable].

Tue, May 14, 7:13 PM · Observability-Tracing

CDanis created T364907: upgrade to latest stable version of otelcol-contrib.

Tue, May 14, 7:13 PM · Observability-Tracing

CDanis added a comment to T363407: Proper service names in trace data.

The good news is that this mostly works, and that I also have a patch pending to fix the issues I had with my first attempt at rolling it out.

Tue, May 14, 6:56 PM · Patch-For-Review, Observability-Tracing

CDanis added a project to T364893: an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024: netops.

To add some context:

Tue, May 14, 6:01 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), Infrastructure-Foundations, netops

Mon, May 13

CDanis closed T364657: Grant access to Google Search Console for OSefu-WMF as Resolved.

Added you to all those Wikipedias plus their mobile variants

Mon, May 13, 5:38 PM · Search-Console-access-request

CDanis assigned T364715: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer to Tobi_WMDE_SW.

Tobi please approve and reassign to me, thanks!

Mon, May 13, 4:34 PM · Data-Engineering, Patch-For-Review, SRE, SRE-Access-Requests

CDanis added a comment to T364657: Grant access to Google Search Console for OSefu-WMF.

Hi @OSefu-WMF , is there a subset of wikiprojects that you think would be sufficient for measuring the impact here? Perhaps just a few of the most popular Wikipedias? I ask because adding access to Google Search Console is a manual process per-domain

Mon, May 13, 4:22 PM · Search-Console-access-request

CDanis added a subtask for T332024: GeoIP mapping experiments: T347114: NetworkProbeLimit cookie for Probenet overwritten on every link hover event.

Mon, May 13, 1:54 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic

CDanis added a parent task for T347114: NetworkProbeLimit cookie for Probenet overwritten on every link hover event: T332024: GeoIP mapping experiments.

Mon, May 13, 1:54 PM · probenet, MediaWiki-extensions-WikimediaEvents, Infrastructure-Foundations, Wikimedia-Performance-recommendation

Thu, May 9

CDanis updated the task description for T364477: Upgrade jaeger helm chart version to latest upstream.

Thu, May 9, 1:30 PM · User-fgiunchedi, Patch-For-Review, Observability-Tracing

CDanis added a comment to T364379: Benthos loses messages when under high load.

I was thinking about this while I was falling asleep last night, and I think it's actually desirable to drop some messages under persistent high incoming reqrate, as a way to shed some load.

Thu, May 9, 1:16 PM · Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Wed, May 8

CDanis created P62120 (An Untitled Masterwork).

Wed, May 8, 9:22 PM

CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T364477: Upgrade jaeger helm chart version to latest upstream.

Wed, May 8, 2:28 PM · Patch-For-Review, Epic, Observability-Tracing

CDanis added a parent task for T364477: Upgrade jaeger helm chart version to latest upstream: T320549: distributed tracing v0 [minimum viable].

Wed, May 8, 2:28 PM · User-fgiunchedi, Patch-For-Review, Observability-Tracing

CDanis created T364477: Upgrade jaeger helm chart version to latest upstream.

Wed, May 8, 2:27 PM · User-fgiunchedi, Patch-For-Review, Observability-Tracing

Mon, May 6

CDanis added a comment to T340552: MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata.

This is an amazing proof-of-concept, thanks so much @TK-999 !!!

Mon, May 6, 1:55 PM · Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-Platform-Team (Radar), MediaWiki-libs-HTTP, Observability-Tracing

CDanis added a comment to T364309: deployment: fix-staging-perms fails to finish.

I checked my shell history on deploy1002 and all I've done there recently is scap backport 1026628.

Mon, May 6, 1:19 PM · Deployments

CDanis updated the name of F50517022: 2024-05-06 user-measured magru latency as violin plots, per country, Latin/South America from "image.png" to "2024-05-06 user-measured magru latency as violin plots, per country, Latin/South America".

Mon, May 6, 1:13 PM

Fri, May 3

CDanis created T364166: Add isp_data to event_transforms refine.

Fri, May 3, 5:43 PM · probenet, Event-Platform, Data-Engineering

CDanis created T364164: Probenet support for per-ip-block mappings.

Fri, May 3, 5:31 PM · Epic, probenet

CDanis edited Description on probenet.

Fri, May 3, 5:08 PM

CDanis added a project to T362902: Add probenet configuration for magru: probenet.

Fri, May 3, 5:04 PM · probenet, MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), Patch-For-Review, netops, Infrastructure-Foundations, SRE

CDanis added a project to T337317: compare Probenet data w/ NEL data: probenet.

Fri, May 3, 5:04 PM · probenet, Infrastructure-Foundations, SRE

CDanis added a project to T338037: move use of Math.random() to mw.user.getPageviewToken() in probenet.js: probenet.

Fri, May 3, 5:04 PM · probenet, MediaWiki-extensions-WikimediaEvents, MW-1.41-notes (1.41.0-wmf.12; 2023-06-06)

CDanis added a project to T337318: decide on an aggregation function to combine multiple probes into a single measurement: probenet.

Fri, May 3, 5:04 PM · probenet, SRE, Traffic, Infrastructure-Foundations

CDanis added a project to T334417: Receive network latency reports from the browsers: probenet.

Fri, May 3, 5:04 PM · probenet, MW-1.41-notes (1.41.0-wmf.19; 2023-07-25), Infrastructure-Foundations

CDanis added a project to T347114: NetworkProbeLimit cookie for Probenet overwritten on every link hover event: probenet.

Fri, May 3, 5:02 PM · probenet, MediaWiki-extensions-WikimediaEvents, Infrastructure-Foundations, Wikimedia-Performance-recommendation

CDanis added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

Unfortunately subdivision-level mapping didn't help in PE -- there are many regions where magru is both better and worse than eqiad.

Fri, May 3, 4:59 PM · Traffic

CDanis created T364155: Create project tag for probenet.

Fri, May 3, 4:54 PM · Project-Admins

CDanis added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

magru is a clear win for:
UY, CL, AR, BR, PY

Fri, May 3, 4:37 PM · Traffic

CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

Oh, and I think magru is a win for SV as well.

Fri, May 3, 4:01 PM · Infrastructure-Foundations, SRE, Traffic

CDanis awarded T233681: compare-and-swap writes for confctl edit and for dbctl commit a Love token.

Fri, May 3, 3:53 PM · conftool

CDanis added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

In T356412#9766840, @MatthewVernon wrote:

I think I have two questions:

Where is it defined what should and shouldn't get its own intermediate? (e.g. I see cassandra has one)

Fri, May 3, 2:11 PM · Patch-For-Review, SRE, SRE-swift-storage

CDanis updated the name of F49977482: Brazil RTT broken down by subdivision from "image.png" to "Brazil RTT broken down by subdivision".

Fri, May 3, 1:23 PM

CDanis added a comment to F49974214: Initial user-measured magru latency as violin plots, per country, Latin/South America.

python
import wmfdata
spark = wmfdata.spark.create_session(type='yarn-regular')

Fri, May 3, 1:12 PM

CDanis updated the name of F49974214: Initial user-measured magru latency as violin plots, per country, Latin/South America from "Initial user-measured magru latency, per country, Latin/South America" to "Initial user-measured magru latency as violin plots, per country, Latin/South America".

Fri, May 3, 1:00 PM

CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

magru is a clear win for:
UY, CL, AR, BR, PY

Fri, May 3, 12:59 PM · Infrastructure-Foundations, SRE, Traffic

CDanis updated the name of F49974214: Initial user-measured magru latency as violin plots, per country, Latin/South America from "image.png" to "Initial user-measured magru latency, per country, Latin/South America".

Fri, May 3, 12:55 PM

Thu, May 2

CDanis added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

That sounds good to me @elukey . I don't think a new intermediate is needed.

Thu, May 2, 8:47 PM · Patch-For-Review, SRE, SRE-swift-storage

CDanis added a comment to T363971: scap should not run mediawiki-image-download on pooled=inactive servers.

FYI this happened for me again, despite the above patch

19:48:44 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-02-194555-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

Thu, May 2, 8:27 PM · Release-Engineering-Team, Scap

CDanis added parent tasks for T362902: Add probenet configuration for magru: T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps, T363722: Craft geo-maps file to create lowest-latency routes from south america.

Thu, May 2, 8:09 PM · probenet, MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), Patch-For-Review, netops, Infrastructure-Foundations, SRE

CDanis added a subtask for T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps: T362902: Add probenet configuration for magru.

Thu, May 2, 8:09 PM · Infrastructure-Foundations, SRE, Traffic

CDanis added a subtask for T363722: Craft geo-maps file to create lowest-latency routes from south america: T362902: Add probenet configuration for magru.

Thu, May 2, 8:09 PM · Traffic

CDanis closed T362902: Add probenet configuration for magru as Resolved.

Thu, May 2, 8:09 PM · probenet, MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), Patch-For-Review, netops, Infrastructure-Foundations, SRE

CDanis closed T362902: Add probenet configuration for magru, a subtask of T362421: magru network setup, as Resolved.

Thu, May 2, 8:08 PM · Patch-For-Review, netops, SRE, Infrastructure-Foundations

CDanis created P61742 (An Untitled Masterwork).