Page MenuHomePhabricator
Feed Advanced Search

Today

CDanis added a comment to T340552: MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata.

This is an amazing proof-of-concept, thanks so much @TK-999 !!!

Mon, May 6, 1:55 PM · Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-Platform-Team (Radar), MediaWiki-libs-HTTP, Observability-Tracing
CDanis added a comment to T364309: deployment: fix-staging-perms fails to finish.

I checked my shell history on deploy1002 and all I've done there recently is scap backport 1026628.

Mon, May 6, 1:19 PM
CDanis updated the name of F50517022: 2024-05-06 user-measured magru latency as violin plots, per country, Latin/South America from "image.png" to "2024-05-06 user-measured magru latency as violin plots, per country, Latin/South America".
Mon, May 6, 1:13 PM

Fri, May 3

CDanis created T364166: Add isp_data to event_transforms refine.
Fri, May 3, 5:43 PM · probenet, Event-Platform, Data-Engineering
CDanis created T364164: Probenet support for per-ip-block mappings.
Fri, May 3, 5:31 PM · Epic, probenet
CDanis edited Description on probenet.
Fri, May 3, 5:08 PM
CDanis added a project to T362902: Add probenet configuration for magru: probenet.
Fri, May 3, 5:04 PM · probenet, MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), Patch-For-Review, netops, Infrastructure-Foundations, SRE
CDanis added a project to T337317: compare Probenet data w/ NEL data: probenet.
Fri, May 3, 5:04 PM · probenet, Infrastructure-Foundations, SRE
CDanis added a project to T338037: move use of Math.random() to mw.user.getPageviewToken() in probenet.js: probenet.
Fri, May 3, 5:04 PM · probenet, MediaWiki-extensions-WikimediaEvents, MW-1.41-notes (1.41.0-wmf.12; 2023-06-06)
CDanis added a project to T337318: decide on an aggregation function to combine multiple probes into a single measurement: probenet.
Fri, May 3, 5:04 PM · probenet, SRE, Traffic, Infrastructure-Foundations
CDanis added a project to T334417: Receive network latency reports from the browsers: probenet.
Fri, May 3, 5:04 PM · probenet, MW-1.41-notes (1.41.0-wmf.19; 2023-07-25), Infrastructure-Foundations
CDanis added a project to T347114: NetworkProbeLimit cookie for Probenet overwritten on every link hover event: probenet.
Fri, May 3, 5:02 PM · probenet, MediaWiki-extensions-WikimediaEvents, Infrastructure-Foundations, Wikimedia-Performance-recommendation
CDanis added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

Unfortunately subdivision-level mapping didn't help in PE -- there are many regions where magru is both better and worse than eqiad.

Fri, May 3, 4:59 PM · Traffic
CDanis created T364155: Create project tag for probenet.
Fri, May 3, 4:54 PM · Project-Admins
CDanis added a comment to T363722: Craft geo-maps file to create lowest-latency routes from south america.

magru is a clear win for:
UY, CL, AR, BR, PY

Fri, May 3, 4:37 PM · Traffic
CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

Oh, and I think magru is a win for SV as well.

Fri, May 3, 4:01 PM · Infrastructure-Foundations, SRE, Traffic
CDanis awarded T233681: compare-and-swap writes for confctl edit and for dbctl commit a Love token.
Fri, May 3, 3:53 PM · conftool
CDanis added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

I think I have two questions:

  1. Where is it defined what should and shouldn't get its own intermediate? (e.g. I see cassandra has one)
Fri, May 3, 2:11 PM · Patch-For-Review, SRE, SRE-swift-storage
CDanis updated the name of F49977482: Brazil RTT broken down by subdivision from "image.png" to "Brazil RTT broken down by subdivision".
Fri, May 3, 1:23 PM
CDanis added a comment to F49974214: Initial user-measured magru latency as violin plots, per country, Latin/South America.
python
import wmfdata
spark = wmfdata.spark.create_session(type='yarn-regular')
Fri, May 3, 1:12 PM
CDanis updated the name of F49974214: Initial user-measured magru latency as violin plots, per country, Latin/South America from "Initial user-measured magru latency, per country, Latin/South America" to "Initial user-measured magru latency as violin plots, per country, Latin/South America".
Fri, May 3, 1:00 PM
CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

magru is a clear win for:
UY, CL, AR, BR, PY

Fri, May 3, 12:59 PM · Infrastructure-Foundations, SRE, Traffic
CDanis updated the name of F49974214: Initial user-measured magru latency as violin plots, per country, Latin/South America from "image.png" to "Initial user-measured magru latency, per country, Latin/South America".
Fri, May 3, 12:55 PM

Thu, May 2

CDanis added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

That sounds good to me @elukey . I don't think a new intermediate is needed.

Thu, May 2, 8:47 PM · Patch-For-Review, SRE, SRE-swift-storage
CDanis added a comment to T363971: scap should not run mediawiki-image-download on pooled=inactive servers.

FYI this happened for me again, despite the above patch

19:48:44 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-02-194555-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out
Thu, May 2, 8:27 PM · Release-Engineering-Team, Scap
CDanis added parent tasks for T362902: Add probenet configuration for magru: T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps, T363722: Craft geo-maps file to create lowest-latency routes from south america.
Thu, May 2, 8:09 PM · probenet, MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), Patch-For-Review, netops, Infrastructure-Foundations, SRE
CDanis added a subtask for T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps: T362902: Add probenet configuration for magru.
Thu, May 2, 8:09 PM · Infrastructure-Foundations, SRE, Traffic
CDanis added a subtask for T363722: Craft geo-maps file to create lowest-latency routes from south america: T362902: Add probenet configuration for magru.
Thu, May 2, 8:09 PM · Traffic
CDanis closed T362902: Add probenet configuration for magru as Resolved.
Thu, May 2, 8:09 PM · probenet, MW-1.43-notes (1.43.0-wmf.3; 2024-04-30), Patch-For-Review, netops, Infrastructure-Foundations, SRE
CDanis closed T362902: Add probenet configuration for magru, a subtask of T362421: magru network setup, as Resolved.
Thu, May 2, 8:08 PM · Patch-For-Review, netops, SRE, Infrastructure-Foundations
CDanis created P61742 (An Untitled Masterwork).
Thu, May 2, 3:43 PM
CDanis assigned T362786: Enable dbctl for parsercache to Scott_French.

+1, omit_replicas_in_mwconfig seems like the right way to begin implementing this.

Thu, May 2, 1:19 PM · Infrastructure-Foundations, Data-Persistence, conftool

Wed, May 1

CDanis added a subtask for T350592: EPIC: migrate in use metrics and dashboards to statslib: T363914: Discrepancy between Graphite & Prometheus editResponseTime counts.
Wed, May 1, 3:09 PM · Epic, SRE Observability (FY2023/2024-Q4), MW-1.42-notes (1.42.0-wmf.15; 2024-01-23), MediaWiki-Platform-Team (Radar), Observability-Metrics
CDanis added a parent task for T363914: Discrepancy between Graphite & Prometheus editResponseTime counts: T350592: EPIC: migrate in use metrics and dashboards to statslib.
Wed, May 1, 3:09 PM · MediaWiki-Platform-Team, Observability-Metrics
CDanis triaged T363914: Discrepancy between Graphite & Prometheus editResponseTime counts as High priority.
Wed, May 1, 3:08 PM · MediaWiki-Platform-Team, Observability-Metrics
CDanis created T363914: Discrepancy between Graphite & Prometheus editResponseTime counts.
Wed, May 1, 3:08 PM · MediaWiki-Platform-Team, Observability-Metrics

Mon, Apr 29

CDanis added a comment to T363407: Proper service names in trace data.

Ok, understood. The only thing I'm really worrying about is that metrics change/get less intuitive with this. For example in here it's pretty clear what the filter means (selecting "local_service"). I think we will loose clarity here if local_service changes to mw-web.eqiad.main. Maybe adding local as suffix/prefix would help here (and you could strip that out again in OTTL?

Mon, Apr 29, 2:36 PM · Observability-Tracing

Fri, Apr 26

CDanis added a comment to T363581: Build a machine-readable catalogue of mariadb tables in production.

I like the idea! A few questions

Fri, Apr 26, 2:26 PM · DBA

Thu, Apr 25

CDanis added a comment to T363407: Proper service names in trace data.

BTW in case it was not clear, my intentions here are basically:

  • deploy something ASAP (like next week) that everyone is reasonably happy with for the interim
  • don't do anything to get in the way of the badly-needed Envoy upgrade
  • don't break anything else
Thu, Apr 25, 4:23 PM · Observability-Tracing
CDanis added a comment to T363407: Proper service names in trace data.

Thanks for the write-up!
What is not very clear to me is what part of the work would need to be done anyways (in case we'd have a envoy version >= 1.24). The reason I'm asking this is that envoy 1.23 is EOL since a year or so, so we need to look at an upgrade anyways.

Thu, Apr 25, 2:19 PM · Observability-Tracing

Wed, Apr 24

Ladsgroup awarded T363407: Proper service names in trace data a Love token.
Wed, Apr 24, 8:06 PM · Observability-Tracing
CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T363407: Proper service names in trace data.
Wed, Apr 24, 8:03 PM · Epic, Observability-Tracing
CDanis added a parent task for T363407: Proper service names in trace data: T320549: distributed tracing v0 [minimum viable].
Wed, Apr 24, 8:03 PM · Observability-Tracing
CDanis created T363407: Proper service names in trace data.
Wed, Apr 24, 8:03 PM · Observability-Tracing

Thu, Apr 18

CDanis reopened T360029: Integrate dbctl IP changes as part of VLAN changes. as "Open".
Thu, Apr 18, 2:38 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis reopened T360029: Integrate dbctl IP changes as part of VLAN changes. , a subtask of T354878: Re-IP db servers in codfw row A/B moving to per-rack subnets, as Open.
Thu, Apr 18, 2:37 PM · Data-Persistence, SRE, Infrastructure-Foundations
CDanis closed T360029: Integrate dbctl IP changes as part of VLAN changes. as Resolved.

Anyway I think that all that is needed to unblock VLAN migrations has been done or documented on this ticket? Optimistically closing but please re-open if you disagree.

Thu, Apr 18, 2:20 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis closed T360029: Integrate dbctl IP changes as part of VLAN changes. , a subtask of T354878: Re-IP db servers in codfw row A/B moving to per-rack subnets, as Resolved.
Thu, Apr 18, 2:19 PM · Data-Persistence, SRE, Infrastructure-Foundations
CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

As for the commit I advocate to add dbctl support in Spicerack but IIRC that requires changes in dbctl as most of its logic is in its CLI part and not exposed as a library, but to be checked.

Thu, Apr 18, 2:19 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis triaged T362893: Spicerack support for dbctl as Low priority.
Thu, Apr 18, 2:17 PM · Infrastructure-Foundations, conftool, SRE-tools, Spicerack
CDanis created T362893: Spicerack support for dbctl.
Thu, Apr 18, 2:16 PM · Infrastructure-Foundations, conftool, SRE-tools, Spicerack

Wed, Apr 17

CDanis added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

I largely agree with Arzhel's assessment. At a cursory glance, Uruguay or Paraguay look ideal as first candidates.

Wed, Apr 17, 7:12 PM · Infrastructure-Foundations, SRE, Traffic
CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

I think you should be able to use the existing spicerack interface to confctl to do the set/host_ip=... action -- that should be equivalent to a ConftoolEntity.update call.

Wed, Apr 17, 5:01 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

@Marostegui As it turns out, plain old confctl can be used to do this already.

Wed, Apr 17, 4:07 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

Actually the idea is that dbctl should not contain the IPs at all. It should look up the IP via DNS, we should store FQDN instead.

Wed, Apr 17, 3:59 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations

Tue, Apr 16

CDanis updated the task description for T362719: Upgrade Jaeger to 1.56.0 (latest stable).
Tue, Apr 16, 9:16 PM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
CDanis removed a project from T362719: Upgrade Jaeger to 1.56.0 (latest stable): Epic.
Tue, Apr 16, 9:16 PM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
CDanis created T362719: Upgrade Jaeger to 1.56.0 (latest stable).
Tue, Apr 16, 9:16 PM · Patch-For-Review, User-fgiunchedi, Observability-Tracing
CDanis committed rOSCT2474362169a1: force enable etcd v2 proto.
force enable etcd v2 proto
Tue, Apr 16, 3:05 PM

Mon, Apr 15

CDanis committed rOSCTd2ad7ee548ae: add python 3.11.
add python 3.11
Mon, Apr 15, 10:40 PM
CDanis committed rOSCTf1dd336c7537: Fix nuisance black diffs.
Fix nuisance black diffs
Mon, Apr 15, 10:40 PM

Thu, Apr 11

CDanis updated the title for P60444 tzdump.py from untitled to tzdump.py.
Thu, Apr 11, 7:07 PM
CDanis created P60444 tzdump.py.
Thu, Apr 11, 6:54 PM

Wed, Apr 10

CDanis created P60266 (An Untitled Masterwork).
Wed, Apr 10, 4:15 PM

Mar 27 2024

CDanis closed T359413: Miniature images from og:image not loading in social media links as Resolved.
Mar 27 2024, 7:26 PM · Traffic, PageImages, Regression, WMF-General-or-Unknown
CDanis added a comment to T359413: Miniature images from og:image not loading in social media links.

This has been fixed with this patch, which I forgot to associate with this bug.

Mar 27 2024, 7:26 PM · Traffic, PageImages, Regression, WMF-General-or-Unknown

Mar 26 2024

CDanis added a member for WMF-NDA: fkaelin.
Mar 26 2024, 2:48 PM
CDanis added a member for WMF-NDA: Pablo.
Mar 26 2024, 2:48 PM

Mar 25 2024

CDanis added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

Just to make sure I understand, the request here is an easy-to-automate way of dbctl to change the instance IP address?

Mar 25 2024, 3:37 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations
CDanis added a project to T360029: Integrate dbctl IP changes as part of VLAN changes. : conftool.
Mar 25 2024, 3:34 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations

Mar 1 2024

DAlangi_WMF awarded T357050: editResponseTime's port to statslib is not actually backwards-compatible a Barnstar token.
Mar 1 2024, 5:47 PM · MediaWiki-libs-Stats, MW-1.42-notes (1.42.0-wmf.18; 2024-02-13)

Feb 26 2024

CDanis added a comment to T357750: Phase out cergen.

Should this ticket really be "deprecate cergen"? :)

Feb 26 2024, 4:00 PM · Patch-For-Review, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
CDanis added a comment to T358455: Primary outbound port utilisation over 80% alert muted.

This would best be fixed by extending the haproxy bwlim work done in T317799 -- we've talked about having per-ASN limits in addition to the existing and partially-deployed per-file-URI limits.

Feb 26 2024, 3:45 PM · Traffic, Sustainability (Incident Followup), Infrastructure-Foundations, netops
CDanis claimed T358189: aux-k8s cluster prometheus setup is incomplete.
Feb 26 2024, 3:28 PM · Infrastructure-Foundations, Observability-Tracing

Feb 22 2024

CDanis added a comment to T358111: oauth2-proxy config changes don't cause any change in the helm Deployment.

Sent upstream as https://github.com/jaegertracing/helm-charts/pull/541

Feb 22 2024, 10:15 PM · Observability-Tracing, Patch-For-Review
CDanis closed T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow , a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.
Feb 22 2024, 9:23 PM · Epic, Observability-Tracing
CDanis closed T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow as Resolved.
Feb 22 2024, 9:23 PM · Observability-Tracing
CDanis added a comment to T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow .

As it turns out, this required a change to the upstream chart:

Feb 22 2024, 9:23 PM · Observability-Tracing
CDanis added a parent task for T358111: oauth2-proxy config changes don't cause any change in the helm Deployment: T321211: distributed tracing v1: tech debt blockers.
Feb 22 2024, 12:46 PM · Observability-Tracing, Patch-For-Review
CDanis added a subtask for T321211: distributed tracing v1: tech debt blockers: T358111: oauth2-proxy config changes don't cause any change in the helm Deployment.
Feb 22 2024, 12:46 PM · Observability-Tracing, Epic
CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow .
Feb 22 2024, 12:45 PM · Epic, Observability-Tracing
CDanis added a parent task for T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow : T320549: distributed tracing v0 [minimum viable].
Feb 22 2024, 12:45 PM · Observability-Tracing

Feb 21 2024

CDanis created T358152: troubleshoot why initial pageloads of trace.wikimedia.org are so slow .
Feb 21 2024, 9:53 PM · Observability-Tracing
CDanis updated the task description for T358111: oauth2-proxy config changes don't cause any change in the helm Deployment.
Feb 21 2024, 3:01 PM · Observability-Tracing, Patch-For-Review
CDanis created T358111: oauth2-proxy config changes don't cause any change in the helm Deployment.
Feb 21 2024, 2:56 PM · Observability-Tracing, Patch-For-Review

Feb 16 2024

CDanis added a comment to T320555: cas-sso idp for jaeger-ui on k8s.

I've verified that oauth2-proxy will silently just serve plain HTTP if you specify https_address but don't provide it with TLS key material. So I think I've provided it with such in this patch?

Feb 16 2024, 7:49 PM · User-fgiunchedi, Observability-Tracing
CDanis committed rLPRI6635d0265938: Add faux secret for jaeger in idp.
Add faux secret for jaeger in idp
Feb 16 2024, 4:09 PM

Feb 9 2024

CDanis awarded T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) a Love token.
Feb 9 2024, 7:24 PM · Traffic, SRE
CDanis added a comment to T356661: Cross fleet runc upgrades.

All pods on k8s-aux-eqiad restarted, thanks @akosiaris for the script.

Feb 9 2024, 6:10 PM · serviceops

Feb 8 2024

CDanis added a subtask for T354435: 1.42.0-wmf.17 deployment blockers: T357050: editResponseTime's port to statslib is not actually backwards-compatible.
Feb 8 2024, 7:03 PM · User-brennen, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
CDanis added a parent task for T357050: editResponseTime's port to statslib is not actually backwards-compatible: T354435: 1.42.0-wmf.17 deployment blockers.
Feb 8 2024, 7:03 PM · MediaWiki-libs-Stats, MW-1.42-notes (1.42.0-wmf.18; 2024-02-13)
CDanis triaged T357050: editResponseTime's port to statslib is not actually backwards-compatible as High priority.
Feb 8 2024, 7:03 PM · MediaWiki-libs-Stats, MW-1.42-notes (1.42.0-wmf.18; 2024-02-13)

Feb 7 2024

CDanis added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

Per docs, Thanos supports logging when a query is received but before it begins execution:

Feb 7 2024, 4:19 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Sustainability (Incident Followup), SRE, observability

Feb 6 2024

Lens0021 awarded T276486: gerrit's sshd is incompatible with RSA pubkeys + Fedora 33 clients (and future versions of OpenSSH proper) a Party Time token.
Feb 6 2024, 9:49 AM · Gerrit (Gerrit 3.6), Upstream

Feb 5 2024

CDanis created P56252 (An Untitled Masterwork).
Feb 5 2024, 5:43 PM

Jan 29 2024

CDanis claimed T332024: GeoIP mapping experiments.
Jan 29 2024, 4:20 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic
CDanis claimed T342624: NetworkProbeLimit cookie should set samesite attribute.
Jan 29 2024, 4:19 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic
CDanis triaged T349807: NEL: don't alert on domains we don't control as Medium priority.
Jan 29 2024, 3:41 PM · SRE, Infrastructure-Foundations, Traffic
CDanis updated subscribers of T329331: create a puppetized abstraction for haproxy blocklist hysteresis.

@Fabfur just wanted to make sure you've seen this task, it is decent documentation of the existing mechanism and probably helpful for doing T353910

Jan 29 2024, 3:40 PM · SRE, Traffic

Jan 24 2024

CDanis closed T266783: move tunnelencabulator's repo to a Wikimedia-owned space as Resolved.

The script was added to the wmf-sre-laptop package in May 2023 with this commit

Jan 24 2024, 5:26 PM · Infrastructure-Foundations