Page MenuHomePhabricator

CDanis (Chris Danis)
SRE @ WMF

Projects (13)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (213 w, 14 h)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

CDanis added a comment to T324466: VictorOps 'escalator' did not work on 2022-12-03.

Both the processes on alert1001 and alert2001 have been stuck for a while.

Mon, Dec 5, 4:51 PM · Patch-For-Review, Observability-Alerting
CDanis created T324466: VictorOps 'escalator' did not work on 2022-12-03.
Mon, Dec 5, 2:54 PM · Patch-For-Review, Observability-Alerting

Fri, Nov 18

fgiunchedi awarded T321120: turn up 'aux' k8s cluster for o11y and other "ancillary"/"supportive" services a Like token.
Fri, Nov 18, 7:26 AM · Patch-For-Review, Observability-Tracing

Thu, Nov 17

CDanis added a comment to T299640: RIPE Atlas exporter improvements.

Would be great to get the struct in place in Puppet for ripeatlas_measurements.

Thu, Nov 17, 3:53 PM · observability, Infrastructure-Foundations

Wed, Nov 16

CDanis added a comment to T320561: Trace header propagation for service-template-node and all service-runner services.

service-template-node patch merged: https://github.com/wikimedia/service-template-node/commit/c4dc28c699190dec5f95725e454695306f80cabc

Wed, Nov 16, 4:00 PM · service-runner, Observability-Tracing

Wed, Nov 9

CDanis committed rOHMP5fb984c6c062: Add block80 (authored by CDanis).
Add block80
Wed, Nov 9, 11:05 PM

Nov 4 2022

CDanis renamed T322424: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 from Repeated swift (cascading?) failures, late 2022 to Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022.
Nov 4 2022, 4:29 PM · SRE, SRE-swift-storage
CDanis merged task T322417: FileBackendError: Iterator page I/O error. into T322424: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022.
Nov 4 2022, 4:17 PM · Patch-For-Review, Wikimedia-production-error
CDanis merged T322417: FileBackendError: Iterator page I/O error. into T322424: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022.
Nov 4 2022, 4:16 PM · SRE, SRE-swift-storage
CDanis created T322424: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022.
Nov 4 2022, 4:16 PM · SRE, SRE-swift-storage

Oct 27 2022

CDanis updated the task description for T321775: Upgrade HAProxy on cp nodes to 2.6.x LTS.
Oct 27 2022, 1:21 PM · Patch-For-Review, SRE, Traffic

Oct 24 2022

CDanis awarded T321134: eqiad: (3) VMs requested for aux-k8s-etcd a Love token.
Oct 24 2022, 1:52 PM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE

Oct 20 2022

akosiaris awarded T321120: turn up 'aux' k8s cluster for o11y and other "ancillary"/"supportive" services a Love token.
Oct 20 2022, 2:44 PM · Patch-For-Review, Observability-Tracing

Oct 19 2022

CDanis created T321212: Build our own OpenTelemetry Collector distribution.
Oct 19 2022, 3:30 PM · Observability-Tracing
CDanis created T321211: distributed tracing v1: tech debt blockers.
Oct 19 2022, 3:27 PM · Observability-Tracing, Epic
CDanis added a project to T320561: Trace header propagation for service-template-node and all service-runner services: service-runner.
Oct 19 2022, 1:27 PM · service-runner, Observability-Tracing

Oct 18 2022

CDanis updated subscribers of T321120: turn up 'aux' k8s cluster for o11y and other "ancillary"/"supportive" services.

@jhathaway have you had the opportunity to work with our Ganeti installation yet? if not please take a look at the instructions and start turning up some nodes :) You can file the provisioning tickets as sub-tasks of this one

Oct 18 2022, 7:15 PM · Patch-For-Review, Observability-Tracing
CDanis added a subtask for T320549: distributed tracing v0 [minimum viable]: T321120: turn up 'aux' k8s cluster for o11y and other "ancillary"/"supportive" services.
Oct 18 2022, 7:14 PM · Epic, Observability-Tracing
CDanis added a parent task for T321120: turn up 'aux' k8s cluster for o11y and other "ancillary"/"supportive" services: T320549: distributed tracing v0 [minimum viable].
Oct 18 2022, 7:14 PM · Patch-For-Review, Observability-Tracing
CDanis created T321120: turn up 'aux' k8s cluster for o11y and other "ancillary"/"supportive" services.
Oct 18 2022, 7:14 PM · Patch-For-Review, Observability-Tracing
CDanis removed a project from T320551: Package OpenTelemetry Collector as a .deb: Epic.
Oct 18 2022, 3:21 PM · serviceops, Observability-Tracing
CDanis removed a project from T320552: Package OpenTelemetry Collector atop our own base Docker images: Epic.
Oct 18 2022, 3:21 PM · Patch-For-Review, serviceops, Observability-Tracing
CDanis removed a project from T320553: Re-package Jaeger components atop our own base Docker images: Epic.
Oct 18 2022, 3:21 PM · Observability-Tracing
CDanis removed a project from T320555: cas-sso idp for jaeger-ui on k8s: Epic.
Oct 18 2022, 3:21 PM · Observability-Tracing
CDanis removed a project from T320554: Deploy and run Jaeger on our k8s clusters: Epic.
Oct 18 2022, 3:21 PM · Observability-Tracing
CDanis removed a project from T320556: Micro-specification for how service owners should propagate tracing headers: Epic.
Oct 18 2022, 3:21 PM · Observability-Tracing
CDanis removed a project from T320561: Trace header propagation for service-template-node and all service-runner services: Epic.
Oct 18 2022, 3:21 PM · service-runner, Observability-Tracing
CDanis removed a project from T320559: Trace header propagation for Mediawiki: Epic.
Oct 18 2022, 3:21 PM · Observability-Tracing
CDanis removed a project from T320562: OpenSearch index provisioned for Jaeger : Epic.
Oct 18 2022, 3:21 PM · Observability-Tracing
CDanis removed a project from T320563: our various Envoys are configured to report traces to local OpenTelemetry Collector: Epic.
Oct 18 2022, 3:21 PM · Observability-Tracing
CDanis removed a project from T320564: OpenTelemetry Collector running as a DaemonSet on Wikikube: Epic.
Oct 18 2022, 3:21 PM · serviceops, Observability-Tracing
CDanis removed a project from T320565: OpenTelemetry Collector puppetized and able to be deployed easily to arbitrary roles: Epic.
Oct 18 2022, 3:21 PM · serviceops, Observability-Tracing

Oct 13 2022

CDanis added a comment to T320551: Package OpenTelemetry Collector as a .deb.

Thanks Clement!

Oct 13 2022, 4:05 PM · serviceops, Observability-Tracing

Oct 11 2022

CDanis added a comment to T320561: Trace header propagation for service-template-node and all service-runner services.

It would be good to get this done before there's much further progress on T308371: Migrate node-based services in production to node16

Oct 11 2022, 8:53 PM · service-runner, Observability-Tracing
CDanis created T320565: OpenTelemetry Collector puppetized and able to be deployed easily to arbitrary roles.
Oct 11 2022, 7:09 PM · serviceops, Observability-Tracing
CDanis created T320564: OpenTelemetry Collector running as a DaemonSet on Wikikube.
Oct 11 2022, 7:08 PM · serviceops, Observability-Tracing
CDanis created T320563: our various Envoys are configured to report traces to local OpenTelemetry Collector.
Oct 11 2022, 7:07 PM · Observability-Tracing
CDanis created T320562: OpenSearch index provisioned for Jaeger .
Oct 11 2022, 7:06 PM · Observability-Tracing
CDanis created T320561: Trace header propagation for service-template-node and all service-runner services.
Oct 11 2022, 7:05 PM · service-runner, Observability-Tracing
CDanis created T320559: Trace header propagation for Mediawiki.
Oct 11 2022, 7:01 PM · Observability-Tracing
CDanis created T320556: Micro-specification for how service owners should propagate tracing headers.
Oct 11 2022, 6:53 PM · Observability-Tracing
CDanis created T320555: cas-sso idp for jaeger-ui on k8s.
Oct 11 2022, 6:51 PM · Observability-Tracing
CDanis added a parent task for T320554: Deploy and run Jaeger on our k8s clusters: T320553: Re-package Jaeger components atop our own base Docker images.
Oct 11 2022, 6:50 PM · Observability-Tracing
CDanis added a subtask for T320553: Re-package Jaeger components atop our own base Docker images: T320554: Deploy and run Jaeger on our k8s clusters.
Oct 11 2022, 6:50 PM · Observability-Tracing
CDanis created T320554: Deploy and run Jaeger on our k8s clusters.
Oct 11 2022, 6:50 PM · Observability-Tracing
CDanis created T320553: Re-package Jaeger components atop our own base Docker images.
Oct 11 2022, 6:48 PM · Observability-Tracing
CDanis created T320552: Package OpenTelemetry Collector atop our own base Docker images.
Oct 11 2022, 6:46 PM · Patch-For-Review, serviceops, Observability-Tracing
CDanis created T320551: Package OpenTelemetry Collector as a .deb.
Oct 11 2022, 6:45 PM · serviceops, Observability-Tracing
CDanis created T320549: distributed tracing v0 [minimum viable].
Oct 11 2022, 6:44 PM · Epic, Observability-Tracing
CDanis added a comment to T304373: Also intake Network Error Logging events into the Analytics Data Lake.

Happy quarterly planning season; I was wondering if there was any updated estimates on when this might happen?

Oct 11 2022, 4:21 PM · Infrastructure-Foundations, Shared-Data-Infrastructure, Data Pipelines, Data-Engineering-Planning

Oct 6 2022

CDanis updated the name of F35546836: hotlinks vs organic traffic surges, per-file cache_upload rps.pdf from "hotlinks vs organic traffic surges, per-file cache_upload rps" to "hotlinks vs organic traffic surges, per-file cache_upload rps.pdf".
Oct 6 2022, 4:57 PM

Oct 4 2022

CDanis created T319344: Add a rolled-up cache_status field to druid webrequest_sampled_128.
Oct 4 2022, 8:14 PM · Data Pipelines, Traffic, SRE, Data-Engineering-Planning
CDanis updated the name of F35546836: hotlinks vs organic traffic surges, per-file cache_upload rps.pdf from "hotlinks2 (1).pdf" to "hotlinks vs organic traffic surges, per-file cache_upload rps".
Oct 4 2022, 7:16 PM
CDanis added a comment to T317799: Rate limiting for hotlinked images.

Here's my jupyter notebook with a rough analysis of a very impactful hotlink incident (on 2022-09-13) and our biggest organic traffic surge to date (Queen Elizabeth's passing on 2022-09-08):

Oct 4 2022, 7:15 PM · Infrastructure-Foundations, Traffic, Patch-For-Review, Sustainability (Incident Followup), SRE
CDanis updated the task description for T319324: Consider adding X-Analytics subfield for 'has a session cookie'.
Oct 4 2022, 5:19 PM · Analytics-Radar, SRE, Traffic
CDanis created T319324: Consider adding X-Analytics subfield for 'has a session cookie'.
Oct 4 2022, 5:19 PM · Analytics-Radar, SRE, Traffic

Sep 30 2022

CDanis added a project to T315403: Framework for running experiments on a subset of the app server fleet: serviceops-collab.

Just pinging this task as OKR season is upon us and this might be a useful and fun thing to sneak in

Sep 30 2022, 1:06 PM · serviceops, Performance-Team (Radar), SRE, Observability-Logging, Observability-Metrics

Sep 28 2022

CDanis created T318804: ncredir redirects for status.wiki* --> status.wikimedia.org.
Sep 28 2022, 12:07 PM · Traffic, SRE-OnFire (FY2021/2022-Q4), SRE

Sep 22 2022

CDanis edited P34897 (An Untitled Masterwork).
Sep 22 2022, 8:41 AM
CDanis created P34897 (An Untitled Masterwork).
Sep 22 2022, 8:40 AM
CDanis created P34896 (An Untitled Masterwork).
Sep 22 2022, 8:39 AM
CDanis created P34895 (An Untitled Masterwork).
Sep 22 2022, 8:34 AM
CDanis updated the language for P34894 (An Untitled Masterwork) from autodetect to shell.
Sep 22 2022, 8:29 AM
CDanis created P34894 (An Untitled Masterwork).
Sep 22 2022, 8:29 AM

Sep 16 2022

CDanis added a project to T317794: requestctl can't act on cache hits: Traffic.
Sep 16 2022, 12:39 AM · Patch-For-Review, Traffic, Sustainability (Incident Followup), SRE, conftool
CDanis added a comment to T317794: requestctl can't act on cache hits.

Going to be bold and append to the task description with what we discussed in the cachebust WG meeting today (so that anyone can update it and tick boxes as we go).

Sep 16 2022, 12:22 AM · Patch-For-Review, Traffic, Sustainability (Incident Followup), SRE, conftool

Sep 12 2022

CDanis updated subscribers of T303725: Extend NEL headers to sites not fronted by CDN.

As a note, such sites also include "everything on WMCS / toolserver" and it would probably be good to extend NEL to that as well.

Sep 12 2022, 4:04 PM · Infrastructure-Foundations, SRE

Sep 8 2022

CDanis triaged T317001: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links as High priority.
Sep 8 2022, 2:05 PM · Traffic, Data-Services, SRE

Sep 7 2022

CDanis created T317240: Improve AlertManager alert titles as sent to VictorOps.
Sep 7 2022, 8:39 PM · SRE Observability (FY2022/2023-Q2), SRE-OnFire, Observability-Alerting
CDanis committed rOSCTab1f32add3af: Remove buggy comma filter footer (authored by CDanis).
Remove buggy comma filter footer
Sep 7 2022, 7:31 PM
CDanis closed T314578: Add the requestctl element of the x-analytics map to turnlio's webrequest_sampled_128 as Resolved.
Sep 7 2022, 7:11 PM · Data-Engineering-Planning
CDanis closed T314578: Add the requestctl element of the x-analytics map to turnlio's webrequest_sampled_128, a subtask of T305582: Annotate X-Analytics header with any matching actions, as Resolved.
Sep 7 2022, 7:11 PM · Patch-For-Review, SRE, conftool
CDanis added a comment to T314578: Add the requestctl element of the x-analytics map to turnlio's webrequest_sampled_128.

This was deployed yesterday, daily job restarted as of Aug 1st.

(in the future, unfortunately Hive behaves like MySQL in that it doesn't let you use aliases in group by. Like if you select something as s you can group by something but you can't group by s. I always thought this was weird, especially since I wrote a DBMS with a query parser in college in like a few days that did not have this limitation...)

Sep 7 2022, 2:20 PM · Data-Engineering-Planning

Sep 6 2022

CDanis created T317159: klaxon CLI tool for seeding an oncall handoff.
Sep 6 2022, 8:23 PM · Patch-For-Review, SRE-OnFire, SRE

Sep 4 2022

CDanis created T317001: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links.
Sep 4 2022, 7:15 PM · Traffic, Data-Services, SRE

Sep 2 2022

CDanis closed T316482: Update wgLBFactoryConf for x2 to register only the local primary as Resolved.
Sep 2 2022, 12:10 PM · Patch-For-Review, Performance-Team (Radar), DBA, serviceops
CDanis closed T316482: Update wgLBFactoryConf for x2 to register only the local primary, a subtask of T312809: Avoid x2-mainstash replica connections (ChronologyProtector), as Resolved.
Sep 2 2022, 12:10 PM · Patch-For-Review, Performance-Team, MediaWiki-libs-ObjectCache

Sep 1 2022

CDanis added a comment to T316482: Update wgLBFactoryConf for x2 to register only the local primary.

This is ready, was tested by hand on cumin2002, and is now deployed to both cumin hosts.

Sep 1 2022, 12:14 PM · Patch-For-Review, Performance-Team (Radar), DBA, serviceops
CDanis committed rOSCTbce73dd7c9a8: debian/changelog (authored by CDanis).
debian/changelog
Sep 1 2022, 11:07 AM
CDanis committed rOSCT421fdedf3cb5: bump version (authored by CDanis).
bump version
Sep 1 2022, 11:04 AM
CDanis committed rOSCTe73df76e3a41: dbctl: Add omit_replicas_in_mwconfig section attribute (authored by CDanis).
dbctl: Add omit_replicas_in_mwconfig section attribute
Sep 1 2022, 10:51 AM

Aug 31 2022

CDanis added a comment to T316482: Update wgLBFactoryConf for x2 to register only the local primary.

@Marostegui I actually implemented this not as a new flavor, but instead as a boolean attribute omit_replicas_in_mwconfig on the section object. Once the patch is merged and deployed I'll let you know.

Aug 31 2022, 8:25 PM · Patch-For-Review, Performance-Team (Radar), DBA, serviceops
CDanis added a comment to T314578: Add the requestctl element of the x-analytics map to turnlio's webrequest_sampled_128.

I've written a patch, which is hopefully correct.

Aug 31 2022, 7:53 PM · Data-Engineering-Planning
CDanis committed rOSCTaf3159c24f99: dbctl: python 3.10 & x2 section (authored by CDanis).
dbctl: python 3.10 & x2 section
Aug 31 2022, 5:28 PM

Aug 29 2022

CDanis added a comment to T316482: Update wgLBFactoryConf for x2 to register only the local primary.

omit-replicas looks good to me, so we can re-use it somewhere else if needed (ideally I would like to handle parsercache with dbctl in a future, as they are the only ones still handled with mediawiki-config)

Aug 29 2022, 12:07 PM · Patch-For-Review, Performance-Team (Radar), DBA, serviceops
CDanis added a comment to T316482: Update wgLBFactoryConf for x2 to register only the local primary.

@Marostegui That looks correct to me.

Aug 29 2022, 12:02 PM · Patch-For-Review, Performance-Team (Radar), DBA, serviceops

Aug 24 2022

CDanis updated the task description for T316160: improve GeoDNS-to-edge mapping.
Aug 24 2022, 8:59 PM · SRE, Infrastructure-Foundations, netops, Traffic
CDanis triaged T316160: improve GeoDNS-to-edge mapping as Low priority.
Aug 24 2022, 8:56 PM · SRE, Infrastructure-Foundations, netops, Traffic
CDanis created T316160: improve GeoDNS-to-edge mapping.
Aug 24 2022, 8:55 PM · SRE, Infrastructure-Foundations, netops, Traffic
CDanis updated subscribers of T314578: Add the requestctl element of the x-analytics map to turnlio's webrequest_sampled_128.

also cc @EChetty

Aug 24 2022, 8:30 PM · Data-Engineering-Planning

Aug 23 2022

CDanis updated subscribers of T314578: Add the requestctl element of the x-analytics map to turnlio's webrequest_sampled_128.

ping @JAllemandou -- did I put this on the right phab tag? It'd be really awesome to have and I suspect is a pretty easy change

Aug 23 2022, 7:53 PM · Data-Engineering-Planning

Aug 19 2022

CDanis added a comment to T314972: LibreNMS seemingly not collecting data for many ports after migration to netmon1003.

Looks like this is resolved...?

Aug 19 2022, 5:44 PM · Patch-For-Review, SRE, netops, SRE Observability (FY2022/2023-Q1), Observability-Metrics, Infrastructure-Foundations

Aug 17 2022

CDanis updated subscribers of T315403: Framework for running experiments on a subset of the app server fleet.

Summary of a conversation that ori, joe, and I had on IRC today:

  • You get some of this "for free" once the appservers are on k8s -- you can add labels to your pods that will be automatically propagated to logstash/prometheus
  • However, it would be valuable to have a framework like this beyond just the appservers or k8s services
    • For instance, Traffic has done a lot of that kind of experimentation on cp nodes with ad-hoc mechanisms in the past, same for some other teams
  • Any Puppet+Prometheus plumbing should be reusable, at least
    • in prometheus you can have those same puppet facts exported by node-exporter after having puppet generate a textfile for it, and then, you can join metrics together at query time
  • Logstash might be more difficult that Prometheus (although I don't know for sure, maybe there's an easy mechanism with a filter script)
    • Perhaps those tags could be injected via rsyslog (as configured via puppet)?
Aug 17 2022, 11:52 PM · serviceops, Performance-Team (Radar), SRE, Observability-Logging, Observability-Metrics

Aug 11 2022

CDanis renamed T314972: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 from LibreNMS seemingly not scraping many devices after migration to netmon1003 to LibreNMS seemingly not collecting data for many ports after migration to netmon1003.
Aug 11 2022, 3:14 AM · Patch-For-Review, SRE, netops, SRE Observability (FY2022/2023-Q1), Observability-Metrics, Infrastructure-Foundations
CDanis triaged T314972: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 as High priority.
Aug 11 2022, 3:13 AM · Patch-For-Review, SRE, netops, SRE Observability (FY2022/2023-Q1), Observability-Metrics, Infrastructure-Foundations
CDanis created T314972: LibreNMS seemingly not collecting data for many ports after migration to netmon1003.
Aug 11 2022, 3:13 AM · Patch-For-Review, SRE, netops, SRE Observability (FY2022/2023-Q1), Observability-Metrics, Infrastructure-Foundations

Aug 9 2022

Ladsgroup awarded T303725: Extend NEL headers to sites not fronted by CDN a Like token.
Aug 9 2022, 7:45 PM · Infrastructure-Foundations, SRE
CDanis added a comment to T313603: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers.

As a followup to this past weekend's misconfiguration that delayed paging, victorops.py now has a check_esc_policy_config subcommand.

Aug 9 2022, 12:53 PM · Patch-For-Review, User-fgiunchedi, SRE Observability (FY2022/2023-Q1), observability, SRE-OnFire

Aug 8 2022

herron awarded T313603: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers a Love token.
Aug 8 2022, 3:09 PM · Patch-For-Review, User-fgiunchedi, SRE Observability (FY2022/2023-Q1), observability, SRE-OnFire

Aug 4 2022

CDanis added a subtask for T305582: Annotate X-Analytics header with any matching actions: T314578: Add the requestctl element of the x-analytics map to turnlio's webrequest_sampled_128.
Aug 4 2022, 2:49 PM · Patch-For-Review, SRE, conftool