Page MenuHomePhabricator

colewhite (cwhite)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Aug 21 2018, 6:05 PM (83 w, 4 d)
Availability
Available
LDAP User
Cwhite
MediaWiki User
Unknown

Recent Activity

Mon, Mar 23

colewhite added a comment to T240685: MediaWiki Prometheus support.

DogStatsD shows some promise here. It's a statsd extension that statsd_exporter supports and enables dynamic labels. In testing, the statsd proxy doesn't support the extension, but translation is trivial if necessary.

Mon, Mar 23, 11:44 PM · serviceops, Operations, MediaWiki-General, observability

Fri, Mar 20

colewhite closed T248131: Prometheus jobs reduced availability alerts for Icinga exporter as Resolved.
Fri, Mar 20, 9:01 PM · observability
colewhite added a comment to T248131: Prometheus jobs reduced availability alerts for Icinga exporter.

Adjusting the timeout resolved the issue. The graphs are clean now.

Fri, Mar 20, 9:01 PM · observability

Thu, Mar 19

colewhite triaged T248131: Prometheus jobs reduced availability alerts for Icinga exporter as High priority.
Thu, Mar 19, 9:10 PM · observability
colewhite created T248131: Prometheus jobs reduced availability alerts for Icinga exporter.
Thu, Mar 19, 9:10 PM · observability
colewhite added a comment to T247820: Decide on `service-runner` aggregated prometheus metrics and use of `service` label.

And of course we can always just add a new one if we feel like it. I would propose in fact we go down that way as we fully control it. Any naming preferences?

Thu, Mar 19, 12:04 AM · Performance-Team (Radar), observability, Operations

Tue, Mar 17

colewhite triaged T247820: Decide on `service-runner` aggregated prometheus metrics and use of `service` label as Medium priority.
Tue, Mar 17, 9:45 PM · Performance-Team (Radar), observability, Operations
colewhite added a comment to T247820: Decide on `service-runner` aggregated prometheus metrics and use of `service` label.

Good idea forking the original task. Thanks for that!

Tue, Mar 17, 9:44 PM · Performance-Team (Radar), observability, Operations

Mon, Mar 16

colewhite added a comment to T246998: Enable SSO for Kibana.

That CSP works well. I think cas needs to respond with an appropriate Access-Control-Allow-Origin. https://apereo.github.io/cas/5.2.x/installation/Configuration-Properties.html#http-web-requests

Mon, Mar 16, 11:05 PM · Operations
colewhite added a comment to T238658: Migrate EventStreams to k8s deployment pipeline.

Sure it might very well be. I am fine with dropping it from statsd-exporter/service-runner itself as long as we expose it in some other way (e.g. a kubernetes label) so that we don't end up breaking all grafana dashboards.

Mon, Mar 16, 7:17 PM · Analytics-Kanban, Analytics, Patch-For-Review, Release-Engineering-Team (Pipeline), Services (watching), Release Pipeline

Fri, Mar 13

colewhite added a comment to T246860: some Prometheis not scraping the full set of targets.

Found a related issue.

Fri, Mar 13, 7:38 PM · Patch-For-Review, Traffic, observability, Operations
colewhite claimed T246860: some Prometheis not scraping the full set of targets.
Fri, Mar 13, 7:22 PM · Patch-For-Review, Traffic, observability, Operations
colewhite added a comment to T246860: some Prometheis not scraping the full set of targets.

It appears a reload does resolve the issue, but it takes some time for Prometheus to fetch and store an update. I used kill -HUP <PID> to reload.

Fri, Mar 13, 12:30 AM · Patch-For-Review, Traffic, observability, Operations

Thu, Mar 12

colewhite added a comment to T246998: Enable SSO for Kibana.

It looks like most of the issues stems from CSP blocking mixed-content. idp.wikimedia.org is redirecting to http per this changeset.

Thu, Mar 12, 10:47 PM · Operations

Fri, Mar 6

colewhite added a comment to T238658: Migrate EventStreams to k8s deployment pipeline.

I created a PR to service-runner for the updates to heapwatch metrics. Thanks for the feedback!

Fri, Mar 6, 6:00 PM · Analytics-Kanban, Analytics, Patch-For-Review, Release-Engineering-Team (Pipeline), Services (watching), Release Pipeline
colewhite added a comment to T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".

... with 2M docs indexed it looks like the change might only be from ~800 bytes/doc to ~950 bytes/doc.

Fri, Mar 6, 2:49 AM · Patch-For-Review, Operations, Wikimedia-Logstash

Thu, Mar 5

colewhite added a comment to T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".

I have concerns about re-implementing the _all field given that it is no longer "free." This means if we use copy_to, each log will take twice the disk space and the index cost in kind. With stack traces and request/response logs including response bodies, I can see this adding up quickly (unless we omit these from the new _all field).

Thu, Mar 5, 10:01 PM · Patch-For-Review, Operations, Wikimedia-Logstash
colewhite added a comment to T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".

It looks like the issue has been run into before in the Beats family of software. There is a template setting that allows us to define an array of fields that are default query fields:

Thu, Mar 5, 8:55 PM · Patch-For-Review, Operations, Wikimedia-Logstash

Mon, Mar 2

colewhite updated the task description for T205870: Fully migrate producers off statsd.
Mon, Mar 2, 10:14 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations

Feb 28 2020

colewhite added a comment to T240685: MediaWiki Prometheus support.

In response to @Joe's concerns:

Feb 28 2020, 4:38 PM · serviceops, Operations, MediaWiki-General, observability
colewhite added a comment to T240685: MediaWiki Prometheus support.

As I think about it more, it's the wire format being wholly incompatible with Prometheus format. In order to make it work, StatsD requires a lot of configuration to adequately convert, and managing that configuration will be burdensome.

Feb 28 2020, 3:55 PM · serviceops, Operations, MediaWiki-General, observability

Feb 27 2020

colewhite added a comment to T240685: MediaWiki Prometheus support.

One alternative is to adopt a sidecar in the form of statsd_exporter and have it do the heavy lifting of translating MediaWiki and MW Extension metrics into Prometheus-compatible format. I see two major pain points with this solution. The first is settling on a pattern of mapping metrics to Prometheus metrics, and second is managing change over time.

Feb 27 2020, 9:03 PM · serviceops, Operations, MediaWiki-General, observability
colewhite added a comment to T240685: MediaWiki Prometheus support.

Per @fgiunchedi recommendation, I put together a very basic mockup of how DirectFileStore might look in prometheus_client_php.

Feb 27 2020, 1:01 AM · serviceops, Operations, MediaWiki-General, observability

Feb 13 2020

colewhite added a comment to T233448: Review prometheus ORES rules for completeness.

I went ahead and updated this dashboard and added the Prometheus version next to the Graphite version as an example. During the process, I amended a couple metrics that were missed or misconfigured.

Feb 13 2020, 8:27 PM · Patch-For-Review, ORES, Scoring-platform-team
colewhite added a comment to T233448: Review prometheus ORES rules for completeness.

I see the value in a refactor/cleanup if what is currently being captured is not everything we need to (at least) recreate the current dashboards.

Feb 13 2020, 1:45 AM · Patch-For-Review, ORES, Scoring-platform-team

Feb 3 2020

colewhite added a comment to T225604: log spam from mtail 3.0.0~rc19 on wezen.

@MoritzMuehlenhoff doing that shouldn't hurt anything AFAIK.

Feb 3 2020, 5:33 PM · Operations, Patch-For-Review, observability

Jan 24 2020

colewhite added a parent task for T225604: log spam from mtail 3.0.0~rc19 on wezen: T243591: varnishmtail panics on buster.
Jan 24 2020, 3:55 PM · Operations, Patch-For-Review, observability
colewhite added a subtask for T243591: varnishmtail panics on buster: T225604: log spam from mtail 3.0.0~rc19 on wezen.
Jan 24 2020, 3:55 PM · Operations, Traffic
colewhite added a comment to T225604: log spam from mtail 3.0.0~rc19 on wezen.

Today we had the same error in varnishmtail on a new buster host (cp4032).

Jan 24 2020, 12:56 AM · Operations, Patch-For-Review, observability

Jan 18 2020

colewhite added a comment to T239833: StatsD Exporter drops relayed metrics.

The latest patch appears to help a lot. There is still a discrepancy that I haven't been able to track down.

$ touch forwarded_new.txt && socat -t 0 FILE:forwarded_new.txt udp-listen:9125,fork
$ ./statsd_exporter_gerrit_554544 --statsd.mapping-config=statsd_exporter.conf --statsd.listen-udp=:8125 --statsd.relay-address=127.0.0.1:9125
$ ./udpreplay --pps 2000 --host localhost --port 8125 ores1001.pcap
Jan 18 2020, 12:14 AM · Patch-For-Review, observability, Operations
colewhite updated the task description for T239833: StatsD Exporter drops relayed metrics.
Jan 18 2020, 12:11 AM · Patch-For-Review, observability, Operations

Jan 17 2020

colewhite renamed T239833: StatsD Exporter drops relayed metrics from StatsD Exporter does not relay dropped metrics to StatsD Exporter drops relayed metrics.
Jan 17 2020, 11:43 PM · Patch-For-Review, observability, Operations

Dec 20 2019

colewhite closed T240917: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick as Resolved.
Dec 20 2019, 11:53 PM · Operations, SRE-Access-Requests
colewhite closed T240917: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick, a subtask of T240739: Onboarding Checklist for Shay Nowick, as Resolved.
Dec 20 2019, 11:53 PM · Product-Analytics (Kanban)
colewhite added a comment to T240917: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick.

Thank you!

Dec 20 2019, 11:53 PM · Operations, SRE-Access-Requests
colewhite updated the task description for T240917: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick.
Dec 20 2019, 11:47 PM · Operations, SRE-Access-Requests
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 20 2019, 8:49 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite added a comment to T240917: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick.

I've moved ahead and added you to the wmf ldap group.

Dec 20 2019, 12:15 AM · Operations, SRE-Access-Requests

Dec 19 2019

colewhite updated the task description for T240917: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick.
Dec 19 2019, 11:40 PM · Operations, SRE-Access-Requests
colewhite updated the task description for T240917: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick.
Dec 19 2019, 11:35 PM · Operations, SRE-Access-Requests
colewhite claimed T240917: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick.
Dec 19 2019, 11:34 PM · Operations, SRE-Access-Requests
colewhite added a subtask for T205870: Fully migrate producers off statsd: T241176: Review and release service-runner 2.8.0.
Dec 19 2019, 9:06 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite added a parent task for T241176: Review and release service-runner 2.8.0: T205870: Fully migrate producers off statsd.
Dec 19 2019, 9:05 PM · Core Platform Team Workboards (Clinic Duty Team), service-runner
colewhite created T241176: Review and release service-runner 2.8.0.
Dec 19 2019, 9:04 PM · Core Platform Team Workboards (Clinic Duty Team), service-runner
colewhite updated subscribers of T240685: MediaWiki Prometheus support.

We recently had a conversation about this.

Dec 19 2019, 8:54 PM · serviceops, Operations, MediaWiki-General, observability
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 19 2019, 7:22 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite added a comment to T236954: Hieradata yaml style checking.

Great idea. Lets raise it at the next SRE meeting.

Dec 19 2019, 3:54 PM · Patch-For-Review, Puppet, Operations, User-jbond

Dec 18 2019

colewhite added a comment to T240870: Audit the WMF LDAP group and limit its permissions.

@jcrespo that sounds bad to me. Perhaps query monitoring is a great candidate for a more specific and limited group?

Dec 18 2019, 11:40 PM · Operations

Dec 17 2019

colewhite added a comment to T236954: Hieradata yaml style checking.

The changesets look great and appear to do the right thing.

Dec 17 2019, 11:51 PM · Patch-For-Review, Puppet, Operations, User-jbond
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 17 2019, 11:44 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 17 2019, 10:13 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite added a subtask for T205870: Fully migrate producers off statsd: T240995: AQS is not OpenAPI 3 compliant.
Dec 17 2019, 9:12 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite added a parent task for T240995: AQS is not OpenAPI 3 compliant: T205870: Fully migrate producers off statsd.
Dec 17 2019, 9:12 PM · Patch-For-Review, Analytics
colewhite created T240995: AQS is not OpenAPI 3 compliant.
Dec 17 2019, 9:08 PM · Patch-For-Review, Analytics
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 17 2019, 12:26 AM · Performance-Team (Radar), Patch-For-Review, observability, Operations

Dec 16 2019

colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 16 2019, 10:17 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 16 2019, 9:02 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 16 2019, 7:52 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite closed T238807: Clean up ORES metrics as Resolved.
Dec 16 2019, 5:06 PM · observability, Operations
colewhite triaged T240870: Audit the WMF LDAP group and limit its permissions as Low priority.
Dec 16 2019, 4:45 PM · Operations
colewhite created T240870: Audit the WMF LDAP group and limit its permissions.
Dec 16 2019, 4:45 PM · Operations

Dec 13 2019

colewhite added a subtask for T205870: Fully migrate producers off statsd: T240685: MediaWiki Prometheus support.
Dec 13 2019, 3:41 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite added a parent task for T240685: MediaWiki Prometheus support: T205870: Fully migrate producers off statsd.
Dec 13 2019, 3:41 PM · serviceops, Operations, MediaWiki-General, observability
colewhite triaged T240685: MediaWiki Prometheus support as Medium priority.
Dec 13 2019, 3:40 PM · serviceops, Operations, MediaWiki-General, observability
colewhite created T240685: MediaWiki Prometheus support.
Dec 13 2019, 3:40 PM · serviceops, Operations, MediaWiki-General, observability

Dec 12 2019

colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 12 2019, 11:53 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 12 2019, 11:13 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations

Dec 11 2019

colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 11 2019, 5:04 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations
colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 11 2019, 4:49 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations

Dec 10 2019

colewhite updated the task description for T205870: Fully migrate producers off statsd.
Dec 10 2019, 9:42 PM · Performance-Team (Radar), Patch-For-Review, observability, Operations

Dec 9 2019

colewhite added a comment to T238807: Clean up ORES metrics.

needs to be done in codfw as well

Dec 9 2019, 9:45 PM · observability, Operations
colewhite reopened T238807: Clean up ORES metrics as "Open".
Dec 9 2019, 9:45 PM · observability, Operations

Dec 6 2019

colewhite closed T239881: LDAP access to the wmf group for Danny Horn as Resolved.
Dec 6 2019, 11:19 PM · Operations, LDAP-Access-Requests
colewhite triaged T239993: Decom LVS recdns as Medium priority.
Dec 6 2019, 11:19 PM · Patch-For-Review, Operations, Traffic

Dec 5 2019

colewhite moved T239881: LDAP access to the wmf group for Danny Horn from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Dec 5 2019, 11:49 PM · Operations, LDAP-Access-Requests
colewhite closed T239494: Requesting access to LogStash for rxy as Resolved.
Dec 5 2019, 7:45 PM · SRE-Access-Requests, Operations
colewhite added a comment to T239494: Requesting access to LogStash for rxy.

@Rxy I've added you to the NDA group which should grant you access to Logstash. Please let me know if you encounter any related issue.

Dec 5 2019, 7:45 PM · SRE-Access-Requests, Operations
colewhite added a comment to T239881: LDAP access to the wmf group for Danny Horn.

@DannyH I've moved ahead and added you to the wmf ldap group on the basis of your status as staff. We still need to know what you need this access for though.

Dec 5 2019, 7:44 PM · Operations, LDAP-Access-Requests
colewhite triaged T239881: LDAP access to the wmf group for Danny Horn as Medium priority.
Dec 5 2019, 6:01 PM · Operations, LDAP-Access-Requests
colewhite triaged T239805: ms-fe2007 NIC failure as Medium priority.
Dec 5 2019, 5:59 PM · User-fgiunchedi, ops-codfw, Operations
colewhite triaged T239832: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages as Medium priority.
Dec 5 2019, 5:58 PM · Operations
colewhite triaged T239880: Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 as Medium priority.
Dec 5 2019, 5:58 PM · Continuous-Integration-Infrastructure (phase-out-jessie), DC-Ops, hardware-requests, Operations
colewhite triaged T239893: BGP peering sessions with corp partially down in ulsfo as Medium priority.
Dec 5 2019, 5:58 PM · Operations, netops
colewhite triaged T239896: Facebook BGP peering links down in ulsfo as Medium priority.
Dec 5 2019, 5:55 PM · netops, Operations
colewhite triaged T239901: Disallow 'weight: 0' for MW db config in dbctl as Medium priority.
Dec 5 2019, 5:55 PM · Operations, DBA, Wikimedia-Incident
colewhite added a comment to T239874: MediaWiki: "host db1062 is unreachable" (Connection refused).

It seems clear that db1062 shouldn't be pooled anywhere. Ran the dbctl depool utility and it's gone from s7.

Dec 5 2019, 12:48 AM · DBA, Wikimedia-production-error
colewhite added a comment to T233448: Review prometheus ORES rules for completeness.

Since it's not used in dashboards, what do we do with the model? I imagine it's useful, but I'm not sure how.

Dec 5 2019, 12:17 AM · Patch-For-Review, ORES, Scoring-platform-team

Dec 4 2019

colewhite closed T239654: Requesting access to production shell for Maryum Styles, a subtask of T239300: Add Maryum to Puppet, as Resolved.
Dec 4 2019, 9:49 PM · Patch-For-Review, Operations, Discovery-Search (Current work)
colewhite closed T239654: Requesting access to production shell for Maryum Styles as Resolved.
Dec 4 2019, 9:49 PM · Discovery-Search (Current work), Operations, SRE-Access-Requests
colewhite added a comment to T239654: Requesting access to production shell for Maryum Styles.

@Mstyles is now in the wmf ldap group. Please let me know if you encounter any related issue.

Dec 4 2019, 9:49 PM · Discovery-Search (Current work), Operations, SRE-Access-Requests
colewhite triaged T239833: StatsD Exporter drops relayed metrics as Medium priority.
Dec 4 2019, 8:17 PM · Patch-For-Review, observability, Operations
colewhite added a parent task for T239833: StatsD Exporter drops relayed metrics: T233448: Review prometheus ORES rules for completeness.
Dec 4 2019, 8:17 PM · Patch-For-Review, observability, Operations
colewhite added a subtask for T233448: Review prometheus ORES rules for completeness: T239833: StatsD Exporter drops relayed metrics.
Dec 4 2019, 8:17 PM · Patch-For-Review, ORES, Scoring-platform-team
colewhite added a comment to T233448: Review prometheus ORES rules for completeness.

I did more research and found a usage pattern that didn't initially occur to me.

Dec 4 2019, 8:17 PM · Patch-For-Review, ORES, Scoring-platform-team
colewhite created T239833: StatsD Exporter drops relayed metrics.
Dec 4 2019, 4:04 PM · Patch-For-Review, observability, Operations
colewhite claimed T239654: Requesting access to production shell for Maryum Styles.
Dec 4 2019, 3:42 AM · Discovery-Search (Current work), Operations, SRE-Access-Requests
colewhite claimed T239494: Requesting access to LogStash for rxy.
Dec 4 2019, 3:41 AM · SRE-Access-Requests, Operations
colewhite triaged T239300: Add Maryum to Puppet as Medium priority.
Dec 4 2019, 3:41 AM · Patch-For-Review, Operations, Discovery-Search (Current work)
colewhite triaged T239586: Add latest jenkins debian packages to apt.wikimedia.org and upgrade jenkins to latest LTS (2.190.3) as Medium priority.
Dec 4 2019, 3:40 AM · Jenkins, Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO, Operations
colewhite triaged T239711: Make DNS operations resilient against predictable failures as Medium priority.
Dec 4 2019, 3:39 AM · Traffic, Operations