Page MenuHomePhabricator

colewhite (cwhite)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Aug 21 2018, 6:05 PM (139 w, 2 d)
Availability
Available
LDAP User
Cwhite
MediaWiki User
Unknown

Recent Activity

Wed, Apr 21

colewhite added a subtask for T274394: ES Curator cron jobs are not cleaned up when output no longer exists: T280805: Error in apifeatureusage curator "forcemerge" step.
Wed, Apr 21, 3:20 PM · Patch-For-Review, observability
colewhite added a parent task for T280805: Error in apifeatureusage curator "forcemerge" step: T274394: ES Curator cron jobs are not cleaned up when output no longer exists.
Wed, Apr 21, 3:20 PM · observability, Discovery-Search
colewhite moved T280805: Error in apifeatureusage curator "forcemerge" step from Inbox to Radar on the observability board.
Wed, Apr 21, 3:19 PM · observability, Discovery-Search
colewhite created T280805: Error in apifeatureusage curator "forcemerge" step.
Wed, Apr 21, 3:19 PM · observability, Discovery-Search
colewhite closed T274394: ES Curator cron jobs are not cleaned up when output no longer exists as Resolved.

Refactor complete!

Wed, Apr 21, 3:16 PM · Patch-For-Review, observability

Thu, Apr 15

colewhite added a comment to T279087: Update the logo for wikitech.wikimedia.org to mirror the new MediaWiki logo.

For me personally it would be nice if two favicons were easily distinguishable as my pile of tabs also mostly reduces the visible info to the favicon and maybe a few letters of the page title.

Thu, Apr 15, 11:10 PM · Logos, wikitech.wikimedia.org

Tue, Apr 13

colewhite created T280083: Pontoon: unable to provision role::puppetmaster::pontoon.
Tue, Apr 13, 9:48 PM · observability

Thu, Apr 8

colewhite lowered the priority of T257024: Buster elasticsearch-curator version not compatible with ELK7 from High to Medium.

I found elasticsearch-curator 5.8.1 in the thirdparty/elastic74 component and added it to the thirdparty/elastic710 component. The package is now available for Pontoon and production.

Thu, Apr 8, 7:30 PM · observability, SRE, Wikimedia-Logstash

Wed, Apr 7

colewhite added a parent task for T279601: reclaim icinga1001.wikimedia.org: T247966: Migrate role::alerting_host to Buster.
Wed, Apr 7, 10:02 PM · observability, decommission-hardware
colewhite added a parent task for T279602: reclaim icinga2001.wikimedia.org: T247966: Migrate role::alerting_host to Buster.
Wed, Apr 7, 10:02 PM · observability, decommission-hardware
colewhite added subtasks for T247966: Migrate role::alerting_host to Buster: T279602: reclaim icinga2001.wikimedia.org, T279601: reclaim icinga1001.wikimedia.org.
Wed, Apr 7, 10:02 PM · Patch-For-Review, observability
colewhite added a project to T279602: reclaim icinga2001.wikimedia.org: observability.
Wed, Apr 7, 10:01 PM · observability, decommission-hardware
colewhite created T279602: reclaim icinga2001.wikimedia.org.
Wed, Apr 7, 10:01 PM · observability, decommission-hardware
colewhite renamed T279601: reclaim icinga1001.wikimedia.org from decommission icinga1001.wikimedia.org to reclaim icinga1001.wikimedia.org.
Wed, Apr 7, 10:01 PM · observability, decommission-hardware
colewhite created T279601: reclaim icinga1001.wikimedia.org.
Wed, Apr 7, 10:00 PM · observability, decommission-hardware
colewhite raised the priority of T257024: Buster elasticsearch-curator version not compatible with ELK7 from Medium to High.

The logstash::elasticsearch7 nodes (and Pontoon) do not have a curator version available that can run against ES 7.0. A version that can (>=5.7.0) is not available in our apt repository. The logstash7 nodes have elasticsearch-curator 5.8.1, but it is unclear how it got there.

Wed, Apr 7, 5:05 PM · observability, SRE, Wikimedia-Logstash

Thu, Apr 1

colewhite claimed T274394: ES Curator cron jobs are not cleaned up when output no longer exists.
Thu, Apr 1, 10:01 PM · Patch-For-Review, observability
colewhite moved T279086: Update Scap to emit ECS-compatible structured logs from Inbox to Radar on the observability board.
Thu, Apr 1, 4:52 PM · Release-Engineering-Team (Doing), observability, Scap
colewhite added a project to T279086: Update Scap to emit ECS-compatible structured logs: observability.
Thu, Apr 1, 4:52 PM · Release-Engineering-Team (Doing), observability, Scap
colewhite added a project to T279086: Update Scap to emit ECS-compatible structured logs: Release-Engineering-Team.
Thu, Apr 1, 4:52 PM · Release-Engineering-Team (Doing), observability, Scap
colewhite created T279086: Update Scap to emit ECS-compatible structured logs.
Thu, Apr 1, 4:51 PM · Release-Engineering-Team (Doing), observability, Scap
MBinder_WMF awarded T265505: Verification email does not reach my inbox a Like token.
Thu, Apr 1, 3:45 PM · VPS-project-Phabricator

Wed, Mar 31

colewhite closed T277080: Enable and ingest the Logstash dead letter queue as Resolved.

This is completed.

Wed, Mar 31, 10:54 PM · Patch-For-Review, observability, Wikimedia-Logstash
colewhite closed T277775: Logstash dead letter queue feature does not monitor queue size as Resolved.

The DLQ max_bytes workaround script fired on March 29th and appears to have done the right thing.

Wed, Mar 31, 5:45 PM · Wikimedia-Logstash, Upstream, observability
colewhite added a comment to T278141: cxserver missing important metrics after service-runner 2.8.1 upgrade.

The whole point of using docker is to avoid all these issues right? I used node:10-buster image just like you gave in the steps.

Wed, Mar 31, 4:00 PM · Patch-For-Review, Language-Team (Language-2021-April-June), CX-cxserver

Tue, Mar 30

colewhite updated subscribers of T278891: Phatality: Invalid time value.
Tue, Mar 30, 9:14 PM · Performance-Team (Radar), Patch-For-Review, Phatality
colewhite updated the task description for T278891: Phatality: Invalid time value.
Tue, Mar 30, 9:13 PM · Performance-Team (Radar), Patch-For-Review, Phatality
colewhite created T278891: Phatality: Invalid time value.
Tue, Mar 30, 9:11 PM · Performance-Team (Radar), Patch-For-Review, Phatality
colewhite added a comment to T278141: cxserver missing important metrics after service-runner 2.8.1 upgrade.

Same steps as you followed, except, look at server logs after first curl. Here is full console output from your steps P15088 servicerunner-PR-148

Tue, Mar 30, 3:49 PM · Patch-For-Review, Language-Team (Language-2021-April-June), CX-cxserver

Fri, Mar 26

colewhite added a comment to T278141: cxserver missing important metrics after service-runner 2.8.1 upgrade.

@santhosh I could not replicate the problem. Please advise on how to replicate?

Fri, Mar 26, 4:21 PM · Patch-For-Review, Language-Team (Language-2021-April-June), CX-cxserver

Mar 23 2021

colewhite moved T277816: Improve Logstash's throttling capabilities from Inbox to Backlog on the observability board.
Mar 23 2021, 10:01 PM · Wikimedia-Logstash, observability
colewhite closed T277813: Logstash is throttling the whole ECS pipeline as Resolved.

Resolved with the rollout of https://gerrit.wikimedia.org/r/c/operations/puppet/+/633224

Mar 23 2021, 10:00 PM · Wikimedia-Logstash, observability

Mar 22 2021

colewhite moved T277080: Enable and ingest the Logstash dead letter queue from Backlog to In progress on the observability board.
Mar 22 2021, 3:52 PM · Patch-For-Review, observability, Wikimedia-Logstash
colewhite triaged T275243: Updating the revision for an index pattern should not prevent curator cleanup as Medium priority.
Mar 22 2021, 3:29 PM · observability
colewhite triaged T274394: ES Curator cron jobs are not cleaned up when output no longer exists as Medium priority.
Mar 22 2021, 3:29 PM · Patch-For-Review, observability
colewhite triaged T277813: Logstash is throttling the whole ECS pipeline as High priority.
Mar 22 2021, 3:23 PM · Wikimedia-Logstash, observability
colewhite triaged T277775: Logstash dead letter queue feature does not monitor queue size as High priority.
Mar 22 2021, 3:23 PM · Wikimedia-Logstash, Upstream, observability
colewhite moved T277775: Logstash dead letter queue feature does not monitor queue size from Inbox to In progress on the observability board.
Mar 22 2021, 3:22 PM · Wikimedia-Logstash, Upstream, observability

Mar 19 2021

colewhite closed T238795: The "logstash-*" index pattern does not contain any of the following field types: ip as Resolved.

ECS is typing these fields appropriately since https://gerrit.wikimedia.org/r/c/operations/puppet/+/647029

Mar 19 2021, 8:42 PM · SRE, observability
colewhite reopened T269676: Mediawiki logging indexing conflict on 'status' for 'authevents' as "Open".

We didn't get them all on the first round.

Mar 19 2021, 5:01 PM · MW-1.36-notes, MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), Platform Team Workboards (External Code Reviews), Patch-For-Review, MW-1.35-notes, observability, MediaWiki-General

Mar 18 2021

colewhite added a project to T277813: Logstash is throttling the whole ECS pipeline: Wikimedia-Logstash.
Mar 18 2021, 10:17 PM · Wikimedia-Logstash, observability
colewhite added a subtask for T277813: Logstash is throttling the whole ECS pipeline: T277816: Improve Logstash's throttling capabilities.
Mar 18 2021, 10:16 PM · Wikimedia-Logstash, observability
colewhite added a parent task for T277816: Improve Logstash's throttling capabilities: T277813: Logstash is throttling the whole ECS pipeline.
Mar 18 2021, 10:16 PM · Wikimedia-Logstash, observability
colewhite created T277816: Improve Logstash's throttling capabilities.
Mar 18 2021, 10:16 PM · Wikimedia-Logstash, observability
colewhite created T277813: Logstash is throttling the whole ECS pipeline.
Mar 18 2021, 10:01 PM · Wikimedia-Logstash, observability
colewhite created T277775: Logstash dead letter queue feature does not monitor queue size.
Mar 18 2021, 4:09 PM · Wikimedia-Logstash, Upstream, observability
colewhite committed rOSEC256452f16ce7: ensure host field is the correct type in late-stage ecs filter (authored by colewhite).
ensure host field is the correct type in late-stage ecs filter
Mar 18 2021, 3:44 PM

Mar 16 2021

colewhite added a comment to T269676: Mediawiki logging indexing conflict on 'status' for 'authevents'.

The extent of this issue is now available to view in logstash.

Mar 16 2021, 3:50 PM · MW-1.36-notes, MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), Platform Team Workboards (External Code Reviews), Patch-For-Review, MW-1.35-notes, observability, MediaWiki-General
colewhite closed T277437: Broken kibana dashboard as Resolved.

Sorry for the confusion @ayounsi. The netdev log stream was migrated to ECS back in February. I added a note to the legacy dashboard you linked above pointing to the new dashboard here. Please let me know if I can assist you further.

Mar 16 2021, 3:45 PM · observability
colewhite closed T216611: Icinga check for ircecho should check for actual activity as Resolved.

Digging a bit into this found that there is some breakdown in communication between prometheus-ircd-exporter and ircd on the new irc2001 host: T224579

Mar 16 2021, 3:41 PM · IRCecho, observability, Icinga, SRE
colewhite added a comment to T224579: Migrate irc.wikimedia.org/kraz to Buster.

Since yesterday, the Prometheus jobs reduced availability alert has been firing about ircd on irc2001. Looking at the logs, there appears to be some breakdown in communication between prometheus-ircd-exporter and ircd:

Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: ERROR:__main__:Failed to connect to IRC server
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: ERROR:__main__:Failed to close connection to IRC server
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: Traceback (most recent call last):
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 599, in process_request_thread
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     self.finish_request(request, client_address)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     self.RequestHandlerClass(request, client_address, self)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     self.handle()
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     self.handle_one_request()
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     method()
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/exposition.py", line 151, in do_GET
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     output = encoder(registry)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     for metric in registry.collect():
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/registry.py", line 75, in collect
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     for metric in collector.collect():
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/bin/prometheus-ircd-exporter", line 55, in collect
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/socket.py", line 191, in __init__
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     _sock = _realsocket(family, type, proto)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: error: [Errno 24] Too many open files
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: Traceback (most recent call last):
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 599, in process_request_thread
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     self.finish_request(request, client_address)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     self.RequestHandlerClass(request, client_address, self)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     self.handle()
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     self.handle_one_request()
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     method()
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/exposition.py", line 151, in do_GET
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     output = encoder(registry)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     for metric in registry.collect():
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/registry.py", line 75, in collect
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     for metric in collector.collect():
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/bin/prometheus-ircd-exporter", line 55, in collect
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/socket.py", line 191, in __init__
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     _sock = _realsocket(family, type, proto)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: error: [Errno 24] Too many open files
Mar 16 2021, 3:39 PM · Patch-For-Review, User-notice, Wikimedia-IRC-RC-Server, SRE

Mar 15 2021

colewhite added a comment to T277080: Enable and ingest the Logstash dead letter queue.

This has been rolled out and appears to be working correctly.

Mar 15 2021, 10:20 PM · Patch-For-Review, observability, Wikimedia-Logstash

Mar 12 2021

colewhite closed T273919: Parse logstash error messages into fields as Resolved.

Logstash logs are now json formatted and indexed in the ecs indexes. It's likely that there are still datapoints to extract from the logs that would be helpful in classifying issues, but we can amend as we go.

Mar 12 2021, 4:37 PM · observability
colewhite archived P14668 (An Untitled Masterwork).
Mar 12 2021, 4:13 PM
colewhite closed T276595: Upgrade prometheus-jmx-exporter as Resolved.

prometheus-jmx-exporter 0.15.0 is deployed to our apt repo.

Mar 12 2021, 4:10 PM · Analytics-Clusters, SRE, wdwb-tech, Wikidata, Wikidata-Query-Service, CirrusSearch, observability
colewhite closed T276595: Upgrade prometheus-jmx-exporter, a subtask of T192948: Upgrade prometheus-jmx-exporter on all services using it, as Resolved.
Mar 12 2021, 4:10 PM · Analytics-Radar, Platform Team Legacy (Watching / External), User-Elukey, observability, Puppet, Services (watching), Cassandra

Mar 11 2021

colewhite added a comment to T276595: Upgrade prometheus-jmx-exporter.

Hello! Does Analytics have to upgrade too? :)

Mar 11 2021, 6:53 PM · Analytics-Clusters, SRE, wdwb-tech, Wikidata, Wikidata-Query-Service, CirrusSearch, observability

Mar 10 2021

colewhite moved T277080: Enable and ingest the Logstash dead letter queue from Inbox to Backlog on the observability board.
Mar 10 2021, 5:55 PM · Patch-For-Review, observability, Wikimedia-Logstash
colewhite triaged T277080: Enable and ingest the Logstash dead letter queue as Low priority.
Mar 10 2021, 5:54 PM · Patch-For-Review, observability, Wikimedia-Logstash
colewhite created T277080: Enable and ingest the Logstash dead letter queue.
Mar 10 2021, 5:54 PM · Patch-For-Review, observability, Wikimedia-Logstash

Mar 8 2021

colewhite added a subtask for T192948: Upgrade prometheus-jmx-exporter on all services using it: T276595: Upgrade prometheus-jmx-exporter.
Mar 8 2021, 4:53 PM · Analytics-Radar, Platform Team Legacy (Watching / External), User-Elukey, observability, Puppet, Services (watching), Cassandra
colewhite added a parent task for T276595: Upgrade prometheus-jmx-exporter: T192948: Upgrade prometheus-jmx-exporter on all services using it.
Mar 8 2021, 4:53 PM · Analytics-Clusters, SRE, wdwb-tech, Wikidata, Wikidata-Query-Service, CirrusSearch, observability
colewhite created P14668 (An Untitled Masterwork).
Mar 8 2021, 3:45 PM

Mar 5 2021

colewhite added a subtask for T234565: Standardize the logging format: T276468: Unable to exclude "error" field in Logstash.
Mar 5 2021, 10:10 PM · Patch-For-Review, Wikimedia-Logstash, observability, SRE
colewhite added a parent task for T276468: Unable to exclude "error" field in Logstash: T234565: Standardize the logging format.
Mar 5 2021, 10:10 PM · observability, Wikimedia-Logstash
colewhite renamed T276595: Upgrade prometheus-jmx-exporter from Logstash 7.10.x does not work with prometheus-jmx-exporter 0.3.0 to Upgrade prometheus-jmx-exporter.
Mar 5 2021, 5:38 PM · Analytics-Clusters, SRE, wdwb-tech, Wikidata, Wikidata-Query-Service, CirrusSearch, observability
colewhite added a project to T276595: Upgrade prometheus-jmx-exporter: SRE.
Mar 5 2021, 5:37 PM · Analytics-Clusters, SRE, wdwb-tech, Wikidata, Wikidata-Query-Service, CirrusSearch, observability
colewhite created T276595: Upgrade prometheus-jmx-exporter.
Mar 5 2021, 5:36 PM · Analytics-Clusters, SRE, wdwb-tech, Wikidata, Wikidata-Query-Service, CirrusSearch, observability

Mar 4 2021

colewhite created T276501: Pontoon enroll fails to complete.
Mar 4 2021, 8:22 PM · observability
colewhite triaged T276468: Unable to exclude "error" field in Logstash as Medium priority.
Mar 4 2021, 3:39 PM · observability, Wikimedia-Logstash
colewhite added a comment to T276468: Unable to exclude "error" field in Logstash.

The error field appears to be in a cross-index field type conflict. I don't think fields in this state can be filtered.

Mar 4 2021, 3:37 PM · observability, Wikimedia-Logstash

Mar 1 2021

colewhite added a comment to T276101: histogram bucket metrics for elasticsearch query latency.

One option could be to enable slowlog. I think the settings for slowlog are a part of the index template which would necessitate an index update.

Mar 1 2021, 3:48 PM · observability
colewhite added a comment to T276095: Keep calculating latencies for MediaWiki requests that happen k8s.

Another possible solution is to extract the metrics via a sum aggregation query with prometheus-es-exporter. It's pretty easy to set up, but has the drawback of the logs must be indexed before they can be queried and exported.

Mar 1 2021, 3:44 PM · observability, SRE, serviceops, MW-on-K8s

Feb 26 2021

colewhite moved T275405: Logstash collector nodes hang indefinitely on reboot from Backlog to In progress on the observability board.
Feb 26 2021, 8:36 PM · Patch-For-Review, observability

Feb 25 2021

colewhite added a comment to T264665: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts.

@Jdlrobson I'd be happy to help. Ping me on IRC or elsewhere to coordinate a time and we'll make it happen.

Feb 25 2021, 10:17 PM · SRE, observability, Instrument-ClientError, MediaWiki-extensions-WikimediaEvents
colewhite moved T275618: Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name from Inbox to Radar on the observability board.
Feb 25 2021, 3:17 PM · observability, Kubernetes, Prod-Kubernetes, serviceops

Feb 24 2021

colewhite closed T269680: MediaWiki logging indexing conflict on 'session' for 'session-ip' channel as Resolved.

Since the latest MediaWiki deploy, these indexing errors have ceased.

Feb 24 2021, 10:31 PM · MW-1.35-notes, MW-1.36-notes (1.36.0-wmf.31; 2021-02-16), Patch-For-Review, Platform Team Workboards (Clinic Duty Team), MediaWiki-extensions-CentralAuth, MediaWiki-Debug-Logger, MediaWiki-Authentication-and-authorization, observability
colewhite added a comment to T275618: Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name.

The proposal I got is the following:

  • Fetch the JSON for all the dashboards under the Service/ and Kubernetes/ grafana folders.
  • Programmatically parse the JSON, and whenever a query like the above is met add an extra query that will just have pod instead of pod_name and container instead of container_name.
  • Save the new JSON
  • Upload it to grafana (possibly manually but it's not that big a deal, it's like 40 dashboards or so).
Feb 24 2021, 4:04 PM · observability, Kubernetes, Prod-Kubernetes, serviceops

Feb 23 2021

colewhite added a comment to T249607: Kibana naming convention.
  • so far I've needed only visualizations essentially based on a single field. In that case I've named the visualization as FIELDNAME - AGGREGATION where aggregation is usually one of:
    • count - a data table with the summary of FIELDNAME
    • over time - area visualization (bars) broken down on FIELDNAME on y axis and aggregated by @timestamp on x axis
    • top N over time - same as above, but top N
Feb 23 2021, 11:41 PM · observability
colewhite added a comment to T264665: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts.

The process for generating an alert from logs is to export a metric based on an ES query then alert off of the generated metric. We use prometheus-es-exporter to do the metrics export. Once there is a metric in place, we can alert on sum_over_time(metric[-1h]) >= 10000 reasonably easy.

Feb 23 2021, 11:22 PM · SRE, observability, Instrument-ClientError, MediaWiki-extensions-WikimediaEvents
colewhite closed T265938: Create a separate logstash ElasticSearch index for schemaed events, a subtask of T234565: Standardize the logging format, as Resolved.
Feb 23 2021, 10:46 PM · Patch-For-Review, Wikimedia-Logstash, observability, SRE
colewhite closed T265938: Create a separate logstash ElasticSearch index for schemaed events, a subtask of T256173: Allow filtering of errors from logged in users, as Resolved.
Feb 23 2021, 10:46 PM · Instrument-ClientError, Better Use Of Data, MW-1.36-notes (1.36.0-wmf.13; 2020-10-12), Patch-For-Review, Readers-Web-Backlog (Tracking), MediaWiki-extensions-WikimediaEvents
colewhite closed T265938: Create a separate logstash ElasticSearch index for schemaed events as Resolved.

Please feel free to reach out if more information is needed.

Feb 23 2021, 10:46 PM · Wikimedia-Logstash, observability, Analytics, Product-Data-Infrastructure
colewhite added a subtask for T234565: Standardize the logging format: T265938: Create a separate logstash ElasticSearch index for schemaed events.
Feb 23 2021, 10:45 PM · Patch-For-Review, Wikimedia-Logstash, observability, SRE
colewhite added a parent task for T265938: Create a separate logstash ElasticSearch index for schemaed events: T234565: Standardize the logging format.
Feb 23 2021, 10:45 PM · Wikimedia-Logstash, observability, Analytics, Product-Data-Infrastructure
colewhite closed T216611: Icinga check for ircecho should check for actual activity as Resolved.

Updated monitoring has been deployed.

Feb 23 2021, 9:17 PM · IRCecho, observability, Icinga, SRE

Feb 22 2021

colewhite moved T275405: Logstash collector nodes hang indefinitely on reboot from Inbox to Backlog on the observability board.
Feb 22 2021, 10:52 PM · Patch-For-Review, observability
colewhite renamed T275405: Logstash collector nodes hang indefinitely on reboot from reshuffle order of operations for ELK reboots to Logstash collector nodes hang indefinitely on reboot.
Feb 22 2021, 10:13 PM · Patch-For-Review, observability

Feb 19 2021

colewhite moved T275243: Updating the revision for an index pattern should not prevent curator cleanup from Inbox to Backlog on the observability board.
Feb 19 2021, 8:50 PM · observability
colewhite added a parent task for T275243: Updating the revision for an index pattern should not prevent curator cleanup: T274394: ES Curator cron jobs are not cleaned up when output no longer exists.
Feb 19 2021, 8:50 PM · observability
colewhite added a subtask for T274394: ES Curator cron jobs are not cleaned up when output no longer exists: T275243: Updating the revision for an index pattern should not prevent curator cleanup.
Feb 19 2021, 8:50 PM · Patch-For-Review, observability
colewhite created T275243: Updating the revision for an index pattern should not prevent curator cleanup.
Feb 19 2021, 8:49 PM · observability

Feb 17 2021

colewhite added a comment to T272370: Indexing conflict for 'request' from linkrecommendation.

Logs of the mapping exceptions here.

Feb 17 2021, 4:01 PM · Growth-Team (Current Sprint), Add-Link

Feb 16 2021

colewhite closed T217032: Logstash syslog input grok parse failure on some network devices lines as Resolved.

logstash-syslog removed from production

Feb 16 2021, 8:50 PM · observability, Wikimedia-Logstash

Feb 12 2021

colewhite moved T274394: ES Curator cron jobs are not cleaned up when output no longer exists from Inbox to Backlog on the observability board.
Feb 12 2021, 9:09 PM · Patch-For-Review, observability
colewhite closed T274593: Logstash beta is not getting any events, a subtask of T271345: 1.36.0-wmf.31 deployment blockers, as Resolved.
Feb 12 2021, 9:08 PM · Patch-For-Review, Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), Release, Train Deployments
colewhite closed T274593: Logstash beta is not getting any events as Resolved.

Restarted logstash on deployment-logstash03 and logs appear to be flowing again.

Feb 12 2021, 9:08 PM · observability, User-DannyS712, Beta-Cluster-Infrastructure, Wikimedia-Logstash
colewhite triaged T274661: Enhance Gerrit logs ingested by Logstash as Medium priority.
Feb 12 2021, 7:52 PM · Technical-Debt, Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), Release-Engineering-Team (Development services), observability, Gerrit
colewhite claimed T274661: Enhance Gerrit logs ingested by Logstash.

ECS migration task: T234565

Feb 12 2021, 7:52 PM · Technical-Debt, Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), Release-Engineering-Team (Development services), observability, Gerrit

Feb 11 2021

colewhite added a project to T274472: Investigate and repool db1134: ops-eqiad.
Feb 11 2021, 3:28 AM · Patch-For-Review, ops-eqiad, DBA, SRE