Page MenuHomePhabricator
Feed Advanced Search

May 23 2019

colewhite updated the task description for T219825: Update dashboards to node-exporter 0.16+ metric names.
May 23 2019, 5:18 PM · Patch-For-Review, observability

May 13 2019

colewhite updated the task description for T219825: Update dashboards to node-exporter 0.16+ metric names.
May 13 2019, 3:08 PM · Patch-For-Review, observability

May 9 2019

colewhite added a comment to T196066: Add prometheus metrics for varnishkafka instances running on caching hosts.

I agree with dropping the prefix in favor of "rdkafka".

May 9 2019, 10:02 PM · Patch-For-Review, Analytics-Kanban, Traffic, SRE, Analytics

May 8 2019

colewhite lowered the priority of T222826: Leverage Grafana annotations to show events in graphs from Medium to Low.
May 8 2019, 5:12 PM · Observability-Logging, SRE
colewhite triaged T222826: Leverage Grafana annotations to show events in graphs as Medium priority.
May 8 2019, 5:12 PM · Observability-Logging, SRE
colewhite added a subtask for T222826: Leverage Grafana annotations to show events in graphs: T174172: unused grafana-dashboard indices on elasticsearch / logstash.
May 8 2019, 5:12 PM · Observability-Logging, SRE
colewhite added a parent task for T174172: unused grafana-dashboard indices on elasticsearch / logstash: T222826: Leverage Grafana annotations to show events in graphs.
May 8 2019, 5:12 PM · Observability-Metrics, observability, Grafana, SRE
colewhite created T222826: Leverage Grafana annotations to show events in graphs.
May 8 2019, 5:11 PM · Observability-Logging, SRE

Apr 30 2019

colewhite added a comment to T219825: Update dashboards to node-exporter 0.16+ metric names.

@CDanis good catch!

Apr 30 2019, 5:13 PM · Patch-For-Review, observability

Apr 25 2019

colewhite updated the task description for T219825: Update dashboards to node-exporter 0.16+ metric names.
Apr 25 2019, 5:24 PM · Patch-For-Review, observability

Apr 24 2019

colewhite added a comment to T217142: [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors.

Copying thoughts to task:

Apr 24 2019, 3:48 PM · observability, User-fgiunchedi, Better Use Of Data, MW-1.34-notes (1.34.0-wmf.15; 2019-07-23), Patch-For-Review, User-herron, Product-Infrastructure-Team-Backlog-Deprecated, Wikimedia-Logstash

Apr 19 2019

colewhite triaged T221481: Degraded RAID on db2047 as High priority.
Apr 19 2019, 10:34 PM · DBA, SRE, ops-codfw

Apr 18 2019

colewhite closed T220084: analytics-wmde group addition for Lucas Werkmeister as Resolved.
Apr 18 2019, 10:41 PM · SRE, SRE-Access-Requests
colewhite added a comment to T220084: analytics-wmde group addition for Lucas Werkmeister.

The group membership change has been deployed.

Apr 18 2019, 10:41 PM · SRE, SRE-Access-Requests
colewhite updated the task description for T220084: analytics-wmde group addition for Lucas Werkmeister.
Apr 18 2019, 10:40 PM · SRE, SRE-Access-Requests
colewhite triaged T221288: Phabricator SPF record contains internal addressing for phab[12]001 as Medium priority.
Apr 18 2019, 7:07 PM · Patch-For-Review, Traffic, SRE, DNS, Mail
colewhite claimed T221290: wiki-mail DKIM failing.
Apr 18 2019, 7:07 PM · Patch-For-Review, Traffic, SRE, DNS, Mail

Apr 17 2019

colewhite triaged T221138: relocate/reimage cloudvirt1004 with 10G interfaces as Medium priority.
Apr 17 2019, 6:48 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
colewhite triaged T221139: relocate/reimage cloudvirt1003 with 10G interfaces as Medium priority.
Apr 17 2019, 6:48 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
colewhite triaged T221140: relocate/reimage cloudvirt1002 with 10G interfaces as Medium priority.
Apr 17 2019, 6:48 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
colewhite triaged T221141: relocate/reimage cloudvirt1001 with 10G interfaces as Medium priority.
Apr 17 2019, 6:47 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
colewhite triaged T221259: eqord - ulsfo Telia link down - IC-313592 as High priority.
Apr 17 2019, 6:47 PM · ops-eqord, SRE, netops

Apr 16 2019

colewhite triaged T220860: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) as Medium priority.
Apr 16 2019, 6:11 PM · Patch-For-Review, SRE, SRE-Access-Requests
colewhite triaged T220844: remove RT mail aliases as Medium priority.
Apr 16 2019, 6:10 PM · Mail, SRE
colewhite triaged T221125: cumin aliases not matching any hosts as Medium priority.
Apr 16 2019, 6:09 PM · Infrastructure-Foundations, Cumin, cloud-services-team, SRE
colewhite triaged T221115: labpuppetmaster logs 'cannot collect exported resources without storeconfigs being set' as Medium priority.
Apr 16 2019, 6:08 PM · cloud-services-team, SRE
colewhite triaged T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory as Medium priority.
Apr 16 2019, 6:07 PM · cloud-services-team (Kanban), SRE, ops-eqiad, DC-Ops, User-Zppix
colewhite triaged T220590: Decom ms-be101[345] as Medium priority.
Apr 16 2019, 6:06 PM · Patch-For-Review, ops-eqiad, decommission-hardware, User-fgiunchedi, SRE-swift-storage, SRE
colewhite triaged T200297: Review Jade data storage and architecture proposal [RFC] as Medium priority.
Apr 16 2019, 6:05 PM · TechCom-RFC (TechCom-RFC-Closed), MW-1.33-notes (1.33.0-wmf.14; 2019-01-22), Patch-For-Review, Machine-Learning-Team (Active Tasks), DBA, SRE, Jade
colewhite triaged T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool as Medium priority.
Apr 16 2019, 6:04 PM · SRE, Icinga, observability
colewhite triaged T220567: Wikitech page views sometimes default to MobileFrontend as Medium priority.
Apr 16 2019, 6:03 PM · Traffic, SRE, wikitech.wikimedia.org
colewhite lowered the priority of T220500: logstash1012 lock up caused central logging stuck from High to Medium.
Apr 16 2019, 6:02 PM · observability, User-herron, Wikimedia-Logstash, SRE
colewhite triaged T220500: logstash1012 lock up caused central logging stuck as High priority.
Apr 16 2019, 6:02 PM · observability, User-herron, Wikimedia-Logstash, SRE
colewhite closed T220880: Degraded RAID on analytics1039 as Resolved.
Apr 16 2019, 6:02 PM · ops-eqiad, SRE
colewhite added a comment to T220880: Degraded RAID on analytics1039.

we're pretty sure this is a false alarm

Apr 16 2019, 6:02 PM · ops-eqiad, SRE
colewhite updated subscribers of T220880: Degraded RAID on analytics1039.
Apr 16 2019, 6:01 PM · ops-eqiad, SRE
colewhite triaged T220681: Set `enable_dl` to 0 in php.ini as Medium priority.
Apr 16 2019, 5:32 PM · Patch-For-Review, PHP 7.2 support, Performance-Team (Radar), SRE
colewhite triaged T220901: Elasticsearch nodes overloading in eqiad as High priority.
Apr 16 2019, 3:55 PM · SRE, Discovery-Search (Current work)
colewhite triaged T220907: Degraded RAID on ms-be1013 as High priority.
Apr 16 2019, 3:42 PM · ops-eqiad, SRE
colewhite triaged T220982: maps hosts have bad permissions under /srv/deployment as High priority.
Apr 16 2019, 3:41 PM · SRE
colewhite triaged T221047: relocate/reimage cloudvirt1007 with 10G interfaces as Medium priority.
Apr 16 2019, 3:40 PM · Patch-For-Review, ops-eqiad, DC-Ops, SRE, cloud-services-team (Kanban)
colewhite triaged T221048: relocate/reimage cloudvirt1006 with 10G interfaces as Medium priority.
Apr 16 2019, 3:40 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
colewhite triaged T221049: relocate/reimage cloudvirt1005 with 10G interfaces as Medium priority.
Apr 16 2019, 3:40 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
colewhite triaged T221052: config file change canarying for logstash as Medium priority.
Apr 16 2019, 3:39 PM · observability, SRE, Wikimedia-Logstash
colewhite triaged T221068: decom ms-be201[345] as Medium priority.
Apr 16 2019, 3:39 PM · decommission-hardware, ops-codfw, SRE-swift-storage, User-fgiunchedi, SRE

Apr 15 2019

colewhite moved T219825: Update dashboards to node-exporter 0.16+ metric names from Inbox to In progress on the observability board.
Apr 15 2019, 3:14 PM · Patch-For-Review, observability

Apr 3 2019

colewhite added a comment to T219825: Update dashboards to node-exporter 0.16+ metric names.

Fundraising dashboards cannot be updated at this time. It looks like the nodes may need upgrading or forwards-compatibility rules.

Apr 3 2019, 10:51 PM · Patch-For-Review, observability

Apr 1 2019

colewhite added a subtask for T213288: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal): T219825: Update dashboards to node-exporter 0.16+ metric names.
Apr 1 2019, 6:44 PM · User-fgiunchedi, Goal, observability, SRE
colewhite added a parent task for T219825: Update dashboards to node-exporter 0.16+ metric names: T213288: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal).
Apr 1 2019, 6:44 PM · Patch-For-Review, observability
colewhite closed T213708: Upgrade production prometheus-node-exporter to >= 0.16 as Resolved.
Apr 1 2019, 6:43 PM · Patch-For-Review, Goal, observability, SRE
colewhite closed T213708: Upgrade production prometheus-node-exporter to >= 0.16, a subtask of T213288: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal), as Resolved.
Apr 1 2019, 6:43 PM · User-fgiunchedi, Goal, observability, SRE
colewhite triaged T219825: Update dashboards to node-exporter 0.16+ metric names as Low priority.
Apr 1 2019, 6:42 PM · Patch-For-Review, observability
colewhite created T219825: Update dashboards to node-exporter 0.16+ metric names.
Apr 1 2019, 6:41 PM · Patch-For-Review, observability

Mar 28 2019

colewhite updated the task description for T213708: Upgrade production prometheus-node-exporter to >= 0.16.
Mar 28 2019, 11:13 PM · Patch-For-Review, Goal, observability, SRE

Mar 22 2019

colewhite closed T216101: LDAP access to the WMF group for Angela Muigai as Resolved.
Mar 22 2019, 6:18 PM · LDAP-Access-Requests
colewhite added a comment to T216101: LDAP access to the WMF group for Angela Muigai.

Thank you for following up!

Mar 22 2019, 6:09 PM · LDAP-Access-Requests

Mar 21 2019

colewhite added a comment to T217932: Change log routing to ELK cluster to use rsyslog->kafka rather than talking directly to the ELK cluster.

As I understand it, journald is already wired up to copy to rsyslog. The only change needed to get these logs onto Kafka is to whitelist the application in the lookup_table_output.json.

Mar 21 2019, 5:24 PM · User-bd808, cloud-services-team (Kanban), Striker

Mar 6 2019

colewhite closed T214594: node-exporter collector.diskstats.ignored-devices underescaped as Resolved.
Mar 6 2019, 6:34 PM · Patch-For-Review, observability

Mar 4 2019

colewhite claimed T214594: node-exporter collector.diskstats.ignored-devices underescaped.
Mar 4 2019, 4:10 PM · Patch-For-Review, observability

Feb 25 2019

colewhite closed T216120: LDAP access to the wmf group for Delphine Ménard (dmenard) as Resolved.
Feb 25 2019, 8:13 PM · Patch-For-Review, LDAP-Access-Requests
colewhite added a comment to T216120: LDAP access to the wmf group for Delphine Ménard (dmenard).

@Delphine_wmf is now in the wmf ldap group. Resolving task.

Feb 25 2019, 8:13 PM · Patch-For-Review, LDAP-Access-Requests

Feb 21 2019

colewhite created P8120 Smartmon Node Exporter comparison.
Feb 21 2019, 10:19 PM
colewhite placed T215940: Mailing list migration for Arbitration Committee to Google Group up for grabs.
Feb 21 2019, 6:23 PM · SRE, WMF-Office-IT, Wikimedia-Mailing-lists
colewhite updated the task description for T215940: Mailing list migration for Arbitration Committee to Google Group.
Feb 21 2019, 6:23 PM · SRE, WMF-Office-IT, Wikimedia-Mailing-lists
colewhite updated subscribers of T215940: Mailing list migration for Arbitration Committee to Google Group.

Mbox files shared with @eross .

Feb 21 2019, 6:23 PM · SRE, WMF-Office-IT, Wikimedia-Mailing-lists
colewhite closed T215576: Please add Runa Bhattacharjee to the `wmf` LDAP group as Resolved.
Feb 21 2019, 5:59 PM · Patch-For-Review, LDAP-Access-Requests
colewhite added a comment to T215576: Please add Runa Bhattacharjee to the `wmf` LDAP group.

@Arrbee is now in the wmf ldap group. Resolving task.

Feb 21 2019, 5:59 PM · Patch-For-Review, LDAP-Access-Requests
colewhite added a comment to T213708: Upgrade production prometheus-node-exporter to >= 0.16.

On further investigation, the log messages appear to be from the shebang of the ipmitool awk script.

Feb 21 2019, 4:51 PM · Patch-For-Review, Goal, observability, SRE

Feb 15 2019

colewhite added a comment to T216120: LDAP access to the wmf group for Delphine Ménard (dmenard).

I was unable to find your account in LDAP. Have you had an account created for you by OIT or created one on wikitech?

Feb 15 2019, 9:41 PM · Patch-For-Review, LDAP-Access-Requests
colewhite triaged T216235: cleanup reprepro configuration for elasticsearch-curator as Medium priority.
Feb 15 2019, 7:36 PM · Discovery-Search (Current work), Patch-For-Review, User-fgiunchedi, Elasticsearch, SRE
colewhite triaged T216226: GPU upgrade for stat1005 as Medium priority.
Feb 15 2019, 7:35 PM · Analytics, hardware-requests, SRE
colewhite triaged T216202: Disk failure on labsdb1005 as Medium priority.
Feb 15 2019, 7:34 PM · SRE, ops-eqiad
colewhite triaged T216243: cron spam for slow queries on mwmaint /usr/local/bin/foreachwiki initSiteStats.php --update > /dev/null as Medium priority.
Feb 15 2019, 7:33 PM · Performance-Team (Radar), SRE, MediaWiki-Maintenance-system
colewhite triaged T216273: New cronspam from db clusters as Medium priority.
Feb 15 2019, 7:33 PM · SRE
colewhite added a subtask for T132324: Tracking and Reducing cron-spam to root@ : T216273: New cronspam from db clusters.
Feb 15 2019, 7:32 PM · Patch-For-Review, Tracking-Neverending, SRE
colewhite added a parent task for T216273: New cronspam from db clusters: T132324: Tracking and Reducing cron-spam to root@ .
Feb 15 2019, 7:32 PM · SRE
colewhite triaged T216223: Degraded RAID on labsdb1005 as Medium priority.
Feb 15 2019, 7:31 PM · cloud-services-team (Kanban), Toolforge, ops-eqiad, SRE
colewhite created T216273: New cronspam from db clusters.
Feb 15 2019, 7:22 PM · SRE
colewhite edited projects for T216223: Degraded RAID on labsdb1005, added: cloud-services-team (Kanban); removed cloud-services-team.
Feb 15 2019, 4:53 PM · cloud-services-team (Kanban), Toolforge, ops-eqiad, SRE

Feb 14 2019

colewhite triaged T216090: ensure httpd error logs from "misc apps" (krypton) end up in logstash as Medium priority.
Feb 14 2019, 11:12 PM · collaboration-services, Observability-Logging, observability, Wikimedia-Logstash, SRE
colewhite updated subscribers of T216090: ensure httpd error logs from "misc apps" (krypton) end up in logstash.
Feb 14 2019, 11:12 PM · collaboration-services, Observability-Logging, observability, Wikimedia-Logstash, SRE
colewhite triaged T216192: Update label and switch to rename labvirt1012 to cloudvirt1012 as Medium priority.
Feb 14 2019, 11:11 PM · ops-eqiad, SRE
colewhite claimed T215940: Mailing list migration for Arbitration Committee to Google Group.
Feb 14 2019, 10:51 PM · SRE, WMF-Office-IT, Wikimedia-Mailing-lists
colewhite claimed T216101: LDAP access to the WMF group for Angela Muigai.
Feb 14 2019, 8:56 PM · LDAP-Access-Requests
colewhite claimed T216120: LDAP access to the wmf group for Delphine Ménard (dmenard).
Feb 14 2019, 8:55 PM · Patch-For-Review, LDAP-Access-Requests
colewhite closed T215830: Requesting access to analytics-privatedata for esanders as Resolved.
Feb 14 2019, 8:55 PM · Patch-For-Review, SRE, SRE-Access-Requests
colewhite added a comment to T215830: Requesting access to analytics-privatedata for esanders.

The group membership change has been deployed.

Feb 14 2019, 8:54 PM · Patch-For-Review, SRE, SRE-Access-Requests
colewhite closed T215938: Access request: Ladsgroup to analytics-wmde-users as Resolved.
Feb 14 2019, 8:54 PM · Patch-For-Review, SRE, SRE-Access-Requests
colewhite added a comment to T215938: Access request: Ladsgroup to analytics-wmde-users.

The group membership change has been deployed.

Feb 14 2019, 8:53 PM · Patch-For-Review, SRE, SRE-Access-Requests
colewhite triaged T216183: Special:ProtectedPages times out on enwiki for Module namespace as High priority.
Feb 14 2019, 8:33 PM · Performance-Team, User-Marostegui, Wikimedia-production-error, MediaWiki-libs-Rdbms, MediaWiki-Special-pages
colewhite added a comment to T216183: Special:ProtectedPages times out on enwiki for Module namespace.

The logs indicate that the request is timing out fetching data from the database.

Feb 14 2019, 8:32 PM · Performance-Team, User-Marostegui, Wikimedia-production-error, MediaWiki-libs-Rdbms, MediaWiki-Special-pages
CDanis awarded T216088: Mapping of servers to stakeholders a Like token.
Feb 14 2019, 1:07 AM · Infrastructure-Foundations, Patch-For-Review, User-jbond

Feb 13 2019

colewhite claimed T213708: Upgrade production prometheus-node-exporter to >= 0.16.
Feb 13 2019, 11:36 PM · Patch-For-Review, Goal, observability, SRE
colewhite claimed T215830: Requesting access to analytics-privatedata for esanders.
Feb 13 2019, 11:31 PM · Patch-For-Review, SRE, SRE-Access-Requests
colewhite triaged T216088: Mapping of servers to stakeholders as Medium priority.
Feb 13 2019, 11:28 PM · Infrastructure-Foundations, Patch-For-Review, User-jbond
colewhite removed a project from T215938: Access request: Ladsgroup to analytics-wmde-users: LDAP-Access-Requests.
Feb 13 2019, 8:49 PM · Patch-For-Review, SRE, SRE-Access-Requests
colewhite claimed T215938: Access request: Ladsgroup to analytics-wmde-users.
Feb 13 2019, 8:48 PM · Patch-For-Review, SRE, SRE-Access-Requests
colewhite closed T216068: Degraded RAID on cloudvirt1024, a subtask of T215892: Degraded RAID on cloudvirt1024, as Resolved.
Feb 13 2019, 8:43 PM · cloud-services-team (Kanban), ops-eqiad, SRE
colewhite closed T216068: Degraded RAID on cloudvirt1024 as Resolved.
Feb 13 2019, 8:43 PM · ops-eqiad, SRE
colewhite added a comment to T216068: Degraded RAID on cloudvirt1024.

Resolving as duplicate of parent.

Feb 13 2019, 8:42 PM · ops-eqiad, SRE