Page MenuHomePhabricator

colewhite (cwhite)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Aug 21 2018, 6:05 PM (178 w, 6 d)
Availability
Available
LDAP User
Cwhite
MediaWiki User
Unknown

Recent Activity

Today

colewhite updated the task description for T299168: Upgrade OpenSearch.
Tue, Jan 25, 1:02 AM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), Observability-Logging

Thu, Jan 20

lmata awarded T240685: MediaWiki Prometheus support a Love token.
Thu, Jan 20, 7:06 PM · MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), MediaWiki-libs-Metrics, Platform Team Workboards (External Code Reviews), Patch-For-Review, serviceops, SRE, MediaWiki-General, observability

Tue, Jan 18

colewhite updated the task description for T299168: Upgrade OpenSearch.
Tue, Jan 18, 11:53 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), Observability-Logging
colewhite updated the task description for T299168: Upgrade OpenSearch.
Tue, Jan 18, 10:48 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), Observability-Logging
colewhite added a comment to T236954: Hieradata yaml style checking.

Thanks for looking into this! Automatic formatting would be great as long as the output is human-oriented.

Tue, Jan 18, 6:31 PM · Infrastructure-Foundations, Patch-For-Review, Puppet, SRE, User-jbond
colewhite created T299431: Upgrade logstash-filter-verifier logstash to 7.16.
Tue, Jan 18, 5:13 PM · Patch-For-Review, Observability-Logging

Thu, Jan 13

colewhite updated the task description for T299168: Upgrade OpenSearch.
Thu, Jan 13, 8:20 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), Observability-Logging
colewhite moved T299168: Upgrade OpenSearch from Inbox to Up next on the SRE Observability (FY2021/2022-Q3) board.
Thu, Jan 13, 8:15 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), Observability-Logging
colewhite created T299168: Upgrade OpenSearch.
Thu, Jan 13, 8:15 PM · Patch-For-Review, SRE Observability (FY2021/2022-Q3), Observability-Logging

Tue, Jan 11

colewhite updated the task description for T294120: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8.
Tue, Jan 11, 8:44 PM · Infrastructure-Foundations, SRE
colewhite added a comment to T297219: CirrusSearch extension can generate too-long error messages.

Findings for today:

Tue, Jan 11, 12:04 AM · Performance-Team (Radar), SRE Observability, Discovery-Search, CirrusSearch

Fri, Jan 7

colewhite added a comment to T298619: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC .

Index curation is affected as well because python's datetime formatter doesn't know weekyear in the same way. We ought to consider using curation based on field_stats or creation_date.

Fri, Jan 7, 4:27 PM · Patch-For-Review, SRE-OnFire, Infrastructure-Foundations, SRE

Thu, Jan 6

colewhite updated the task description for T240667: Ingestion errors for production logs on ELK7.
Thu, Jan 6, 4:39 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash
colewhite closed T239458: Mediawiki logging indexing conflict as Resolved.

No instances in the last two days. Will reopen if it comes back.

Thu, Jan 6, 4:39 PM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General
colewhite added a comment to T282863: Upgrade Grafana to 8.x.

Grafana 8 is running on grafana-next.

Thu, Jan 6, 12:27 AM · SRE Observability (FY2021/2022-Q3), Performance-Team (Radar)

Wed, Jan 5

colewhite added a comment to T298619: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC .

Per the linked upstream issue, Logstash uses Joda which uses this pattern syntax.

Wed, Jan 5, 4:13 PM · Patch-For-Review, SRE-OnFire, Infrastructure-Foundations, SRE

Tue, Jan 4

herron awarded T297433: Upgrade firmware on graphite1004 if upgrade available. a Stroopwafel token.
Tue, Jan 4, 4:18 PM · Graphite, DC-Ops, ops-eqiad
colewhite closed T298036: OpenSearch coordinate map view thinks it is broken as Resolved.

Tentatively resolving because the root issue was resolved upstream.

Tue, Jan 4, 3:36 PM · Observability-Logging, Wikimedia-Logstash
colewhite added a comment to T298036: OpenSearch coordinate map view thinks it is broken.

From the OpenSearch team: "We are committed to providing support for OpenSearch [Dashboards], and that includes providing a maps service for all users. Interruptions of the map tile service are an oversight."

Tue, Jan 4, 3:34 PM · Observability-Logging, Wikimedia-Logstash

Mon, Jan 3

colewhite added a comment to T298036: OpenSearch coordinate map view thinks it is broken.

Checked on it today and Amazon's tile server is no longer 404'ing. I'll inquire as to expected availability at my next opportunity.

Mon, Jan 3, 9:05 PM · Observability-Logging, Wikimedia-Logstash

Dec 22 2021

colewhite changed the status of T298036: OpenSearch coordinate map view thinks it is broken from In Progress to Open.
Dec 22 2021, 2:27 AM · Observability-Logging, Wikimedia-Logstash
colewhite added a comment to T298036: OpenSearch coordinate map view thinks it is broken.

There's a near-term fix that worked on beta but it raises several questions about sustaining the feature:

Dec 22 2021, 2:26 AM · Observability-Logging, Wikimedia-Logstash

Dec 21 2021

colewhite closed T294581: Upgrade ECS to 1.11.0 as Resolved.

The template is now properly installed on the cluster.

Dec 21 2021, 4:30 PM · Patch-For-Review, Observability-Logging, SRE Observability (FY2021/2022-Q2)
colewhite closed T294581: Upgrade ECS to 1.11.0, a subtask of T292881: Mutate mmkubernetes k8s fields into ECS fields, as Resolved.
Dec 21 2021, 4:29 PM · serviceops, Observability-Logging
colewhite reopened T298036: OpenSearch coordinate map view thinks it is broken as "In Progress".
Dec 21 2021, 12:10 AM · Observability-Logging, Wikimedia-Logstash

Dec 20 2021

colewhite added a comment to T298036: OpenSearch coordinate map view thinks it is broken.

This overlay appears when the a tile fails to load (in this case, they're 404'ing).

Dec 20 2021, 10:42 PM · Observability-Logging, Wikimedia-Logstash
colewhite changed the status of T298036: OpenSearch coordinate map view thinks it is broken from Open to In Progress.
Dec 20 2021, 4:24 PM · Observability-Logging, Wikimedia-Logstash

Dec 18 2021

colewhite committed rOSEC20305ff580e6: add and enable subset filters (authored by colewhite).
add and enable subset filters
Dec 18 2021, 1:05 AM

Dec 16 2021

colewhite reopened T294581: Upgrade ECS to 1.11.0 as "In Progress".

It appears ecs 1.11.0 ships with experimental fields using licensed types preventing the template from being installed. We'll want to exclude those fields.

Dec 16 2021, 12:08 AM · Patch-For-Review, Observability-Logging, SRE Observability (FY2021/2022-Q2)
colewhite reopened T294581: Upgrade ECS to 1.11.0, a subtask of T292881: Mutate mmkubernetes k8s fields into ECS fields, as In Progress.
Dec 16 2021, 12:08 AM · serviceops, Observability-Logging

Dec 14 2021

colewhite added a comment to T297239: Move logstash api-feature-usage output away from v5 cluster.

I tested Logstash 7.10 writing api feature usage logs to an ES 6 instance in cloud. Somewhere in the pipeline, the api feature usage logs get assigned two types which makes ES 6 reject it:

[2021-12-14T20:56:19,352][DEBUG][o.e.a.b.TransportShardBulkAction] [4Js0n17] [apifeatureusage-2021.12.14][0] failed to execute bulk item (index) index {[apifeatureusage-2021.12.14][doc][_Ly7un0B49-t0j9CVvuQ], source[{"agent":"ChangePropagation/WMF","feature":"https-expected","@timestamp":"2021-12-14T20:56:19.118Z","type":"api-feature-usage-sanitized","@version":1}]}
java.lang.IllegalArgumentException: Rejecting mapping update to [apifeatureusage-2021.12.14] as the final mapping would have more than 1 type: [doc, api-feature-usage-sanitized]
        at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:451) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:399) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:331) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:313) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:229) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.5.4.jar:6.5.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Dec 14 2021, 11:53 PM · Patch-For-Review, Observability-Logging, SRE

Dec 9 2021

colewhite updated subscribers of T297433: Upgrade firmware on graphite1004 if upgrade available..

I think it just needs to be downtimed in icinga for the maintenance window. Being the backup host, I think you can proceed when you're ready.

Dec 9 2021, 8:55 PM · Graphite, DC-Ops, ops-eqiad
colewhite added a parent task for T297433: Upgrade firmware on graphite1004 if upgrade available.: T297265: graphite1004 freezing.
Dec 9 2021, 8:37 PM · Graphite, DC-Ops, ops-eqiad
colewhite added a subtask for T297265: graphite1004 freezing: T297433: Upgrade firmware on graphite1004 if upgrade available..
Dec 9 2021, 8:37 PM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
colewhite created T297433: Upgrade firmware on graphite1004 if upgrade available..
Dec 9 2021, 8:36 PM · Graphite, DC-Ops, ops-eqiad
colewhite added a comment to T295706: Improve TransactionProfiler as replacement for tendril's slow queries.

@Ladsgroup we have moved to codfw. The fields are no longer in conflict on this cluster.

Dec 9 2021, 6:14 PM · Performance-Team-publish, MW-1.38-notes (1.38.0-wmf.9; 2021-11-16), Patch-For-Review, Performance-Team (Radar), Developer Productivity, Wikimedia-Rdbms, DBA, User-Ladsgroup
colewhite lowered the priority of T297265: graphite1004 freezing from Unbreak Now! to High.
Dec 9 2021, 4:09 AM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
colewhite added a comment to T297265: graphite1004 freezing.

We are failed over to graphite2003 for now.

Dec 9 2021, 4:09 AM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite
colewhite added a comment to T297265: graphite1004 freezing.
$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Jun-01-2018 | 10:56:02 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared
2   | Dec-19-2018 | 15:58:30 | Status           | Power Supply                | Power Supply input lost (AC/DC)
3   | Dec-19-2018 | 15:58:40 | Status           | Power Supply                | Power Supply input lost (AC/DC)
4   | Sep-10-2019 | 15:07:07 | Status           | Power Supply                | Power Supply input lost (AC/DC)
5   | Sep-10-2019 | 15:07:08 | PS Redundancy    | Power Supply                | Redundancy Lost
6   | Sep-10-2019 | 15:11:28 | PS Redundancy    | Power Supply                | Fully Redundant
7   | Sep-10-2019 | 15:11:32 | Status           | Power Supply                | Power Supply input lost (AC/DC)
8   | Oct-31-2019 | 11:48:07 | Status           | Power Supply                | Power Supply input lost (AC/DC)
9   | Oct-31-2019 | 11:48:09 | PS Redundancy    | Power Supply                | Redundancy Lost
10  | Oct-31-2019 | 11:51:27 | Status           | Power Supply                | Power Supply input lost (AC/DC)
11  | Oct-31-2019 | 11:51:29 | PS Redundancy    | Power Supply                | Fully Redundant
12  | Oct-31-2019 | 11:58:10 | Status           | Power Supply                | Power Supply input lost (AC/DC)
13  | Oct-31-2019 | 11:58:14 | PS Redundancy    | Power Supply                | Redundancy Lost
14  | Oct-31-2019 | 12:01:00 | Status           | Power Supply                | Power Supply input lost (AC/DC)
15  | Oct-31-2019 | 12:01:04 | PS Redundancy    | Power Supply                | Fully Redundant
16  | Sep-26-2021 | 00:21:40 | Mem ECC Warning  | Memory                      | transition to Non-Critical from OK
17  | Sep-26-2021 | 00:25:50 | Mem ECC Warning  | Memory                      | transition to Critical from less severe
Dec 9 2021, 12:39 AM · User-fgiunchedi, Patch-For-Review, Wikimedia-Incident, SRE, Graphite

Dec 8 2021

colewhite changed the status of T288621: Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch from Open to In Progress.
Dec 8 2021, 5:22 PM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, Goal
colewhite added a comment to T123243: Ability to alert when we get a sudden increase in bad passwords for privileged accounts.

I know this is asking a lot, but if we had some way to add on some detection around the most privileged accounts to detect higher risk behavior or maybe major deviations from normal activity from a particular account (e.g. a login from a place the account owner would never log in from), that would be useful.

Dec 8 2021, 5:20 PM · Security-Team, observability, Security
colewhite added a comment to T297239: Move logstash api-feature-usage output away from v5 cluster.

Also worth considering is the option of addressing points 1 & 2 within a single logstash cluster using pipeline configurations. Essentially instead of cloning the log as api-feature-usage-sanitized we could output it to a secondary pipeline which in turn would output to search.svc.

Personally I'd be inclined to explore using multiple pipelines, assuming it pans out we could repurpose and expand on that approach for other uses in the future.

Dec 8 2021, 12:20 AM · Patch-For-Review, Observability-Logging, SRE

Dec 6 2021

colewhite added a comment to T288612: Remove outdated Wikibase settings from production config.

Mentioned in SAL (#wikimedia-operations) [2021-12-06T20:14:09Z] <cwhite> begin codfw opensearch upgrade T288612

Dec 6 2021, 8:16 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)), Wikidata

Dec 3 2021

colewhite added a comment to T288621: Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch.

https://gerrit.wikimedia.org/r/743049

This is the last automated step in provisioning OpenSearch.

Before merging this change, Puppet should be disabled on the whole codfw ES cluster.

Optionally, disable shard allocation for the cluster. Between each node join, shard allocation will have to be re-enabled to allow the node to allocate and update its own shards.

After merge, the first nodes to migrate are data nodes. Do serially:

  • Stop elasticsearch
  • Enable and Run Puppet

After Puppet applies this change, ensure OpenSearch is stopped. Prior to joining the cluster, we'll want to put the ES index data into place:

  • mv /etc/elasticsearch/production-elk7-codfw /srv/opensearch/production-elk7-codfw
  • chown -R opensearch:opensearch /srv/opensearch/production-elk7-codfw Once data is in place, start OpenSearch and watch logs and api endpoints for a successful cluster join and shard provisioning.

After merge, the last nodes to migrate are collector nodes. Do serially:

  • Stop Logstash
  • Stop elasticsearch
  • Enable and Run Puppet

Some manual steps once complete:

  • Purge elasticsearch-oss and kibana packages
  • Disable and stop lingering services
    • sudo systemctl disable elasticsearch_7@production-elk7-codfw
    • sudo systemctl disable elasticsearch-production-elk7-codfw-gc-log-cleanup.timer
    • sudo systemctl stop elasticsearch-production-elk7-codfw-gc-log-cleanup.timer
  • Check for and possibly remove lingering files:
    • /etc/logrotate.d/elastic*
    • /etc/elasticsearch
    • /lib/systemd/system/elasticsearch*
    • /var/log/elasticsearch-production-elk7-codfw-gc-log-cleanup
    • /etc/kibana
    • /etc/default/kibana
    • /etc/sudoers.d/kibana-deploy-phatality
Dec 3 2021, 6:05 PM · SRE Observability (FY2021/2022-Q3), Patch-For-Review, Goal

Dec 2 2021

colewhite added a comment to T123243: Ability to alert when we get a sudden increase in bad passwords for privileged accounts.

I'm assuming -priv suffix means privileged accounts.

Dec 2 2021, 9:34 PM · Security-Team, observability, Security
colewhite added a comment to T295706: Improve TransactionProfiler as replacement for tendril's slow queries.

Which is the one I need :( Is there a way to fix the conflict?

Dec 2 2021, 5:00 PM · Performance-Team-publish, MW-1.38-notes (1.38.0-wmf.9; 2021-11-16), Patch-For-Review, Performance-Team (Radar), Developer Productivity, Wikimedia-Rdbms, DBA, User-Ladsgroup
colewhite added a comment to T295706: Improve TransactionProfiler as replacement for tendril's slow queries.

Change 742923 merged by Ladsgroup:

[operations/puppet@production] logstash: Add maxSeconds and actualSeconds as numeric fields

https://gerrit.wikimedia.org/r/742923

Dec 2 2021, 1:34 AM · Performance-Team-publish, MW-1.38-notes (1.38.0-wmf.9; 2021-11-16), Patch-For-Review, Performance-Team (Radar), Developer Productivity, Wikimedia-Rdbms, DBA, User-Ladsgroup

Dec 1 2021

colewhite closed T294581: Upgrade ECS to 1.11.0 as Resolved.

ECS 1.11.0 is deployed.

Dec 1 2021, 1:11 AM · Patch-For-Review, Observability-Logging, SRE Observability (FY2021/2022-Q2)
colewhite closed T294581: Upgrade ECS to 1.11.0, a subtask of T292881: Mutate mmkubernetes k8s fields into ECS fields, as Resolved.
Dec 1 2021, 1:10 AM · serviceops, Observability-Logging
colewhite added a comment to T123243: Ability to alert when we get a sudden increase in bad passwords for privileged accounts.

We have prometheus-es-exporter available that will turn the result of ES queries into Prometheus metrics. Alertmanager can easily turn these metrics into alerts.

Dec 1 2021, 1:08 AM · Security-Team, observability, Security

Nov 24 2021

colewhite committed rOSEC76fda87b8c8e: upgrade ecs to 1.11.0 (authored by colewhite).
upgrade ecs to 1.11.0
Nov 24 2021, 4:40 PM
colewhite claimed T294581: Upgrade ECS to 1.11.0.
Nov 24 2021, 3:20 PM · Patch-For-Review, Observability-Logging, SRE Observability (FY2021/2022-Q2)

Nov 23 2021

colewhite added a subtask for T295628: linkrecommendation-internal logs appear to be multiline json: T296334: Evaluate usefulness of linkrecommendation printed logging on each request.
Nov 23 2021, 8:24 PM · Add-Link, Growth-Team, observability
colewhite added a parent task for T296334: Evaluate usefulness of linkrecommendation printed logging on each request: T295628: linkrecommendation-internal logs appear to be multiline json.
Nov 23 2021, 8:24 PM · Add-Link, observability, Growth-Team
colewhite moved T296334: Evaluate usefulness of linkrecommendation printed logging on each request from Inbox to Radar on the observability board.
Nov 23 2021, 8:24 PM · Add-Link, observability, Growth-Team
colewhite created T296334: Evaluate usefulness of linkrecommendation printed logging on each request.
Nov 23 2021, 8:24 PM · Add-Link, observability, Growth-Team
colewhite added a comment to T295628: linkrecommendation-internal logs appear to be multiline json.

Thanks @kostajh! I see it reduced the number of linkrecommendation events logged by 84% 🎉

Screenshot from 2021-11-23 20-11-42.png (257×1 px, 31 KB)

Nov 23 2021, 8:16 PM · Add-Link, Growth-Team, observability
colewhite committed rRMWAcff8162dd72b: remove indent parameter (authored by colewhite).
remove indent parameter
Nov 23 2021, 8:49 AM
colewhite closed T295717: Logstash Kafka Consumer Lag alert firing every hour as Resolved.

This has not recurred and the api-gateway logs are greatly reduced.

Nov 23 2021, 12:42 AM · observability
colewhite closed T295717: Logstash Kafka Consumer Lag alert firing every hour, a subtask of T295935: Average logging ingest has doubled over the last 90 days, as Resolved.
Nov 23 2021, 12:42 AM · Observability-Logging

Nov 22 2021

colewhite moved T282863: Upgrade Grafana to 8.x from Inbox to In progress on the SRE Observability (FY2021/2022-Q2) board.
Nov 22 2021, 9:40 PM · SRE Observability (FY2021/2022-Q3), Performance-Team (Radar)
colewhite renamed T282863: Upgrade Grafana to 8.x from Upgrade Grafana to 8.1 to Upgrade Grafana to 8.x.
Nov 22 2021, 9:36 PM · SRE Observability (FY2021/2022-Q3), Performance-Team (Radar)
colewhite added a comment to T288549: Indexing errors from logs generated by Activator.

Cole I am a bit confused, didn't the patch take care of the field that varies type?

Nov 22 2021, 6:24 PM · Observability-Logging, Machine-Learning-Team
colewhite reopened T288549: Indexing errors from logs generated by Activator as "Open".

failed to parse field [knative_dev/key] of type [text]

is still coming in: https://logstash.wikimedia.org/goto/375bfec5b28ed0614b653f0c49d7f6d4

Nov 22 2021, 6:16 PM · Observability-Logging, Machine-Learning-Team

Nov 20 2021

colewhite committed rOSECdcf8a1e89cd6: add stack.head field for aggregating events by stack head (authored by colewhite).
add stack.head field for aggregating events by stack head
Nov 20 2021, 2:01 AM

Nov 17 2021

colewhite updated the task description for T240667: Ingestion errors for production logs on ELK7.
Nov 17 2021, 11:46 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash
colewhite created T295944: Citoid: Object mapping for [err.body] tried to parse field [body] as object, but found a concrete value.
Nov 17 2021, 11:45 PM · Observability-Logging
colewhite updated the task description for T240667: Ingestion errors for production logs on ELK7.
Nov 17 2021, 11:39 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash
colewhite updated the task description for T240667: Ingestion errors for production logs on ELK7.
Nov 17 2021, 11:35 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash
colewhite updated the task description for T240667: Ingestion errors for production logs on ELK7.
Nov 17 2021, 11:34 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash
colewhite updated the task description for T240667: Ingestion errors for production logs on ELK7.
Nov 17 2021, 11:32 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash
colewhite updated the task description for T240667: Ingestion errors for production logs on ELK7.
Nov 17 2021, 10:49 PM · Observability-Logging, observability, SRE, Wikimedia-Logstash
colewhite reopened T239458: Mediawiki logging indexing conflict as "Open".
Nov 17 2021, 10:34 PM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General
colewhite closed T239458: Mediawiki logging indexing conflict as Resolved.

This error is no longer manifesting. Still an issue, it appears. Waiting on 1.38.0-wmf.9 deployment.

Nov 17 2021, 10:30 PM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General
colewhite moved T260667: scap on beta fails canary check: KeyError: 'aggregations' from Backlog to Up next on the Wikimedia-Logstash board.
Nov 17 2021, 10:26 PM · Release-Engineering-Team (Doing), observability, Discovery-Search, Wikimedia-Logstash, Beta-Cluster-Infrastructure
colewhite closed T241485: [_field_stats] endpoint is deprecated! Use [_field_caps] instead or run a min/max aggregations on the desired fields. as Resolved.

I think we can assume this is resolved unless otherwise indicated.

Nov 17 2021, 10:26 PM · SRE Observability, observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure
colewhite added a comment to T260667: scap on beta fails canary check: KeyError: 'aggregations'.

Since we moved to the new beta logging cluster, is this issue still occurring?

Nov 17 2021, 10:25 PM · Release-Engineering-Team (Doing), observability, Discovery-Search, Wikimedia-Logstash, Beta-Cluster-Infrastructure
colewhite closed T288989: beta logstash servers run out of disk space, a subtask of T233134: logstash-beta.wmflabs.org does not receive any mediawiki events, as Resolved.
Nov 17 2021, 10:23 PM · SRE Observability, observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure
colewhite closed T288989: beta logstash servers run out of disk space as Resolved.

Boldly resolving because we have migrated to a new cluster serving beta logs.

Nov 17 2021, 10:23 PM · Release-Engineering-Team (Radar), Beta-Cluster-Infrastructure
colewhite closed T211984: Logstash in beta fails periodically as Resolved.
Nov 17 2021, 10:22 PM · SRE Observability (FY2021/2022-Q2), observability, Beta-Cluster-Infrastructure, Wikimedia-Logstash
colewhite added a subtask for T277816: Improve Logstash's throttling capabilities: T295939: Logstash throttler does not apply to k8s logs.
Nov 17 2021, 10:21 PM · Observability-Logging, observability, Wikimedia-Logstash
colewhite added a parent task for T295939: Logstash throttler does not apply to k8s logs: T277816: Improve Logstash's throttling capabilities.
Nov 17 2021, 10:21 PM · Observability-Logging
colewhite added a parent task for T295939: Logstash throttler does not apply to k8s logs: T295935: Average logging ingest has doubled over the last 90 days.
Nov 17 2021, 10:21 PM · Observability-Logging
colewhite added a subtask for T295935: Average logging ingest has doubled over the last 90 days: T295939: Logstash throttler does not apply to k8s logs.
Nov 17 2021, 10:21 PM · Observability-Logging
colewhite created T295939: Logstash throttler does not apply to k8s logs.
Nov 17 2021, 10:20 PM · Observability-Logging
colewhite updated the task description for T295935: Average logging ingest has doubled over the last 90 days.
Nov 17 2021, 10:05 PM · Observability-Logging
colewhite updated the task description for T295935: Average logging ingest has doubled over the last 90 days.
Nov 17 2021, 10:02 PM · Observability-Logging
colewhite added a subtask for T295935: Average logging ingest has doubled over the last 90 days: T295627: Millions of access log entries per hour from shellbox httpd.
Nov 17 2021, 9:39 PM · Observability-Logging
colewhite added a parent task for T295627: Millions of access log entries per hour from shellbox httpd: T295935: Average logging ingest has doubled over the last 90 days.
Nov 17 2021, 9:39 PM · Shellbox, observability
colewhite added a parent task for T295628: linkrecommendation-internal logs appear to be multiline json: T295935: Average logging ingest has doubled over the last 90 days.
Nov 17 2021, 9:39 PM · Add-Link, Growth-Team, observability
colewhite added a parent task for T295717: Logstash Kafka Consumer Lag alert firing every hour: T295935: Average logging ingest has doubled over the last 90 days.
Nov 17 2021, 9:39 PM · observability
colewhite added subtasks for T295935: Average logging ingest has doubled over the last 90 days: T295628: linkrecommendation-internal logs appear to be multiline json, T295717: Logstash Kafka Consumer Lag alert firing every hour.
Nov 17 2021, 9:39 PM · Observability-Logging
colewhite created T295935: Average logging ingest has doubled over the last 90 days.
Nov 17 2021, 9:38 PM · Observability-Logging
lmata awarded T295717: Logstash Kafka Consumer Lag alert firing every hour a Like token.
Nov 17 2021, 4:17 PM · observability
colewhite added a comment to T295717: Logstash Kafka Consumer Lag alert firing every hour.

Even greater reduction seen this morning:

Screenshot from 2021-11-17 07-58-12.png (274×1 px, 35 KB)

Nov 17 2021, 3:00 PM · observability

Nov 16 2021

colewhite added a comment to T295717: Logstash Kafka Consumer Lag alert firing every hour.

That bump appears to have offloaded around 700k logs per hour?

Screenshot from 2021-11-16 14-17-58.png (258×1 px, 33 KB)

Nov 16 2021, 9:19 PM · observability
colewhite closed T295731: Gitlab Sidekiq mapper parsing exceptions since 2021-11-15@1825 as Resolved.

Reformatting the logs has eliminated the source of mapping errors.

Nov 16 2021, 4:39 PM · GitLab (Infrastructure), Release-Engineering-Team (Radar), Observability-Logging
colewhite added a comment to T295717: Logstash Kafka Consumer Lag alert firing every hour.

The growth team turned down the volume of addLink requests from deployment-prep and this reduced the amount of api-gateway logging significantly.

Screenshot from 2021-11-16 09-17-38.png (266×1 px, 39 KB)

Nov 16 2021, 4:19 PM · observability

Nov 15 2021

colewhite closed T295627: Millions of access log entries per hour from shellbox httpd as Resolved.

Applying the rsyslog filter has reduced the logging volume from ~25k/min to ~300/min.

Nov 15 2021, 11:02 PM · Shellbox, observability
colewhite added a project to T295628: linkrecommendation-internal logs appear to be multiline json: Growth-Team.
Nov 15 2021, 10:57 PM · Add-Link, Growth-Team, observability