Page MenuHomePhabricator

herron (Keith Herron)
Ops Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (147 w, 4 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Fri, Mar 27

herron added a comment to T248689: populate puppetdb fails for unknown hosts .

Spent some time on IRC with @jbond reproducing this and indeed puppetdb-populate will fail repeatedly for new hosts until performing a run using an empty manifest to populate facts. Then subsequent runs succeed.

Fri, Mar 27, 6:25 PM · User-jbond, Operations, puppet-compiler
herron created P10809 (An Untitled Masterwork).
Fri, Mar 27, 6:05 PM
herron closed T247376: Logstash: add SSD tier to ELK7 cluster as Resolved.
Fri, Mar 27, 5:22 PM · Wikimedia-Logstash, observability, Operations
herron updated the task description for T247376: Logstash: add SSD tier to ELK7 cluster.
Fri, Mar 27, 5:21 PM · Wikimedia-Logstash, observability, Operations

Thu, Mar 26

herron awarded T245066: Request for creation: les sans pagEs Mailing List a Like token.
Thu, Mar 26, 5:59 PM · Operations, Wikimedia-Mailing-lists

Wed, Mar 25

herron updated the task description for T247376: Logstash: add SSD tier to ELK7 cluster.
Wed, Mar 25, 8:40 PM · Wikimedia-Logstash, observability, Operations
herron updated the task description for T247376: Logstash: add SSD tier to ELK7 cluster.
Wed, Mar 25, 8:17 PM · Wikimedia-Logstash, observability, Operations

Tue, Mar 24

herron added a comment to T248400: elk7: fields indexed without position data; cannot run PhraseQuery.

In my testing simply removing instances of "index_options":"docs" from the logstash template addresses the issue, please see https://gerrit.wikimedia.org/r/583112

Tue, Mar 24, 5:31 PM · Patch-For-Review, Operations, Wikimedia-Logstash
herron added a project to T248400: elk7: fields indexed without position data; cannot run PhraseQuery: Patch-For-Review.
Tue, Mar 24, 5:28 PM · Patch-For-Review, Operations, Wikimedia-Logstash
herron updated the task description for T248400: elk7: fields indexed without position data; cannot run PhraseQuery.
Tue, Mar 24, 5:28 PM · Patch-For-Review, Operations, Wikimedia-Logstash
herron created T248400: elk7: fields indexed without position data; cannot run PhraseQuery.
Tue, Mar 24, 5:21 PM · Patch-For-Review, Operations, Wikimedia-Logstash

Mon, Mar 23

herron closed T235550: Rename multimedia-team to structured-data-team as Resolved.

Hello! The multimedia-team list has been renamed to structured-data-team, and redirects/forwarding have been put into place. I'll transition this to resolved as a soft close, but please re-open if any follow up is needed. Thanks!

Mon, Mar 23, 5:27 PM · Patch-For-Review, Wikimedia-Mailing-lists, Operations

Wed, Mar 18

herron added a comment to T247538: Icinga latency is skyrocketing and commands ignored.

Another low-hanging fruit is to reduce the SSH check for the mgmts I think: It currently runs every minute, but the non-avail of the mgmt sshd has no end-user impact, so checking them hourly should be good enough? That would slash the 30087 checks from above by a lot.

Wed, Mar 18, 3:57 PM · User-fgiunchedi, Patch-For-Review, fundraising-tech-ops, observability, Operations

Tue, Mar 17

herron updated the task description for T247376: Logstash: add SSD tier to ELK7 cluster.
Tue, Mar 17, 5:30 PM · Wikimedia-Logstash, observability, Operations
herron updated the task description for T247376: Logstash: add SSD tier to ELK7 cluster.
Tue, Mar 17, 5:30 PM · Wikimedia-Logstash, observability, Operations
herron updated the task description for T247376: Logstash: add SSD tier to ELK7 cluster.
Tue, Mar 17, 5:23 PM · Wikimedia-Logstash, observability, Operations

Thu, Mar 12

herron updated the task description for T247376: Logstash: add SSD tier to ELK7 cluster.
Thu, Mar 12, 5:13 PM · Wikimedia-Logstash, observability, Operations
herron added a comment to T247538: Icinga latency is skyrocketing and commands ignored.

https://gerrit.wikimedia.org/r/579329 seems like low hanging fruit that could help reduce load

Thu, Mar 12, 5:05 PM · User-fgiunchedi, Patch-For-Review, fundraising-tech-ops, observability, Operations

Wed, Mar 11

herron closed T240881: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet as Resolved.

Thanks @Cmjohnson! Will resolve this and track service setup in T247376

Wed, Mar 11, 12:28 AM · Operations, Wikimedia-Logstash
herron updated the task description for T247376: Logstash: add SSD tier to ELK7 cluster.
Wed, Mar 11, 12:24 AM · Wikimedia-Logstash, observability, Operations
herron closed T240882: (Need by: TBD) rack/setup/install logstash202[6-9].codfw.wmnet as Resolved.

@herron is it possible to create another task to track this down and close the racking and setup task?
thanks.

Wed, Mar 11, 12:18 AM · Patch-For-Review, Operations, ops-codfw, Wikimedia-Logstash
herron triaged T247376: Logstash: add SSD tier to ELK7 cluster as Medium priority.
Wed, Mar 11, 12:16 AM · Wikimedia-Logstash, observability, Operations

Tue, Mar 10

herron added a comment to T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".

@EBernhardson thanks!

Tue, Mar 10, 9:26 PM · Patch-For-Review, Operations, Wikimedia-Logstash

Fri, Mar 6

herron added a comment to T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".

I hope you don't mind, i started up a test on logstash-next with logstash-2020.02.20 recreated as ebernhardson-size-test where i think i've adjusted all text fields to have index_options: docs and copy_to: all, along with the all field defined as only a text field with standard indexing. A reindex is running to copy all the docs over and we can see how the sizes differ with a little less guessing on my part. It hasn't finished yet, but with 2M docs indexed it looks like the change might only be from ~800 bytes/doc to ~950 bytes/doc.

Fri, Mar 6, 2:19 AM · Patch-For-Review, Operations, Wikimedia-Logstash

Thu, Mar 5

herron updated subscribers of T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".

Thank you @Gehel @EBernhardson @dcausse @colewhite for looking at this!

Thu, Mar 5, 9:42 PM · Patch-For-Review, Operations, Wikimedia-Logstash
herron added a comment to T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".

Additionally, before moving on to max_clause_count I had experimented with settings like "default_field": "*" and "default_field": "message" in Kibana config query:queryString:options but the errors persisted.

Thu, Mar 5, 6:57 PM · Patch-For-Review, Operations, Wikimedia-Logstash
herron added a comment to T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".

In testing I was able to work around this by increasing indices.query.bool.max_clause_count to a value greater than the number of fields matched. This comes with some tradeoffs wrt resource utilization, but it does resolve the issue.

Thu, Mar 5, 6:37 PM · Patch-For-Review, Operations, Wikimedia-Logstash
herron created T247014: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726".
Thu, Mar 5, 6:35 PM · Patch-For-Review, Operations, Wikimedia-Logstash
herron updated the task description for T234854: Upgrade ELK Stack.
Thu, Mar 5, 5:37 PM · Operations, Wikimedia-Logstash

Feb 24 2020

herron added a comment to T240881: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet.

Thank you!

Feb 24 2020, 4:04 PM · Operations, Wikimedia-Logstash
herron added a comment to T240881: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet.

@herron - is there a specific date that you need these by? We can adjust our priorities and the need by date of this task to meet that.

Feb 24 2020, 3:49 PM · Operations, Wikimedia-Logstash
herron updated subscribers of T240881: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet.

Hi @wiki_willy, do you know what the ETA is for these hosts?

Feb 24 2020, 2:30 PM · Operations, Wikimedia-Logstash

Feb 21 2020

herron added a comment to T244472: Stream a subset of mediawiki apache logs to logstash .

Looking a bit closer I think this is happening because the nodes in labs are assigned their roles/profiles/etc via the external node classifier in horizon, which isn't making the call to role() as we do in prod and so $::_role isn't set in the process.

Feb 21 2020, 8:09 PM · Patch-For-Review, Beta-Cluster-Infrastructure, Operations, serviceops, observability

Feb 20 2020

herron added a comment to T245778: Spike in "Use of ResourceLoaderSkinModule::getAvailableLogos with $wgLogoHD set instead of $wgLogos was deprecated in MediaWiki 1.35.".

Similar to recent issue T245725

Feb 20 2020, 8:52 PM · Wikimedia-production-error, Wikimedia-Logstash, MediaWiki-General
herron created T245778: Spike in "Use of ResourceLoaderSkinModule::getAvailableLogos with $wgLogoHD set instead of $wgLogos was deprecated in MediaWiki 1.35.".
Feb 20 2020, 8:51 PM · Wikimedia-production-error, Wikimedia-Logstash, MediaWiki-General

Feb 19 2020

herron added a comment to T244472: Stream a subset of mediawiki apache logs to logstash .

I've cherry picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/571239/ on deployment-puppetmaster04.deployment-prep.eqiad.wmflabs (and made a minor change in patchset 9, since logstash was complaining about the quotes). The config loads ok in logstash.

Feb 19 2020, 4:18 PM · Patch-For-Review, Beta-Cluster-Infrastructure, Operations, serviceops, observability

Feb 14 2020

herron added a comment to T244472: Stream a subset of mediawiki apache logs to logstash .

Learned today that T243226 is tracking the current beta cluster puppetmaster issues

Feb 14 2020, 5:22 PM · Patch-For-Review, Beta-Cluster-Infrastructure, Operations, serviceops, observability
herron added a comment to T243226: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).

Turning off debug logging on puppetmaster04 (I set logdest = /dev/null in /etc/puppet/puppet.conf) has helped with the disk usage, and sluggishness issues. But sadly puppet runs are currently failing with:

Feb 14 2020, 5:20 PM · Operations, Beta-Cluster-Infrastructure

Feb 13 2020

herron added a comment to T244472: Stream a subset of mediawiki apache logs to logstash .

Fwiw I do see logs flowing into logstash-beta generally, but puppet was broken in the beta cluster because the master filled its disk. Puppet master on deployment-puppetmaster04.deployment-prep.eqiad.wmflabs seems to be logging at debug level, making puppet runs super slow and rapidly filling the disk. I don't have time at the moment, but if still broken in the morning I'll take a closer look.

Feb 13 2020, 10:30 PM · Patch-For-Review, Beta-Cluster-Infrastructure, Operations, serviceops, observability
herron added a comment to T244472: Stream a subset of mediawiki apache logs to logstash .

Hey @jijiki, usually to test/validate filters like this I'll cherry pick or live-hack the logstash config on the beta cluster and generate the desired traffic there to see how logstash behaves. There are some details at https://wikitech.wikimedia.org/wiki/Logstash#Beta_Cluster_Logstash

Feb 13 2020, 6:09 PM · Patch-For-Review, Beta-Cluster-Infrastructure, Operations, serviceops, observability

Jan 15 2020

herron reassigned T239732: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet from Papaul to jbond.

@herron thanks in that case you can just add the server to site.pp with the role ( spare::system) and assign the task to @jbond

Jan 15 2020, 9:03 PM · User-jbond, Operations, ops-codfw
herron added a comment to T239732: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet.

Hey @Papaul, I don't think there is any specific urgency to this and it can wait until he's back, but if it needs to go sooner I could work on it.

Jan 15 2020, 8:36 PM · User-jbond, Operations, ops-codfw
herron created T242885: Expand Eqiad Ganeti row_A capacity.
Jan 15 2020, 4:30 PM · hardware-requests, Operations

Jan 14 2020

herron added a comment to T242770: Logstash for MediaWiki is down in Beta Cluster.

This should be fixed now.

Jan 14 2020, 5:37 PM · observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure

Jan 8 2020

herron added a comment to T240906: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2.

I'd like to rule out possible hardware issues by migrating this VM to another Ganeti host, and seeing if that makes any improvement.

Jan 8 2020, 4:17 PM · Operations, Mail
herron added a comment to T228924: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet.

The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few of these new hosts to expand the existing row_A ganeti group?

Jan 8 2020, 4:10 PM · Patch-For-Review, serviceops, Operations

Jan 7 2020

herron added a comment to T228099: rack/setup/install ganeti500[123].eqsin.wmnet.

The eqsin ganeti cluster is now up and running, and a first VM netflow5001 has been created.

Jan 7 2020, 10:38 PM · Operations
herron updated the task description for T228099: rack/setup/install ganeti500[123].eqsin.wmnet.
Jan 7 2020, 10:35 PM · Operations
herron added a comment to T239151: Gerrit VM to test data migration.

ganeti-test.wikimedia.org VM has been created on row_C, and I've uploaded a patch to assign it role::gerrit with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/562587/

Jan 7 2020, 7:22 PM · Gerrit, vm-requests, Operations

Jan 3 2020

herron added a comment to T240906: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2.

There has been multiple of mx1001 issues lately (even if that is unreliable, it is worth noting). My suggestion would be, at least initially, to detect the same issue, if real, on icinga.

Jan 3 2020, 11:00 PM · Operations, Mail
herron triaged T240341: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org as Medium priority.
Jan 3 2020, 8:33 PM · Traffic, Operations, DNS
herron triaged T240495: investigate making 'notrack' the default on our ferm rules as Medium priority.
Jan 3 2020, 8:33 PM · Operations
herron triaged T240824: PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /var/www/php-monitoring/lib.php on line 35 as Medium priority.
Jan 3 2020, 7:45 PM · serviceops, Operations
herron triaged T240843: Track services without a native systemd unit as Medium priority.
Jan 3 2020, 7:44 PM · Operations
herron triaged T241309: Add more detailed instructions to the "sec-advice" page as Medium priority.
Jan 3 2020, 7:44 PM · Traffic, Operations
herron triaged T241494: Degraded RAID on cloudvirt1014 as High priority.
Jan 3 2020, 7:44 PM · Patch-For-Review, cloud-services-team (Hardware), ops-eqiad, Operations
herron triaged T241719: Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 as Medium priority.
Jan 3 2020, 7:43 PM · cloud-services-team (Kanban), Operations
herron triaged T241838: Requesting access to EventLogging data for knissen as Medium priority.
Jan 3 2020, 7:43 PM · SRE-Access-Requests, Operations
herron added a comment to T241096: Requesting access to analytics-privatedata-users and researchers for Aroraakhil.

Hi @Nuria, a friendly ping/bump for approval on this. Happy new year!

Jan 3 2020, 7:42 PM · Operations, SRE-Access-Requests, Research
herron moved T241838: Requesting access to EventLogging data for knissen from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Jan 3 2020, 7:39 PM · SRE-Access-Requests, Operations
herron updated the task description for T241838: Requesting access to EventLogging data for knissen.
Jan 3 2020, 7:38 PM · SRE-Access-Requests, Operations
herron added a comment to T240250: Convert the existing access request documentation into a Phab template.

I'd like to edit the form but don't currently have permission. Primarily I'd like to add the clinic duty checklist and clarify a few prerequisites for the requestor to complete. These are things that we currently do manually via back-and-forth comments. Adding them to the template should save time on every request. I'd like to update the template like so:

Jan 3 2020, 7:31 PM · WMF-CTO-Team-Backlog, Product-Analytics
herron updated the task description for T241838: Requesting access to EventLogging data for knissen.
Jan 3 2020, 7:14 PM · SRE-Access-Requests, Operations
herron added a comment to T241722: NDA for Superset Request from WMDE Employee - Kris Litson.

Thanks for the update @Kris_Litson_WMDE

Jan 3 2020, 4:36 PM · LDAP-Access-Requests, Operations
herron placed T240250: Convert the existing access request documentation into a Phab template up for grabs.
Jan 3 2020, 4:16 PM · WMF-CTO-Team-Backlog, Product-Analytics

Jan 2 2020

herron reassigned T240929: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman from herron to jcrespo.

Sounds good @jcrespo, please pass back to me when you've received the export and uploaded it to the mailman host and I'll see what I can do to import. Thanks!

Jan 2 2020, 3:46 PM · Operations, Wikimedia-Mailing-lists
herron moved T241722: NDA for Superset Request from WMDE Employee - Kris Litson from Backlog to NDA Pending on the LDAP-Access-Requests board.
Jan 2 2020, 3:39 PM · LDAP-Access-Requests, Operations
herron updated subscribers of T241722: NDA for Superset Request from WMDE Employee - Kris Litson.

Hello! Looping in @RStallman-legalteam to coordinate getting your NDA on file.

Jan 2 2020, 3:36 PM · LDAP-Access-Requests, Operations
herron removed a project from T223463: (2019-09) Create secteam groups in admin.yaml and define permissions: SRE-Access-Requests.

Removing the SRE-Access-Requests project tag for now. Please update and re-add if/when any further action is needed. Thanks!

Jan 2 2020, 2:55 PM · Operations, Security-Team, Patch-For-Review

Dec 19 2019

herron added a comment to T241166: Sync new ganeti clusters with netbox.

esams and ulsfo are online now, and eqsin should be shortly. Not sure if it's best to do all at once, or per-site, but wanted to get a task created to keep tabs on it.

Dec 19 2019, 7:04 PM · Operations, netbox
herron triaged T241166: Sync new ganeti clusters with netbox as Medium priority.
Dec 19 2019, 7:03 PM · Operations, netbox
herron updated subscribers of T236216: rack/setup/install ganeti300[123].

The esams ganeti cluster is now up and running, and netflow3001 has been created there as a first VM.

Dec 19 2019, 6:16 PM · Operations, ops-esams
herron added a comment to T236216: rack/setup/install ganeti300[123].

These hosts have been reimaged with buster, certs created, and patches uploaded to enable ganeti.

Dec 19 2019, 5:58 AM · Operations, ops-esams
herron committed rLPRI8cfae6c0f64b: add dummy esams and eqsin ganeti keys to pacify PCC (authored by herron).
add dummy esams and eqsin ganeti keys to pacify PCC
Dec 19 2019, 5:08 AM
herron added a comment to T228099: rack/setup/install ganeti500[123].eqsin.wmnet.

Hey @RobH, T229243 is encouraging. How are these hosts looking now?

Dec 19 2019, 4:56 AM · Operations
herron added a comment to T226444: rack/setup/install ganeti400[123].

Actually since netflow4001 is not yet puppetized the instance has been shut down. https://gerrit.wikimedia.org/r/559330 should unblock the first puppet run, and can re-start the instance after its merged.

Dec 19 2019, 4:50 AM · Traffic, Operations
herron added a comment to T226444: rack/setup/install ganeti400[123].

For sure, but its a work in progress currently. Basically I'd like a sanity check that the manual steps make sense and aren't already automated, or are better handled, in a way that I'm not aware of.

Dec 19 2019, 3:43 AM · Traffic, Operations
herron changed the status of T226444: rack/setup/install ganeti400[123] from Stalled to Open.

The ulsfo buster ganeti cluster is up and running now, and netflow4001 has been created there as a first VM.

Dec 19 2019, 2:44 AM · Traffic, Operations

Dec 16 2019

herron added a comment to T240906: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2.

Looked into these alerts a bit, and pulled the source IP addresses for these checks from watchmouse, but I don't see these IPs appearing in the mx logs. I think it is because the exim mx logs are not currently detailed enough. So I'll make the logs a bit more verbose and review again after more log information has been gathered.

Dec 16 2019, 9:58 PM · Operations, Mail
herron triaged T240906: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 as Medium priority.
Dec 16 2019, 9:52 PM · Operations, Mail
herron committed rLPRIe3dbfc3a2e5c: add dummy ulsfo ganeti RAPI key to pacify PCC (authored by herron).
add dummy ulsfo ganeti RAPI key to pacify PCC
Dec 16 2019, 7:57 PM
herron added a comment to T233134: logstash-beta.wmflabs.org does not receive any mediawiki events.

Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.

Dec 16 2019, 3:40 PM · Release-Engineering-Team-TODO, observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure

Dec 10 2019

herron added a comment to T234854: Upgrade ELK Stack.

@elukey hey, yes that's been fixed by making a newer version of curator available to the new clusters. Haven't seen cron errors from these since Dec 5. Thanks for cleaning up the "config does not exist" entries!

Dec 10 2019, 12:27 PM · Operations, Wikimedia-Logstash

Dec 5 2019

herron added a comment to T233134: logstash-beta.wmflabs.org does not receive any mediawiki events.

Looking more closely the problem was due to a Broker: Leader not available issue in the deployment-prep kafka logging cluster. After starting deployment-logstash2 back up (the instance had been stopped) logs are flowing again. Longer term we'll likely need another logstash stretch instance and to migrate over the broker id from deployment-logstash2 to the new instance.

Dec 5 2019, 5:33 PM · Release-Engineering-Team-TODO, observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure
herron added a comment to T233134: logstash-beta.wmflabs.org does not receive any mediawiki events.

seeing rsyslog complaining about "omkafka: kafka delivery FAIL" on deployment-prep hosts.

Dec 5 2019, 2:57 PM · Release-Engineering-Team-TODO, observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure

Dec 4 2019

herron updated the task description for T234854: Upgrade ELK Stack.
Dec 4 2019, 3:35 PM · Operations, Wikimedia-Logstash

Nov 26 2019

herron updated subscribers of T239121: VE edit data stopped due to statsv falling over (?) on webperf1001.
Nov 26 2019, 7:39 PM · Performance-Team (Radar), observability, Analytics, Editing-team
herron added a comment to T226444: rack/setup/install ganeti400[123].

Ok, for my own edification, how would the private only LVS model work if we wanted to stand up a public facing non HTTP(S) service in a VM at one+ of these sites?

Nov 26 2019, 3:13 PM · Traffic, Operations
herron added a comment to T226444: rack/setup/install ganeti400[123].

Will this Ganeti cluster use vlan tagged interfaces, or will separate physical interfaces connect to both public and private vlans? If tagging, are the switchports configured for that yet?

Nov 26 2019, 2:55 PM · Traffic, Operations

Nov 19 2019

herron added a comment to T237587: Determine & implement near-term method for escalating network alerts.

Friendly ping to @Volans about @fgiunchedi question above

Nov 19 2019, 9:12 PM · Patch-For-Review, Operations, netops, observability
herron added a comment to T230492: Requesting SRE permissions to create Gerrit projects under operations/debs.

Thanks for the ping, I missed the question. Sure, being added to the Gerrit Manager that would work for me!

Nov 19 2019, 6:08 PM · Gerrit-Privilege-Requests

Nov 15 2019

herron added a comment to T238416: Logstash doesn't parse ulogd source and destination ports.

https://gerrit.wikimedia.org/r/551270 should do the trick for source/dest ports. I don't recall why these weren't parsed out in the first place. While we're at it would any of the other parts the ulogd/iptables events be useful as fields?

Nov 15 2019, 9:32 PM · Operations, observability

Nov 13 2019

herron added a comment to T235891: Ingest production logs with ELK7.

re: bridging the gap with non-kafka inputs, my current thinking is to output all logs with deprecated-input tag back into kafka-logging on a separate topic and consume that from the new cluster. cc @herron @colewhite

Nov 13 2019, 3:54 PM · User-fgiunchedi, Patch-For-Review, Operations, Wikimedia-Logstash

Nov 8 2019

herron updated the task description for T230236: De-noise ipsec alerts (Reduce Icinga alert noise goal).
Nov 8 2019, 9:22 PM · User-herron, Goal, observability
herron closed T230236: De-noise ipsec alerts (Reduce Icinga alert noise goal), a subtask of T228878: Reduce Icinga alert noise, as Resolved.
Nov 8 2019, 9:22 PM · User-fgiunchedi, Goal, observability
herron closed T230236: De-noise ipsec alerts (Reduce Icinga alert noise goal) as Resolved.

https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status probably needs some cleanup (some of the graphs are empty, there's a note there to ignore icinga errors, etc). Also fix missing doc link on the alert?

Nov 8 2019, 9:22 PM · User-herron, Goal, observability

Nov 7 2019

herron added a comment to T236497: cp3056 hardware issue.

Sorry I missed that you already had a patch! But in any case, we only need commenting from cache::nodes to fix up this case (there's no good reason to e.g. churn it out of conftool or the various iptables rules defined from the other stuff).

Nov 7 2019, 4:25 PM · DC-Ops, ops-esams, Operations, Traffic
herron added a comment to T236497: cp3056 hardware issue.

Since it looks like cp3056 might be down for some time could we remove it from the config until fixed? It would be good to let the ipsec checks in icinga return to green.

Nov 7 2019, 3:25 PM · DC-Ops, ops-esams, Operations, Traffic

Nov 6 2019

herron added a comment to T237587: Determine & implement near-term method for escalating network alerts.

In terms of “what” should be escalated, so far we discussed

Nov 6 2019, 10:43 PM · Patch-For-Review, Operations, netops, observability
herron triaged T237587: Determine & implement near-term method for escalating network alerts as Medium priority.
Nov 6 2019, 10:37 PM · Patch-For-Review, Operations, netops, observability