Page MenuHomePhabricator

herron (Keith Herron)
Ops Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (111 w, 3 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Wed, Jul 3

herron updated the task description for T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345].
Wed, Jul 3, 3:42 PM · Patch-For-Review, Services (watching), Core Platform Team Backlog (Watching / External), EventBus, Analytics, User-herron, Operations

Tue, Jul 2

herron added a comment to T227065: Move icinga alarm for the EventStreams external endpoint to SRE.

+1 from me as well

Tue, Jul 2, 1:31 PM · Analytics-Kanban, Wikimedia-Incident, Analytics, Operations
herron awarded T227065: Move icinga alarm for the EventStreams external endpoint to SRE a Like token.
Tue, Jul 2, 1:31 PM · Analytics-Kanban, Wikimedia-Incident, Analytics, Operations

Mon, Jul 1

herron added a project to T226884: Some emails coming from Gerrit are being tagged as suspicious by Gmail: cloud-services-team.

Thanks, based on these headers it looks that an @tools.wmflabs.org alias which points to an @gmail.com address is being used as the email in gerrit.

Mon, Jul 1, 1:44 PM · cloud-services-team, Mail, Gerrit
herron added a comment to T226884: Some emails coming from Gerrit are being tagged as suspicious by Gmail.

Could you please include the headers from an affected message? Thanks in advance!

Mon, Jul 1, 1:05 PM · cloud-services-team, Mail, Gerrit

Wed, Jun 26

herron updated the task description for T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345].
Wed, Jun 26, 6:48 PM · Patch-For-Review, Services (watching), Core Platform Team Backlog (Watching / External), EventBus, Analytics, User-herron, Operations

Tue, Jun 25

herron updated the task description for T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345].
Tue, Jun 25, 9:03 PM · Patch-For-Review, Services (watching), Core Platform Team Backlog (Watching / External), EventBus, Analytics, User-herron, Operations
herron updated the task description for T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345].
Tue, Jun 25, 5:01 PM · Patch-For-Review, Services (watching), Core Platform Team Backlog (Watching / External), EventBus, Analytics, User-herron, Operations

Fri, Jun 21

herron reassigned T226274: (Need By: June 30) rack/setup/install kafka-main100[1-5] from herron to Cmjohnson.
Fri, Jun 21, 7:58 PM · User-herron, Operations
herron added a comment to T226274: (Need By: June 30) rack/setup/install kafka-main100[1-5].

These will need internal vlan/ips. Fwiw kafka-main100[1-5] will be replacing kafka100[123], so those existing hosts could be used as a template.

Fri, Jun 21, 7:58 PM · User-herron, Operations
herron assigned T226274: (Need By: June 30) rack/setup/install kafka-main100[1-5] to RobH.
Fri, Jun 21, 5:19 PM · User-herron, Operations
herron added a subtask for T226274: (Need By: June 30) rack/setup/install kafka-main100[1-5]: Unknown Object (Task).
Fri, Jun 21, 5:14 PM · User-herron, Operations
herron created T226274: (Need By: June 30) rack/setup/install kafka-main100[1-5].
Fri, Jun 21, 5:13 PM · User-herron, Operations

Thu, Jun 20

herron awarded T224128: Migrate network device syslogs to Kafka logging pipeline a Party Time token.
Thu, Jun 20, 1:55 PM · Patch-For-Review, User-herron, Operations, netops, Wikimedia-Logstash

Jun 5 2019

Restricted Application added a project to T225129: Move eventgate logs to new logging infrastructure: Analytics.
Jun 5 2019, 5:57 PM · Analytics, EventBus, Operations, Wikimedia-Logstash
herron added a subtask for T225125: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline: T211125: Move service-runner to new logging infrastructure.
Jun 5 2019, 5:56 PM · Elasticsearch, Operations, Wikimedia-Logstash, Discovery-Search
herron added a parent task for T211125: Move service-runner to new logging infrastructure: T225125: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline.
Jun 5 2019, 5:56 PM · Core Platform Team Backlog (Watching / External), Patch-For-Review, service-runner, Wikimedia-Logstash, Operations
Restricted Application added a project to T225125: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline: Discovery-Search.
Jun 5 2019, 5:33 PM · Elasticsearch, Operations, Wikimedia-Logstash, Discovery-Search
herron added a subtask for T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline: T211125: Move service-runner to new logging infrastructure.
Jun 5 2019, 5:22 PM · Operations, Wikimedia-Logstash
herron added a parent task for T211125: Move service-runner to new logging infrastructure: T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline.
Jun 5 2019, 5:22 PM · Core Platform Team Backlog (Watching / External), Patch-For-Review, service-runner, Wikimedia-Logstash, Operations
herron triaged T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline as Normal priority.
Jun 5 2019, 5:22 PM · Operations, Wikimedia-Logstash
herron added a comment to T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345].

Here's a first shot at per-host replacement steps for kafka2003 -> kafka-main2003:

Jun 5 2019, 3:35 PM · Patch-For-Review, Services (watching), Core Platform Team Backlog (Watching / External), EventBus, Analytics, User-herron, Operations

Jun 4 2019

herron added a comment to T224128: Migrate network device syslogs to Kafka logging pipeline.

Before moving production logs to this I think we should decide on some cnames for the service, so we avoid needing to reconfigure clients if/when the backend syslog hosts change.

Jun 4 2019, 7:41 PM · Patch-For-Review, User-herron, Operations, netops, Wikimedia-Logstash
herron added a comment to T224128: Migrate network device syslogs to Kafka logging pipeline.

A syslog UDP listener on port 10514 is now running on lithium/wezen, and forwarding messages received to the Kafka logging pipeline.

Jun 4 2019, 7:35 PM · Patch-For-Review, User-herron, Operations, netops, Wikimedia-Logstash
herron awarded T221212: spicerack/cookbook: add additional arguments IRC/SAL logging a Like token.
Jun 4 2019, 6:21 PM · Patch-For-Review, SRE-tools, Operations
herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

Tracking service implementation in T225005

Jun 4 2019, 5:32 PM · ops-codfw, Operations
herron added a comment to T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345].

Fwiw kafka2001 is the current controller so thinking we should start with kafka2003 -> kafka-main2003

Jun 4 2019, 5:04 PM · Patch-For-Review, Services (watching), Core Platform Team Backlog (Watching / External), EventBus, Analytics, User-herron, Operations
herron triaged T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] as Normal priority.
Jun 4 2019, 5:01 PM · Patch-For-Review, Services (watching), Core Platform Team Backlog (Watching / External), EventBus, Analytics, User-herron, Operations
herron closed T223493: rack/setup/install kafka-main200[1-5] as Resolved.
Jun 4 2019, 4:23 PM · ops-codfw, Operations
herron updated the task description for T223493: rack/setup/install kafka-main200[1-5].
Jun 4 2019, 4:23 PM · ops-codfw, Operations
herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

That did the trick! All of the new codfw kafka-main hosts are now installed and ready for service setup

Jun 4 2019, 4:23 PM · ops-codfw, Operations
herron updated the task description for T223493: rack/setup/install kafka-main200[1-5].
Jun 4 2019, 4:22 PM · ops-codfw, Operations
herron added a comment to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.

Long JSON messages to ELK are being truncated since T187147#5182892 which addresses "Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated."

Jun 4 2019, 3:32 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron updated the task description for T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.
Jun 4 2019, 3:19 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron reassigned T223493: rack/setup/install kafka-main200[1-5] from herron to Papaul.

Kafka-main200[123], and kafka-main2005 are installed, have had the initial puppet run applied and are now marked "staged" in netbox.

Jun 4 2019, 2:59 AM · ops-codfw, Operations

Jun 3 2019

herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

Ok, thanks!

Jun 3 2019, 3:22 PM · ops-codfw, Operations
herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

I would expect though that the DHCP requests would make it to the install servers, with or without entries in the dhcp config file.

Jun 3 2019, 2:59 PM · ops-codfw, Operations
herron reassigned T223493: rack/setup/install kafka-main200[1-5] from herron to Papaul.

Today I tried to perform OS installs on kafka-main200[345] but was not seeing DHCP requests from these hosts make it to the installNNNN hosts yet.

Jun 3 2019, 1:55 AM · ops-codfw, Operations

May 31 2019

herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

@Papaul could you have a look at kafka-main2002? It seems to be stuck, at least I'm not able to open a console or power cycle.

May 31 2019, 5:53 PM · ops-codfw, Operations
herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

I did some testing of various software and hardware raid configurations and wrote up a summary at https://wikitech.wikimedia.org/wiki/Kafka/Kafka-main-raid-performance-testing-2019

May 31 2019, 4:30 PM · ops-codfw, Operations

May 30 2019

herron added a comment to T224692: mx1001 exim queue warning.

Added some high level troubleshooting tips at https://wikitech.wikimedia.org/wiki/Exim#Troubleshooting_"exim_queue_warning"_alerts

May 30 2019, 7:46 PM · Operations, observability
herron added a comment to T224692: mx1001 exim queue warning.

Discussed on IRC adding here to close the loop

May 30 2019, 7:20 PM · Operations, observability

May 29 2019

herron added a comment to T197624: Improve visibility of incoming operations tasks.

I do think that we would need to be consistent about what constitutes "Acknowledged" (or a similar column name). IMO the workboard transition action would indicate that the clinic duty task triage work was done. The task has been prioritized, relevant people/groups/tags have been added, etc.

May 29 2019, 3:15 PM · User-herron, Operations

May 24 2019

herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

Kafka-main2001 is installed. I updated the netboot config to assign the partman config to these hostnames, and switched the hardware controller to HBA mode. Then it completed the install using the 8 disk md raid10 config. Now to do some testing!

May 24 2019, 3:48 AM · ops-codfw, Operations

May 23 2019

herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

ok, no worries I'll poke at this for a bit and try to get it installed

May 23 2019, 9:23 PM · ops-codfw, Operations

May 22 2019

herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

Hey @Papaul, I added a raid10-gpt-srv-lvm-ext4-8disks.cfg for the initial installs on these.

May 22 2019, 7:10 PM · ops-codfw, Operations
herron triaged T224128: Migrate network device syslogs to Kafka logging pipeline as Normal priority.
May 22 2019, 2:52 PM · Patch-For-Review, User-herron, Operations, netops, Wikimedia-Logstash
herron added a comment to T221969: Puppet catalog compiler - increasing max concurrent jobs.

I thought about this task a little bit. The current instances have 4 vCPUs. The operations-puppet-catalog-compiler-test job runs the compiler with NUM_THREADS=2
I would suggest:

  • to use x1.large instances (8 vCPUS / 16G RAM / 160G disk). The RAM / disk is a bit overkill since the compiler is mostly CPU bound iirc.
  • Set the jobs to use NUM_THREADS=6 (or 7? so we at least have one CPU available for the rest)
  • add a third instance to the pool
May 22 2019, 2:25 PM · Release-Engineering-Team-TODO (201907), puppet-compiler, Continuous-Integration-Infrastructure

May 21 2019

herron renamed T213902: Implement sensitive logstash access control from Implement sensitive log access control to Implement sensitive logstash access control.
May 21 2019, 8:38 PM · Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
herron renamed T213902: Implement sensitive logstash access control from [stretch] Implement sensitive log access control, onboard 3 sensitive log producers to Implement sensitive log access control.
May 21 2019, 8:38 PM · Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
herron added a subtask for T220103: TEC6: Logging infrastructure (Q4 2018/19 goal): T213902: Implement sensitive logstash access control.
May 21 2019, 8:37 PM · Wikimedia-Logstash, User-fgiunchedi, Operations, Goal
herron added a parent task for T213902: Implement sensitive logstash access control: T220103: TEC6: Logging infrastructure (Q4 2018/19 goal).
May 21 2019, 8:37 PM · Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
herron updated the task description for T220103: TEC6: Logging infrastructure (Q4 2018/19 goal).
May 21 2019, 8:37 PM · Wikimedia-Logstash, User-fgiunchedi, Operations, Goal

May 17 2019

herron added a comment to T223493: rack/setup/install kafka-main200[1-5].

Good point! And if we number from 200[1-5] it should simplify mapping of broker IDs between old and new hosts too. I updated the description to reflect this, but if you think its best to keep the 200[4-8] suffix happy to go that route instead.

May 17 2019, 2:49 PM · ops-codfw, Operations
herron renamed T223493: rack/setup/install kafka-main200[1-5] from rack/setup/install kafka200[4-8] to rack/setup/install kafka-main200[1-5].
May 17 2019, 2:45 PM · ops-codfw, Operations
herron updated the task description for T223493: rack/setup/install kafka-main200[1-5].
May 17 2019, 2:44 PM · ops-codfw, Operations

May 16 2019

herron triaged T223483: Logstash stops processing messages if a single output becomes blocked as Normal priority.
May 16 2019, 7:38 PM · Operations, Wikimedia-Logstash

May 14 2019

herron awarded T222800: Requesting quota increase for 'puppet-diffs' project a Party Time token.
May 14 2019, 8:59 PM · Operations, Cloud-VPS (Quota-requests), puppet-compiler

May 13 2019

herron moved T187147: Port mediawiki/php/wmerrors to PHP7 and deploy from Backlog to In Dev/Progress on the Wikimedia-Logstash board.
May 13 2019, 3:25 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron added a project to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy: Wikimedia-Logstash.
May 13 2019, 3:22 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron added a comment to T221969: Puppet catalog compiler - increasing max concurrent jobs.

I have deployed it on May 6th and thus puppet compile jobs should be hopefully equally split between the compiler1001 and compiler1002. I don't know how to proof check that though :-(

May 13 2019, 1:56 PM · Release-Engineering-Team-TODO (201907), puppet-compiler, Continuous-Integration-Infrastructure

May 10 2019

herron moved T187147: Port mediawiki/php/wmerrors to PHP7 and deploy from Backlog to Working on on the User-herron board.
May 10 2019, 5:16 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron added projects to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy: Operations, MediaWiki-Logging.
May 10 2019, 5:16 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron added a project to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy: User-herron.
May 10 2019, 5:16 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron added a comment to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.

@herron Yes, I can do that to help avoid this specific instance of the problem. The problem I'd like to solve in this task, however, is to be able to detect it. That is, if there is a significant influx of errors that happen to be too large, this really should show up under type:mediawiki in some kind of channel (e.g. syslog_truncated) with a severity of "ERROR", so that they still get counted and immediately trigger the necessary alarms during a MediaWiki deployment.
For that it's totally find if the json is no parsed and only stored as raw message text. It would still be picked up at least with a timestamp, type and a bit of context (e.g. which MW server it came from), and the raw text will have to suffice for a MW developer to figure out where it came from and either to fix the problem that caused the error to be reported, or to make the error message less big.
But the immediate issue is to be able to at least index them and detect the problem.

May 10 2019, 5:15 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)

May 9 2019

herron closed T182819: custom fact interface_primary breaks under newer versions of facter as Resolved.

yup!

May 9 2019, 8:06 PM · User-herron, Patch-For-Review, Puppet, Operations
herron closed T182819: custom fact interface_primary breaks under newer versions of facter, a subtask of T177254: Upgrade to puppet 4 (4.8 or newer), as Resolved.
May 9 2019, 8:06 PM · cloud-services-team (FY2017-18), Puppet, User-Joe, Operations
herron moved T213902: Implement sensitive logstash access control from Backlog to Working on on the User-herron board.
May 9 2019, 8:05 PM · Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
herron moved T217359: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. from Backlog to Working on on the User-herron board.
May 9 2019, 8:05 PM · User-herron, Core Platform Team (Modern Event Platform (TEC2)), Core Platform Team Backlog (Watching / External), Services (watching), EventBus, Analytics, Operations
herron moved T220387: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) from Backlog to Working on on the User-herron board.
May 9 2019, 8:05 PM · User-herron, Operations
herron moved T222075: Prevent puppet catalog compiler workers from running out of disk space from Backlog to Working on on the User-herron board.
May 9 2019, 8:05 PM · observability, User-herron, puppet-compiler, Operations
herron added a comment to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.

After further testing I'm seeing these messages are arriving to rsyslog with @cee formatting, but truncated. Meaning the msg field does not contain valid json, specifically within the json-in-json-in-json field msg.fatal_exception.trace. The msg field comes from rsyslog extraction of the json payload prefixed by the @cee cookie in the syslog message.

May 9 2019, 5:58 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)

May 8 2019

herron added a comment to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.

Comparing (in beta) a working mediawiki log message and a log message failing with max_bytes_length_exceeded_exception I'm noticing differences in json formatting as well. For example

May 8 2019, 8:21 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron added a comment to T221969: Puppet catalog compiler - increasing max concurrent jobs.

@hashar while on the topic, is it possible for Jenkins to more evenly dispatch PCC jobs across the workers? Currently compiler1002 receives the bulk of the work and currently is at 95% disk full, while compiler1001 is at only 50% disk full.

May 8 2019, 3:08 PM · Release-Engineering-Team-TODO (201907), puppet-compiler, Continuous-Integration-Infrastructure
herron triaged T222800: Requesting quota increase for 'puppet-diffs' project as Normal priority.
May 8 2019, 3:04 PM · Operations, Cloud-VPS (Quota-requests), puppet-compiler
herron closed T221290: wiki-mail DKIM failing as Resolved.
May 8 2019, 1:57 PM · Patch-For-Review, Traffic, Operations, DNS, Mail
herron added a comment to T221288: Phabricator SPF record contains internal addressing for phab[12]001.

Do those IPv6 addresses actually send any mail?

May 8 2019, 1:21 PM · Patch-For-Review, Traffic, Operations, DNS, Mail
herron closed T221288: Phabricator SPF record contains internal addressing for phab[12]001 as Resolved.

Ready to resolve afaict!

May 8 2019, 1:13 PM · Patch-For-Review, Traffic, Operations, DNS, Mail

May 7 2019

herron updated subscribers of T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.
May 7 2019, 7:45 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
herron added a comment to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.

Seeing errors like this from logstash that appear related. This one specifically originated from logstash1007 /var/log/logstash/logstash-plain.log

May 7 2019, 7:44 PM · Core Platform Team Workboards (Clinic Duty Team), serviceops, Patch-For-Review, MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)

May 1 2019

herron added a comment to T222072: compiler1002.puppet-diffs.eqiad.wmflabs disk is full.

On paper this use case also would lend itself to a filesystem with transparent compression. Maybe btrfs with compression. The data stored on disk is non-critical, and there are multiple worker nodes should issues arise with one filesystem.

May 1 2019, 6:41 PM · Patch-For-Review, Operations, puppet-compiler, Jenkins
herron added a comment to T222072: compiler1002.puppet-diffs.eqiad.wmflabs disk is full.
May 1 2019, 6:29 PM · Patch-For-Review, Operations, puppet-compiler, Jenkins
herron added a comment to T221290: wiki-mail DKIM failing.

Looking better after merging the above. From a password reminder mail:

May 1 2019, 5:33 PM · Patch-For-Review, Traffic, Operations, DNS, Mail

Apr 30 2019

herron triaged T222198: Gmail - Multiple destination domains per transaction is unsupported. Please try again. as Normal priority.
Apr 30 2019, 4:19 PM · Patch-For-Review, Mail, Operations

Apr 29 2019

herron added a project to T222075: Prevent puppet catalog compiler workers from running out of disk space: observability.
Apr 29 2019, 3:10 PM · observability, User-herron, puppet-compiler, Operations
herron triaged T222075: Prevent puppet catalog compiler workers from running out of disk space as Normal priority.
Apr 29 2019, 3:10 PM · observability, User-herron, puppet-compiler, Operations
herron closed T221990: LDAP access to the (nda) wmf group for sukhe as Resolved.

uid=sukhe,ou=people,dc=wikimedia,dc=org has been added to the NDA group. Please re-open if any follow up is needed. Thanks!

Apr 29 2019, 2:52 PM · LDAP-Access-Requests

Apr 26 2019

herron triaged T221529: Frequent puppet failures as Normal priority.
Apr 26 2019, 10:44 PM · Patch-For-Review, Puppet, puppet-compiler, Operations
herron triaged T221904: swift backend decomms / rebalances are noisy as Normal priority.
Apr 26 2019, 10:44 PM · observability, media-storage, Operations
herron triaged T221939: Investigate use of hp-asrd on HPE servers as Normal priority.
Apr 26 2019, 10:42 PM · cloud-services-team, Operations
herron triaged T221985: puppet-merge shouldn't fail if `tput` doesn't grok your terminal as Normal priority.
Apr 26 2019, 10:40 PM · Puppet, Operations
herron added a comment to T189434: Fake email about @tools.wmflabs.org email.

(eg, it says it's to security@tools.wmflabs.org but I somehow got the email).

Apr 26 2019, 3:23 PM · Mail, cloud-services-team
herron created T221969: Puppet catalog compiler - increasing max concurrent jobs.
Apr 26 2019, 2:31 PM · Release-Engineering-Team-TODO (201907), puppet-compiler, Continuous-Integration-Infrastructure

Apr 25 2019

herron added a comment to T116011: ferm: Log dropped packets.

Looking at cumin1001 I noticed that the log prefix at the end of the input chan is "fw-out-drop" and the output chain is empty with an accept policy. Is "out" indeed the direction in this case? Or would dropped packets logged by the input chain be considered "in"?

Apr 25 2019, 5:54 PM · Operations
herron added a comment to T220860: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts).

Since we're approaching two weeks on this request I've proposed the above patch to move forward using the existing deployment group and trust that caution will be exercised. Happy to see another approach implemented, but at the same time would like to unblock this individual access request.

Apr 25 2019, 5:08 PM · Patch-For-Review, Operations, SRE-Access-Requests
herron added a comment to T221744: Add Progresslabs to WMF LDAP group for transparency report editing (allow 'nda' users to login on transparency-private).

Hello, I am not seeing an existing account with username Progresslabs. Could you please confirm that the account has already been created, and this is indeed the username? If you know what email was used I could try searching for that.

Apr 25 2019, 4:24 PM · LDAP-Access-Requests

Apr 24 2019

herron closed T221143: Kibana breaks during rolling upgrade as Resolved.

The Kibana lvs has been updated to use the source hash scheduler

Apr 24 2019, 3:57 PM · Patch-For-Review, User-herron, Wikimedia-Logstash, Operations
herron added a comment to T220982: maps hosts have bad permissions under /srv/deployment.

Is there anything left to do before closing this?

Apr 24 2019, 3:33 PM · Operations
herron closed T212640: logstash stuck on its persistent queue as Resolved.

I think it's safe to resolve this now since we're on logstash 5.6.15, and have disabled the logstash persistent queue.

Apr 24 2019, 3:30 PM · Operations, Wikimedia-Logstash
herron added a parent task for T221529: Frequent puppet failures : T201247: Sporadic puppet failures.
Apr 24 2019, 3:18 PM · Patch-For-Review, Puppet, puppet-compiler, Operations