Wed, Jul 3
Tue, Jul 2
+1 from me as well
Mon, Jul 1
Thanks, based on these headers it looks that an @tools.wmflabs.org alias which points to an @gmail.com address is being used as the email in gerrit.
Could you please include the headers from an affected message? Thanks in advance!
Wed, Jun 26
Tue, Jun 25
Fri, Jun 21
These will need internal vlan/ips. Fwiw kafka-main100[1-5] will be replacing kafka100, so those existing hosts could be used as a template.
Thu, Jun 20
Jun 5 2019
Here's a first shot at per-host replacement steps for kafka2003 -> kafka-main2003:
Jun 4 2019
A syslog UDP listener on port 10514 is now running on lithium/wezen, and forwarding messages received to the Kafka logging pipeline.
Tracking service implementation in T225005
Fwiw kafka2001 is the current controller so thinking we should start with kafka2003 -> kafka-main2003
That did the trick! All of the new codfw kafka-main hosts are now installed and ready for service setup
Long JSON messages to ELK are being truncated since T187147#5182892 which addresses "Changes to rsyslog/Kafka mean that large errors are now completely lost instead of truncated."
Kafka-main200, and kafka-main2005 are installed, have had the initial puppet run applied and are now marked "staged" in netbox.
Jun 3 2019
I would expect though that the DHCP requests would make it to the install servers, with or without entries in the dhcp config file.
Today I tried to perform OS installs on kafka-main200 but was not seeing DHCP requests from these hosts make it to the installNNNN hosts yet.
May 31 2019
@Papaul could you have a look at kafka-main2002? It seems to be stuck, at least I'm not able to open a console or power cycle.
I did some testing of various software and hardware raid configurations and wrote up a summary at https://wikitech.wikimedia.org/wiki/Kafka/Kafka-main-raid-performance-testing-2019
May 30 2019
Added some high level troubleshooting tips at https://wikitech.wikimedia.org/wiki/Exim#Troubleshooting_"exim_queue_warning"_alerts
Discussed on IRC adding here to close the loop
May 29 2019
I do think that we would need to be consistent about what constitutes "Acknowledged" (or a similar column name). IMO the workboard transition action would indicate that the clinic duty task triage work was done. The task has been prioritized, relevant people/groups/tags have been added, etc.
May 24 2019
Kafka-main2001 is installed. I updated the netboot config to assign the partman config to these hostnames, and switched the hardware controller to HBA mode. Then it completed the install using the 8 disk md raid10 config. Now to do some testing!
May 23 2019
ok, no worries I'll poke at this for a bit and try to get it installed
May 22 2019
Hey @Papaul, I added a raid10-gpt-srv-lvm-ext4-8disks.cfg for the initial installs on these.
May 21 2019
May 17 2019
Good point! And if we number from 200[1-5] it should simplify mapping of broker IDs between old and new hosts too. I updated the description to reflect this, but if you think its best to keep the 200[4-8] suffix happy to go that route instead.
May 16 2019
May 14 2019
May 13 2019
May 10 2019
May 9 2019
After further testing I'm seeing these messages are arriving to rsyslog with @cee formatting, but truncated. Meaning the msg field does not contain valid json, specifically within the json-in-json-in-json field msg.fatal_exception.trace. The msg field comes from rsyslog extraction of the json payload prefixed by the @cee cookie in the syslog message.
May 8 2019
Comparing (in beta) a working mediawiki log message and a log message failing with max_bytes_length_exceeded_exception I'm noticing differences in json formatting as well. For example
@hashar while on the topic, is it possible for Jenkins to more evenly dispatch PCC jobs across the workers? Currently compiler1002 receives the bulk of the work and currently is at 95% disk full, while compiler1001 is at only 50% disk full.
Ready to resolve afaict!
May 7 2019
Seeing errors like this from logstash that appear related. This one specifically originated from logstash1007 /var/log/logstash/logstash-plain.log
May 1 2019
On paper this use case also would lend itself to a filesystem with transparent compression. Maybe btrfs with compression. The data stored on disk is non-critical, and there are multiple worker nodes should issues arise with one filesystem.
Looking better after merging the above. From a password reminder mail:
Apr 30 2019
Apr 29 2019
uid=sukhe,ou=people,dc=wikimedia,dc=org has been added to the NDA group. Please re-open if any follow up is needed. Thanks!
Apr 26 2019
Apr 25 2019
Looking at cumin1001 I noticed that the log prefix at the end of the input chan is "fw-out-drop" and the output chain is empty with an accept policy. Is "out" indeed the direction in this case? Or would dropped packets logged by the input chain be considered "in"?
Since we're approaching two weeks on this request I've proposed the above patch to move forward using the existing deployment group and trust that caution will be exercised. Happy to see another approach implemented, but at the same time would like to unblock this individual access request.
Hello, I am not seeing an existing account with username Progresslabs. Could you please confirm that the account has already been created, and this is indeed the username? If you know what email was used I could try searching for that.
Apr 24 2019
The Kibana lvs has been updated to use the source hash scheduler
Is there anything left to do before closing this?
I think it's safe to resolve this now since we're on logstash 5.6.15, and have disabled the logstash persistent queue.