User Details
- User Since
- Aug 21 2018, 6:05 PM (275 w, 2 d)
- Availability
- Available
- LDAP User
- Cwhite
- MediaWiki User
- CWhite (WMF) [ Global Accounts ]
Fri, Nov 17
Definitely a different problem.
Thu, Nov 16
Logstash was crashlooping because it was attempting to load a template that did not exist on the host anymore. Now that it is using the right template, logs are flowing again.
Wed, Nov 15
Injecting statslib into the legacy statsd code path introduces significant complexity and risk.
@jcrespo roll-restarted swift proxies today using sre.swift.roll-restart-reboot-swift-ms-proxies cookbook in response to high 502s and 504s from ATS.
Tue, Nov 14
Tue, Nov 7
Trying option 1 seems like a good start to try handling memory size issues. Note that we may want to adjust logstash tuning as well afterwards.
Mon, Nov 6
Fri, Nov 3
Oct 27 2023
There are other tasks filed doing what this task intended.
Oct 25 2023
Oct 19 2023
Oct 18 2023
Thanks, @aaron!
Oct 17 2023
Summarizing highlights from my IRC conversation with @fgiunchedi:
Thanks for the report!
Oct 16 2023
I have no problem keeping the 30s and 60s buckets. I was under the impression from the meeting that @Krinkle preferred to omit them?
Per meeting with @Krinkle today:
Oct 12 2023
It doesn't look like there is an option to change the timeout parameter. We'll need to patch curator. 😕
I propose we implement a mixed approach.
OpenSearch replies with HTTP 200 {"acknowledged":false} indicating the operation hasn't failed, but has hit the 30s "explicit operation timeout". This is different than master_timeout (which curator is providing) specifying the "timeout for connection to master".
Oct 11 2023
Oct 10 2023
Done!
Oct 4 2023
Stopped and removed units for coal, uwsgi-coal, wmf_auto_restart_coal and wmf_auto_restart_uwsgi-coal.
Delete actions took a few milliseconds over 30s today. Optimistically resolving.
Oct 3 2023
Optimistically resolving now that WAL is enabled. Will watch the logs for new instances.
I concur. Let's bump the timeout.
Oct 2 2023
There's an outstanding Icinga alert that seems related: CRITICAL - degraded: The following units failed: dispatch-scheduler.service,docker-image-prune-old.service
Sep 28 2023
Discovered some more evidence of this in logs this morning. There is another recommendation to enable WAL on the sqlite db (new in Grafana 9.4).
Sep 25 2023
Error logs have gone away and the prometheus view looks good.
Sep 22 2023
21:34:16 <Krinkle> cwhite: it seems 'coal' is still running on webperf1003. I guess we didn't absent it and/or intentionally removed it simply with intention to remove by hand but haven't yet?
The group membership change has been deployed.
Restored the level of access held before last contract expired.
Sep 21 2023
The group membership change has been deployed.
Migrated to nda ldap group.
ping: @thcipriani as approver for deployment group membership
@MGerlach is there an expiry date for this contract renewal?
The group membership change has been deployed.
mw2381:
$ ulimit -Hn 1048576 $ ulimit -Sn 1024
I see logs in logstash! \o/
Sep 20 2023
It's been more than a week and I can see no more instances of this in the logs.
Sep 19 2023
Considering how much time has passed, it's probably safe to say this is complete. If not, please reach out :)
Sep 18 2023
Interestingly, if StatsLib creates executeTiming_seconds_bucket as a counter and executeTiming_seconds as a timer and sends them to the exporter, this renders statsd-exporter inoperable until a restart is commanded. It appears we have to be careful not to step on the metric names as statsd-exporter would generate them for summaries and histograms.
Sep 14 2023
It appears there is some problem with the kafka-jumbo nodes in deployment prep.
Sep 12 2023
Sep 11 2023
Grafana is updated and silence is removed.
Linking my comment here for visibility: T345337#9150551
Sep 7 2023
9.4.14 is live on grafana-next. Will do some testing there before rolling to production early next week. Reinstalled the silence until we can complete the upgrade.
Ran into this today trying to pip install wikimedia-spicerack (Python 3.11).