Hi @MNoorWMF, this is implemented now! Resolving the task but feel free to reopen if something is amiss.
Fri, Apr 16
Issue has been mitigated by reimaging thanos-fe2001 (the host that runs thanos-compact) with a raid0 /srv, I'll reimage the other frontends next week
This is implemented now! @awight I've expanded a little https://wikitech.wikimedia.org/wiki/Graphite#Deleting_metrics on how to delete metrics, please reach out if you have questions!
@HMonroy you are now a member of deployment group! Resolving task, please reopen if something is amiss.
Since the frontends are meant to be stateless, I think I prefer #2 over #1 to avoid special-casing a data partition on one of the backends. Double the space for compaction should buy us quite some time.
Thu, Apr 15
This is done! (see subtask)
Wed, Apr 14
@Lena_WMDE you are now in nda and wmde groups, please verify access and reopen the task if something is amiss!
Thank you @Papaul, all good
Tue, Apr 13
Thu, Apr 8
@Papaul I'm running into troubles with the disk I haven't seen before (xfs crashes after a while, log below). Can we try another spare disk just to exclude the disk itself as faulty (or just plain old)? Thank you!
Failed for the second time with An unknown error occurred in storage backend "local-swift-eqiad". @fgiunchedi I'm not sure if I can see the detailed logs for this one, could you either find them or help me locate them? Thanks!
Wed, Apr 7
Thank you @Papaul !
Tue, Apr 6
@Papaul please replace the failed 4TB disk, led should be blinking, thank you !
With the last rebalance the hosts are now fully in service (at weight 8000), netbox is updated.
Thu, Apr 1
@fgiunchedi Is there any process we should follow to test/make sure everything is okay if we add ipv6 DNS for ms-be and ms-fe?
Sounds good @Papaul ! So in Icinga we're monitoring each phase to see if it hits 80%/85% of the 30A breaker, and in Prometheus we're collecting most of what we can via snmp (current, voltage, sensors).
Wed, Mar 31
I think gmail's threading logic groups messages with "similar enough" subjects and "close enough" in time, which would explain the behavior above. Have you experienced counter-examples to this theory ?
Tentatively resolving, will get reopened if it happens again.
Thank you for taking care of the Python 3 migration in Puppet !
Mhh sdc2 got booted off md0 but stayed in md1, I didn't see any obvious messages/failures about sdc in dmesg so I added the disk back, let's see what happens
Tue, Mar 30
Mon, Mar 29
Upstream merged the PR, will be included in the next LibreNMS release \o/
Fri, Mar 26
Since we've set up task opening for AM alerts this quarter we can definitely tackle some of these.
@fgiunchedi could you re-run your analysis to see if mw1307 (10.64.0.169) is still exhibiting the issue?
Thu, Mar 25
This is complete! Alerts will get deployed from operations/alerts to Prometheus instances
Wed, Mar 24
I also have a shell alias (proxy-on / proxy-off) for convenience and use it in a few cases when building packages require internet. +1 to have a shared alias available (since in practice we already have that, just sprinkled in a few places) (and -1 to have proxy enabled by default, for reasons already mentioned)
Thank you for the feedback! Unfortunately I think addressing some of the feedback will need a librenms patch
This is complete! Please reopen if sth is amiss
Mar 18 2021
I picked this up last week again, and ran a more substantial test job using 50 workers downloading ~1million commons images (400px thumbnails) using a spark job. Some more questions before I run a job on the full datasets (~53M image files). Looking at the grafana dashboard,
- what does the increase in put 201 in the object state-changing? cache misses for the thumbnails that get filled?
Possible but hard to say from that graph, when did the job start/finish ? I'm assuming ~23:30 to ~1:40 but best to confirm
Something else to check for thumbnailing activity is the Thumbor dashboard (for the same timeframe):
Thanks for that dashboard, that is useful to look at. I should have clarified the period, your assumption is right ~23:30 to ~1:40.
- client errors chart. we do expect to see a lot of 404 since some images we query for will now be deleted. however, I also notice a high number of timeouts with a timeout of 5 seconds. Is this to be expected? I am doing retries and will increase the timeout but it seems high.
Yes some timeouts are to be expected for sure; were the timeouts for a certain kind of file types? It might be a thumb miss plus long thumb regeneration time, or it might be Swift timing out while fetching the image. Indeed timeout + exponential retries should get you basically all the way there.
@fgiunchedi Does this dashboard and approach look ok to you from the swift perspective? If so, I kick off the main job this week, it is expected to run for ~6days.
It seems to generally be fine, although I'm surprised at the thumbnailing activity being higher than expected, as in 400px should be pre-generated at upload time by MW for most wikis ($wgThumbLimits in mediawiki-config).
Looking at the errors, I noticed that I actually used a 3s timeout - but still there are so many timeouts, with no retries almost 25% of requests fail. All errors are timeout errors, and the distribution of file types of successful and failed attempts are roughly the same. Is it possible that somehow the servers are overloaded?
After a chat with Filippo, IIUC 8.2008.0-1 is used only on centrallog nodes (hence the component) but we might want to use 8.19011 provided on Buster and add the custom bits for rsyslog-kubernetes, uploading all to main. The alternative would be to modify 8.2008.0 and use the component instead (so adding rsyslog-kubernetes to it).
Any preference? :)
I have no idea why 8.2008.0-1 is used on centrallog , but if it is compatible with 8.1901 and can have kubernetes support, I 'd just upload it to main.
And connections from Prometheus kept piling up. AFAIK the service/exporter is not owned ATM, I've restarted the exporter but this is obviously bound to happen again.