In T94896#1567576, @BBlack wrote:Well this basically got solved along the way while doing other things. ... I think we can go ahead and close this ticket in favor of a possible new one about eventually looking at the specific upload problem.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Feed Advanced Search
Advanced Search
Advanced Search
Oct 21 2015
Oct 21 2015
this sounds like something for check_graphite, right? YuvOri?
Thank you!
can we first see with cipher_list which are available?
In T111654#1737846, @jcrespo wrote:
- (I assume `ssl_cipher=TLSv1.2
In T111654#1737846, @jcrespo wrote:
- Recommended cipher and key length (I suppose 2048), that we use for other production services (I assume ssl_cipher=TLSv1.2,
Dzahn added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
or anyone, can you fix https://gerrit.wikimedia.org/r/#/c/247760/2/modules/icinga/manifests/gsbmonitoring.pp with different options of check_http? i tried with -f follow , -f sticky among other things but did not find a solution
Dzahn renamed T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API) from Google Safe Browsing Monitoring turned CRIT to Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
Dzahn triaged T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API) as Medium priority.
i'll still say priority normal since this is broken monitoring (due to Google changing things on their side), not actual alarms that our sites have a problem
Dzahn removed a project from T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API): Patch-For-Review.
Dzahn added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
"The API key format has changed. API keys are now managed in the Google Developers Console,"
12:19 < mutante> papaul: are all the cisco servers shut down?
12:20 < papaul> no
12:20 < papaul> there are stay up
12:20 < papaul> doing the wipe
12:20 < mutante> but you dont need mgmt to do that?
12:20 < papaul> no
12:20 < papaul> i don't
12:20 < mutante> alright,ok
Dzahn added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
should this use the real API ?
In T34796#1737605, @hashar wrote:Do we really care of having status.wikimedia.org to be served over TLS?
Dzahn added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
still not really working after switch to https, won't find the string
Dzahn added a comment to T115880: adding tjones to analytics-privatedata-users (hive and webrequests).
@EBernhardson Thank you for checking that. Ok, this was also confirmed by Otto on the gerrit change.
Change 247480 merged by Dzahn:
Removed mgmt DNS for virt20[0-1][1-9], pc200[1-3], labsdb200[1-3] and WMF5709
Oct 20 2015
Oct 20 2015
gerritbot added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
Change 247760 merged by Dzahn:
Switch safe browsing checks to HTTPS
gerritbot added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
Change 247760 had a related patch set uploaded (by MaxSem):
Switch safe browsing checks to HTTPS
Racked, cabled and ILO setup. DNS completed
Dzahn added a comment to T114059: ssl expiry tracking in icinga - we don't monitor that many domains.
dumps: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=dataset1001&service=HTTPS
OTRS: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=iodine&service=HTTPS
lists: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=fermium&service=HTTPS
icinga: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=neon&service=HTTPS
gerrit: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ytterbium&service=HTTPS
RT: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=magnesium&service=HTTPS
planet: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=planet1001&service=HTTPS (BROKEN, this needs to be on the cp boxes, special case! )
librenms: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon1001&service=HTTPS
gerritbot added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
Change 247754 merged by Dzahn:
Update safe browsing checks
gerritbot added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
Change 247754 had a related patch set uploaded (by MaxSem):
Update safe browsing checks
Dzahn added a comment to T116099: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API).
16:00 < icinga-wm> CUSTOM - Host google is UP: PING OK - Packet loss = 0%, RTA = 9.61 ms
Definitely not varnish!
gerritbot added a comment to T114059: ssl expiry tracking in icinga - we don't monitor that many domains.
Change 247744 merged by Dzahn:
planet: add ssl cert expiry check
gerritbot added a comment to T114059: ssl expiry tracking in icinga - we don't monitor that many domains.
Change 247744 had a related patch set uploaded (by Dzahn):
planet: add ssl cert expiry check
ssastry updated the task description for T116090: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests.
gerritbot added a comment to T114059: ssl expiry tracking in icinga - we don't monitor that many domains.
Change 244617 merged by Dzahn:
dumps: add cert expiry check
intracer added a comment to T106517: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404.
In T106517#1722718, @Tgr wrote:See [[ https://github.com/wikimedia/mediawiki/blob/a2d6ecc4539e60501803155990ec36575bdb4332/includes/filerepo/FileRepo.php#L1764 | FileRepo::nameForThumb() ]] for how the thumbnail file name (the part after the /) is generated. IIRC abbrvThreshold is 200 for Wikimedia sites.
gerritbot added a comment to T115760: deployment: user trebuchet gets added and removed from group wikidev on every puppet run.
Change 247721 had a related patch set uploaded (by Thcipriani):
Remove trebuchet user from wikidev group
Dzahn added a comment to T114059: ssl expiry tracking in icinga - we don't monitor that many domains.
progress tracking on etherpad now:
In T81030#1739822, @demon wrote:Apache syslog error rate, MW debug log error rates, HHVM error rates and OOMs all tracked via this dashboard now.
faidon added a comment to T115760: deployment: user trebuchet gets added and removed from group wikidev on every puppet run.
In T115760#1739925, @thcipriani wrote:It seems like the Right Thing™ would be to make wikidev the primary group for the trebuchet user.
thcipriani added a comment to T115760: deployment: user trebuchet gets added and removed from group wikidev on every puppet run.
So, currently, it doesn't matter if the trebuchet user is in the wikidev group, this has only been the case since commit acfeeefb landed.
Dzahn added a comment to T114059: ssl expiry tracking in icinga - we don't monitor that many domains.
and let's also have meta monitoring. icinga itself should have a working cert :)
Apache syslog error rate, MW debug log error rates, HHVM error rates and OOMs all tracked via this dashboard now.
Also, for the record we are now talking about beaconImpressions files, not bannerImpressions. E.g. /archive/banner_logs/2015/beaconImpressions-sampled10.tsv-20151020-184501.log.gz
Confirmed that the campaign is intact. All the pipeline does is store URLs in a file, the banner impression loader job is what's responsible for parsing these URLs and importing into the database.
RobH added a subtask for T116063: Hardware Automation Workflow - Overall Tracking: Unknown Object (Task).
RobH added a parent task for T88424: Migrate racktables to servermon: T116063: Hardware Automation Workflow - Overall Tracking.
RobH added a parent task for T84001: alternatives to racktables ?: T116063: Hardware Automation Workflow - Overall Tracking.
RobH added a parent task for T78135: Provide a pxe-bootable rescue image: T116063: Hardware Automation Workflow - Overall Tracking.
RobH added a subtask for T116063: Hardware Automation Workflow - Overall Tracking: Unknown Object (Task).
Yurik moved T116062: Deploy TileratorUI service from Backlog to Stalled/Waiting on the Maps-Sprint board.
Change 244436 had a related patch set uploaded (by Yurik):
maps: Add tileratorui service
gerritbot added a comment to T114059: ssl expiry tracking in icinga - we don't monitor that many domains.
Change 244614 merged by Dzahn:
icinga: add cert expiry check for icinga itself
Change 247613 merged by Andrew Bogott:
Logstash: track apache2 syslog error rate in statsd
pc200X is not in use, and pending to be replaced. I didn't even know labsdb200X existed.
jcrespo moved T112473: Better mysql monitoring for number of connections and processlist strange patterns from In progress to Backlog on the DBA board.
jcrespo moved T99485: implement performance_schema for mysql monitoring from Backlog to In progress on the DBA board.
gerritbot added a project to T99485: implement performance_schema for mysql monitoring: Patch-For-Review.
Change 247615 had a related patch set uploaded (by Jcrespo):
Enabling performance schema experimentally on db1018
• DStrine moved T104774: Publishing translations for central notice banners fails from Sprint +3 to Q3 2021-2022 on the Fundraising-Backlog board.
fe-0/0/5 up up Transit: <! Equinix OOB {#?} [100Mbps Cu]
Change 247613 had a related patch set uploaded (by Chad):
Logstash: track apache2 syslog error rate in statsd
Dzahn removed a project from T114861: mailman check_queue recurrent alarm/recovery: Patch-For-Review.
Above commit will resolve this. Unsilenced icinga check.
Based on our experience we have good enough monitoring for either Gerrit or Phabricator. The critical bits are monitored via Icinga (ex: process existence) and we have enough experimented user that pokes us about potential failures even before monitoring notify them.
icinga has paged me, and opsen, on multiple occasions when phabricator was down. I'm pretty sure that it's working.
Change 247606 merged by Ottomata:
Revert previous changes to make sure file_mover has uid and gid 30001
Change 247604 merged by Dzahn:
mailman: increase out queue to 300 check
Change 247606 had a related patch set uploaded (by Ottomata):
Revert previous changes to make sure file_mover has uid and gid 30001
Change 247604 had a related patch set uploaded (by John F. Lewis):
mailman: increase out queue to 300 check
gerritbot added a project to T114861: mailman check_queue recurrent alarm/recovery: Patch-For-Review.
jcrespo updated the task description for T114752: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends.
gerritbot added a comment to T114752: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on the several monitoring backends.
Change 244651 merged by Jcrespo:
Add pt-heartbeat start & execution script to mariadb
Change 247589 merged by Ottomata:
Remove uid setting from file_mover user. enforce-users-groups-cleanup was removing this
fgiunchedi added a comment to T105218: check_graphite - "UNKNOWN: More than half of the datapoints are undefined ".
taking another look at this, I'm going to block it with T101141: UDP rcvbuferrors and inerrors on graphite hosts about fixing inbound udp errors on graphite first since it might be the root cause
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL