User Details
- User Since
- Nov 5 2018, 2:54 PM (121 w, 1 d)
- Availability
- Available
- IRC Nick
- cdanis
- LDAP User
- CDanis
- MediaWiki User
- CDanis (WMF) [ Global Accounts ]
Today
I believe this is related to https://sal.toolforge.org/log/YZV29XcBa_6PSCT9KHrZ
I'm pretty sure the baud rate is 19200
Yesterday
Did this cause any actual issue?
Thu, Feb 25
wmf-utils is a little-used repo, with only two scripts in it.
Fri, Feb 19
Thu, Feb 18
Caches have filled, and there's no more fetch failures due to "LRU limited": https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=1613576456135&to=1613678433179
Wed, Feb 17
@Ottomata just one more question for you!
Tue, Feb 16
Sun, Feb 14
Thu, Feb 11
Email address sent privately.
Tue, Feb 9
Mon, Feb 8
@hnowlan just a heads up that it looks like the depool of maps1001 left maps@eqiad underprovisioned:
https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=1612777514651&to=1612819057362
Fri, Feb 5
I contacted Carol on Slack; this request is approved.
Glad to hear, thanks!
Hi, just wanted to check in if anything more was needed here?
@KFrancis Can you please get an NDA signed with this WMDE staff member? Thanks!
@jrobell Can you please confirm? Thanks!
The wmf group does not require manager approval -- only verification that staff is staff :)
@AikoChou should have shell access within half an hour.
Checked in with Miriam on IRC and Turnilo/Superset access isn't needed, but Kerberos is. Doing that now.
Thank you!
Thu, Feb 4
I think the change pushed today in the puppet private repo (hash b1b32d4ab) broke the meta-monitoring validation script.
Tue, Feb 2
Jan 29 2021
looks good now, thanks!
Hi Maya,
I think something is still wrong? LibreNMS is showing the port on the asw receiving about 7kbps of errors: https://librenms.wikimedia.org/device/device=149/tab=port/port=12410/view=graphs/
The first few cycles of logs from the ulsfo side:
Jan 29 20:21:46 cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:4d 18:26 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. Jan 29 20:21:46 cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 159, new state: down, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66 Jan 29 20:21:46 cr4-ulsfo rpd[16292]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead) Jan 29 20:21:46 cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router) Jan 29 20:21:47 cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 159, new state: up, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66 Jan 29 20:22:09 cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed) Jan 29 20:22:39 cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:00:00:52 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. Jan 29 20:22:39 cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 159, new state: down, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66 Jan 29 20:22:39 cr4-ulsfo rpd[16292]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead) Jan 29 20:22:39 cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router) Jan 29 20:22:41 cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed) Jan 29 20:22:43 cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 159, new state: up, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66 Jan 29 20:22:50 cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:00:00:06 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Here are overall, whole-cluster mean latency milliseconds, summed across all machines and then broken down by kernel version.
Jan 28 2021
Jan 27 2021
Can you please provide a complete dump of a "null response", with both the complete response headers and the raw response body?
Jan 26 2021
It seems the User-Agent being used is Peachy MediaWiki Bot API Version 2.0 (alpha 8) (which ideally should be updated to comply with the User-Agent policy).
What is the originating IP address of these requests?
Jan 22 2021
Jan 21 2021
Sorry for the truly baffling error message.
Jan 20 2021
There's one related problem, which is that enable-puppet should check the given message both with and without appending - $SUDO_USER, as perhaps you set a disable-puppet from a context where that wasn't present, for example https://sal.toolforge.org/log/8BlxIXcBgTbpqNOm5XlV
Jan 15 2021
John has it right -- I wanted to lower the bar, and ensure all deployers / folks with any sort of shell access have Klaxon access. I wasn't sure that set was covered by wmf/wmde/nda; thanks to John for proving that is indeed the case :)
Jan 13 2021
Hey @Ottomata, I meant to get around to this last quarter but didn't. Would very much like to get some mechanism in place soon -- do you have any ideas on the best path forward? Maybe we should set up a quick meeting to chat through options?
Thanks for the writeup with all the background! And for the cleanup patches so far :)
Jan 8 2021
Jan 5 2021
Jan 4 2021
Dec 23 2020
Thanks to observability for initial feedback, and to @Joe @jbond and especially @RLazarus for code reviews!
Now live: https://klaxon.wikimedia.org/
and successfully tested today!
Dec 22 2020
Thanks for the report.
Dec 21 2020
Dec 16 2020
I'll put this in puppet-private once I have the other puppetization ready :)
Not totally sure about the best fit here vs Beta, but you might also
consider using the 'staging' k8s cluster.
So I guess we need a separate task for paging and the check_librenms
deprecation?
Dec 14 2020
Does this mean we can deprecate the check_librenms Icinga integration?
Sure, always happy to help :)
Dec 9 2020
Dec 8 2020
Tagging @BBlack in the hopes he knows anything offhand about shipping new GeoLite2 or full GeoIP2 files to Beta Cluster
Dec 3 2020
Dec 2 2020
The max lifetime of any object in the Traffic CDN is 24 hours. Are you sure they're being cached there? Can you give a full example URL?