There's still an issue on Jio's side that needs to be fixed by them, but, we've put a temporary workaround in place, and their users should be able to access Wikipedia and other WMF sites. Please let us know if that isn't the case!
For posterity, relevant workaround patch and deployment thereof: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/620377
Thu, Aug 13
A thing that someone daring in EUTZ might want to try: Using perf probe, or by modifying the bpfcc-memleak script, or by writing a trivial bpftrace script: attach a tracepoint to memcg_schedule_kmem_cache_create and gather calling stacktraces. That's the function that creates the work item that results in a worker thread calling memcg_create_kmem_cache, as seen in the stack traces we saw for 32-byte mallocs.
Wed, Aug 12
There's some preliminary results from tracing 4096-byte allocations atalthough some of the stack traces make me think I need to make sure the script is tracing only the kmalloc codepaths.
Here's a summary of my findings so far, although keep in mind that I'm well out of my depth here and this is an amalgamation of guesswork.
sampled after ~freshly rebooted 5 minutes prior
Tue, Aug 11
The Envoy TLS terminator is now configured to allow websocket upgrades -- however, it's improperly configured. It's trying to connect via IPv6 address to the nodejs process, but the nodejs process is not listening there -- it listens only on the IPv4 address.
Wed, Jul 29
The structure of the data makes this teeth-pullingly impossibly difficult to do well, and the current production configuration has an incredible breadth *and* depth of mismatches between the two ideas of the world held by profile::lvs::realserver::$pools vs conftool.
Tue, Jul 28
With the serial console now attached, I found myself in a rescue shell.
Mon, Jul 27
We haven't seen a recurrence of this so I am optimistically resolving.
This was not a duplicate, and the issue still exists. T234450 had a misleading title; it isn't about any WMFTimeoutException on Special:Contributions, but a particular special kind from months ago that was creating widespread database issues, because of excessive traffic from a scraper.
Sat, Jul 25
Thu, Jul 23
Re-validation forced for ATS-BE, and also a Varnish cache ban has been put in place, so we should no longer be serving any cached 301s.
This issue has been ongoing for a while and likely merits some CPT attention.
Wed, Jul 22
First occurrence was June 17th, 15:10 UTC:
Do you think you'd have time to attempt this in the next few days? The eqiad anchor being down has some impact on our monitoring, and we have a RIPE engineer waiting on us to try rebooting the anchor.
Tue, Jul 21
Hi @Cmjohnson -- any idea of when you might be able to get around to this? Thanks!
Mon, Jul 20
Email with temporary Kerberos password sent.
Sun, Jul 19
Sat, Jul 18
Fri, Jul 17
✔️ email@example.com ~ 🕧☕ host -t txt wikimediafoundation.org wikimediafoundation.org descriptive text "facebook-domain-verification=orjhfudqyeumf4uber3bpwan5pisu0"
Thu, Jul 16
Ah -- one last thing -- you should have an email in your inbox with a temporary Kerberos password. Please follow the instructions in it and set your own password there soon.
Thanks! Will go through early next week, per the waiting period
Are you still having trouble accessing the stat machines ?
Two things pending:
- @Nuria's approval for analytics access
- three business day waiting period
Thanks, I've verified that NDA is on file.
Done; should be live across the fleet within 30 minutes.
thanks! will do so today
@spatton I'm going to optimistically close this assuming that Turnilo access has been sufficient for you, please do reopen if it's not!
@Nuria ping :)
@Jrbranaa ping? this access request is waiting on your reply re: contract end date
Jul 15 2020
I think it would be nice if there was a small helper function to set User-Agent headers per our policy.
Jul 14 2020
Jul 12 2020
This happened again just now.
Jul 9 2020
Jul 8 2020
FTR i-->h is a single bit-flip in the LSB.