This particular issue is resolved for now, and the action items and other ideas spawned in the discussion of it will be tracked as sub-tasks of T263275: Capacity planning for (& optimization of) transport backhaul vs edge egress
for posterity: repooling swift@eqiad took 3.5Gbit/s off of the codfw->eqiad path.
Thu, Sep 17
(an update: duh, we have ~3Gbit/s of codfw-->esams traffic that is traversing eqiad)
After extensive investigation by one of our network connectivity providers, we believe that the cause has been discovered and fixed as of about 15:30 UTC today.
For posterity, logstash link: https://logstash.wikimedia.org/goto/f8c9aec62cbdb9dacf931493e056196c
Wed, Sep 16
Today, all is fine. I would consider this issue closed and solved.
Great, thank you! That was my thinking as well, but I wanted to confirm.
hey @jlinehan -- do you have any concerns about me re-routing the DNS of intake-logging.wikimedia.org to resolve not to the nearest edge datacenter to the user, but to the second-nearest edge datacenter? This would help us get Network Error Logging reports in realtime, while also not really negatively impacting the client JS error use case AFAICT. There's some more details about this specific change in T261340, and the larger context in T257527 (to which I think you're subscribed).
Tue, Sep 15
Today was another day where it would have been helpful to use this endpoint, but I couldn't :)
- For the TTL (defined by max_age member), there seem to be two TTLs we have to think about: the TTL for report_tothat specifies the endpoint group where the NEL reports will be sent to, and the other for the NEL policy itself. The TTL for the endpoint can be greater than or equal to the TTL of the NEL policy. (As per the standard: "If the Reporting policy expires, NEL reports will not be delivered, even if the NEL policy has not expired.")
FYI @CDanis is working on T257527: automatically collect network error reports from users' browsers (Network Error Logging API) which has expects to have http.client_ip in logstash.
For anyone running into this, please follow https://www.mediawiki.org/wiki/How_to_report_a_bug#Reporting_a_connectivity_issue (but please note that this ticket is public so you may not want to post your IP and other personal data) - thanks!
Mon, Sep 14
I believe the only thing left to do is to perform a rolling restart of the eventgate-logging-external pods (or the container within them).
Today we had reports of an issue from @Andyrom75 that was happening all the time on their Wind (AS1267) mobile connection, and was happening under some circumstances on their Vodafone (AS30722) connection, but we did not get a full traceroute or an IP address, so it's very hard to say what was going on or if the issue was related.
An update on my last known disposition of the issue:
The referenced-by-transclusion paste is P12582
Fri, Sep 4
I just had an alternate idea, which wouldn't require any change to gdnsd.
There's three degrees of freedom to play with here:
- The set of domains for which we request reports
- The sampling fraction we set for all of/each of those (when a user agent sees an error, how often does it create a report for that error?)
- The TTL we set for how long user agents will persist the above
Example request/responses of both preflight and actual request are in NDA'd paste P12494 (has my own PII in it)
Wed, Sep 2
I think that idea could be reasonable... but is it too hard to get the
original XFF header out of the user request made to Turnilo, and forward
Tue, Sep 1
@jbond kindly backported the buster version of rasdaemon to stretch. I'm going to attempt installing it on a few stretch hosts that are consistently reporting memory issues
@CDanis while reviewing the PS from Dzahn i noticed that the backport has the wrong version number i.e. deb8u1 vs deb9u1. This is not a problem but if we still plan to install this on all stretch servers it would be good to fix it. So i wondered if this is still something you want to push to the stretch machines. If not ill just delete the package from stretch-wikimedia and remove it from thumbor1004 (the only stretch box to currently have it )
Mon, Aug 31
FTR, in T261506 I added wikimedia.pl to our list of allowed domains.
- They're an affiliate, listed on metawiki for some time, which I think is the closest thing we have to a 'bright line' right now.
- It was a time-sensitive request
- It was similar in nature to the already-allowed wikilovesmonuments, which seemed uncontroversial.
A fix has been merged and should take effect within the next half hour. Please re-open if you still see issues after an hour from now.
Sun, Aug 30
Fri, Aug 28
- T260520: maps.wikilovesmonuments.org returns a HTTP 429 error (let it access varnish maps_domains) for Commons's photo competition
- T261506: wikimedia.pl returns a HTTP 429 error (let it access varnish maps_domains) from Wikimedia Polska
- https://twitter.com/naveenpf/status/1299219712488779781 from a volunteer contributor: Apparently our tileserver is (one of the few? the only?) public server with many kinds of localized tiles.
My two cents:
Yes, it would. There's two use cases here:
- DoS attack analysis, for which real-time is essential. Here, the augmented data would be helpful, but it's not required or as important as real-time
- Historical analysis of our traffic flows with other networks, so we can propose peering with them. Here the augmented data would be very helpful.
Does that make sense?
Thu, Aug 27
It's critical that this data remain real-time, even if some of the fields aren't available in the real-time data.
Wed, Aug 26
Tue, Aug 25
Thanks for opening this! Really happy to see it (and was also talking to @wkandek just yesterday about making bpfcc generally available in the fleet).
Fri, Aug 21
As a workaround, you can add a bookmarklet to your bookmarks bar: https://edg2s.github.io/w.wiki-bookmarklet/
Thu, Aug 20
Aug 18 2020
Aug 17 2020
Ah, yes -- and replied to them, clarifying both the cause of their outage and what contact addresses they should use for us in the future (although I haven't heard anything back yet).
Aug 16 2020
Aug 14 2020
There's still an issue on Jio's side that needs to be fixed by them, but, we've put a temporary workaround in place, and their users should be able to access Wikipedia and other WMF sites. Please let us know if that isn't the case!
For posterity, relevant workaround patch and deployment thereof: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/620377
Aug 13 2020
A thing that someone daring in EUTZ might want to try: Using perf probe, or by modifying the bpfcc-memleak script, or by writing a trivial bpftrace script: attach a tracepoint to memcg_schedule_kmem_cache_create and gather calling stacktraces. That's the function that creates the work item that results in a worker thread calling memcg_create_kmem_cache, as seen in the stack traces we saw for 32-byte mallocs.