Page MenuHomePhabricator
Feed Advanced Search

Yesterday

CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Fri, Sep 18, 9:46 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis created T263291: experiment with a "unified" ATS-BE pool.
Fri, Sep 18, 9:08 PM · Traffic, Operations
CDanis created T263290: Turnilo: per-second rates for wmf_netflow bytes + packets.
Fri, Sep 18, 8:52 PM · Analytics, netops, Traffic, Operations
CDanis added a parent task for T263212: Consider balancing VRRP primaries to cr1/cr2: T263275: Capacity planning for (& optimization of) transport backhaul vs edge egress.
Fri, Sep 18, 8:45 PM · Operations, netops
CDanis added subtasks for T263275: Capacity planning for (& optimization of) transport backhaul vs edge egress: Unknown Object (Task), T263212: Consider balancing VRRP primaries to cr1/cr2, T263230: Set the same OSPF weight on eqiad/codfw wavelenghts.
Fri, Sep 18, 8:45 PM · netops, Traffic, Epic, Operations
CDanis added a parent task for T263230: Set the same OSPF weight on eqiad/codfw wavelenghts: T263275: Capacity planning for (& optimization of) transport backhaul vs edge egress.
Fri, Sep 18, 8:45 PM · Operations, netops
CDanis created T263288: experiment with reënabling compression between applayer's TLS terminators and edge caches.
Fri, Sep 18, 8:26 PM · netops, Traffic, Operations
CDanis triaged T263277: Collect netflow data for internal traffic as Medium priority.
Fri, Sep 18, 6:08 PM · netops, Traffic, Operations
CDanis created T263277: Collect netflow data for internal traffic.
Fri, Sep 18, 6:07 PM · netops, Traffic, Operations
CDanis closed T263206: cr1-codfw<->cr1-eqiad link saturation as Resolved.
Fri, Sep 18, 5:50 PM · netops, Operations
CDanis added a comment to T263206: cr1-codfw<->cr1-eqiad link saturation.

This particular issue is resolved for now, and the action items and other ideas spawned in the discussion of it will be tracked as sub-tasks of T263275: Capacity planning for (& optimization of) transport backhaul vs edge egress

Fri, Sep 18, 5:50 PM · netops, Operations
CDanis triaged T263275: Capacity planning for (& optimization of) transport backhaul vs edge egress as Medium priority.
Fri, Sep 18, 5:49 PM · netops, Traffic, Epic, Operations
CDanis edited P12668 simple-nel.py.
Fri, Sep 18, 5:12 PM
CDanis updated the title for P12668 simple-nel.py from Masterwork From Distant Lands to simple-nel.py.
Fri, Sep 18, 5:11 PM
CDanis edited P12668 simple-nel.py.
Fri, Sep 18, 5:11 PM
CDanis added a comment to T263206: cr1-codfw<->cr1-eqiad link saturation.

for posterity: repooling swift@eqiad took 3.5Gbit/s off of the codfw->eqiad path.

Fri, Sep 18, 2:01 PM · netops, Operations
jcrespo awarded T262869: Wikimedia projects not reachable for some Telecom Italia users a Mountain of Wealth token.
Fri, Sep 18, 8:04 AM · Traffic, Operations, netops

Thu, Sep 17

Dzahn awarded T262869: Wikimedia projects not reachable for some Telecom Italia users a Yellow Medal token.
Thu, Sep 17, 11:51 PM · Traffic, Operations, netops
CDanis added a comment to T263206: cr1-codfw<->cr1-eqiad link saturation.

(an update: duh, we have ~3Gbit/s of codfw-->esams traffic that is traversing eqiad)

Thu, Sep 17, 10:54 PM · netops, Operations
CDanis closed T262869: Wikimedia projects not reachable for some Telecom Italia users as Resolved.

After extensive investigation by one of our network connectivity providers, we believe that the cause has been discovered and fixed as of about 15:30 UTC today.

Thu, Sep 17, 10:52 PM · Traffic, Operations, netops
CDanis created T263206: cr1-codfw<->cr1-eqiad link saturation.
Thu, Sep 17, 10:20 PM · netops, Operations
CDanis added a comment to T263132: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount.

For posterity, logstash link: https://logstash.wikimedia.org/goto/f8c9aec62cbdb9dacf931493e056196c

Thu, Sep 17, 3:44 PM · Analytics-Kanban, Operations, Event-Platform, Analytics, Wikimedia-production-error
CDanis added a comment to T262869: Wikimedia projects not reachable for some Telecom Italia users.

There was another instance of it about 10 hours ago.

Also right now it seems, at least from some TIM customer in Milan (in Wikimedia Italia's office).

Thu, Sep 17, 1:34 PM · Traffic, Operations, netops

Wed, Sep 16

CDanis added a comment to T262869: Wikimedia projects not reachable for some Telecom Italia users.

Today, all is fine. I would consider this issue closed and solved.

Wed, Sep 16, 9:28 PM · Traffic, Operations, netops
CDanis added a comment to T226986: Client side error logging production launch.

Great, thank you! That was my thinking as well, but I wanted to confirm.

Wed, Sep 16, 5:43 PM · Analytics-Radar, MW-1.35-notes (1.35.0-wmf.24; 2020-03-17), Performance-Team (Radar), Desktop Improvements, Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic
CDanis added a comment to T226986: Client side error logging production launch.

hey @jlinehan -- do you have any concerns about me re-routing the DNS of intake-logging.wikimedia.org to resolve not to the nearest edge datacenter to the user, but to the second-nearest edge datacenter? This would help us get Network Error Logging reports in realtime, while also not really negatively impacting the client JS error use case AFAICT. There's some more details about this specific change in T261340, and the larger context in T257527 (to which I think you're subscribed).

Wed, Sep 16, 5:28 PM · Analytics-Radar, MW-1.35-notes (1.35.0-wmf.24; 2020-03-17), Performance-Team (Radar), Desktop Improvements, Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic
CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Wed, Sep 16, 5:16 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis updated the title for P12607 RIPE Atlas traceroute data from affected Telecom Italia probes from untitled to RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 1:39 PM
CDanis edited P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 1:33 PM
CDanis edited P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 1:20 PM
CDanis edited P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 1:19 PM
CDanis edited P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 1:02 PM
CDanis edited P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 12:56 PM
CDanis edited P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 12:48 PM
CDanis edited P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 12:45 PM
CDanis edited P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 12:42 PM
CDanis created P12607 RIPE Atlas traceroute data from affected Telecom Italia probes.
Wed, Sep 16, 12:35 PM

Tue, Sep 15

CDanis added a comment to T178458: un blacklist https://integration.wikimedia.org/ci/computer/XXXX/builds.

Today was another day where it would have been helpful to use this endpoint, but I couldn't :)

Tue, Sep 15, 9:07 PM · User-Addshore, Continuous-Integration-Infrastructure
CDanis added a comment to T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
  1. For the TTL (defined by max_age member), there seem to be two TTLs we have to think about: the TTL for report_tothat specifies the endpoint group where the NEL reports will be sent to, and the other for the NEL policy itself. The TTL for the endpoint can be greater than or equal to the TTL of the NEL policy. (As per the standard: "If the Reporting policy expires, NEL reports will not be delivered, even if the NEL policy has not expired.")
Tue, Sep 15, 8:05 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis added a comment to T262626: Remove http.client_ip from EventGate default schema (again).

FYI @CDanis is working on T257527: automatically collect network error reports from users' browsers (Network Error Logging API) which has expects to have http.client_ip in logstash.

Tue, Sep 15, 7:28 PM · Patch-For-Review, Product-Analytics (Kanban), Product-Infrastructure-Data, observability, Privacy Engineering, Analytics, Event-Platform
CDanis edited P12591 Masterwork From Distant Lands.
Tue, Sep 15, 6:04 PM
CDanis added a comment to T262869: Wikimedia projects not reachable for some Telecom Italia users.

For anyone running into this, please follow https://www.mediawiki.org/wiki/How_to_report_a_bug#Reporting_a_connectivity_issue (but please note that this ticket is public so you may not want to post your IP and other personal data) - thanks!

Tue, Sep 15, 5:10 PM · Traffic, Operations, netops

Mon, Sep 14

CDanis updated subscribers of T262087: Deploy an updated eventgate-logging-external with NEL patches.

I believe the only thing left to do is to perform a rolling restart of the eventgate-logging-external pods (or the container within them).

Mon, Sep 14, 10:57 PM · Patch-For-Review, Analytics, Operations
CDanis added a comment to T262869: Wikimedia projects not reachable for some Telecom Italia users.

Today we had reports of an issue from @Andyrom75 that was happening all the time on their Wind (AS1267) mobile connection, and was happening under some circumstances on their Vodafone (AS30722) connection, but we did not get a full traceroute or an IP address, so it's very hard to say what was going on or if the issue was related.

Mon, Sep 14, 10:51 PM · Traffic, Operations, netops
CDanis updated subscribers of T262869: Wikimedia projects not reachable for some Telecom Italia users.

An update on my last known disposition of the issue:

Mon, Sep 14, 9:59 PM · Traffic, Operations, netops
CDanis updated the title for P12583 top websites that send Report-To/NEL response headers from Masterwork From Distant Lands to top websites that send Report-To/NEL response headers.
Mon, Sep 14, 6:56 PM
CDanis edited P12583 top websites that send Report-To/NEL response headers.
Mon, Sep 14, 6:56 PM
CDanis edited P12583 top websites that send Report-To/NEL response headers.
Mon, Sep 14, 6:52 PM
CDanis added a comment to P12583 top websites that send Report-To/NEL response headers.

The referenced-by-transclusion paste is P12582

Mon, Sep 14, 6:46 PM
CDanis edited P12583 top websites that send Report-To/NEL response headers.
Mon, Sep 14, 6:46 PM
CDanis edited P12582 top 47 websites according to https://en.wikipedia.org/wiki/List_of_most_popular_websites.
Mon, Sep 14, 6:27 PM
CDanis edited P12581 wtf.
Mon, Sep 14, 4:33 PM

Fri, Sep 4

CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Fri, Sep 4, 8:30 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis added a comment to T261340: 'skip_first' feature flag for gdnsd GeoIP plugin.

I just had an alternate idea, which wouldn't require any change to gdnsd.

Fri, Sep 4, 8:28 PM · Patch-For-Review, DNS, Traffic, Operations
CDanis added a comment to T257527: automatically collect network error reports from users' browsers (Network Error Logging API).

There's three degrees of freedom to play with here:

  1. The set of domains for which we request reports
  2. The sampling fraction we set for all of/each of those (when a user agent sees an error, how often does it create a report for that error?)
  3. The TTL we set for how long user agents will persist the above
Fri, Sep 4, 7:46 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Fri, Sep 4, 7:42 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis added a comment to T262087: Deploy an updated eventgate-logging-external with NEL patches.

Example request/responses of both preflight and actual request are in NDA'd paste P12494 (has my own PII in it)

Fri, Sep 4, 7:39 PM · Patch-For-Review, Analytics, Operations
CDanis updated the task description for T262087: Deploy an updated eventgate-logging-external with NEL patches.
Fri, Sep 4, 7:25 PM · Patch-For-Review, Analytics, Operations
CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Fri, Sep 4, 7:22 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis created T262087: Deploy an updated eventgate-logging-external with NEL patches.
Fri, Sep 4, 7:21 PM · Patch-For-Review, Analytics, Operations

Wed, Sep 2

CDanis added a comment to T233336: Add urlshortener button to Turnilo.

I think that idea could be reasonable... but is it too hard to get the
original XFF header out of the user request made to Turnilo, and forward
that?

Wed, Sep 2, 7:04 PM · Patch-For-Review, Analytics

Tue, Sep 1

CDanis updated subscribers of T250104: wm-bot doesn't set charset=utf-8, which breaks (amongst other things) emoji rendering.

@RLazarus encountered this today while doing some retrospective on Datacenter-Switchover.

Tue, Sep 1, 8:35 PM · Operations, WM-Bot
CDanis added a comment to T205396: Evaluate/integrate rasdaemon as a replacement for mcelog.

@jbond kindly backported the buster version of rasdaemon to stretch. I'm going to attempt installing it on a few stretch hosts that are consistently reporting memory issues

@CDanis while reviewing the PS from Dzahn i noticed that the backport has the wrong version number i.e. deb8u1 vs deb9u1. This is not a problem but if we still plan to install this on all stretch servers it would be good to fix it. So i wondered if this is still something you want to push to the stretch machines. If not ill just delete the package from stretch-wikimedia and remove it from thumbor1004 (the only stretch box to currently have it )

Tue, Sep 1, 12:12 PM · Patch-For-Review, observability, Operations

Mon, Aug 31

CDanis closed T261531: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs as Resolved.
Mon, Aug 31, 6:07 PM · Patch-For-Review, Matrix, DNS, Operations, Traffic
CDanis updated subscribers of T261424: Limit maps serving to Wikimedia hosted sites only.

FTR, in T261506 I added wikimedia.pl to our list of allowed domains.

  • They're an affiliate, listed on metawiki for some time, which I think is the closest thing we have to a 'bright line' right now.
  • It was a time-sensitive request
  • It was similar in nature to the already-allowed wikilovesmonuments, which seemed uncontroversial.
Mon, Aug 31, 4:43 PM · Maps, Traffic, Product-Infrastructure-Team-Backlog, Operations
CDanis closed T261506: wikimedia.pl returns a HTTP 429 error (let it access varnish maps_domains) as Resolved.

A fix has been merged and should take effect within the next half hour. Please re-open if you still see issues after an hour from now.

Mon, Aug 31, 4:35 PM · Traffic, Operations, Wiki-Loves-Monuments (2020), Maps

Sun, Aug 30

CDanis updated the task description for T254646: Reconsidering how we name things.
Sun, Aug 30, 4:01 PM · MW-1.36-notes (1.36.0-wmf.10; 2020-09-22), MediaWiki-extensions-General, Wikimedia-General-or-Unknown, MediaWiki-General, MW-1.35-notes (1.35.0-wmf.39; 2020-06-30), Patch-For-Review, Voice & Tone

Fri, Aug 28

CDanis added a comment to T261424: Limit maps serving to Wikimedia hosted sites only.

@JMinor @Elitre Although some of this is already mentioned upwards in this task, here's a summary of the community objections I'm aware of so far:

Fri, Aug 28, 6:43 PM · Maps, Traffic, Product-Infrastructure-Team-Backlog, Operations
CDanis added a comment to T261424: Limit maps serving to Wikimedia hosted sites only.

My two cents:

Fri, Aug 28, 6:09 PM · Maps, Traffic, Product-Infrastructure-Team-Backlog, Operations
CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Fri, Aug 28, 3:19 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis added a comment to T254332: Add more dimensions in the netflow/pmacct/Druid pipeline.

Yes, it would. There's two use cases here:

  • DoS attack analysis, for which real-time is essential. Here, the augmented data would be helpful, but it's not required or as important as real-time
  • Historical analysis of our traffic flows with other networks, so we can propose peering with them. Here the augmented data would be very helpful.

Does that make sense?

Fri, Aug 28, 12:28 PM · Analytics-Kanban, Analytics, netops, Operations

Thu, Aug 27

CDanis edited P12405 Masterwork From Distant Lands.
Thu, Aug 27, 3:29 PM
CDanis added a comment to T254332: Add more dimensions in the netflow/pmacct/Druid pipeline.

It's critical that this data remain real-time, even if some of the fields aren't available in the real-time data.

Thu, Aug 27, 2:22 PM · Analytics-Kanban, Analytics, netops, Operations

Wed, Aug 26

CDanis updated the task description for T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Wed, Aug 26, 5:38 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis added a subtask for T257527: automatically collect network error reports from users' browsers (Network Error Logging API): T261340: 'skip_first' feature flag for gdnsd GeoIP plugin.
Wed, Aug 26, 5:36 PM · Patch-For-Review, Product-Infrastructure-Data, Operations, Goal, Epic
CDanis added a parent task for T261340: 'skip_first' feature flag for gdnsd GeoIP plugin: T257527: automatically collect network error reports from users' browsers (Network Error Logging API).
Wed, Aug 26, 5:36 PM · Patch-For-Review, DNS, Traffic, Operations
CDanis created T261340: 'skip_first' feature flag for gdnsd GeoIP plugin.
Wed, Aug 26, 5:36 PM · Patch-For-Review, DNS, Traffic, Operations

Tue, Aug 25

CDanis triaged T261193: Make bpfcc-tools available fleet-wide as Medium priority.

Thanks for opening this! Really happy to see it (and was also talking to @wkandek just yesterday about making bpfcc generally available in the fleet).

Tue, Aug 25, 1:06 PM · Patch-For-Review, Operations
CDanis edited P12342 Masterwork From Distant Lands.
Tue, Aug 25, 12:11 PM

Fri, Aug 21

CDanis added a comment to T233336: Add urlshortener button to Turnilo.

As a workaround, you can add a bookmarklet to your bookmarks bar: https://edg2s.github.io/w.wiki-bookmarklet/

Fri, Aug 21, 1:05 PM · Patch-For-Review, Analytics
CDanis renamed T233336: Add urlshortener button to Turnilo from Add urlshortener to Turnilo to Add urlshortener button to Turnilo.
Fri, Aug 21, 1:04 PM · Patch-For-Review, Analytics
CDanis updated subscribers of T258748: Client Developer has a cookie-free API call.
Fri, Aug 21, 4:49 AM · Operations, Traffic, Platform Team Sprints Board (Sprint 1), Platform Team Workboards (Green), Story, Platform Team Initiatives (API Gateway)

Thu, Aug 20

CDanis updated subscribers of T260943: Don't set cookies for api.wikimedia.org at the caching layer.
Thu, Aug 20, 8:56 PM · Operations, Traffic

Aug 18 2020

CDanis edited P12298 Masterwork From Distant Lands.
Aug 18 2020, 6:52 PM
CDanis edited P12297 Masterwork From Distant Lands.
Aug 18 2020, 6:49 PM

Aug 17 2020

CDanis added a comment to T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.

Ah, yes -- and replied to them, clarifying both the cause of their outage and what contact addresses they should use for us in the future (although I haven't heard anything back yet).

Aug 17 2020, 6:09 PM · Operations, netops, Traffic
CDanis closed T260452: clean up workaround and measurements put in place during Jio RPKI error as Resolved.
Aug 17 2020, 3:40 PM · Operations, netops, Traffic
CDanis closed T260452: clean up workaround and measurements put in place during Jio RPKI error, a subtask of T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites, as Resolved.
Aug 17 2020, 3:40 PM · Operations, netops, Traffic
CDanis updated the task description for T260452: clean up workaround and measurements put in place during Jio RPKI error.
Aug 17 2020, 3:07 PM · Operations, netops, Traffic
CDanis added a member for Triagers: jlinehan.
Aug 17 2020, 12:54 PM
CDanis added a member for Triagers: CDanis.
Aug 17 2020, 12:54 PM

Aug 16 2020

CDanis added a comment to T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.
Aug 16 2020, 1:04 PM · Operations, netops, Traffic

Aug 14 2020

CDanis created T260452: clean up workaround and measurements put in place during Jio RPKI error.
Aug 14 2020, 5:10 PM · Operations, netops, Traffic
CDanis closed T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites as Resolved.

There's still an issue on Jio's side that needs to be fixed by them, but, we've put a temporary workaround in place, and their users should be able to access Wikipedia and other WMF sites. Please let us know if that isn't the case!

Aug 14 2020, 5:02 PM · Operations, netops, Traffic
CDanis closed Restricted Task, a subtask of T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites, as Resolved.
Aug 14 2020, 5:02 PM · Operations, netops, Traffic
CDanis updated the task description for T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.
Aug 14 2020, 4:52 PM · Operations, netops, Traffic
CDanis added a comment to T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.

For posterity, relevant workaround patch and deployment thereof: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/620377
https://sal.toolforge.org/production?p=0&q=I9fcff8&d=

Aug 14 2020, 4:51 PM · Operations, netops, Traffic
CDanis added a subtask for T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites: Unknown Object (Task).
Aug 14 2020, 4:50 PM · Operations, netops, Traffic
CDanis created T260449: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites.
Aug 14 2020, 4:50 PM · Operations, netops, Traffic

Aug 13 2020

CDanis added a comment to T260281: mw* servers memory leaks (12 Aug).

A thing that someone daring in EUTZ might want to try: Using perf probe, or by modifying the bpfcc-memleak script, or by writing a trivial bpftrace script: attach a tracepoint to memcg_schedule_kmem_cache_create and gather calling stacktraces. That's the function that creates the work item that results in a worker thread calling memcg_create_kmem_cache, as seen in the stack traces we saw for 32-byte mallocs.

Aug 13 2020, 1:29 AM · Wikidata, Platform Engineering, Sustainability (Incident Followup), Operations, serviceops