Page MenuHomePhabricator
Feed Advanced Search

Thu, Oct 14

BBlack added a comment to T293294: setup drmrs mgmt & private prefixs - question on switch status.

We've two private subnets assigned, one for each rack/switch:

https://netbox.wikimedia.org/ipam/prefixes/405/prefixes/

LVS will be connected to each switch directly similar to current setup, for access to both.

Thu, Oct 14, 12:30 PM · Traffic, SRE, ops-drmrs, DNS, Infrastructure-Foundations, netops

Wed, Oct 13

BBlack updated subscribers of T293294: setup drmrs mgmt & private prefixs - question on switch status.

CC @MMandere as well once we have a decision on the IP prefixes here!

Wed, Oct 13, 7:15 PM · Traffic, SRE, ops-drmrs, DNS, Infrastructure-Foundations, netops

Tue, Oct 12

BBlack added a comment to T287584: DNS Discovery for active/passive failover within a data centre.

I think (but I'm sure it can be debated!) that from the Traffic POV, a service's resiliency/failover within a DC shouldn't be managed via DNS automations like the discovery setup, for sure. The traffic infra does already aid the resiliency/failover of services within a DC, between distinct internal nodes/IPs, via LVS (which can have pools managed by etcd and/or healthchecks), but anything more-involved than that isn't really in our bailiwick and is often cluster/service-specific (or in a future k8sy world, perhaps managed in a uniform way at that layer).

Tue, Oct 12, 2:17 PM · SRE, Traffic
ema awarded T289787: Clean up Traffic tag/workboard a Love token.
Tue, Oct 12, 1:25 PM · PM, SRE, Traffic

Fri, Oct 8

BBlack moved T274431: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 from DNS Infra to Upcoming on the Traffic board.
Fri, Oct 8, 8:19 PM · SRE, Traffic
BBlack moved T275409: Create and document Wikidough's privacy policy from DNS Infra to Upcoming on the Traffic board.
Fri, Oct 8, 8:19 PM · Privacy Engineering, SRE, Traffic
BBlack moved T283614: RIPE Atlas monitoring of reachability & latency towards anycasted Wikidough IP from DNS Infra to Upcoming on the Traffic board.
Fri, Oct 8, 8:19 PM · Traffic
BBlack moved T289536: Deploy durum: check service for Wikidough from DNS Infra to In Progress on the Traffic board.
Fri, Oct 8, 8:19 PM · Patch-For-Review, SRE, Traffic
BBlack moved T292737: Anycast: Add IPv6 support to bird and anycast-healthchecker (Puppet) from DNS Infra to In Progress on the Traffic board.
Fri, Oct 8, 8:19 PM · Traffic, Infrastructure-Foundations, SRE
BBlack moved T252132: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver from DNS Infra to In Progress on the Traffic board.
Fri, Oct 8, 8:17 PM · Patch-For-Review, SRE, Traffic
BBlack moved T292397: Improve runbooks for OCSP-related alerts from TLS to Triage on the Traffic board.
Fri, Oct 8, 8:17 PM · SRE, Traffic
BBlack moved T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic from TLS to In Progress on the Traffic board.
Fri, Oct 8, 8:17 PM · Patch-For-Review, SRE, Traffic
BBlack moved T271407: Upgrade envoyproxy to 1.16.2 from TLS to In Progress on the Traffic board.
Fri, Oct 8, 8:17 PM · SRE, serviceops, Traffic
BBlack moved T283164: Let's Encrypt issuance chains update from TLS to Active Issues on the Traffic board.
Fri, Oct 8, 8:17 PM · SRE, Traffic
BBlack moved T283165: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain from TLS to Active Issues on the Traffic board.
Fri, Oct 8, 8:16 PM · Patch-For-Review, Infrastructure-Foundations, SRE, Traffic
BBlack moved T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic from TLS to In Progress on the Traffic board.
Fri, Oct 8, 8:16 PM · Performance-Team (Radar), Patch-For-Review, SRE, Traffic
BBlack moved T292619: Implement a watchdog mechanism on acme-chief from TLS to In Progress on the Traffic board.
Fri, Oct 8, 8:16 PM · Patch-For-Review, SRE, Traffic, Acme-chief
BBlack moved T288106: Experiment with single backend CDN nodes from Caching to In Progress on the Traffic board.
Fri, Oct 8, 8:15 PM · User-ema, Patch-For-Review, SRE, Traffic
BBlack moved T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet from Caching to In Progress on the Traffic board.
Fri, Oct 8, 8:15 PM · SRE, Traffic, ops-ulsfo, DC-Ops
BBlack moved T282787: Configure dns and puppet repositories for new drmrs datacenter from Caching to In Progress on the Traffic board.
Fri, Oct 8, 8:15 PM · Patch-For-Review, SRE, Traffic
BBlack moved T282788: drmrs: primary software task from Caching to In Progress on the Traffic board.
Fri, Oct 8, 8:15 PM · Infrastructure-Foundations, procurement, netops, Traffic, SRE
BBlack moved T274228: Phabricator should cache tasks for a few minutes for logged-out users from Caching to Triage on the Traffic board.
Fri, Oct 8, 8:15 PM · SRE, Traffic, Phabricator
BBlack moved T282880: Revisit varnish dynamic backends mechanism from Caching to In Progress on the Traffic board.
Fri, Oct 8, 8:14 PM · Patch-For-Review, SRE, Traffic
BBlack moved T289787: Clean up Traffic tag/workboard from Caching to In Progress on the Traffic board.
Fri, Oct 8, 8:14 PM · PM, SRE, Traffic
BBlack moved T292290: Package and deploy Varnish 6.0.8 from Caching to In Progress on the Traffic board.
Fri, Oct 8, 8:14 PM · Performance-Team (Radar), User-ema, SRE, Traffic
BBlack moved T292817: Multiple ATS HTTP2 stats missing from Prometheus from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:59 PM · Traffic-Icebox, User-ema, SRE, SRE Observability
BBlack moved T292870: externally-hosted NEL report forwarders for more timely report reception from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:58 PM · Traffic-Icebox, Infrastructure-Foundations, netops, SRE
BBlack moved T292815: ATS should alert if the number of total or active connections reached maximum from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 7:56 PM · SRE, User-ema, Traffic
BBlack moved T292737: Anycast: Add IPv6 support to bird and anycast-healthchecker (Puppet) from Triage to DNS Infra on the Traffic board.
Fri, Oct 8, 7:56 PM · Traffic, Infrastructure-Foundations, SRE
BBlack closed T292632: Many KaiOS devices can't access WMF websites and can't use Wikipedia app, a subtask of T283164: Let's Encrypt issuance chains update, as Resolved.
Fri, Oct 8, 7:55 PM · SRE, Traffic
BBlack closed T292632: Many KaiOS devices can't access WMF websites and can't use Wikipedia app as Resolved.

Closing for now as I don't think there's anything we want to do on our end here. Thanks for the heads up!

Fri, Oct 8, 7:55 PM · KaiOS-Wikipedia-app, SRE, Inuka-Team, Traffic
BBlack moved T292506: Investigate cp5006 crash from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 7:54 PM · User-ema, SRE Observability (FY2021/2022-Q2), SRE, Traffic
BBlack moved T292291: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 7:54 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic, envoy, serviceops, CirrusSearch, WMF-JobQueue, Discovery-Search, Wikimedia-production-error
BBlack added a comment to T292291: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02.

Update on the ca-certificates end of this: Debian has a patch that will correct this at their own level at https://salsa.debian.org/debian/ca-certificates/-/commit/5b83fd984706ea03101dbb011846e60364c3a149 - but we don't yet know if this will be released for buster and/or bullseye updates. Stalling out a little on this before we move forward with the puppet-based solution.

Fri, Oct 8, 7:53 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic, envoy, serviceops, CirrusSearch, WMF-JobQueue, Discovery-Search, Wikimedia-production-error
BBlack moved T292820: Create runbook for VarnishTrafficDrop alert, change dashboard link from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 7:51 PM · User-ema, SRE, Traffic
BBlack moved T291148: VarnishTrafficDrop alert false positives due to DCs depooled from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 7:49 PM · SRE Observability (FY2021/2022-Q2), SRE, Traffic
BBlack moved T290694: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet from Triage to Caching on the Traffic board.
Fri, Oct 8, 7:49 PM · SRE, Traffic, ops-ulsfo, DC-Ops
BBlack moved T292290: Package and deploy Varnish 6.0.8 from Triage to Caching on the Traffic board.
Fri, Oct 8, 7:47 PM · Performance-Team (Radar), User-ema, SRE, Traffic
BBlack moved T290536: Serve production traffic via Kubernetes from Triage to Radar on the Traffic board.
Fri, Oct 8, 7:47 PM · Performance-Team (Radar), Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
BBlack moved T289974: Prometheus Varnish exporter alert: add runbook and link to dashboard from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 7:46 PM · User-ema, Observability-Alerting, SRE, Traffic
BBlack moved T289787: Clean up Traffic tag/workboard from Triage to Caching on the Traffic board.
Fri, Oct 8, 7:46 PM · PM, SRE, Traffic
BBlack moved T289536: Deploy durum: check service for Wikidough from Triage to DNS Infra on the Traffic board.
Fri, Oct 8, 7:45 PM · Patch-For-Review, SRE, Traffic
BBlack moved T287847: Performance implications of buffer sizes in Apache Traffic Server intercept plugins from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:42 PM · Traffic-Icebox, SRE
BBlack closed T287584: DNS Discovery for active/passive failover within a data centre as Declined.

Given your generous offer of declination, I think we'll take that route! :)

Fri, Oct 8, 7:42 PM · SRE, Traffic
BBlack moved T288106: Experiment with single backend CDN nodes from Triage to Caching on the Traffic board.
Fri, Oct 8, 7:39 PM · User-ema, Patch-For-Review, SRE, Traffic
BBlack added a comment to T287561: Review use of realloc in varnishkafka.

The last time I looked at the patches, I was a bit baffled and left it alone. It's not clear that there's any active issue affecting us that this will solve, and these kinds of issues and patchwork on them are *always* a lot more complex to get right (and to audit) than they seem. Given we will likely eventually move away from varnishkafka, I'm inclined to leave this alone unless there's a good reason to fix it (and then if there is a good reason - reviewing this will be non-trivial!)

Fri, Oct 8, 7:34 PM · Analytics-Kanban, Traffic, Analytics
BBlack moved T287266: Unexpected auditd service restart failure from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 7:29 PM · User-MoritzMuehlenhoff, SRE, Traffic
BBlack moved T286924: LVS should handle losing a NIC on eqiad and codfw from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:29 PM · Traffic-Icebox, Sustainability (Incident Followup), SRE
BBlack moved T286554: Per-country Frontend Traffic dashboards from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:27 PM · Traffic-Icebox, observability, SRE, Sustainability (Incident Followup)
BBlack closed T285953: cp3059 Varnish child crash: Worker Pool Queue does not move as Resolved.

We have a new varnish version coming soon, so stale crash reports are probably of little value now.

Fri, Oct 8, 7:25 PM · SRE, Traffic
BBlack moved T285926: Preserve Server response header when generating custom error page via VCL from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:25 PM · Traffic-Icebox, Patch-For-Review, SRE
BBlack removed a project from T285707: Services without a service IP cannot automatically be switched by the switchdc cookbook: Traffic.
Fri, Oct 8, 7:24 PM · SRE, serviceops, Datacenter-Switchover
BBlack added a comment to T284981: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change.

We chose S:BP for those queries on the assumption that, by its nature, it would be a cheap page to monitor. Is there a better option we should be using, or is this ticket more about fixing inefficiencies in it?

Fri, Oct 8, 7:21 PM · SRE, MediaWiki-General, Traffic, Pybal, wdwb-tech, Wikidata
BBlack moved T284555: Consider using BindsTo instead of Requires to declare dependencies between systemd unit from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:18 PM · Traffic-Icebox, SRE, serviceops, VPS-project-Codesearch
BBlack moved T284304: Create dashboard showing aggregate data transfer rates per DC/cluster from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:17 PM · Traffic-Icebox, SRE
BBlack moved T284292: Take response size into account in CDN HTTP requests throttling from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 7:16 PM · Traffic-Icebox, SRE
BBlack moved T283164: Let's Encrypt issuance chains update from Triage to TLS on the Traffic board.
Fri, Oct 8, 7:14 PM · SRE, Traffic
BBlack moved T283614: RIPE Atlas monitoring of reachability & latency towards anycasted Wikidough IP from Triage to DNS Infra on the Traffic board.
Fri, Oct 8, 7:14 PM · Traffic
BBlack moved T280628: Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams from Triage to Radar on the Traffic board.
Fri, Oct 8, 7:13 PM · Wikimedia Enterprise (Okapi Wikimedia Enterprise), Traffic, SRE, Platform Engineering
BBlack added a comment to T281700: PDNS in cloud can return inconsistent answers .

As noted in the description, DNS is inconsistent in general within reasonable TTL bounds, so I don't see resolving the inconsistency being shown here as a good reason to take any action.

Fri, Oct 8, 7:09 PM · Traffic, SRE, DNS, Cloud-Services
BBlack moved T282787: Configure dns and puppet repositories for new drmrs datacenter from Triage to Caching on the Traffic board.
Fri, Oct 8, 7:01 PM · Patch-For-Review, SRE, Traffic
BBlack moved T282788: drmrs: primary software task from Triage to Caching on the Traffic board.
Fri, Oct 8, 7:01 PM · Infrastructure-Foundations, procurement, netops, Traffic, SRE
BBlack shifted T282788: drmrs: primary software task from the Restricted Space space to the S1 Public space.
Fri, Oct 8, 7:00 PM · Infrastructure-Foundations, procurement, netops, Traffic, SRE
BBlack moved T279664: Decide on details of progressive Multi-DC roll out from Triage to Radar on the Traffic board.
Fri, Oct 8, 6:59 PM · SRE, Traffic, serviceops, Performance-Team
BBlack moved T278964: Separate ingress IPs and/or infrastructure for large content uploads from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:55 PM · Traffic-Icebox, SRE
BBlack moved T275809: cache_upload cache policy + large_objects_cutoff concerns from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 6:54 PM · Patch-For-Review, SRE, Traffic
BBlack moved T277553: varnishkafka / ATSkafka should support setting the kafka message timestamp from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:54 PM · Traffic-Icebox, SRE, Analytics
BBlack removed a project from T275234: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org: Traffic.

Removing Traffic as I don't think this looks actionable for our team (but might still be for netops if the conversations above are still ongoing!).

Fri, Oct 8, 6:54 PM · Infrastructure-Foundations, SRE, netops
BBlack merged Restricted Task into T274888: cp_upload @ eqsin cascading failures, February 2021.
Fri, Oct 8, 6:51 PM · Patch-For-Review, SRE, Traffic
BBlack moved T274888: cp_upload @ eqsin cascading failures, February 2021 from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 6:50 PM · Patch-For-Review, SRE, Traffic
BBlack moved T275904: Establish wikifunctions.org from Triage to Radar on the Traffic board.
Fri, Oct 8, 6:45 PM · Abstract Wikipedia team (Phase κ), SRE, Traffic, DNS
BBlack moved T273737: Get traffic team green light for Cloud NAT to wikis change from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:42 PM · Traffic-Icebox, SRE, cloud-services-team (Kanban), Cloud-VPS
BBlack closed T273248: wikireplicas last-minute infra work to discuss / resolve as Resolved.
Fri, Oct 8, 6:41 PM · Infrastructure-Foundations, SRE, netops, Traffic, Data-Services, cloud-services-team (Kanban)
BBlack moved T275409: Create and document Wikidough's privacy policy from Triage to DNS Infra on the Traffic board.
Fri, Oct 8, 6:40 PM · Privacy Engineering, SRE, Traffic
BBlack closed T273248: wikireplicas last-minute infra work to discuss / resolve, a subtask of T271476: Iron out issues in the proxy structure for multi-instance wikireplicas, as Resolved.
Fri, Oct 8, 6:40 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
BBlack removed a project from T273086: Downloading from Archiva.wikimedia.org seems slower than Maven Central: Traffic.
Fri, Oct 8, 6:38 PM · Analytics, SRE
BBlack moved T270618: Create Generalised blocking stratagy from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:38 PM · Traffic-Icebox, Infrastructure-Foundations, SRE, netops
BBlack moved T270391: varnish filtering: should we automatically update public_cloud_nets from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:38 PM · Traffic-Icebox, Infrastructure-Foundations, User-jbond, netops, SRE
BBlack moved T270034: Send HSTS header on all Wordpress VIP-hosted domains from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:37 PM · Traffic-Icebox, Technical blog, SRE, HTTPS, Diff-blog
BBlack removed a project from T269946: Enable webp thumbnails on all images for non-Commons wikis: Traffic.
Fri, Oct 8, 6:37 PM · SRE, Performance-Team
BBlack removed a project from T266331: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation: Traffic.
Fri, Oct 8, 6:35 PM · Infrastructure-Foundations, SRE, netbox, DNS, netops, cloud-services-team (Kanban)
BBlack moved T265904: Remove SLAAC IPs from Ganeti hosts from Triage to Radar on the Traffic board.
Fri, Oct 8, 6:35 PM · Patch-For-Review, Traffic, SRE
BBlack moved T264881: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC from Triage to Radar on the Traffic board.
Fri, Oct 8, 6:34 PM · SRE, Traffic, iOS-app-Bugs, Wikipedia-iOS-App-Backlog
BBlack moved T260943: Don't set cookies for api.wikimedia.org at the caching layer from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:33 PM · Traffic-Icebox, SRE
BBlack moved T269828: X-Cache-Status: distinguish between fresh and stale hits/misses from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:32 PM · Traffic-Icebox, SRE
BBlack moved T261803: Requests for /static get an invalid WMF-Last-Access cookie for wikipedia.org on non-Wikipedia requests from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 6:32 PM · Performance-Team (Radar), Analytics-Radar, Traffic, WMF-General-or-Unknown, SRE
BBlack moved T263277: Collect netflow data for internal traffic from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:31 PM · Traffic-Icebox, Infrastructure-Foundations, netops, SRE
BBlack moved T252227: Mobile redirects drop provenance parameters from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:31 PM · Traffic-Icebox, Analytics-Radar, Readers-Web-Backlog (Tracking), SRE
BBlack closed T276213: Sudden surge of requests to https://wikipedia.org/ from Telus customers as Declined.

Feel free to reopen/link if this is useful in a future investigation!

Fri, Oct 8, 6:30 PM · Traffic, SRE
BBlack moved T269825: Incorrect X-Cache-Status reported by deployment-prep caches from Triage to Active Issues on the Traffic board.
Fri, Oct 8, 6:29 PM · Patch-For-Review, Traffic, SRE
BBlack moved T271144: Some Traffic clusters apparently do not support IPv6 from Triage to Icebox-Temp on the Traffic board.
Fri, Oct 8, 6:27 PM · Traffic-Icebox, Infrastructure-Foundations, IPv6, SRE, User-crusnov, SRE-tools
BBlack removed a project from T268621: Move some of wikimediacloud.org 185.15.56.0/23 to Netbox: Traffic.
Fri, Oct 8, 6:27 PM · Infrastructure-Foundations, cloud-services-team (Kanban), SRE, DNS, netbox
BBlack moved T209785: INMARSAT geolocates to the UK, leading to requests going to esams from Bug Reports to Done on the Traffic board.
Fri, Oct 8, 6:05 PM · SRE, Traffic
BBlack moved T130904: Host rewrite for /static/ not applied to purges from Bug Reports to Done on the Traffic board.
Fri, Oct 8, 6:05 PM · Traffic, SRE
BBlack moved T256302: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error from Bug Reports to Done on the Traffic board.
Fri, Oct 8, 6:05 PM · SRE, Traffic
BBlack moved T264074: varnishkafka 1.1.0 CPU usage increase from Bug Reports to Done on the Traffic board.
Fri, Oct 8, 6:05 PM · Patch-For-Review, Analytics-Clusters, Traffic, SRE
BBlack moved T264378: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working from Bug Reports to Done on the Traffic board.
Fri, Oct 8, 6:05 PM · SRE, Traffic
BBlack moved T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) from Bug Reports to Done on the Traffic board.
Fri, Oct 8, 6:05 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
BBlack moved T265625: ats-be occasional system CPU usage increase from Bug Reports to Done on the Traffic board.
Fri, Oct 8, 6:05 PM · Performance-Team, Traffic, SRE
BBlack moved T268883: fifo-log-tailer: gracefully handle missing unix socket from Bug Reports to Done on the Traffic board.
Fri, Oct 8, 6:05 PM · SRE, Traffic