Page MenuHomePhabricator
Feed Advanced Search

Oct 17 2019

BBlack added a comment to T234803: Provide an easy way of picking the traffic serving TLS certificate used by ATS.

Notes from IRC, etc:

Oct 17 2019, 2:09 PM · Patch-For-Review, Traffic, SRE

Oct 3 2019

BBlack added a project to T233183: Automate generation of Management DNS records from Netbox: Traffic.
Oct 3 2019, 6:51 PM · netbox, Patch-For-Review, User-jbond, SRE, Traffic, User-crusnov, Goal, SRE-tools
BBlack updated subscribers of T233183: Automate generation of Management DNS records from Netbox.

I've been pushing this to my back burner for a few days because it's complicated. My current $0.03 on all related things:

Oct 3 2019, 6:50 PM · netbox, Patch-For-Review, User-jbond, SRE, Traffic, User-crusnov, Goal, SRE-tools

Sep 30 2019

BBlack updated subscribers of T233661: Publish tls related info to webrequest via varnish.
Sep 30 2019, 4:05 PM · Patch-For-Review, Analytics-Kanban, observability, SRE, Analytics, Traffic

Sep 27 2019

Ladsgroup awarded T102099: Fix IPv6 autoconf issues once and for all, across the fleet. a Orange Medal token.
Sep 27 2019, 9:19 PM · Infrastructure-Foundations, User-jbond, netops, SRE, IPv6
BBlack added a comment to T216172: Set up basic email infra for w.wiki domain.

Awesome, thank you!

Sep 27 2019, 5:16 PM · Traffic, SRE, Mail
BBlack closed T232602: GRE MTU mitigations - Tracking as Resolved.
Sep 27 2019, 4:50 PM · SRE, Traffic
BBlack added a project to T216172: Set up basic email infra for w.wiki domain: Traffic.

Ping @herron can we move on this? Any current blockers?

Sep 27 2019, 4:34 PM · Traffic, SRE, Mail

Sep 26 2019

Ladsgroup awarded T170567: Support TLSv1.3 a Like token.
Sep 26 2019, 11:10 AM · Performance-Team (Radar), Wikimedia-Incident, Goal, Traffic, SRE

Sep 25 2019

BBlack added a comment to T232602: GRE MTU mitigations - Tracking.

@BBlack @faidon let me know when is a good time to remove that MSS hack on the routers.
To be done one router at a time with time in between for the sessions to re-establish. Will also drain NTT/Telia using BGP graceful shutdown beforehand.

Sep 25 2019, 10:04 PM · SRE, Traffic

Sep 21 2019

BBlack added a comment to T233271: 503 Backend fetch failed.

[removed - someone linked this during an ongoing incident and I assumed it was fresh. These reports are from days ago and my comment was not relevant]

Sep 21 2019, 1:18 AM · User-DannyS712, SRE

Sep 19 2019

Krinkle awarded T165765: Refactor pybal/LVS config for shared failover a Orange Medal token.
Sep 19 2019, 5:28 PM · Traffic-Icebox, Performance-Team (Radar), SRE
BBlack added a comment to T232006: LDF service does not Vary responses by Accept, sending incorrect cached responses to clients.

We'll also need to normalize the incoming Accept headers up in the edge cache layer to avoid pointless vary explosions. Ideally the normalization should exactly match the application-layer logic that chooses the output content type. Do you have some pseudo-code (or real code link is fine too) description of how accept is parsed to select content-types?

Sep 19 2019, 4:09 AM · Discovery-Search (Current work), Patch-For-Review, SRE, Traffic, Wikidata-Query-Service, Wikidata

Sep 17 2019

BBlack added a comment to T111899: Deprecate pybal SSH health checks.

Still TODO here before resolving: remove the ferm puppetization on the MW hosts that was allowing LVS ssh access

Sep 17 2019, 2:22 PM · Traffic-Icebox, SRE

Sep 13 2019

BBlack added a comment to T232795: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data .

The problem stems from the "Trust" in "Trusted Proxy". The user-agent string isn't a reliable source (can be set to anything by anyone), and ditto for the contents of X-Forwarded-For. So we can't decide to trust XFF contents in the absence of something reliable, and the UA string isn't it. This is why we need a list of source IPs / networks (and a way to keep them updated) to know who we can trust XFF data from.

Sep 13 2019, 5:59 PM · Data-Engineering-Icebox, Traffic-Icebox, SRE
BBlack added a comment to T232602: GRE MTU mitigations - Tracking.

Right, that would cover cases like install1002 and archiva (and probably many other minor cases we've missed which haven't set off big alarm bells), but we'll still need direct mitigation on the hosts where it matters for inbound (the cpNNNN, gerrit, etc, which probably also has a long tail of cases we haven't really noticed yet).

Sep 13 2019, 3:53 PM · SRE, Traffic
BBlack added a comment to T232679: Images served with text/html content type.

The URL mentioned at the top isn't a media URL, it actually is HTML content and is a pageview. Try it in your browser: https://commons.wikimedia.org//wiki/File:Arm_muscles_back_numbers.png

Sep 13 2019, 10:22 AM · Traffic, Analytics, SRE

Sep 12 2019

BBlack added a comment to T165765: Refactor pybal/LVS config for shared failover.

T180069 - Ticket from the feature add for pybal itself

Sep 12 2019, 7:27 PM · Traffic-Icebox, Performance-Team (Radar), SRE
BBlack added a comment to T165765: Refactor pybal/LVS config for shared failover.

What's missing here is turning on BGP peering with all local routers, which is available in our current 1.15 pybal releases. Will fix that up here and then resolve (the rest has been live for a while for all new LVS deploys).

Sep 12 2019, 7:20 PM · Traffic-Icebox, Performance-Team (Radar), SRE

Sep 11 2019

BBlack added a comment to T128559: Enable HSTS on store.wikimedia.org for HTTPS.

@MBeat33 + @Jseddon - Thank you for the update(s)

Sep 11 2019, 6:33 PM · Traffic, SRE, Wikimedia-Shop, HTTPS
BBlack updated the task description for T232602: GRE MTU mitigations - Tracking.
Sep 11 2019, 5:33 PM · SRE, Traffic
BBlack added a comment to T128559: Enable HSTS on store.wikimedia.org for HTTPS.
Sep 11 2019, 5:11 PM · Traffic, SRE, Wikimedia-Shop, HTTPS
BBlack reopened T227408: (OoW) restbase2009 lockup as "Open".

Re-open as this isn't really complete yet, the battery came in and replacement is proceeding. Since @jijiki did this before and claims it's just a depool command, we'll go with that again :)

Sep 11 2019, 5:02 PM · serviceops, ops-codfw, SRE
BBlack updated the task description for T232602: GRE MTU mitigations - Tracking.
Sep 11 2019, 12:10 PM · SRE, Traffic
BBlack triaged T232602: GRE MTU mitigations - Tracking as Medium priority.
Sep 11 2019, 12:09 PM · SRE, Traffic
BBlack added a comment to T232491: Numerous people reporting issues saving edits and viewing previews/diffs.

I've made a temporary MTU-related fixup on the affected eqiad and esams cache hosts. Assuming we understand the issue correctly, it should be resolve the issue for fresh connections (worst case, restart your browser). Can any previous reporters confirm the same continued breakage, or new success?

Sep 11 2019, 1:16 AM · netops, Traffic, WMF-General-or-Unknown, SRE

Sep 9 2019

BBlack added a comment to T231108: upload LB: retry swift 404s cross-cluster.

@ema would know better about how difficult such things are with ATS in particular. I tend not to like this idea in general, though. In the case of some failure causing lots of temporary pointless 404s, it might double up traffic, and it seems like a hacky crutch which we'd come to rely on instead of fixing the real underlying issues. If others feel strongly about it and it's feasible and reasonably-temporary, I can be convinced, though!

Sep 9 2019, 11:15 PM · Commons, MediaWiki-File-management, SRE-swift-storage, Traffic, SRE

Sep 8 2019

BBlack added a comment to T232224: September 2019 DoS attacks [Public].

Note we don't actually use phabricator for the actual incident response on something like this. There's no need to mess with priorities or send notifications here :)

Sep 8 2019, 11:43 PM · Sustainability (Incident Followup), SRE

Sep 7 2019

BBlack added a comment to T232224: September 2019 DoS attacks [Public].

It was definitely the attack, not a device failure. We won't generally release fine-grained details about an attack publicly, at least not this early and while threats and mitigations continue to be an ongoing concern. While attempting to investigate and mitigate various phases and variants of the attack during various windows of time yesterday, we did take various network engineering steps which shifted global traffic around between our edges, some of which can lead to the confusing analysis results above.

Sep 7 2019, 1:08 PM · Sustainability (Incident Followup), SRE

Aug 30 2019

BBlack added a comment to T230638: Move old transparency report pages to historical URLs and setup redirect.

I agree that is a better long-term setup and is something I can bring up with Automattic. Is it safe to say this is something that could be done relatively easy if the site were hosted internally?

Aug 30 2019, 4:33 PM · Patch-For-Review, serviceops, SRE, WMF-Legal
BBlack added a comment to T230638: Move old transparency report pages to historical URLs and setup redirect.

On the broader meta-topics: Long-lived canonical URLs are important, and I think that transparency.wikimedia.org seems like a more-natural fit for that (and to continue printing and publishing it). IMHO, the ideal end-game here* would be to move transparency.wikimedia.org to Automattic hosting completely and have it serve the new content directly, as well as the historical parts, and have the blog's links link into it. The currently-outlined (interim?) setup sends confusing social and technical signals (e.g. to search engines) about which of https://transparency.wikimedia.org/ or https://wikimediafoundation.org/about/transparency/ is the canonical location of the content.

Aug 30 2019, 1:48 PM · Patch-For-Review, serviceops, SRE, WMF-Legal
BBlack added a comment to T230638: Move old transparency report pages to historical URLs and setup redirect.

There are two separate things to do here:

Aug 30 2019, 1:30 PM · Patch-For-Review, serviceops, SRE, WMF-Legal
BBlack updated subscribers of T230638: Move old transparency report pages to historical URLs and setup redirect.

Some clarifying points:

Aug 30 2019, 1:10 PM · Patch-For-Review, serviceops, SRE, WMF-Legal

Aug 29 2019

BBlack added a comment to T101048: Policy decisions for new (and current) DNS domains registered to the WMF.

4 years later, lots of things have changed for the better, and we're starting to get near the end of this.

Aug 29 2019, 3:08 PM · Traffic-Icebox, SRE, WMF-Legal
BBlack added a comment to T128559: Enable HSTS on store.wikimedia.org for HTTPS.

Bump - Whomever's in charge of Shopify on our end, can we check if they've added support for includeSubdomains and preload now in some site setting?

Aug 29 2019, 3:03 PM · Traffic, SRE, Wikimedia-Shop, HTTPS
BBlack closed T214253: en.wikipedia.com [sic] serves an invalid certificate as Resolved.

It does work fine now, thanks to the new non-canonical redirect service!

Aug 29 2019, 2:59 PM · SRE, Traffic, HTTPS

Aug 27 2019

BBlack added a comment to T231388: Error pulling image from docker registry.

Please leave this open for now so @ema can look at a more-permanent fixup tomorrow!

Aug 27 2019, 9:59 PM · Traffic, serviceops, SRE
BBlack added a comment to T231388: Error pulling image from docker registry.

Depooled cp1075 ats-be service via confctl, can someone retry and confirm mitigated?

Aug 27 2019, 9:51 PM · Traffic, serviceops, SRE
BBlack assigned T231388: Error pulling image from docker registry to ema.

Assigning to @ema to investigate (yes, this is the live test server for ATS backends for these servers). Most likely the problem is specific to ATS<->docker-registry, probably because the underlying service TLS certificate's SAN list doesn't match the public name docker-registry.wikimedia.org.

Aug 27 2019, 9:41 PM · Traffic, serviceops, SRE
BBlack changed the status of T230638: Move old transparency report pages to historical URLs and setup redirect from Stalled to Open.
Aug 27 2019, 9:30 PM · Patch-For-Review, serviceops, SRE, WMF-Legal
BBlack added a comment to T230638: Move old transparency report pages to historical URLs and setup redirect.

@Varnent: For the redirects: just the main https://transparency.wikimedia.org/ URL? Or also the sub-pages like https://transparency.wikimedia.org/content.html ? I haven't yet looked at the content for the move to /historical/, but I assume it's relatively-simple.

Aug 27 2019, 9:30 PM · Patch-For-Review, serviceops, SRE, WMF-Legal
BBlack changed the status of T231387: Updating DNS records (pr.wikimedia.org) from Open to Stalled.

Holding on this until early next week, as we have too many decision-makers on vacation this week, and there are policy and security implications to granting DKIM for @wikimedia.org to a third party via Amazon SES.

Aug 27 2019, 9:24 PM · Mail, WMF-Communications, SRE

Aug 21 2019

BBlack created T230955: Configure Layer3 hashing for router ECMP (for anycast DNS).
Aug 21 2019, 8:15 PM · SRE, Traffic

Aug 20 2019

BBlack changed the status of T230638: Move old transparency report pages to historical URLs and setup redirect from Open to Stalled.

Just stalling this so that anyone following it doesn't try to pick this up or move with it yet. There's an ongoing email thread about clarifying this task, and we're waiting for at least one person to return from a vacation and provide guidance before we move forward here.

Aug 20 2019, 3:30 PM · Patch-For-Review, serviceops, SRE, WMF-Legal

Aug 19 2019

BBlack changed the status of T230687: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users from Open to Stalled.

There's perhaps a faulty implicit assumption here that we desire to use one cert for the world and that we'd just "switch" everything to LE. We're currently using the Globalsign cert at all edges due to various problems earlier in the year, but what we were doing in the past and would like to continue doing in the future is using two certs simultaneously from unrelated CAs, and making the split on a per-datacenter basis (with the US sites using GlobalSign, and the non-US sites using LE, in this case).

Aug 19 2019, 5:31 PM · Traffic-Icebox, SRE, Acme-chief
BBlack added a comment to T230733: Expose pooled status of gdnsd and conftool managed services as metrics.

I'd start with the conftool stuff before moving on to anything that tracks gdnsd's admin_state -driven things. That whole mechanism is likely to be replaced in the next quarter or two on the gdnsd side, and I wouldn't be surprised if we end up driving the new mechanism from conftool by default.

Aug 19 2019, 4:02 PM · User-CDanis, SRE, observability
BBlack added a comment to T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

(Also, is the specific TMH fix actually deployed to all groups yet?)

Aug 19 2019, 3:47 PM · Traffic-Icebox, Sustainability (Incident Followup), Platform Engineering, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Performance-Team (Radar), MediaWiki-extensions-CentralAuth, SRE
BBlack added a comment to T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

Is this fixed now ?

The specific issue of TMH triggering CentralAuth misbehavior is fixed. The more generic issue of CentralAuth misbehavior being easily triggerable via wrong use of Request or RequestContext is not.

Aug 19 2019, 3:46 PM · Traffic-Icebox, Sustainability (Incident Followup), Platform Engineering, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Performance-Team (Radar), MediaWiki-extensions-CentralAuth, SRE
BBlack placed T230638: Move old transparency report pages to historical URLs and setup redirect up for grabs.

Unassign for now. The actual ask here is unclear in terms of technical details.

Aug 19 2019, 10:50 AM · Patch-For-Review, serviceops, SRE, WMF-Legal

Aug 15 2019

BBlack added a comment to T98006: Anycast AuthDNS.

General status updates and planning, for this very old ticket which is still on the radar!

Aug 15 2019, 2:48 PM · Traffic-Icebox, Infrastructure-Foundations, Patch-For-Review, netops, SRE
BBlack added a subtask for T186550: Anycast recdns: T228190: Roll out Anycast RecDNS to more servers.
Aug 15 2019, 2:26 PM · Patch-For-Review, netops, SRE, Traffic
BBlack added a parent task for T228190: Roll out Anycast RecDNS to more servers: T186550: Anycast recdns.
Aug 15 2019, 2:26 PM · SRE, Traffic

Aug 14 2019

BBlack created P8910 bblack NewPP on 9s greece load.
Aug 14 2019, 4:57 PM
BBlack added a comment to T228190: Roll out Anycast RecDNS to more servers.

I'm not sure if it goes as a subtask here, or of T167841 and/or T227808 - but recording here so we don't forget, from an earlier IRC conversation:

Aug 14 2019, 4:22 PM · SRE, Traffic
BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

As noted in T155359 - WMDE has moved the hosting of this to some other platform, including the DNS hosting (and we never had the whois entry). So this task can resolve as Decline I think (or whatever), but we should use it to track down various revert patches first before we close it up (revert the DNS repo stuff and whatever else we've got going on in various other repos supporting the wikiba.se site).

Aug 14 2019, 3:10 PM · User-Addshore, serviceops, [DEPRECATED] wdwb-tech, Traffic, wikiba.se website, SRE, Wikidata-Sprint-2016-11-08, Wikidata

Aug 13 2019

BBlack closed T229860: SRE Onboarding for Sukhbir Singh as Resolved.

Looks like it to me :)

Aug 13 2019, 7:12 PM · SRE-Access-Requests, Traffic, SRE

Aug 5 2019

BBlack added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

May as well link in an earlier related ticket from late last year for more backstory, too: https://phabricator.wikimedia.org/T205609

Aug 5 2019, 7:50 PM · SRE, netops
BBlack added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

Again today, causing a small spike of esams-specific 503s and icinga alerts:

Aug 5 2019, 7:49 PM · SRE, netops
BBlack created T229860: SRE Onboarding for Sukhbir Singh.
Aug 5 2019, 5:19 PM · SRE-Access-Requests, Traffic, SRE
BBlack added a comment to T229621: Icinga check defined from LVS configuration for cloudelastic are borked.

So, yes, cloudelastic is correct in DNS for normal lookups. The issue is that the icinga check defines the virtual host entry for cloudelastic monitoring an explicit IP in its configuration, and that IP ends up being the IP of icinga1001, not of cloudelastic. This probably has to do with the puppet host context in which the resource is evaluated.

Aug 5 2019, 12:53 PM · Patch-For-Review, Discovery-Search (Current work), Elasticsearch, SRE, Traffic

Aug 2 2019

BBlack added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

Re: transitioning away from SLAAC for the current fleet/setup (which I think is probably a good incremental idea, and could happen ahead of the future netbox work to make that transition easier in the future). Some thoughts on accomplishing that:

Aug 2 2019, 2:11 PM · Infrastructure-Foundations, User-jbond, netops, SRE, IPv6

Aug 1 2019

BBlack created T229621: Icinga check defined from LVS configuration for cloudelastic are borked.
Aug 1 2019, 8:57 PM · Patch-For-Review, Discovery-Search (Current work), Elasticsearch, SRE, Traffic
BBlack added a comment to T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099.

These are ready to go for dcops-level work!

Aug 1 2019, 5:14 PM · ops-eqiad, SRE, decommission-hardware
BBlack closed T221343: puppet fails to run in cp1008 under certain conditions, a subtask of T219803: upgrade facter and puppet across the fleet, as Declined.
Aug 1 2019, 4:57 PM · Infrastructure-Foundations, User-jbond, Patch-For-Review, Packaging, Puppet, SRE
BBlack closed T221343: puppet fails to run in cp1008 under certain conditions as Declined.

Decom in T229586

Aug 1 2019, 4:57 PM · Packaging, Puppet, SRE
BBlack created T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099.
Aug 1 2019, 4:00 PM · ops-eqiad, SRE, decommission-hardware
BBlack closed T202966: Make cp1099 the new pinkunicorn, a subtask of T208734: Decommission asw-c-eqiad, as Declined.
Aug 1 2019, 3:31 PM · Infrastructure-Foundations, decommission-hardware, SRE, ops-eqiad, netops
BBlack closed T202966: Make cp1099 the new pinkunicorn as Declined.

We had a quick discussion and a small informal vote and decided we don't really need this functionality (pinkunicorn) anymore, so we're going to retire it and not replace it.

Aug 1 2019, 3:31 PM · SRE, Traffic

Jul 31 2019

BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

Heh, apparently I can't even remember things I read and said before even when they're right above me in the same ticket!

Jul 31 2019, 6:50 PM · Release-Engineering-Team-TODO, Release-Engineering-Team (Development services), SRE, Traffic, Phabricator
BBlack updated the task description for T228190: Roll out Anycast RecDNS to more servers.
Jul 31 2019, 6:32 PM · SRE, Traffic
BBlack added a comment to T228190: Roll out Anycast RecDNS to more servers.

Rollout status update: things that are using anycast recdns resolv.conf in production as of 2019-07-31:

  • All hosts in edge DCs (esams, ulsfo, eqsin)
  • All cp edge cache hosts globally
  • All LVS hosts globally
  • Canary Mediawiki API and Appserver hosts in both core DCs
  • Network devices
  • Install-time stuff (as in dhcp settings and Debian installer)
Jul 31 2019, 6:32 PM · SRE, Traffic
BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

Replying to myself earlier: apparently they're datestamped URIs beginning with /yyyy/mm/, examples being:

Jul 31 2019, 6:19 PM · Release-Engineering-Team-TODO, Release-Engineering-Team (Development services), SRE, Traffic, Phabricator
BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

I think it should be techblog.wikimedia.org, because even if that introduces a complication around redirect based on time period

Jul 31 2019, 6:13 PM · Release-Engineering-Team-TODO, Release-Engineering-Team (Development services), SRE, Traffic, Phabricator
BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

TODO list here from my POV, as best I understand things:

Jul 31 2019, 4:47 PM · Release-Engineering-Team-TODO, Release-Engineering-Team (Development services), SRE, Traffic, Phabricator
BBlack closed T207340: Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) as Resolved.

The 421 code is deployed and seems to be working correctly, with a fairly small global average rate of somewhere <1 req/sec. This is the most-legitimate thing we can do with these misdirected requests, and it may actually fix some of them if the UA's own confusion is truly at fault, but it may not be able to help if some kind of DNS or HTTPS proxy interference is causing persistent issues. Maybe it will at least reduce error reporting and debugging confusion in such cases, though, as 421 is very specific to this issue (vs generic 404).

Jul 31 2019, 4:24 PM · Performance-Team (Radar), Traffic, SRE

Jul 25 2019

BBlack added a comment to T228190: Roll out Anycast RecDNS to more servers.

All the LVSes are now using the anycasted recdns, which gets rid of the LVS<->recdns dependency loop and simplifies recdns server downtime processes: https://wikitech.wikimedia.org/w/index.php?title=Service_restarts&type=revision&diff=1833705&oldid=1832671

Jul 25 2019, 10:13 PM · SRE, Traffic

Jul 23 2019

BBlack created P8785 esams L3 wave recent history.
Jul 23 2019, 4:41 PM · netops, Traffic
BBlack added a comment to T228730: TLS config issue for nginx on Buster.

If we need this to work ASAP, probably the most-expedient thing to do would be to patch our puppetization to exclude the patched features from config on buster only, and use the vendor package. Traffic is in the process of moving away from nginx, hopefully by EOQ-ish, after which we won't need the problematic custom package, and the stock vendor package should work fine for other uses of the tlsproxy module (but we're not quite ready enough, yet, to mess with our current solution by removing the WMF package from stretch!).

Jul 23 2019, 12:45 PM · Traffic-Icebox, SRE

Jul 22 2019

BBlack added a comment to T227538: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC).

cp1079 and cp1080 just need normal depooling process here.

Jul 22 2019, 8:13 PM · DC-Ops, SRE, ops-eqiad
BBlack added a comment to T227542: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC).

lvs1014 here will need special care, Traffic should stop puppet and pybal and monitor failover to lvs1016 ahead of work, then revert afterwards. cp1081 and cp1082 here can be depooled as normal.

Jul 22 2019, 8:12 PM · DC-Ops, SRE, ops-eqiad
BBlack added a comment to T227143: a7-eqiad pdu refresh.

(task desc edited for correct cp nodes: this rack has 77/78, not 76/77)

Jul 22 2019, 8:09 PM · DC-Ops, SRE, ops-eqiad
BBlack updated the task description for T227143: a7-eqiad pdu refresh.
Jul 22 2019, 8:09 PM · DC-Ops, SRE, ops-eqiad
BBlack added a comment to T227143: a7-eqiad pdu refresh.

The Traffic nodes cp1077 + cp1078 can be depooled the usual way, but lvs1013 needs some special care. Someone from Traffic should handle and monitor that just in case (basically we need to manually disable puppet and stop pybal a few minutes in advance of the work, verify traffic moving correctly to lvs1016, and then put everything back to normal afterwards).

Jul 22 2019, 8:05 PM · DC-Ops, SRE, ops-eqiad
BBlack added a comment to T227141: a5-eqiad pdu refresh.

All the traffic cp and lvs nodes are decoms and not in use: T208584 T208586

Jul 22 2019, 6:14 PM · DC-Ops, SRE, ops-eqiad
BBlack updated the task description for T228678: Implement GeoDNS smooth repooling in gdnsd.
Jul 22 2019, 4:33 PM · Traffic-Icebox, SRE
BBlack moved T228678: Implement GeoDNS smooth repooling in gdnsd from Backlog to Some old column on the Traffic board.
Jul 22 2019, 4:26 PM · Traffic-Icebox, SRE
BBlack merged task T94697: implement better failure-scenario geoip mapping in gdnsd into T228678: Implement GeoDNS smooth repooling in gdnsd.
Jul 22 2019, 4:25 PM · SRE, Traffic
BBlack merged T94697: implement better failure-scenario geoip mapping in gdnsd into T228678: Implement GeoDNS smooth repooling in gdnsd.
Jul 22 2019, 4:25 PM · Traffic-Icebox, SRE
BBlack triaged T228678: Implement GeoDNS smooth repooling in gdnsd as Medium priority.
Jul 22 2019, 4:10 PM · Traffic-Icebox, SRE
BBlack updated the task description for T228671: Decommission lvs100[123456].
Jul 22 2019, 2:54 PM · Traffic, DC-Ops, SRE, decommission-hardware
BBlack created T228671: Decommission lvs100[123456].
Jul 22 2019, 2:52 PM · Traffic, DC-Ops, SRE, decommission-hardware
BBlack added a comment to T227140: a4-eqiad pdu refresh.

cp1076 - Can depool ahead of work and repool later, with the local commands "depool" and "pool"
lvs100[123] - Not in use and should be decommed, but this ticket made me realize we haven't made an lvs1001-6 decom ticket yet (will do shortly!)

Jul 22 2019, 2:46 PM · DC-Ops, SRE, ops-eqiad
BBlack closed T184293: rack/setup/install lvs101[3-6] as Resolved.

These have been in-service for a while now, closing!

Jul 22 2019, 2:43 PM · SRE, Traffic

Jul 19 2019

BBlack triaged T228533: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 as Medium priority.
Jul 19 2019, 5:22 PM · Traffic-Icebox, Analytics-Radar, User-jbond, SRE

Jul 18 2019

BBlack added a comment to T120085: RFC: Serve Main Page of Wikimedia wikis from a consistent URL.

Oh one more thing that should've been (3) on that list:

Jul 18 2019, 12:24 PM · Wikimedia-Performance-recommendation, Traffic-Icebox, Fundraising-Backlog, Editing-team, Parsing-Team--ARCHIVED, User-notice, Platform Engineering, SRE, TechCom-RFC, SEO, Wikimedia-Site-requests
BBlack added a comment to T120085: RFC: Serve Main Page of Wikimedia wikis from a consistent URL.

I like the end result here, and I don't think it's problematic from the Traffic perspective in the long view, but I think the initial rollout isn't so trivial:

Jul 18 2019, 12:17 PM · Wikimedia-Performance-recommendation, Traffic-Icebox, Fundraising-Backlog, Editing-team, Parsing-Team--ARCHIVED, User-notice, Platform Engineering, SRE, TechCom-RFC, SEO, Wikimedia-Site-requests

Jul 17 2019

BBlack moved T228190: Roll out Anycast RecDNS to more servers from Backlog to Some old column on the Traffic board.
Jul 17 2019, 3:49 PM · SRE, Traffic
BBlack closed T203194: cp1075-90 - bnxt_en transmit hangs as Resolved.

@Vgutierrez The firmware update on the NICs fixed this for good, right? Can we close this task?

Jul 17 2019, 3:48 PM · Patch-For-Review, SRE, Traffic

Jul 16 2019

BBlack added a comment to T186550: Anycast recdns.

It's my understanding that this reduces the steps necessary to restart our recursors is now reduced to a simple depool/repool and that the previous, complex approach from
https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) is now obsolete, right?

Jul 16 2019, 4:57 PM · Patch-For-Review, netops, SRE, Traffic

Jul 3 2019

BBlack added a comment to T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

I wonder if other uses of new RequestContext() (most of which are intentional) can trigger this bug.

Looks to be exclusively in tests?

Jul 3 2019, 3:56 PM · Traffic-Icebox, Sustainability (Incident Followup), Platform Engineering, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Performance-Team (Radar), MediaWiki-extensions-CentralAuth, SRE