Page MenuHomePhabricator
Feed Advanced Search

Wed, Mar 27

BBlack closed T361046: Requesting access to analytics-privatedata-users for bblack as Resolved.
Wed, Mar 27, 3:05 PM · Patch-For-Review, SRE, SRE-Access-Requests
BBlack updated the task description for T361046: Requesting access to analytics-privatedata-users for bblack.
Wed, Mar 27, 2:34 PM · Patch-For-Review, SRE, SRE-Access-Requests
BBlack updated the task description for T361046: Requesting access to analytics-privatedata-users for bblack.
Wed, Mar 27, 1:49 AM · Patch-For-Review, SRE, SRE-Access-Requests

Tue, Mar 26

BBlack updated the task description for T361046: Requesting access to analytics-privatedata-users for bblack.
Tue, Mar 26, 7:17 PM · Patch-For-Review, SRE, SRE-Access-Requests
BBlack created T361046: Requesting access to analytics-privatedata-users for bblack.
Tue, Mar 26, 7:13 PM · Patch-For-Review, SRE, SRE-Access-Requests

Feb 12 2024

ayounsi awarded T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) a Like token.
Feb 12 2024, 7:22 AM · Traffic, SRE

Feb 9 2024

CDanis awarded T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) a Love token.
Feb 9 2024, 7:24 PM · Traffic, SRE

Jan 19 2024

BBlack added a comment to T355446: Synchronize and rotate TCP Fastopen keys for various use-cases.

We discussed this in Traffic earlier this week, and I ended up implementing what I think is a reasonable solution already, so now I've made this ticket for the paper trail and to cover the followup work to debianize and usefully-deploy it. The core code for it is published at https://github.com/blblack/tofurkey .

Jan 19 2024, 7:17 PM · Traffic
BBlack triaged T355446: Synchronize and rotate TCP Fastopen keys for various use-cases as Medium priority.
Jan 19 2024, 7:14 PM · Traffic

Dec 5 2023

BBlack added a comment to T352744: OpenSSL 3.x performance issues.

The perf issues are definitely relevant for traffic's use of haproxy (in a couple of different roles). Your option (making a libssl1.1-dev for bookworm that tracks the sec fixes that are still done for the bullseye case, and packaging our haproxy to build against it) would be the easiest path from our POV, for these cases.

Dec 5 2023, 2:21 PM · SRE-swift-storage, Traffic

Nov 29 2023

BBlack added a comment to T345939: Create metrics/monitoring of fifo-log-demux.

Followup: did a 3-minute test of the same pair of parameter changes on cp3066 for a higher-traffic case. No write failures detected via strace in this case (we don't have the error log outputs to go by in 9.1 builds). mtail CPU usage at 10ms polling interval was significantly higher than it was in ulsfo, but still seems within reason overall and not saturating anything.

Nov 29 2023, 3:33 PM · Traffic
BBlack added a comment to T345939: Create metrics/monitoring of fifo-log-demux.

I went on a different tangent with this problem, and tried to figure out why we're having ATS fail writes to the notpurge log pipe in the first place. After some hours of digging around this problem (I'll spare you endless details of temporary test configs and strace outputs of various daemons' behavior, etc), these are the basic issues I see:

Nov 29 2023, 3:19 PM · Traffic

Nov 22 2023

BBlack created P53731 screenlocker script.
Nov 22 2023, 6:59 PM
BBlack created P53730 Triggering go template errors (tpl).
Nov 22 2023, 5:46 PM · Traffic
BBlack created P53729 Triggering go template errors (script).
Nov 22 2023, 5:45 PM · Traffic

Nov 9 2023

BBlack created T350869: cr2-eqiad xe-3/2/2 has errors for the past ~week.
Nov 9 2023, 1:54 PM · Infrastructure-Foundations, netops

Nov 7 2023

BBlack added a comment to T350354: Do we need to generate aggregates for LVS service IP ranges?.

I don't suspect it serves any real purpose at present, unless it was to avoid some filtering that exists elsewhere to avoid cross-site sharing of /32 routes or something.

Nov 7 2023, 2:11 PM · netops, Infrastructure-Foundations, SRE

Oct 19 2023

BBlack edited P53018 Example grafana text panel to pick specific absolute time ranges.
Oct 19 2023, 3:45 PM · Traffic
BBlack updated the task description for T349314: cp3079 bios settings.
Oct 19 2023, 3:37 PM · DC-Ops, ops-esams, SRE, Traffic
BBlack created T349314: cp3079 bios settings.
Oct 19 2023, 3:37 PM · DC-Ops, ops-esams, SRE, Traffic
BBlack created P53018 Example grafana text panel to pick specific absolute time ranges.
Oct 19 2023, 3:25 PM · Traffic

Oct 16 2023

BBlack added a comment to T348837: Investigate IPVS IPIP encapsulation support.

One potential issue with relying solely on MSS reduction is that, obviously, it only affects TCP. For now this is fine, as long as we're only using LVS (or future liberica) for TCP traffic (I think that's currently the case for LVS anyways!), but we could add UDP-based things in the future (e.g. DNS and QUIC/HTTP3), at which point we'll have to solve these problems differently.

Oct 16 2023, 2:46 PM · Patch-For-Review, SRE, Traffic
BBlack added a comment to T348837: Investigate IPVS IPIP encapsulation support.

The one thing you may not be able to control with mtu/advmss on a route is traffic to the local subnet, as that route is added by the kernal when the IP is added to the interface. Not sure if that can be modified to differ from interface MTU.

Oct 16 2023, 2:21 PM · Patch-For-Review, SRE, Traffic
BBlack added a comment to T348837: Investigate IPVS IPIP encapsulation support.

Could we take the opposite approach with the MTU fixup for the tunneling, and arrange the host/interface settings on both sides (the LBs and the target hosts) such that they only use a >1500 MTU on the specific unicast routes for the tunnels, but default to their current 1500 for all other traffic? If per-route MTU can usefully be set higher than base interface MTU, this seems trivial, but even if not, surely with some set of ip commands we could set the iface MTU to the higher value, while clamping it back down to 1500 for all cases except the tunnel.

Oct 16 2023, 1:09 PM · Patch-For-Review, SRE, Traffic

Oct 11 2023

BBlack created P52909 HA example.
Oct 11 2023, 4:25 PM

Oct 3 2023

BBlack added a comment to T348041: Remove static routes for ns[01] and replace their announcements with bird.

Looks about right to me!

Oct 3 2023, 7:52 PM · netops, SRE, Infrastructure-Foundations, Traffic
BBlack added a comment to T346165: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run".

We could add some normalization function at the ferm or puppet-dns-lookup layer perhaps (lowercase and do the zeros in a consistent way)?

Oct 3 2023, 3:12 PM · Patch-For-Review, cloud-services-team, Dumps-Generation, Data-Platform-SRE

Sep 27 2023

BBlack renamed T342159: Q1:rack/setup/install cp11[00-15] from Q1:rack/setup/install cp1[098-113] to Q1:rack/setup/install cp11[00-15].
Sep 27 2023, 6:23 PM · SRE, ops-eqiad, Traffic, DC-Ops

Sep 25 2023

BBlack added a comment to T323723: Alert on Varnish high thread count.

To clarify and expand on my position about this thread count parameter (which is really just a side-issue related to this ticket, which is fundamentally complete):

Sep 25 2023, 3:56 PM · Patch-For-Review, SRE, Traffic

Sep 24 2023

BBlack committed rMLIP042607cfd46d: Initial commit of stand-alone IPSet library.
Initial commit of stand-alone IPSet library
Sep 24 2023, 3:30 AM

Sep 22 2023

BBlack added a comment to T342159: Q1:rack/setup/install cp11[00-15].

Adding to the confusion: historically, we once used the hostname cp1099 back in 2015 for a one-off host: T96873 - therefore that name already exists in both phab and git history, confusingly.

Sep 22 2023, 1:03 PM · SRE, ops-eqiad, Traffic, DC-Ops
BBlack added a comment to T342159: Q1:rack/setup/install cp11[00-15].

Reading a little deeper on this, I think we still have a hostnames issue. If those other 8 hosts are indeed being brought from ulsfo+eqsin. Those 8 hosts, I presume, would be 1091-8, and so these hosts should start at 1099, not 1098?

Sep 22 2023, 1:00 PM · SRE, ops-eqiad, Traffic, DC-Ops
BBlack added a comment to T342159: Q1:rack/setup/install cp11[00-15].

@VRiley-WMF - Sukhbir's out right now, but I've updated the racking plan on his behalf!

Sep 22 2023, 12:48 PM · SRE, ops-eqiad, Traffic, DC-Ops
BBlack updated the task description for T342159: Q1:rack/setup/install cp11[00-15].
Sep 22 2023, 12:47 PM · SRE, ops-eqiad, Traffic, DC-Ops

Sep 18 2023

BBlack updated the task description for T346640: Traffic cache daemon restart scripts need some rework.
Sep 18 2023, 2:27 PM · SRE, Traffic
BBlack triaged T346640: Traffic cache daemon restart scripts need some rework as Medium priority.
Sep 18 2023, 2:26 PM · SRE, Traffic

Sep 15 2023

BBlack added a comment to T337446: Rebuild sanitarium hosts.

There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924508/1/hieradata/common/service.yaml

Sep 15 2023, 7:38 PM · User-notice-archive, TaxonBot, cloud-services-team, Data-Engineering, Data-Services, DBA

Sep 14 2023

BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

https://grafana.wikimedia.org/d/000000513/ping-offload might be a good starting point (might need some updates/tweaking to get the exact data you want, though)

Sep 14 2023, 7:31 PM · Infrastructure-Foundations, Traffic, netops, SRE
BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

some sort of rate-limiting configured on the switch-side for ICMP echo, which was IP-aware and didn't count packets from our own internal systems

Sep 14 2023, 7:29 PM · Infrastructure-Foundations, Traffic, netops, SRE

Sep 8 2023

BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

Reading into the code above and the history more and self-correcting: the ratelimiter doesn't apply to PTB packets, just some other informational packets. Apparently we bumped the ratelimiter first as a short-term mitigation (for all the sites), I guess primarily to avoid what looks like ping loss to our monitoring and/or users, then deployed the ping offloader in some places as well as a better way to deal with it (and I guess at thousands per second, the pps reduction probably is useful, although I don't know to what degree).

Sep 8 2023, 12:57 PM · Infrastructure-Foundations, Traffic, netops, SRE
BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

The current puppetized tuneables are at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/8ed59718c7a7603b61d7d42e05726fd11dae5eaa/modules/lvs/manifests/kernel_config.pp#49

Sep 8 2023, 12:52 PM · Infrastructure-Foundations, Traffic, netops, SRE
BBlack added a comment to T345809: Do we need ping offload servers at all POPs?.

to reduce load on LVS hosts

Sep 8 2023, 12:48 PM · Infrastructure-Foundations, Traffic, netops, SRE

Sep 5 2023

BBlack added a comment to T345334: Cache thumbs in our caching infrastructure (e.g. ATS).

This topic probably deserves a ~hour meeting w/ Traffic to hash out some of the potential solutions and tradeoffs, but I'm gonna try to bullet-point my way through a few points for now anyways to seed further discussion:

Sep 5 2023, 5:57 PM · SRE, Thumbor, SRE-swift-storage, Traffic

Jun 12 2023

BBlack added a comment to T337535: Figure out what changes are needed in the traffic layer for having codfw be the r/w DC for half a year.

The more I've thought about this issue, I think we should probably stick with the (very approximate) latency mapping we have, and not try to have a second setup to optimize for the codfw-primary case. I do think we should swap the core DCs at the front of the global default entry on switchover, though, and the patch above makes that spot a little more visible. There shouldn't be any hard dependencies between this and other steps, but it could be done around the start of the switchover process asynchronously.

Jun 12 2023, 4:59 PM · SRE, Traffic, serviceops, Datacenter-Switchover

May 31 2023

BBlack added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

We've got a pair of patches to review now which configure this on the pybal and safe-service-restart sides. We could especially use serviceops input on the latter. None of it's particularly pretty, but at least it's fairly succinct and seems to do the job!

May 31 2023, 2:29 PM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool

May 30 2023

BBlack added a comment to T337446: Rebuild sanitarium hosts.

Note: I restored+amended https://gerrit.wikimedia.org/r/c/operations/puppet/+/924342 and merged+deployed it on lvs1018+lvs1020. This seems to work and disable the problematic monitoring that impacts LVS itself.

May 30 2023, 1:25 PM · User-notice-archive, TaxonBot, cloud-services-team, Data-Engineering, Data-Services, DBA

May 27 2023

BBlack added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

As you seem to be working on this I'm bluntly assigning to you as part of the incident followup.

May 27 2023, 12:47 AM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool

Apr 27 2023

BBlack added a comment to T334048: Cookbook to depool a site in AuthDNS.

I like this direction (etcd). It's not super-trivial, but we've complained a lot even internally about the lack of etcd support for depooling whole sites at the public edge.

Apr 27 2023, 7:41 PM · Traffic

Apr 26 2023

BBlack added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

Probably needs subtasks for two things:

  1. Fix "safe-service-restart.py" being unsafe (either it or its caller is failing to propogate an error upstream to stop the carnage, and is also leaving a node depooled when the error happens between the depool and repool operations. At least one of those needs fixing, if not both).
  2. The whole 'template the local appservers.svc IP into the "instrumentation_ips"' thing at the pybal level, plus whatever changes are needed to use it from the scap side of things (so that it only checks one local pybal, and it's the correct one by current pooling).
Apr 26 2023, 9:08 PM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool
BBlack created P47284 Something like this.
Apr 26 2023, 6:48 PM
BBlack added a comment to T334467: Can't retrieve HTML from REST API .

The patch has been rolled out everywhere for a little while at this point, should be able to confirm success

Apr 26 2023, 4:56 PM · RESTBase-API, API Platform, Wikimedia Enterprise
BBlack added a comment to T334467: Can't retrieve HTML from REST API .

We had a brief meeting on this, and I think the actual problem and immediate workaround is actually much simpler than we imagined. We're going to apply the same workaround we did for MediaWiki traffic in T238285 ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/882663/ ) to the Restbase traffic for now. Patch incoming shortly!

Apr 26 2023, 3:00 PM · RESTBase-API, API Platform, Wikimedia Enterprise

Apr 25 2023

BBlack added a comment to T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh').

I think we need to rewind a step here. We do want mh, but we want it for the current public sh cases (basically: text and upload ports 80+443), and maybe the other three sh cases (kibana + thanos), although we can start with text+upload first and then talk about those others with the respective teams. The current ticket description and patches seem to be going after the opposite: switching the current wrr services to mh via hieradata and spicerack changes. I think this would be actively harmful. sh and mh choose the destination based on hashes of the source address, which is great for public-facing, but would be hasing on our very limited set of internal cache exit IPs (or other internal service clusters for internal LVS'd traffic), and so it wouldn't balance very well at all. One could potentially address that by including the source port in the hash, but it still seems like it would be more-complicated and less-optimal than just sticking with wrr for these cases.

Apr 25 2023, 1:48 PM · Traffic

Apr 17 2023

BBlack added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

So, the solution quoted from my IRC chat above: that's about making the depool verification code actually track the currently-live "low-traffic" (applayer/internal) LVS routing, as opposed to what it's doing now (which I think checks the primary+secondary for the role as-configured in puppet, which doesn't account for any failure/depool/etc at the LVS layer).

Apr 17 2023, 2:47 PM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool

Apr 14 2023

BBlack added a comment to T332024: GeoIP mapping experiments.

It's awesome to see this moving along! One minor point:

Apr 14 2023, 7:08 PM · Patch-For-Review, SRE, Infrastructure-Foundations, Traffic

Apr 13 2023

BBlack added a comment to T331356: Wikidata seems to still be utilizing insecure HTTP URIs.

Some remarks:

  • We should consider these canonical HTTP URIs to be names in the first place, which are unique worldwide and issued by the Wikidata project as the "owner" [1] of the wikidata.org domain. The purpose of these names is to identify things.
Apr 13 2023, 6:32 PM · wmde-wikidata-tech, SRE, [DEPRECATED] wdwb-tech, Traffic, Wikidata

Mar 30 2023

BBlack created P46001 Bard on FF/CRLite.
Mar 30 2023, 6:50 PM

Mar 13 2023

BBlack added a comment to T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]00[123]..

Resilient hashing indeed sounds much better (it seems like that's their codeword for some internal "consistent hashing" implementation), but it doesn't look like our current router OS have it, at least not when I looked at cr1-eqiad.

Mar 13 2023, 5:02 PM · SRE, Traffic

Mar 6 2023

BBlack closed T330906: HTTP URIs do not resolve from NL and DE? as Resolved.

The redirects are neither good nor bad, they're instead both necessary (although that necessity is waning) and insecure. We thought we had standardized on all canonical URIs being of the secure variant ~8 years ago, and this oversight has flown under the radar since then, only to be exposed recently when we intentionally (for unrelated operational reasons) partially degraded our port 80 services.

Mar 6 2023, 9:26 PM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic
BBlack triaged T331356: Wikidata seems to still be utilizing insecure HTTP URIs as High priority.
Mar 6 2023, 9:25 PM · wmde-wikidata-tech, SRE, [DEPRECATED] wdwb-tech, Traffic, Wikidata
BBlack reopened T330906: HTTP URIs do not resolve from NL and DE? as "Open".

As I already mentioned earlier, the SPARQL endpoint and the RDF serialized data all use the HTTP version as the canonical identifier. This makes sense to me and is, as far as I know, in line with other linked data best practices. But there needs to be a machine readable way to access the data.

Using a 301 to redirect to the HTTPS url is the correct approach and in fact this is already implemented and currently working again from my end. When I run the same command as mentioned in my first report I now do get a 301 reply. I hope this will keep working in this way until HTTP are no longer used within WD. I will close the issue for now.

Mar 6 2023, 5:21 PM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic

Mar 2 2023

BBlack added a comment to T128559: Enable HSTS on store.wikimedia.org for HTTPS.

Is there a reasonable shopify alternative that meets policy? That would be my question. If there isn't, we're stuck with this policy violation, but shouldn't stop calling it out as a violation. If there is, we should take a look at whether they can also meet our tech policies for our domains. Another viable resolution is simply to move the service to a novel domainname that isn't entangled with our infrastructure (wikipedia-shop.com or whatever), and manage it entirely separately from our production policies and practices.

Mar 2 2023, 7:10 PM · Traffic, SRE, Wikimedia-Shop, HTTPS
BBlack added a comment to T330906: HTTP URIs do not resolve from NL and DE?.

Thanks for the replies! Advising to use HTTPS over HTTP makes sense.

But not supporting redirection from HTTP to HTTPS will in my opinion introduce a fundamental problem for using Wikidata as a source for Linked Data. When querying Wikidata through the sparql endpoint the entities of the result set are all HTTP URIs. The RDF description of WD entities (accessed as described on https://www.wikidata.org/wiki/Wikidata:Data_access) contain many HTTP URIs for related entities and other resources.

Using the HTTP as identifier for the entity is not problematic as long as the redirection from HTTP to HTTPS can deliver access to the data itself.

Mar 2 2023, 2:44 PM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic

Mar 1 2023

BBlack added a comment to T330906: HTTP URIs do not resolve from NL and DE?.

The intermittent availability of port 80 is part of ongoing operational work, which is why it worked briefly earlier. However, the correct fix from the user POV is to not use port 80 in the first place (use HTTPS, not HTTP).

Mar 1 2023, 4:36 PM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic
BBlack added a comment to T330906: HTTP URIs do not resolve from NL and DE?.

Please use HTTPS rather than unencrypted HTTP, in any URIs referencing Wikimedia sites, e.g. https://www.wikidata.org/entity/Q42

Mar 1 2023, 4:33 PM · [DEPRECATED] wdwb-tech, Wikidata, SRE, Traffic

Feb 23 2023

BBlack added a comment to T128559: Enable HSTS on store.wikimedia.org for HTTPS.

Maybe worth pointing out (I had an old stale link to this years ago earlier in the ticket), if nothing else because it may cause whomever at shopify to actually reach out to an engineer:

Feb 23 2023, 9:09 PM · Traffic, SRE, Wikimedia-Shop, HTTPS
BBlack added a comment to T128559: Enable HSTS on store.wikimedia.org for HTTPS.

I assume "makes sense" here is probably cases where shopify knows of or has configured actual subdomains of the domain in question, or something like that. In either case, yes, probably pointing out the "preload" requires "includeSubdomains" should help with that part. preload is what we really want.

Feb 23 2023, 9:04 PM · Traffic, SRE, Wikimedia-Shop, HTTPS
BBlack added a comment to T309787: Remove IEContentAnalyzer.

Looks good to me, and appropriate at the Varnish layer in this case.

Feb 23 2023, 5:11 PM · MW-1.41-notes (1.41.0-wmf.2; 2023-03-27), MW-1.40-notes (1.40.0-wmf.27; 2023-03-13), SRE, Traffic, Technical-Debt, MediaWiki-File-management

Feb 21 2023

BBlack moved T330084: gdnsd failures when converting services from active/passive to active/active from Backlog to Traffic team actively servicing on the Traffic board.
Feb 21 2023, 3:50 PM · Traffic, Infrastructure-Foundations, netbox
BBlack moved T330165: eqiad row B switches upgrade from Backlog to Ready for work on the Traffic board.
Feb 21 2023, 3:49 PM · Patch-For-Review, Data Pipelines, Data-Engineering-Planning, DBA, Discovery-Search (Current work), SRE, serviceops, cloud-services-team, Machine-Learning-Team, Platform Engineering, SRE Observability, Infrastructure-Foundations, collaboration-services, Traffic

Feb 6 2023

BBlack closed T325797: oom killed varnish on cp4052 as Resolved.
Feb 6 2023, 6:33 PM · SRE, Traffic

Jan 25 2023

BBlack added a comment to T238285: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS).

With the merge above, I think this issue is at least mitigated for now. It's not a great long-term solution, but it should alleviate the user-facing side of this in practice.

Jan 25 2023, 4:06 PM · Traffic-Icebox, affects-Kiwix-and-openZIM, WMF-General-or-Unknown, SRE, User-DannyS712

Jan 24 2023

BBlack added a comment to T326564: codfw: Relocate servers to make space for new switches in rowA and rowB.

@Papaul - I can't make that slot for LVS, I have meetings a bit later that might get run over. @ssingh might be able to though!

Jan 24 2023, 7:15 PM · SRE, Infrastructure-Foundations, netops, ops-codfw

Jan 18 2023

BBlack added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

I fixed all these cases noted above for now. Note that in the lvs1017 case, this could've potentially caused a public service outage for IPv6 text-lb. This is because @ipaddress6 was also templated into pybal.conf as the BGP next-hop address. After the fixup, the puppet agent fixed that:

Jan 18 2023, 7:58 PM · Infrastructure-Foundations, User-jbond, netops, SRE, IPv6
BBlack added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

Bump - these issues continue to affect us sometimes. There seem to be some cases where Juniper can mis-route an RA to an interface it doesn't belong on (interface is on vlanX, but gets an RA that should only ever be seen on vlanY). During this past week/weekend's switch issues in codfw, this issue caused all hosts in rack B2 (which are all in the private1-b-codfw vlan 2620:0:860:102:) to receive RAs from the cloud-hosts1-codfw vlan 2620:0:860:118:.

Jan 18 2023, 7:44 PM · Infrastructure-Foundations, User-jbond, netops, SRE, IPv6

Jan 17 2023

BBlack created P43171 dc-maint.sh current code.
Jan 17 2023, 1:34 PM

Jan 12 2023

BBlack added a comment to T326745: Remove IPSec/Strongswan from Puppet repository.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/875897/ ! (apparently someone was already working on this!)

Jan 12 2023, 3:15 PM · Traffic, SRE

Jan 9 2023

BBlack added a comment to T325797: oom killed varnish on cp4052.

We have the patched package on cp5032 (bullseye). Did some manual testing on it today:

Jan 9 2023, 7:45 PM · SRE, Traffic
BBlack merged T86651: Fix LVS "sh" shortcomings into T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh').
Jan 9 2023, 2:45 PM · Traffic
BBlack merged task T86651: Fix LVS "sh" shortcomings into T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh').
Jan 9 2023, 2:45 PM · Traffic, SRE

Jan 5 2023

BBlack added a comment to T325797: oom killed varnish on cp4052.

Summarizing some of the lengthy IRC discussion and investigation on this topic (most of which was @Vgutierrez !):

Jan 5 2023, 1:03 AM · SRE, Traffic

Dec 14 2022

BBlack committed rCCKBe1f194c7845a: discovery: add drmrs IP.
discovery: add drmrs IP
Dec 14 2022, 3:30 PM

Dec 9 2022

BBlack created P42664 traffic restarts.
Dec 9 2022, 3:46 PM

Dec 8 2022

BBlack closed T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn' as Resolved.

This is completed now. AFAIK all relevant scripts/automations/etc were updated to match. The conftool service keys for cacheproxy nodes are now just cdn, which controls pooling of the front edge port 80/443 pooling towards pybal, and ats-be, which controls pooling of the varnish->ats-be for cross-node chashing (except in ulsfo and eqsin, which have switched to a single-backend model and don't do this part anymore).

Dec 8 2022, 4:34 PM · SRE, Traffic

Dec 7 2022

BBlack moved T269828: X-Cache-Status: distinguish between fresh and stale hits/misses from Minor TODO to Complicated on the Traffic-Icebox board.
Dec 7 2022, 6:46 PM · Traffic-Icebox, SRE
BBlack moved T267867: purged is not resilient to kafka main nodes going down from Minor TODO to Complicated on the Traffic-Icebox board.
Dec 7 2022, 6:45 PM · Traffic-Icebox, SRE

Dec 2 2022

BBlack closed T324334: netbox-exports git cloning perf issues as Resolved.

Confirmed the same. That was a simple and elegant fix, so I doubt there's any reason to pursue more-complex options! Thank you!

Dec 2 2022, 3:47 PM · SRE, Infrastructure-Foundations, Traffic
BBlack updated the task description for T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn'.
Dec 2 2022, 2:54 PM · SRE, Traffic
BBlack updated the task description for T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn'.
Dec 2 2022, 2:53 PM · SRE, Traffic
BBlack added a comment to T324334: netbox-exports git cloning perf issues.

Comparison point: operations/dns is quite a bit different: total byte size is ~1/4 the size (~6MB vs the ~22MB size of netbox-exports), but has ~4x more commit history (~6000 for ops/dns vs ~1500 for netbox-exports). A fresh clone of this from eqsin (from the gerrit server over https) takes ~6s.

Dec 2 2022, 2:32 PM · SRE, Infrastructure-Foundations, Traffic
BBlack added a comment to T324334: netbox-exports git cloning perf issues.

Another thing I failed to mention above: for cases like this (automated git clones just for functional data) we could also potentially speed up the initial fetch by doing shallow checkouts (--depth=N), however, the non-smart HTTP protocol doesn't support this.

Dec 2 2022, 2:22 PM · SRE, Infrastructure-Foundations, Traffic
BBlack updated the task description for T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn'.
Dec 2 2022, 2:19 PM · SRE, Traffic
BBlack updated the task description for T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn'.
Dec 2 2022, 2:16 PM · SRE, Traffic
BBlack created T324336: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn'.
Dec 2 2022, 2:15 PM · SRE, Traffic
BBlack created T324334: netbox-exports git cloning perf issues.
Dec 2 2022, 1:58 PM · SRE, Infrastructure-Foundations, Traffic

Nov 4 2022

BBlack added a comment to T322420: ATS flags origin servers as down during 60 seconds after a connect timeout.

Arguably we want this down server cache time to be very low or even disabled in the general case. It's not likely that caching the origin outage is going to help us more than hurt us (unless its the only thing that prevents ats-be meltdown due to dead origin A having a bigger effect on unrelated origins B, C, D...)

Nov 4 2022, 3:31 PM · Traffic, SRE

Nov 3 2022

BBlack added a comment to T282880: Revisit varnish dynamic backends mechanism.

Bump - we should revisit this, but perhaps after finishing the cache role name cleanup (text vs text_envoy vs text_haproxy...).

Nov 3 2022, 12:20 PM · Patch-For-Review, Traffic
BBlack closed T282788: drmrs: primary software task as Resolved.

Should've been resolved a while back!

Nov 3 2022, 12:16 PM · Infrastructure-Foundations, Traffic, SRE

Oct 31 2022

BBlack added a comment to T288106: Experiment with single backend CDN nodes.

Update - ulsfo is repooled this morning, with all new hardware on the new configuration, and has the "single-backend" mode enabled for both clusters at ulsfo. We'll be keeping an eye on hitrates here, and then trying to follow the same pattern in the upcoming eqsin hardware transition.

Oct 31 2022, 3:50 PM · SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), User-ema, Traffic

Oct 27 2022

BBlack committed rLPRIc8c256b06046: Add fake digicert-2022 keys.
Add fake digicert-2022 keys
Oct 27 2022, 10:11 PM