This is approved.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 20 2024
Sep 8 2023
Approved.
Sep 7 2023
Approved.
Apr 19 2023
@BTullis @odimitrijevic Given that this is an ongoing privacy leak, could we get some clarity on whether we can get this deployed soon, or how other teams may be able to help if needed?
Jul 18 2022
Jul 11 2022
This appears to be configurable now in Swift 2.24.0 and later (we currently seem to be running 2.26.0 on 6/8 of frontends...), by enabling a piece of middleware and configuring RFC compliant ETag responses for specific Swift user accounts or containers:
https://docs.openstack.org/swift/latest/middleware.html#module-swift.common.middleware.etag_quoter
In T256217#7960730, @Krinkle wrote:I'm not sure since when, but based on us having <14 days ats-be storage, and based on there still beeing ETag headers on cached responses, I am guessing this is a pretty recent regression.
I'm finding that upload.wikimedia.org responses have neither ETag nor Last-Modified. This is also observed in T295556, but I'm skeptical of whether it is the same given the above caching, but perhaps these heades are both kept in Swift and still used post-Swift upgrade for pre-existing objects?
Feb 24 2022
Jul 26 2021
Given that the underlying problem that this change might help with has already caused multiple full outages (all wikis affected) in the past year alone and the extension is deployed on quite a few wiki, I'd like to ask this to be looked into again for the near-term. Raising priority to 'high'. Would this be in scope for PET's Clinic Duty? How can SRE help?
Feb 19 2021
In T274459#6841122, @thcipriani wrote:Whoa, catching up on scrollback overnight. My question is: is this the first anyone in SRE has heard about any of this?
Jan 22 2021
It's purely an idea I've had for a long time, to make it immediately obvious to anyone logging in what is backed up, and what isn't. That should help to:
Oct 8 2020
Hi all,
Sep 4 2020
Approved.
Sep 1 2020
Approved.
Approved.
Aug 20 2020
@wiki_willy @Papaul It seems we've had an ongoing pattern of crashes with this (rather important) backup host, which means we are not yet able to trust it. Until we are able to resolve this we also cannot decommission the older hosts (that this replaces) either. At the moment the system doesn't even boot. Are there any steps we can take soon to debug this issue? Anything we can help with? Thanks!
Jul 10 2020
Approved.
May 27 2020
Apr 7 2020
Feb 21 2020
I am pretty sure there are a bunch of optics (of various kinds) in the "spare" switches, in the bottom of rack OE15. Unfortunately those switches are not powered up, and certainly not configured and remote manageable - something we should probably fix on next visit.
Feb 18 2020
There are multiple 10G LR optics on-site for sure. Longer distance ones, less so.
Feb 13 2020
Personally I don't think Pybal should be rejecting that; it's a valid configuration from a technical standpoint, and there can be valid reasons to have it, at least temporarily. But we may decide that in our specific environment that should be avoided at all cost, so perhaps that logic should be implemented elsewhere - in the code that manages pooling state.
Feb 12 2020
@wiki_willy With Chris having been ill the past few days, what's a realistic new ETA for this?
Dec 10 2019
In T238909#5727693, @akosiaris wrote:
I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balancers. Why couldn't our caching layer do that directly, and know about all the k8s proxies/nodes directly and do health checks for them?
Nov 28 2019
Nov 27 2019
In T184066#5695891, @RobH wrote:In T184066#5694288, @Papaul wrote:qfx5100-spare1, psu 0 {#20156} to ps2-oe15-esams:17
qfx5100-spare2, psu 0 {#20157} to ps2-oe15-esams:16
qfx5100-spare1, psu 1 {#20159} to ps1-oe15-esams:2
qfx5100-spare2, psu 1 {#20158} to ps1-oe15-esams:3
asw2-oe16-esams:psu0 {#20162} to ps2-oe16-esams:26
asw2-oe16-esams:psu1 {#20164} to ps1-oe16-esams:26All the above are done, but NOT
scs1-oe15-esams:psu1 {#20163} to ps2-oe15-esams:34
scs1-oe15-esams:psu2 {#20164} to ps1-oe15-esams:34as there is no scs-oe15-esams, not sure what that is. Mark's comment T184066#5694430 covers scs-oe16-esams.
Nov 26 2019
scs1-oe16-esams:psu1 {#20163} to ps2-oe16-esams:34
scs1-oe16-esams:psu2 {#20164} to ps1-oe16-esams:34
cr3-esams now has its power cables labeled:
cr2-esams now has its power cables labeled:
All duplicate ids have been fixed, labels replaced for one pair and updated in netbox.
I've filled out all red cells in the (original) bootstrap spreadsheet.
All 7 cable managers have been asset tagged and put into Netbox with the appropriate info and rack position.
All SERVER power cords have been audited in this sheet: https://docs.google.com/spreadsheets/d/1RMb6lMCc94wUj6MgSm1yYdnAC3SUsZIRj8zHLtxRx4o/edit?usp=sharing
Done.
Nov 25 2019
Nov 4 2019
CPT: please take a new look, thanks :)
Oct 24 2019
I'm a bit confused; as far as I know the old plan was always to have HA of Phabricator between eqiad and codfw, and the linked task T190572 also talks about that. So is that no longer the case, and if so, why is that? I believe there have been blockers & complications for that deployment, but are they documented anywhere? How does this task relate to those plans, why do we feel failover within eqiad is (also) needed?
Oct 22 2019
Could CPT take a look at this please? Thanks!
Sep 17 2019
What's the status of this? Is this done and working?
Sep 12 2019
In T231387#5471833, @Varnent wrote:@mark - Thank you very much for that thoughtful and helpful reply!
Talking it over, we would like to try the first option if you believe that will work.
So how do we go about getting this setup?
Anusha Alikhan
aalikhan@pr.wikimedia.orgSamantha Lien
slien@pr.wikimedia.org
Sep 5 2019
Hi Anusha, Greg,
Aug 9 2019
EX4200 can also have any port converted as VC - just won't be as fast, max 10Gbps.
Aug 6 2019
Approved for access.
Jul 23 2019
Because this means that right now stub dumps generation for (at least) enwiki and dewiki and several other is broken, we have only a few days to fix this before the dumps need to be done at the end of the month. Setting UBN...
Apr 16 2019
Apr 5 2019
While I agree with Daniel and others that the use of the MediaWiki db connection/load balancing layer is an absolute minimum requirement, there are quite a few other potential problems that could affect the security/privacy, reliability or maintainability of our data and services, if Doctrine is to be used to access MediaWiki's existing databases in any way (it's definitely easier if done in separate, not connected database clusters). However this ticket so far is very sparse on details, and we don't have the information we need to make an informed decision. I've requested access to the linked document yesterday, but so far it wasn't granted yet. Alternatively, could this perhaps be replicated here on Phabricator so everyone involved can build an informed opinion? Thanks. :)
Apr 1 2019
There has been some concern from our DBAs the archiving of the old policy will make it even harder for developers to find out about what database-related requirements their code should fulfill, and what the processes would be to get any schema or query changes deployed (such as a link to the Schema_changes page). The old information on database related requirements, while admittedly a bit outdated, was discussed as an RFC at the time: https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2015-09-16
Mar 22 2019
Mar 21 2019
Mar 6 2019
Feb 22 2019
Feb 5 2019
Jan 23 2019
In T211254#4902340, @BBlack wrote:In T211254#4902250, @mark wrote:In T211254#4902223, @BBlack wrote:It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.
In a world where there's ample address space (such as 10/8 in our context), yes. In today's world where IPv4 address space is scarce and we can likely not get any more, not so much.
I would personally have preferred that with the renumbering.of WMCS they simply acquired new public IPv4 space of their own
That's simply not realistic, they can't "acquire" IPv4 address space of their own. They're part of this organisation, this ASN, and need to use our PI/PA space where we have it available before we collectively can get more.
I understand the basic concerns here about exhaustion and how the process works. I think it would've been possible to find a way to ask for new or acquire new space though, even in the US. It's just a process and a cost at the end of the day.
Yes, we should probably move over to prefix-limit to prevent (improving) filters from making accepted-prefix-limit ineffective.
In T211254#4902223, @BBlack wrote:It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.
Have a look at https://github.com/mwiget/bgp_graceful_shutdown for a JunOS op script (SLAX) that does this fully automatically for all peers with a single command.
This was solved by fixing the original bastion, a while ago.
I really don't see the point of this. With the scarcity of IPv4 space we only need to get MORE flexible about how we use our IP space, and we will almost certainly not be able to maintain production vs others split between these address blocks in the future. Rather than spend time on renumbering I think it's much more valuable to spend that effort on better managing our ACLs and more automation.
Jan 11 2019
Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.
Dec 19 2018
I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?
Oct 12 2018
Sep 18 2018
Although we didn't manage to discuss this in our SRE meeting yesterday I discussed it with relevant people afterwards.
Sep 11 2018
T97368 appears to be about the same issue.
Indeed, let's go with a "proper" Debian package, imho the cleanest way to go and conforming to how we do things.
Sep 3 2018
Yes, this can be merged once Nuria approves.
Aug 14 2018
@Dzahn please get her added to this list. Thanks!
Aug 13 2018
Aug 10 2018
In T200297#4493122, @Halfak wrote:I talked to @mark today. Here's what I understood from the conversation:
- All of the following points assume that the TechCom discussion happens and there's a decision that the local-wiki JADE namespace is the only reasonable implementation strategy
- Large wikis (enwiki, wikidatawiki, and commonswiki) are where the concerns exist. All other, smaller wikis are less of a concern.
- The revision table is the only table that is a serious concern for large wikis. The page table is less of a concern.
- Our estimated growth of 0.5M new revisions per large wiki per year is acceptable growth.
- In order to account for fluctuations, a ceiling of 1M new revisions per large wiki per year is acceptable for JADE judgments.
Jul 30 2018
I am a bit confused by this RFC/proposal as it stands now, as I feel it doesn't really reflect the discussions we've been having.
Jul 25 2018
In T195923#4450204, @Cmjohnson wrote:@ayounsi I was not able to add the ports in row A to the public vlan. Can you check the following and add to public vlan. Also, adding the servers to the vlan in the other rows did not automatically enable the ports. Can you also check please.
@ema: Has this been seen again? Does this need any work in Pybal?
The eqdfw-knams needs have a lower metric than the current primary (codfw-eqiad + eqiad-esams) links so traffic from codfw to esams prefer that link.