Mon, Nov 4
CPT: please take a new look, thanks :)
Thu, Oct 24
I'm a bit confused; as far as I know the old plan was always to have HA of Phabricator between eqiad and codfw, and the linked task T190572 also talks about that. So is that no longer the case, and if so, why is that? I believe there have been blockers & complications for that deployment, but are they documented anywhere? How does this task relate to those plans, why do we feel failover within eqiad is (also) needed?
The package just shipped and will hopefully arrive tomorrow. I created IM ticket SCTASK0120754 for shipment notification.
Tue, Oct 22
Could CPT take a look at this please? Thanks!
Sep 17 2019
What's the status of this? Is this done and working?
Sep 12 2019
Sep 5 2019
Hi Anusha, Greg,
Aug 9 2019
EX4200 can also have any port converted as VC - just won't be as fast, max 10Gbps.
Aug 6 2019
Approved for access.
Jul 23 2019
Because this means that right now stub dumps generation for (at least) enwiki and dewiki and several other is broken, we have only a few days to fix this before the dumps need to be done at the end of the month. Setting UBN...
Apr 16 2019
Apr 5 2019
While I agree with Daniel and others that the use of the MediaWiki db connection/load balancing layer is an absolute minimum requirement, there are quite a few other potential problems that could affect the security/privacy, reliability or maintainability of our data and services, if Doctrine is to be used to access MediaWiki's existing databases in any way (it's definitely easier if done in separate, not connected database clusters). However this ticket so far is very sparse on details, and we don't have the information we need to make an informed decision. I've requested access to the linked document yesterday, but so far it wasn't granted yet. Alternatively, could this perhaps be replicated here on Phabricator so everyone involved can build an informed opinion? Thanks. :)
Apr 1 2019
There has been some concern from our DBAs the archiving of the old policy will make it even harder for developers to find out about what database-related requirements their code should fulfill, and what the processes would be to get any schema or query changes deployed (such as a link to the Schema_changes page). The old information on database related requirements, while admittedly a bit outdated, was discussed as an RFC at the time: https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2015-09-16
Mar 22 2019
Mar 21 2019
Mar 6 2019
Feb 22 2019
Feb 5 2019
Jan 23 2019
Yes, we should probably move over to prefix-limit to prevent (improving) filters from making accepted-prefix-limit ineffective.
Have a look at https://github.com/mwiget/bgp_graceful_shutdown for a JunOS op script (SLAX) that does this fully automatically for all peers with a single command.
This was solved by fixing the original bastion, a while ago.
I really don't see the point of this. With the scarcity of IPv4 space we only need to get MORE flexible about how we use our IP space, and we will almost certainly not be able to maintain production vs others split between these address blocks in the future. Rather than spend time on renumbering I think it's much more valuable to spend that effort on better managing our ACLs and more automation.
Jan 11 2019
Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.
Dec 19 2018
I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?
Oct 12 2018
Sep 18 2018
Although we didn't manage to discuss this in our SRE meeting yesterday I discussed it with relevant people afterwards.
Sep 11 2018
T97368 appears to be about the same issue.
Indeed, let's go with a "proper" Debian package, imho the cleanest way to go and conforming to how we do things.
Sep 3 2018
Yes, this can be merged once Nuria approves.
Aug 14 2018
@Dzahn please get her added to this list. Thanks!
Aug 13 2018
Aug 10 2018
Jul 30 2018
I am a bit confused by this RFC/proposal as it stands now, as I feel it doesn't really reflect the discussions we've been having.
Jul 25 2018
@ema: Has this been seen again? Does this need any work in Pybal?
The eqdfw-knams needs have a lower metric than the current primary (codfw-eqiad + eqiad-esams) links so traffic from codfw to esams prefer that link.
Jul 24 2018
Jul 16 2018
Jul 11 2018
This has been addressed in acdd0ebf74e5dd9e06c3216b9a93063ab8e91574
We had a long and interesting discussion about this on IRC.
Jul 10 2018
Jul 5 2018
Nice work! :)
Jul 4 2018
All to-be-removed servers have had their cabling removed, except in rack OE12.
The last RAID array (md2) is now resyncing.
All esams racks have been audited with their Racktables counterparts, and object location and unusable items are now correct.
This has now been corrected and verified on-site.
After consultation with Ema and considering how long this server has been broken, is 1 out of 4 misc varnish servers and the misc cluster is being folded into text anyway, we decided it's not worth repairing this server.
ms-be3003 is still connected, and eth0 and eth1 are connected to ports 4 and 5 of csw2-oe11-esams respectively. (csw2-esams).
bast3002 (aka hooft) sdb has been swapped with amslvs3's sdb (both LFF, non-hotswap). The server has booted back up, and sdb is being repartitioned and added to the RAID arrays.
Because cp3048 is out of warranty and it's unlikely we can get it fixed, I've used it as a parts donor for other broken systems:
cp3043 drive 2 (sdb) has been swapped for cp3048's drive 2 (sdb), as cp3048 is out of warranty and unfixable anyway. RAID1 md0 has been restored, and the server is back up and running.