Apr 16 2019
Apr 5 2019
While I agree with Daniel and others that the use of the MediaWiki db connection/load balancing layer is an absolute minimum requirement, there are quite a few other potential problems that could affect the security/privacy, reliability or maintainability of our data and services, if Doctrine is to be used to access MediaWiki's existing databases in any way (it's definitely easier if done in separate, not connected database clusters). However this ticket so far is very sparse on details, and we don't have the information we need to make an informed decision. I've requested access to the linked document yesterday, but so far it wasn't granted yet. Alternatively, could this perhaps be replicated here on Phabricator so everyone involved can build an informed opinion? Thanks. :)
Apr 1 2019
There has been some concern from our DBAs the archiving of the old policy will make it even harder for developers to find out about what database-related requirements their code should fulfill, and what the processes would be to get any schema or query changes deployed (such as a link to the Schema_changes page). The old information on database related requirements, while admittedly a bit outdated, was discussed as an RFC at the time: https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2015-09-16
Mar 22 2019
Mar 21 2019
Mar 6 2019
Feb 22 2019
Feb 5 2019
Jan 23 2019
Yes, we should probably move over to prefix-limit to prevent (improving) filters from making accepted-prefix-limit ineffective.
Have a look at https://github.com/mwiget/bgp_graceful_shutdown for a JunOS op script (SLAX) that does this fully automatically for all peers with a single command.
This was solved by fixing the original bastion, a while ago.
I really don't see the point of this. With the scarcity of IPv4 space we only need to get MORE flexible about how we use our IP space, and we will almost certainly not be able to maintain production vs others split between these address blocks in the future. Rather than spend time on renumbering I think it's much more valuable to spend that effort on better managing our ACLs and more automation.
Jan 11 2019
Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.
Dec 19 2018
I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?
Oct 12 2018
Sep 18 2018
Although we didn't manage to discuss this in our SRE meeting yesterday I discussed it with relevant people afterwards.
Sep 11 2018
T97368 appears to be about the same issue.
Indeed, let's go with a "proper" Debian package, imho the cleanest way to go and conforming to how we do things.
Sep 3 2018
Yes, this can be merged once Nuria approves.
Aug 14 2018
@Dzahn please get her added to this list. Thanks!
Aug 13 2018
Aug 10 2018
Jul 30 2018
I am a bit confused by this RFC/proposal as it stands now, as I feel it doesn't really reflect the discussions we've been having.
Jul 25 2018
@ema: Has this been seen again? Does this need any work in Pybal?
The eqdfw-knams needs have a lower metric than the current primary (codfw-eqiad + eqiad-esams) links so traffic from codfw to esams prefer that link.
Jul 24 2018
Jul 16 2018
Jul 11 2018
This has been addressed in acdd0ebf74e5dd9e06c3216b9a93063ab8e91574
We had a long and interesting discussion about this on IRC.
Jul 10 2018
Jul 5 2018
Nice work! :)
Jul 4 2018
All to-be-removed servers have had their cabling removed, except in rack OE12.
The last RAID array (md2) is now resyncing.
All esams racks have been audited with their Racktables counterparts, and object location and unusable items are now correct.
This has now been corrected and verified on-site.
After consultation with Ema and considering how long this server has been broken, is 1 out of 4 misc varnish servers and the misc cluster is being folded into text anyway, we decided it's not worth repairing this server.
ms-be3003 is still connected, and eth0 and eth1 are connected to ports 4 and 5 of csw2-oe11-esams respectively. (csw2-esams).
bast3002 (aka hooft) sdb has been swapped with amslvs3's sdb (both LFF, non-hotswap). The server has booted back up, and sdb is being repartitioned and added to the RAID arrays.
Because cp3048 is out of warranty and it's unlikely we can get it fixed, I've used it as a parts donor for other broken systems:
cp3043 drive 2 (sdb) has been swapped for cp3048's drive 2 (sdb), as cp3048 is out of warranty and unfixable anyway. RAID1 md0 has been restored, and the server is back up and running.
Jul 3 2018
As Chris confirmed, this is either due to CPU or memory.
These were purchased by Rob and given to me in January.
Jun 25 2018
Jun 14 2018
It looks like cp3030-cp3039 are in OE13 (despite what Racktables says), and cp3040+ are in OE40. So this is swapped from reality in Racktables. This is confirmed with LLDP info from the switch.