Page MenuHomePhabricator

Rack/cable/configure asw2-a-eqiad switch stack
Closed, ResolvedPublic

Description

Similar to T148506.

This is about row A only:

  • Rack and cable the switches according to diagram (blocked on T187118) [Chris]
    rows-abc-eqiad-cabling.png (1×1 px, 73 KB)
  • Connect mgmt/serial [Chris]
  • Check via serial that switches work, ports are configured as down [Arzhel]
  • Stack the switch, upgrade JunOS, initial switch configuration [Arzhel]
  • Add to DNS [Arzhel]
  • Add to LibreNMS & Rancid [Arzhel]
  • Uplinks ports configured [Arzhel]
  • Add to Icinga [Arzhel]

Thursday 22nd, noon Eastern (4pm UTC) 3h (for all 3 rows)

  • Verify cr2-eqiad is VRRP master
  • Disable interfaces from cr1-eqiad to asw-a
  • Move cr1 router uplinks from asw-a to asw2-a (and document cable IDs if different) [Chris/Arzhel]
xe-2/0/44 -> cr1-eqiad:xe-3/0/0
xe-2/0/45 -> cr1-eqiad:xe-3/1/0
xe-7/0/44 -> cr1-eqiad:xe-4/0/0
xe-7/0/45 -> cr1-eqiad:xe-4/1/0
  • Connect asw2-a with asw-a with 4x10G (and document cable IDs if different) [Chris]
xe-2/0/42 -> asw-a-eqiad:xe-8/1/0
xe-7/0/42 -> asw-a-eqiad:xe-2/1/0
xe-2/0/43 -> asw-a-eqiad:xe-1/1/0
xe-7/0/43 -> asw-a-eqiad:xe-7/0/0
  • Verify traffic is properly flowing though asw2-a
  • Update interfaces descriptions on cr1

____

  • Switch ports configuration to match asw-a (+login announcement) [Arzhel]
  • Solve snowflakes [Chris/Arzhel]
hostnameold portnew port
labstore1006xe-4/1/0xe-4/0/35
cp1075xe-4/1/2xe-4/0/36
  • Pre populate FPC2, FPC4 and FPC7 (QFX) with copper SFPs matching the current production servers on rack 2, 4 and 7 [Chris]
ge-2/0/1        up    up   es1011
ge-2/0/2        up    up   es1012
ge-2/0/3        up    up   ms-be1019
ge-2/0/4        up    up   db1074
ge-2/0/5        up    up   db1075
ge-2/0/6        up    up   db1079
ge-2/0/12       up    up   kafka-jumbo1002
ge-2/0/13       up    up   tungsten
ge-2/0/16       up    up   db1080
ge-2/0/17       up    up   analytics1012 no-bw-mon
ge-2/0/18       up    up   analytics1013 no-bw-mon
ge-2/0/19       up    up   notebook1002
ge-2/0/20       up    up   conf1001
ge-2/0/21       up    up   db1081
ge-2/0/22       up    up   db1082
ge-2/0/23       up    up   db1107
ge-4/0/0        up    up   druid1001
ge-4/0/1        up    up   aqs1004
ge-4/0/2        up    up   scb1001
ge-4/0/3        up    up   logstash1004
ge-4/0/4        up    up   kubestage1001
ge-4/0/5        up    up   snapshot1005
ge-4/0/6        up    up   analytics1070
ge-4/0/8        up    up   oxygen
ge-4/0/9        up    up   maps1001
ge-4/0/12       up    up   holmium
ge-4/0/15       up    up   conf1004
ge-4/0/16       up    up   db1111
ge-4/0/17       up    up   rhenium
ge-4/0/19       up    up   lvs1001
ge-4/0/20       up    up   lvs1002
ge-4/0/21       up    up   lvs1003
ge-4/0/22       up    up   netmon1002
ge-4/0/25       up    up   kafka1001
ge-4/0/26       up    up   contint1001
ge-4/0/27       up    up   ganeti1005
ge-4/0/30       up    up   restbase1007
ge-4/0/31       up    up   stat1004
ge-4/0/32       up    up   oresrdb1002
ge-4/0/34       up    up   wdqs1003
ge-4/0/43       up    up   rdb1003
ge-7/0/6        up    up   mw1267
ge-7/0/7        up    up   mw1268
ge-7/0/8        up    up   mw1269
ge-7/0/9        up    up   mw1270
ge-7/0/10       up    up   mw1271
ge-7/0/11       up    up   mw1272
ge-7/0/12       up    up   mw1273
ge-7/0/13       up    up   mw1274
ge-7/0/14       up    up   mw1275
ge-7/0/15       up    up   mw1276
ge-7/0/16       up    up   mw1277
ge-7/0/17       up    up   mw1278
ge-7/0/18       up    up   mw1279       
ge-7/0/19       up    up   mw1280
ge-7/0/20       up    up   mw1281
ge-7/0/21       up    up   mw1282
ge-7/0/22       up    up   mw1283
  • Ping service owners 30min before moving link (see bellow) [Arzhel]

In maintenance window - 3h - Tuesday 19th 14:00UTC - https://everytimezone.com/s/de2dcd7c

  • Move servers from asw-a to asw2-a [Chris]

No special need or easy depool (one at a time)

A1:
ge-1/0/6  - wdqs1006
ge-1/0/7  - kafka-jumbo1001 - ping Luca before moving host
ge-1/0/8  - dns1001
ge-1/0/9  - labsdb1009 no special need

A2:
ge-2/0/1  - es1011
ge-2/0/2  - es1012
ge-2/0/3  - ms-be1019
ge-2/0/4  - db1074
ge-2/0/5  - db1075
ge-2/0/6  - db1079
ge-2/0/12 - kafka-jumbo1002 - ping Luca before moving host, wait for 1001 to be full up
ge-2/0/16 - db1080
ge-2/0/17 - analytics1012
ge-2/0/18 - analytics1013
ge-2/0/19 - notebook1002
ge-2/0/20 - conf1001 - spare - do at any point in time
ge-2/0/21 - db1081
ge-2/0/22 - db1082

A3:
ge-3/0/3  - dbproxy1001 spare
ge-3/0/4  - dbproxy1002 spare
ge-3/0/5  - prometheus1003 - ok to have a network blip of a few seconds, depool optional
ge-3/0/7  - cp1008 - depool first
ge-3/0/8  - elastic1032
ge-3/0/9  - elastic1033
ge-3/0/10 - elastic1034
ge-3/0/11 - elastic1035
ge-3/0/12 - dbproxy1003 spare
ge-3/0/13 - db1103
ge-3/0/14 - relforge1001
ge-3/0/16 - restbase1010 1G - pybal depool
ge-3/0/17 - restbase1011 1G - pybal depool
ge-3/0/20 - kubernetes1001
ge-3/0/21 - restbase1016 eth0 - pybal depool
ge-3/0/22 - restbase1016 eth1 - should be down, remove cable
ge-3/0/23 - restbase1016 eth2 - should be down, remove cable
ge-3/0/24 - elastic1030
ge-3/0/25 - elastic1031
ge-3/0/26 - analytics1052
ge-3/0/27 - analytics1053
ge-3/0/28 - analytics1054
ge-3/0/29 - analytics1055
ge-3/0/30 - analytics1056
ge-3/0/31 - analytics1057
ge-3/0/32 - cloudservices1004 - Ping WMCS team before
ge-3/0/33 - analytics1059
ge-3/0/34 - analytics1060


A4:
ge-4/0/0  - druid1001
ge-4/0/1  - aqs1004 - ensure aqs1007 is up
ge-4/0/2  - scb1001 - poweroff gracefully to drain traffic, can do at any time
ge-4/0/3  - logstash1004 - spare, can be done at any time
ge-4/0/4  - kubestage1001
ge-4/0/6  - analytics1070
ge-4/0/8  - oxygen - spare - do at any point in time
ge-4/0/9  - maps1001
ge-4/0/12 - labservices1002 - Ping WMCS team
ge-4/0/15 - conf1004 - etcd/zookeeper, make sure a service ops person is around
ge-4/0/16 - db1111 test host
ge-4/0/17 - rhenium
ge-4/0/21 - lvs1003
ge-4/0/22 - netmon1002
ge-4/0/25 - kafka1001 - Ping Luca to stop Kafka
ge-4/0/30 - restbase1007 - pybal depool
ge-4/0/31 - stat1004 - Analytics
ge-4/0/32 - oresrdb1002 - Fine to do at any time
ge-4/0/34 - wdqs1003
xe-4/1/2  - cp1075 - depool first

A6:
ge-6/0/0  - db1096
ge-6/0/2  - mw1307
ge-6/0/3  - mw1308
ge-6/0/4  - mw1309
ge-6/0/5  - mw1310
ge-6/0/6  - mw1311
ge-6/0/7  - mw1312
ge-6/0/8  - labcontrol1003 - Ping WMCS team
ge-6/0/9  - restbase-dev1004
ge-6/0/10 - ores1001
ge-6/0/11 - db1116 - backups host
ge-6/0/12 - druid1004
ge-6/0/13 - labmon1002 - Ping WMCS team
ge-6/0/14 - db1115 tendril master (no special need)
ge-6/0/18 - weblog1001 - make sure a serviceops person is around
ge-6/0/23 - elastic1048
ge-6/0/24 - aqs1007 - ensure aqs1004 is up
ge-6/0/25 - mc1019 - One at a time check with Luca or Joe before proceeding to next
ge-6/0/26 - mc1020 - One at a time check with Luca or Joe before proceeding to next
ge-6/0/27 - mc1021 - One at a time check with Luca or Joe before proceeding to next
ge-6/0/28 - mc1022 - One at a time check with Luca or Joe before proceeding to next
ge-6/0/29 - mc1023 - One at a time check with Luca or Joe before proceeding to next
ge-6/0/30 - wtp1025
ge-6/0/31 - wtp1026
ge-6/0/32 - dbproxy1013 - spare
ge-6/0/34 - elastic1044
ge-6/0/35 - elastic1045
ge-6/0/36 - wtp1027
ge-6/0/37 - wdqs1004

A7:
All mw*, one at a time

A8:
ge-8/0/2  - db1118
ge-8/0/3  - torrelay1001
ge-8/0/4  - bohrium eth0 - status planned
ge-8/0/5  - bohrium eth1 - status planned
ge-8/0/8  - labstore1003 - Sync up with Brooke (needs Icinga downtime)
ge-8/0/10 - Core: mr1-eqiad:ge-0/0/1
ge-8/0/13 - helium - fine to do at any point in time, graceful poweroff

Special needs or unsorted

A1:
ge-1/0/5  - db1069 x1 master - set read only (coordinate with @Addshore to set Cognate in read only)
A2:
ge-2/0/13 - tungsten - xhgui:app - test system, check with performance team
ge-2/0/23 - db1107 - Give 10min head's up to elukey

A3:
ge-3/0/2  - ganeti1007 - Ensure other hosts are up, drain VMs - will probably need a different timeslot that ganeti1005, ganeti1006
ge-3/0/19 - rdb1005 - ??

A4:
ge-4/0/5  - snapshot1005 - Ariel? 
ge-4/0/19 - lvs1001 - Ensure other hosts are up, disable pybal
ge-4/0/20 - lvs1002 - Ensure other hosts are up, disable pybal
ge-4/0/26 - contint1001 - Ping hashar, need to shutdown CI
ge-4/0/27 - ganeti1005 - Ensure other hosts are up, drain VMs - will probably need a different timeslot than ganeti1004, ganeti1006
ge-4/0/43 - rdb1003 - ??
xe-4/1/0  - labstore1006 - Sync up with Brooke (need service failover)

A6:
ge-6/0/1  - ganeti1006 - Ensure other hosts are up, drain VMs -  - will probably need a different timeslot than ganeti1004, ganeti1005
ge-6/0/15 - an-master1001 - Ping Luca for failover before
ge-6/0/16 - db1066 - s2 master see T217441
ge-6/0/45 - lvs1004:eth1 - Ensure other hosts are up, disable pybal
ge-6/0/46 - lvs1005:eth1 - Ensure other hosts are up, disable pybal
ge-6/0/47 - lvs1006:eth1 - Ensure other hosts are up, disable pybal
  • Failover VRRP master to cr1-eqiad and verify status + traffic shift [Arzhel]
cr2
set interfaces ae1 unit 1001 family inet address 208.80.154.3/26 vrrp-group 1 priority 70
set interfaces ae1 unit 1001 family inet6 address 2620:0:861:1:fe00::2/64 vrrp-inet6-group 1 priority 70
set interfaces ae1 unit 1017 family inet address 10.64.0.3/22 vrrp-group 17 priority 70
set interfaces ae1 unit 1017 family inet6 address 2620:0:861:101:fe00::2/64 vrrp-inet6-group 17 priority 70
set interfaces ae1 unit 1030 family inet address 10.64.5.3/24 vrrp-group 30 priority 70
set interfaces ae1 unit 1030 family inet6 address 2620:0:861:104:fe00::2/64 vrrp-inet6-group 30 priority 70
set interfaces ae1 unit 1117 family inet address 10.64.4.3/24 vrrp-group 117 priority 70
set interfaces ae1 unit 1117 family inet6 address 2620:0:861:117:fe00::2/64 vrrp-inet6-group 117 priority 70
On cr1/2:
show vrrp summary -> master/backup
  • Disable cr2-eqiad:ae1 [Arzhel]
  • Move cr2 router uplinks from asw-a to asw2-a (and document cable IDs if different) [Chris/Arzhel]
xe-2/0/46 -> cr2-eqiad:xe-3/0/0
xe-2/0/47 -> cr2-eqiad:xe-3/1/0
xe-7/0/46 -> cr2-eqiad:xe-4/0/0
xe-7/0/47 -> cr2-eqiad:xe-4/1/0
  • Enable cr2-eqiad:ae1 [Arzhel]
  • Re-move VRRP master to cr2-eqiad [Arzhel]
  • Update interfaces descriptions on cr2
  • Verify no more traffic on asw-a<->asw2-a link [Arzhel]
  • Disable asw-a<->asw2-a link [Arzhel]
  • Verify all servers are healthy, monitoring happy [Arzhel]

After maintenance window
T208734

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2019-03-01T00:12:27Z] <XioNoX> pre-configure asw-a3 ports on asw2-a3-eqiad - T187960

Mentioned in SAL (#wikimedia-operations) [2019-03-01T19:17:23Z] <XioNoX> pre-configure asw-a5 ports on asw2-a5-eqiad - T187960

Mentioned in SAL (#wikimedia-operations) [2019-03-01T19:29:01Z] <XioNoX> pre-configure asw-a6 ports on asw2-a6-eqiad - T187960

Mentioned in SAL (#wikimedia-operations) [2019-03-01T19:32:37Z] <XioNoX> pre-configure asw-a7 ports on asw2-a7-eqiad - T187960

Mentioned in SAL (#wikimedia-operations) [2019-03-01T19:40:09Z] <XioNoX> pre-configure asw-a8 ports on asw2-a8-eqiad - T187960

@elukey there is a few seconds of downtime expected for db1107 (event logging master) during this maintenance.

@elukey there is a few seconds of downtime expected for db1107 (event logging master) during this maintenance.

Thanks for the heads up! I only need 10 minutes of warning beforehand so I can stop eventlogging and the related replication/sanitization scripts.

Growth-Team Language-Team Cognate this network maintenance will affect x1 master (which unfortunately cannot be put read-only from mediawiki, so it will be set read_only on MySQL level). The expected downtime for writes is around 3-5 seconds

For Cognate this will result in DBReadOnlyErrors for page creations, redirect changes, moves, deletions, in the main namespace on wiktionaries.
We could put all wiktionaries into read only for the and avoid the exceptions? But also if it is only going to be 3-5 seconds then maybe the exceptions are fine?

For Cognate this will result in DBReadOnlyErrors for page creations, redirect changes, moves, deletions, in the main namespace on wiktionaries.
We could put all wiktionaries into read only for the and avoid the exceptions?

Can that be done on a mediawiki level?

But also if it is only going to be 3-5 seconds then maybe the exceptions are fine?

Last time we did this operation it lost only 2 pings on s6 master, but we should also be ready in case this takes more (just in case). So maybe a configuration change ready to be merge in case of failure could be helpful.

#reading-infrastructure-team-backlog tagging you here as this affects x1 master (T187960#4997790) which might be something you use, so trying to get your attention here :)

So read only errors are handled nicely in Cognate, well, the data will never end up being written, but users won't see errors.
Failures here are not currently logged, so we wouldn't know if it happened.
Cognate actually has $wgCognateReadOnly and then it wont even attempt to write to the db.

If the db is in readonly or cognate is in readonly but writes can still happen to wiktionaries (redirections, creates, deletes, moves) then data could be missed in the cognate tables, and the PopulateCognatePages maint script from the Cognate extension may have to be run in order to being everything back in line (not that expensive a script to run).

If we wanted to avoid having to run the maint script at all then we could set https://www.mediawiki.org/wiki/Manual:$wgReadOnly on the wiktionaries for the seconds that x1 would be read only?

From where I am sat setting $wgReadOnly for the few seconds would be the best thing for cognate, and would avoid having to run the maintenance scripts.

From where I am sat setting $wgReadOnly for the few seconds would be the best thing for cognate, and would avoid having to run the maintenance scripts.

Thanks for the explanation! If you consider that is the best approach, I am fine with that! Can you be available during the maintenance window to coordinate? :)

From where I am sat setting $wgReadOnly for the few seconds would be the best thing for cognate, and would avoid having to run the maintenance scripts.

Thanks for the explanation! If you consider that is the best approach, I am fine with that! Can you be available during the maintenance window to coordinate? :)

Tuesday the 19th, I sure can, please ping me then :)

akosiaris updated the task description. (Show Details)

I was looking at Special needs or unsorted. @ayounsi I 've updated a few, feel free to move them to other sections. Pinging:

for the rest

contint1001 hosts the CI system subscribing @thcipriani as well

It is not clear to me what this operation is about. Is that just about re cabling the server from a switch to another? If so I would expect the downtime to be rather minimal. Note that iirc contint1001 is limited to 100Mbps, I think on purpose to avoid having CI to overflow the network layer somehow. So would be nice to double check the current network speed and reuse the same when migrating.

Tuesday 19th 14:00 UTC looks fine to me. Will have to announce it ahead of time and gracefully shutdown CI, I am not sure how the stack would react on a network interruption.

About Analytics nodes:

  • ge-1/0/7 - kafka-jumbo1001 -> Kafka needs to be stopped ~10/15 minutes beforehand to have a graceful shutdown (if possible)
  • ge-2/0/12 - kafka-jumbo1002 -> same thing, and it is better to have kafka-jumbo1001 up and running (with Kafka partitions sync recovered) before doing maintenance on 1002. We can survive two brokers down but since we control the maintenance I'd prefer not :)

The above can survive a network blip of some seconds, doing it one at the time, so I might be too paranoid :)

  • ge-2/0/17 - analytics1012
  • ge-2/0/18 - analytics1013
  • ge-2/0/19 - notebook1002 -> all these are decommed nodes, should we move them or just unrack them?
  • ge-3/0/26 - analytics1052
  • ge-3/0/27 - analytics1053
  • ge-3/0/28 - analytics1054
  • ge-3/0/29 - analytics1055
  • ge-3/0/30 - analytics1056
  • ge-3/0/31 - analytics1057
  • ge-3/0/33 - analytics1059
  • ge-3/0/34 - analytics1060 -> all hadoop worker nodes, one at the time would be great
  • ge-4/0/0 - druid1001 - depool va pybal --> it is not pooled in pybal (druid100[4-6] are), a network blip of some seconds is fine.
  • ge-4/0/1 - aqs1004 - depool va pybal -> a network blip of some seconds is fine, but not at the same time of aqs1007 please :)
  • ge-4/0/6 - analytics1070 -> hadoop worker node, no special needs
  • ge-4/0/25 - kafka1001 - depool from pybal --> Kafka would need to be gracefully stopped if possible beforehand, this host hold the job queues, I'd be cautious.
  • ge-6/0/12 - druid1004 - depool via pybal -> a network blip of some seconds is fine.
  • ge-6/0/15 - an-master1001 - status planned -> This is the hadoop master node, extremely important, I need to perform a failover beforehand.
  • ge-6/0/24 - aqs1007 - depool from pybal -> a network blip of some seconds is fine, but not at the same time of aqs1004 please :)
ge-6/0/25 - mc1019
ge-6/0/26 - mc1020
ge-6/0/27 - mc1021
ge-6/0/28 - mc1022
ge-6/0/29 - mc1023

The above ones are holding the eqiad mediawiki object cache, they are extremely important. IIUC this maintenance will not require any host shutdown, that would mean stopping memcached and wipe the cache (please @ayounsi confirm that :), so a network blip of one at the time will be fine. Please do it one at the time, checking with me or Joe that nothing is exploding on the mediawiki side while the maintenance is ongoing.

Thank you all for the quick replies!

Is that just about re cabling the server from a switch to another? If so I would expect the downtime to be rather minimal.

Correct, ~5s if no issue

Note that iirc contint1001 is limited to 100Mbps, I think on purpose to avoid having CI to overflow the network layer somehow.

The link is negotiated at 1000Mbps, and there is no rate limiter on the switch side. So unless there is rate-limiter on the server side, this operates at 1G.
Maybe it was true on an older infra, but now the network is able to handle 1G hosts.

Will have to announce it ahead of time and gracefully shutdown CI, I am not sure how the stack would react on a network interruption.

Better figure it out during a planned maintenance than a 4am outage ;)

IIUC this maintenance will not require any host shutdown [...] (please @ayounsi confirm that :)

Confirmed. Server list updated to include your comments.

#reading-infrastructure-team-backlog tagging you here as this affects x1 master (T187960#4997790) which might be something you use, so trying to get your attention here :)

Thanks for the heads-up! I don't think we care about writes erroring out for a few seconds for reading lists. Pinging @Dbrant, @JoeWalsh just in case.

Change 496720 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Failover db1066 to db1076 on s2

https://gerrit.wikimedia.org/r/496720

Change 496721 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Failover db1066 to db1076 on s2

https://gerrit.wikimedia.org/r/496721

Change 496723 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1120 as x1 master

https://gerrit.wikimedia.org/r/496723

Change 496724 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1120 as x1 master

https://gerrit.wikimedia.org/r/496724

Change 497319 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: failing over the web only for network changes

https://gerrit.wikimedia.org/r/497319

Change 497328 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: reduce TTL on labstore1006 for failover

https://gerrit.wikimedia.org/r/497328

Change 497328 merged by Bstorm:
[operations/dns@master] dumps distribution: reduce TTL on labstore1006 for failover

https://gerrit.wikimedia.org/r/497328

Change 497420 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: fail over dumps to labstore1007

https://gerrit.wikimedia.org/r/497420

Change 497420 merged by Bstorm:
[operations/dns@master] dumps distribution: fail over dumps to labstore1007

https://gerrit.wikimedia.org/r/497420

Change 497319 merged by Bstorm:
[operations/puppet@production] dumps distribution: failing over the web only for network changes

https://gerrit.wikimedia.org/r/497319

Change 497469 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool hosts in row A

https://gerrit.wikimedia.org/r/497469

Change 497472 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set s2 on read only

https://gerrit.wikimedia.org/r/497472

Change 497469 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool hosts in row A

https://gerrit.wikimedia.org/r/497469

Mentioned in SAL (#wikimedia-operations) [2019-03-19T14:44:51Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool databases in row A - T187960 (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2019-03-19T15:12:24Z] <XioNoX> eqiad A7 servers uplink move - T187960

Change 497472 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set s2 on read only

https://gerrit.wikimedia.org/r/497472

Mentioned in SAL (#wikimedia-operations) [2019-03-19T15:21:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s2 database master on read only - T187960 (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2019-03-19T15:27:56Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s2 read only OFF - T187960 (duration: 00m 26s)

For the record, read only times on s2:
Read only ON: 15:21:10
Read only OFF: 15:27:56

ayounsi updated the task description. (Show Details)

Everything here is done, thank you all for your help!

Mentioned in SAL (#wikimedia-operations) [2019-03-20T06:09:58Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool databases in row A - T187960 (duration: 00m 49s)

Change 496724 abandoned by Marostegui:
db-eqiad.php: Promote db1120 as x1 master

Reason:
Not needed

https://gerrit.wikimedia.org/r/496724

Change 496723 abandoned by Marostegui:
mariadb: Promote db1120 as x1 master

Reason:
Not needed

https://gerrit.wikimedia.org/r/496723

Change 496721 abandoned by Marostegui:
db-eqiad.php: Failover db1066 to db1076 on s2

Reason:
Not needed

https://gerrit.wikimedia.org/r/496721

Change 496720 abandoned by Marostegui:
mariadb: Failover db1066 to db1076 on s2

Reason:
Not needed

https://gerrit.wikimedia.org/r/496720

labsdb1009.mgmt (stress on management interface) is down according to icinga for 14 hours (around net maintenance), maybe a loose cable or misconfiguration? Not a huge blocker, but better make sure it is not an intended/known state.

ipmi works from localhost but fails remotely, so we are pretty confident it could be a network connectivity and not a hw issue.