⚓ T187960 Rack/cable/configure asw2-a-eqiad switch stack

hostname	old port	new port
labstore1006	xe-4/1/0	xe-4/0/35
cp1075	xe-4/1/2	xe-4/0/36

Subject	Repo	Branch	Lines +/-
mariadb: Failover db1066 to db1076 on s2	operations/puppet	production	+5 -6
db-eqiad.php: Failover db1066 to db1076 on s2	operations/mediawiki-config	master	+2 -3
mariadb: Promote db1120 as x1 master	operations/puppet	production	+5 -5
db-eqiad.php: Promote db1120 as x1 master	operations/mediawiki-config	master	+2 -2
db-eqiad.php: Depool hosts in row A	operations/mediawiki-config	master	+42 -41
db-eqiad.php: Set s2 on read only	operations/mediawiki-config	master	+1 -1
dumps distribution: failing over the web only for network changes	operations/puppet	production	+2 -2
dumps distribution: fail over dumps to labstore1007	operations/dns	master	+1 -1
dumps distribution: reduce TTL on labstore1006 for failover	operations/dns	master	+1 -1
netops: add asw2-a-eqiad and asw2-c-eqiad	operations/puppet	production	+2 -0

Status	Assigned	Task
Resolved	ayounsi	T213122 Increase network capacity (2018-19 Q3 Goal)
Resolved	Jclark-ctr	T218734 Decommission asw-a-eqiad
Resolved	• Cmjohnson	T187960 Rack/cable/configure asw2-a-eqiad switch stack
Resolved	• Cmjohnson	T212348 Move servers off asw2-a5-eqiad
Resolved	Jclark-ctr	T208584 Decommission old eqiad caches
Resolved	Jclark-ctr	T208586 Decommission lvs1007-1012
Resolved	• Cmjohnson	T217383 Decommission asw2-a5-eqiad

Mentioned in SAL (#wikimedia-operations) [2019-03-01T00:12:27Z] <XioNoX> pre-configure asw-a3 ports on asw2-a3-eqiad - T187960

ayounsi updated the task description. (Show Details)Mar 1 2019, 12:19 AM

• Marostegui subscribed.Mar 1 2019, 5:09 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-01T19:17:23Z] <XioNoX> pre-configure asw-a5 ports on asw2-a5-eqiad - T187960

Mentioned in SAL (#wikimedia-operations) [2019-03-01T19:29:01Z] <XioNoX> pre-configure asw-a6 ports on asw2-a6-eqiad - T187960

Mentioned in SAL (#wikimedia-operations) [2019-03-01T19:32:37Z] <XioNoX> pre-configure asw-a7 ports on asw2-a7-eqiad - T187960

Mentioned in SAL (#wikimedia-operations) [2019-03-01T19:40:09Z] <XioNoX> pre-configure asw-a8 ports on asw2-a8-eqiad - T187960

ayounsi assigned this task to • Cmjohnson.Mar 1 2019, 7:49 PM

ayounsi updated the task description. (Show Details)

ayounsi updated the task description. (Show Details)Mar 1 2019, 8:01 PM

ayounsi mentioned this in T217441: 15min read-only on some wikis for network maintenance on 2019-03-19.Mar 1 2019, 8:22 PM

ayounsi updated the task description. (Show Details)Mar 1 2019, 10:16 PM

• Marostegui updated the task description. (Show Details)Mar 4 2019, 8:40 AM

@elukey there is a few seconds of downtime expected for db1107 (event logging master) during this maintenance.

In T187960#4997110, @Marostegui wrote:

@elukey there is a few seconds of downtime expected for db1107 (event logging master) during this maintenance.

Thanks for the heads up! I only need 10 minutes of warning beforehand so I can stop eventlogging and the related replication/sanitization scripts.

Growth-Team Language-Team Cognate this network maintenance will affect x1 master (which unfortunately cannot be put read-only from mediawiki, so it will be set read_only on MySQL level). The expected downtime for writes is around 3-5 seconds

For Cognate this will result in DBReadOnlyErrors for page creations, redirect changes, moves, deletions, in the main namespace on wiktionaries.
We could put all wiktionaries into read only for the and avoid the exceptions? But also if it is only going to be 3-5 seconds then maybe the exceptions are fine?

In T187960#4997817, @Addshore wrote:

For Cognate this will result in DBReadOnlyErrors for page creations, redirect changes, moves, deletions, in the main namespace on wiktionaries.
We could put all wiktionaries into read only for the and avoid the exceptions?

Can that be done on a mediawiki level?

But also if it is only going to be 3-5 seconds then maybe the exceptions are fine?

Last time we did this operation it lost only 2 pings on s6 master, but we should also be ready in case this takes more (just in case). So maybe a configuration change ready to be merge in case of failure could be helpful.

• Marostegui added projects: Growth-Team, Language-Team, Cognate.Mar 4 2019, 2:29 PM

#reading-infrastructure-team-backlog tagging you here as this affects x1 master (T187960#4997790) which might be something you use, so trying to get your attention here :)

ayounsi updated the task description. (Show Details)Mar 4 2019, 4:25 PM

ayounsi updated the task description. (Show Details)Mar 4 2019, 4:32 PM

So read only errors are handled nicely in Cognate, well, the data will never end up being written, but users won't see errors.
Failures here are not currently logged, so we wouldn't know if it happened.
Cognate actually has $wgCognateReadOnly and then it wont even attempt to write to the db.

If the db is in readonly or cognate is in readonly but writes can still happen to wiktionaries (redirections, creates, deletes, moves) then data could be missed in the cognate tables, and the PopulateCognatePages maint script from the Cognate extension may have to be run in order to being everything back in line (not that expensive a script to run).

If we wanted to avoid having to run the maint script at all then we could set https://www.mediawiki.org/wiki/Manual:$wgReadOnly on the wiktionaries for the seconds that x1 would be read only?

From where I am sat setting $wgReadOnly for the few seconds would be the best thing for cognate, and would avoid having to run the maintenance scripts.

Restricted Application added a project: Wikidata. · View Herald TranscriptMar 4 2019, 5:29 PM

In T187960#4998807, @Addshore wrote:

From where I am sat setting $wgReadOnly for the few seconds would be the best thing for cognate, and would avoid having to run the maintenance scripts.

Thanks for the explanation! If you consider that is the best approach, I am fine with that! Can you be available during the maintenance window to coordinate? :)

In T187960#4998831, @Marostegui wrote:

In T187960#4998807, @Addshore wrote:

From where I am sat setting $wgReadOnly for the few seconds would be the best thing for cognate, and would avoid having to run the maintenance scripts.

Thanks for the explanation! If you consider that is the best approach, I am fine with that! Can you be available during the maintenance window to coordinate? :)

Tuesday the 19th, I sure can, please ping me then :)

Will do! Thanks!

ayounsi updated the task description. (Show Details)Mar 5 2019, 12:36 AM

ayounsi updated the task description. (Show Details)Mar 5 2019, 12:39 AM

ayounsi updated the task description. (Show Details)Mar 5 2019, 12:47 AM

akosiaris updated the task description. (Show Details)Mar 5 2019, 6:44 AM

akosiaris updated the task description. (Show Details)

I was looking at Special needs or unsorted. @ayounsi I 've updated a few, feel free to move them to other sections. Pinging:

ge-2/0/13 - tungsten - xhgui:app @Krinkle @Joe
ge-3/0/5 - prometheus1003 - @fgiunchedi
ge-4/0/3 - logstash1004 - @fgiunchedi
ge-4/0/5 - snapshot1005 - @ArielGlenn
ge-4/0/26 - contint1001 - Releng - @hashar
ge-4/0/31 - stat1004 - Analytics, @elukey
ge-3/0/19 - rdb1005 - @jijiki , @Joe
ge-4/0/43 - rdb1003 - @jijiki , @Joe

for the rest

fgiunchedi updated the task description. (Show Details)Mar 5 2019, 8:59 AM

contint1001 hosts the CI system subscribing @thcipriani as well

It is not clear to me what this operation is about. Is that just about re cabling the server from a switch to another? If so I would expect the downtime to be rather minimal. Note that iirc contint1001 is limited to 100Mbps, I think on purpose to avoid having CI to overflow the network layer somehow. So would be nice to double check the current network speed and reuse the same when migrating.

Tuesday 19th 14:00 UTC looks fine to me. Will have to announce it ahead of time and gracefully shutdown CI, I am not sure how the stack would react on a network interruption.

About Analytics nodes:

ge-1/0/7 - kafka-jumbo1001 -> Kafka needs to be stopped ~10/15 minutes beforehand to have a graceful shutdown (if possible)
ge-2/0/12 - kafka-jumbo1002 -> same thing, and it is better to have kafka-jumbo1001 up and running (with Kafka partitions sync recovered) before doing maintenance on 1002. We can survive two brokers down but since we control the maintenance I'd prefer not :)

The above can survive a network blip of some seconds, doing it one at the time, so I might be too paranoid :)

ge-2/0/17 - analytics1012
ge-2/0/18 - analytics1013
ge-2/0/19 - notebook1002 -> all these are decommed nodes, should we move them or just unrack them?

ge-3/0/26 - analytics1052
ge-3/0/27 - analytics1053
ge-3/0/28 - analytics1054
ge-3/0/29 - analytics1055
ge-3/0/30 - analytics1056
ge-3/0/31 - analytics1057
ge-3/0/33 - analytics1059
ge-3/0/34 - analytics1060 -> all hadoop worker nodes, one at the time would be great

ge-4/0/0 - druid1001 - depool va pybal --> it is not pooled in pybal (druid100[4-6] are), a network blip of some seconds is fine.

ge-4/0/1 - aqs1004 - depool va pybal -> a network blip of some seconds is fine, but not at the same time of aqs1007 please :)

ge-4/0/6 - analytics1070 -> hadoop worker node, no special needs

ge-4/0/25 - kafka1001 - depool from pybal --> Kafka would need to be gracefully stopped if possible beforehand, this host hold the job queues, I'd be cautious.

ge-6/0/12 - druid1004 - depool via pybal -> a network blip of some seconds is fine.

ge-6/0/15 - an-master1001 - status planned -> This is the hadoop master node, extremely important, I need to perform a failover beforehand.

ge-6/0/24 - aqs1007 - depool from pybal -> a network blip of some seconds is fine, but not at the same time of aqs1004 please :)

ge-6/0/25 - mc1019
ge-6/0/26 - mc1020
ge-6/0/27 - mc1021
ge-6/0/28 - mc1022
ge-6/0/29 - mc1023

The above ones are holding the eqiad mediawiki object cache, they are extremely important. IIUC this maintenance will not require any host shutdown, that would mean stopping memcached and wipe the cache (please @ayounsi confirm that :), so a network blip of one at the time will be fine. Please do it one at the time, checking with me or Joe that nothing is exploding on the mediawiki side while the maintenance is ongoing.

ayounsi updated the task description. (Show Details)Mar 5 2019, 4:23 PM

ayounsi updated the task description. (Show Details)Mar 5 2019, 4:41 PM

ayounsi updated the task description. (Show Details)Mar 5 2019, 5:00 PM

Thank you all for the quick replies!

In T187960#5000934, @hashar wrote:

Is that just about re cabling the server from a switch to another? If so I would expect the downtime to be rather minimal.

Correct, ~5s if no issue

Note that iirc contint1001 is limited to 100Mbps, I think on purpose to avoid having CI to overflow the network layer somehow.

The link is negotiated at 1000Mbps, and there is no rate limiter on the switch side. So unless there is rate-limiter on the server side, this operates at 1G.
Maybe it was true on an older infra, but now the network is able to handle 1G hosts.

Will have to announce it ahead of time and gracefully shutdown CI, I am not sure how the stack would react on a network interruption.

Better figure it out during a planned maintenance than a 4am outage ;)

In T187960#5001105, @elukey wrote:

IIUC this maintenance will not require any host shutdown [...] (please @ayounsi confirm that :)

Confirmed. Server list updated to include your comments.

ayounsi mentioned this in T217686: Document service owner in Netbox.Mar 5 2019, 5:37 PM

• Bstorm subscribed.Mar 5 2019, 6:07 PM

JTannerWMF moved this task from Inbox to External on the Growth-Team board.Mar 5 2019, 7:36 PM

• Jhernandez moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Mar 6 2019, 4:39 PM

In T187960#4997875, @Marostegui wrote:

#reading-infrastructure-team-backlog tagging you here as this affects x1 master (T187960#4997790) which might be something you use, so trying to get your attention here :)

Thanks for the heads-up! I don't think we care about writes erroring out for a few seconds for reading lists. Pinging @Dbrant, @JoeWalsh just in case.

• Marostegui updated the task description. (Show Details)Mar 6 2019, 4:52 PM

MMiller_WMF unsubscribed.Mar 6 2019, 11:23 PM

Change 496720 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Failover db1066 to db1076 on s2

https://gerrit.wikimedia.org/r/496720

Change 496721 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Failover db1066 to db1076 on s2

https://gerrit.wikimedia.org/r/496721

Change 496723 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1120 as x1 master

https://gerrit.wikimedia.org/r/496723

Change 496724 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1120 as x1 master

https://gerrit.wikimedia.org/r/496724

Change 497319 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: failing over the web only for network changes

https://gerrit.wikimedia.org/r/497319

Change 497328 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: reduce TTL on labstore1006 for failover

https://gerrit.wikimedia.org/r/497328

Change 497328 merged by Bstorm:
[operations/dns@master] dumps distribution: reduce TTL on labstore1006 for failover

https://gerrit.wikimedia.org/r/497328

Change 497420 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: fail over dumps to labstore1007

https://gerrit.wikimedia.org/r/497420

Change 497420 merged by Bstorm:
[operations/dns@master] dumps distribution: fail over dumps to labstore1007

https://gerrit.wikimedia.org/r/497420

Change 497319 merged by Bstorm:
[operations/puppet@production] dumps distribution: failing over the web only for network changes

https://gerrit.wikimedia.org/r/497319

Change 497469 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool hosts in row A

https://gerrit.wikimedia.org/r/497469

Change 497472 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Set s2 on read only

https://gerrit.wikimedia.org/r/497472

Change 497469 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool hosts in row A

https://gerrit.wikimedia.org/r/497469

Mentioned in SAL (#wikimedia-operations) [2019-03-19T14:44:51Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool databases in row A - T187960 (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2019-03-19T15:12:24Z] <XioNoX> eqiad A7 servers uplink move - T187960

Krinkle unsubscribed.Mar 19 2019, 3:17 PM

Change 497472 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Set s2 on read only

https://gerrit.wikimedia.org/r/497472

Mentioned in SAL (#wikimedia-operations) [2019-03-19T15:21:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s2 database master on read only - T187960 (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2019-03-19T15:27:56Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set s2 read only OFF - T187960 (duration: 00m 26s)

For the record, read only times on s2:
Read only ON: 15:21:10
Read only OFF: 15:27:56

• Marostegui mentioned this in T218692: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error".Mar 19 2019, 3:42 PM

ayounsi updated the task description. (Show Details)Mar 19 2019, 6:45 PM

Everything here is done, thank you all for your help!

ayounsi added a parent task: T208734: Decommission asw-c-eqiad.Mar 19 2019, 8:18 PM

ayounsi edited parent tasks, added: T218734: Decommission asw-a-eqiad; removed: T208734: Decommission asw-c-eqiad.

Mentioned in SAL (#wikimedia-operations) [2019-03-20T06:09:58Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool databases in row A - T187960 (duration: 00m 49s)

Change 496724 abandoned by Marostegui:
db-eqiad.php: Promote db1120 as x1 master

Reason:
Not needed

https://gerrit.wikimedia.org/r/496724

Change 496723 abandoned by Marostegui:
mariadb: Promote db1120 as x1 master

Reason:
Not needed

https://gerrit.wikimedia.org/r/496723

Change 496721 abandoned by Marostegui:
db-eqiad.php: Failover db1066 to db1076 on s2

Reason:
Not needed

https://gerrit.wikimedia.org/r/496721

Change 496720 abandoned by Marostegui:
mariadb: Failover db1066 to db1076 on s2

Reason:
Not needed

https://gerrit.wikimedia.org/r/496720

labsdb1009.mgmt (stress on management interface) is down according to icinga for 14 hours (around net maintenance), maybe a loose cable or misconfiguration? Not a huge blocker, but better make sure it is not an intended/known state.

ipmi works from localhost but fails remotely, so we are pretty confident it could be a network connectivity and not a hw issue.

Thanks, opened T218789

Ladsgroup mentioned this in T226358: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC.Jun 26 2019, 10:58 AM

• Cmjohnson closed subtask T217383: Decommission asw2-a5-eqiad as Resolved.Oct 1 2020, 6:58 PM

ayounsi reopened subtask T217383: Decommission asw2-a5-eqiad as Open.Oct 1 2020, 8:16 PM

• Cmjohnson closed subtask T217383: Decommission asw2-a5-eqiad as Resolved.Oct 7 2020, 3:08 PM

ayounsi reopened subtask T217383: Decommission asw2-a5-eqiad as Open.Oct 7 2020, 3:13 PM

ayounsi closed subtask T217383: Decommission asw2-a5-eqiad as Resolved.Oct 7 2020, 4:08 PM